Analysing Crawl Heritrix reports

Python Libraries used

In [1]:
%matplotlib inline
from pandas import DataFrame, read_csv
import matplotlib as plt
import matplotlib.pyplot as gplt
import pandas as pd
import sys
In [2]:
print 'Python version ' + sys.version
print 'Pandas version ' + pd.__version__
print 'Matplotlib version ' + plt.__version__
Python version 2.7.6 (default, Mar 22 2014, 22:59:56) 
[GCC 4.8.2]
Pandas version 0.15.2
Matplotlib version 1.4.2

Heritrix Report: seeds-report.txt

In [3]:
dfseeds = read_csv('seeds-report.txt', sep = ' ')
In [5]:
dfseeds
Out[5]:
[code] [status] [seed] [redirect]
0 0 NOTCRAWLED http://007007007.eu/ NaN
1 0 NOTCRAWLED http://11001100.eu/ NaN
2 0 NOTCRAWLED http://111proxy.eu/ NaN
3 0 NOTCRAWLED http://1234proxy.eu/ NaN
4 0 NOTCRAWLED http://123blogging.eu/ NaN
5 0 NOTCRAWLED http://192.168.5.4/index.htm NaN
6 0 NOTCRAWLED http://192.168.5.7/index.htm NaN
7 0 NOTCRAWLED http://192.168.5.8:8084/Default.htm NaN
8 0 NOTCRAWLED http://1987proxy.eu/ NaN
9 0 NOTCRAWLED http://1golf.eu/ NaN
10 0 NOTCRAWLED http://24cast.eu/ NaN
11 0 NOTCRAWLED http://250.eu/ NaN
12 0 NOTCRAWLED http://2links.eu/ NaN
13 0 NOTCRAWLED http://3d-meble.eu/ NaN
14 0 NOTCRAWLED http://3dunitygames.eu/ NaN
15 0 NOTCRAWLED http://3dwebdirectory.eu/ NaN
16 0 NOTCRAWLED http://3gb.eu/ NaN
17 0 NOTCRAWLED http://3uu.eu/ NaN
18 0 NOTCRAWLED http://4reprint.eu/ NaN
19 0 NOTCRAWLED http://5x5.eu/ NaN
20 0 NOTCRAWLED http://6perces.eu/ NaN
21 0 NOTCRAWLED http://7dite.eu/ NaN
22 0 NOTCRAWLED http://7edmicky.eu/ NaN
23 0 NOTCRAWLED http://84e-forum.eu/ NaN
24 0 NOTCRAWLED http://888proxy.eu/ NaN
25 0 NOTCRAWLED http://999proxy.eu/ NaN
26 0 NOTCRAWLED http://9gags.eu/ NaN
27 0 NOTCRAWLED http://9pic.eu/ NaN
28 0 NOTCRAWLED http://a2a.eu/ NaN
29 0 NOTCRAWLED http://a2aenergia.eu/ NaN
... ... ... ... ...
55472 200 CRAWLED https://www.tweaknews.eu/ NaN
55473 200 CRAWLED https://www.twinker.eu/ NaN
55474 200 CRAWLED https://www.ukauka.eu/et/ NaN
55475 200 CRAWLED https://www.ultimate-warez.eu/ NaN
55476 200 CRAWLED https://www.umg-cms.eu/ NaN
55477 200 CRAWLED https://www.unidomo.de/ NaN
55478 200 CRAWLED https://www.universign.eu/en/ NaN
55479 200 CRAWLED https://www.vanstechelman.eu/ NaN
55480 200 CRAWLED https://www.viacopter.eu/ NaN
55481 200 CRAWLED https://www.visitportugal.com/pt-pt NaN
55482 200 CRAWLED https://www.visurepra.eu/visure/ NaN
55483 200 CRAWLED https://www.vitaminedesk.eu/ NaN
55484 200 CRAWLED https://www.vitaminedesk.eu/noscript/ NaN
55485 200 CRAWLED https://www.webace.eu/ NaN
55486 200 CRAWLED https://www.webdom.sk/ NaN
55487 200 CRAWLED https://www.webspace4all.eu/ NaN
55488 200 CRAWLED https://www.werunbcn.eu/ NaN
55489 200 CRAWLED https://www.wmbroker.eu/ NaN
55490 200 CRAWLED https://www.xsnews.nl/en/index.html NaN
55491 200 CRAWLED https://www.xsocial.eu/ NaN
55492 200 CRAWLED https://www.xsonline.eu/ NaN
55493 200 CRAWLED https://www.you-create.eu/ NaN
55494 200 CRAWLED https://www.youtube.com/user/SuperBufla?sub_co... NaN
55495 200 CRAWLED https://www.zitmaxx.nl/ NaN
55496 200 CRAWLED https://xemix.eu/ NaN
55497 200 CRAWLED https://ydns.eu/ NaN
55498 200 CRAWLED https://you-create.eu/ NaN
55499 200 CRAWLED https://yousmoke.eu/ NaN
55500 200 CRAWLED https://zayna.eu/ NaN
55501 200 CRAWLED https://zfs.ecdc.europa.eu/ NaN

55502 rows × 4 columns

Seeds Return Codes Distribution

In [6]:
codes = dfseeds['[code]'].value_counts()
codes
Out[6]:
 200     31361
 301     10836
 302      6200
 0        3306
 403      1797
-9998      993
 303       282
 404       207
 503       157
 500        86
 401        78
 307        68
-1          33
 406        30
 400        23
 522        15
 502         6
 410         6
 204         4
-6           3
 520         3
-2           2
 521         2
 501         1
-50          1
 202         1
 402         1
dtype: int64
In [7]:
codes_to_plot = codes[:6]
In [19]:
plt = codes_to_plot.plot(kind='pie', figsize=(6, 6), autopct='%.2f%%',fontsize=15)
plt.set_title('Seeds return codes weigth', fontsize=25)
Out[19]:
<matplotlib.text.Text at 0x7fe9ebb53c50>

Heritrix Report: hosts-report.txt

In [43]:
dfhosts = read_csv('hosts-report.txt', delimiter = ' ', index_col=False)

URL per site

In [73]:
dfhosts['TOTAL_URLS'] = dfhosts['[#urls]'] + dfhosts['[#remaining]'] 
order_dfhosts = dfhosts.ix[1:].sort(columns='TOTAL_URLS', ascending=False)
Out[73]:
[#urls] [#bytes] [host] [#robots] [#remaining] TOTAL_URLS
3209 9990 44443540 pro.annuairefrancais.fr 0 3597848 3607838
3034 9990 492318034 fbclick.eu 0 2227821 2237811
1453 9993 379573269 www.planet-elektronik.eu 0 1924447 1934440
5178 9983 885653131 www.casatableware.eu 0 1495235 1505218
7152 7407 1206814 red.saela.eu 0 1369163 1376570
6602 8128 758195207 www.europeana.eu 10 1311399 1319527
2900 9991 878086322 www.vodka-shop-wodka.eu 20 1306460 1316451
128879 56 1297666 inscrip.annuaire-francais.eu 222609 1309202 1309258
4854 9986 2211089432 www.colette.fr 5738 961414 971400
16 19986 455080060 www.youtube.com 1 950963 970949
1955 9992 426094345 www.europages.eu 0 945360 955352
6180 8726 612484314 www.kuebler.eu 0 889557 898283
1515 9992 108504631 cdn1.porniac.com 0 879125 889117
6167 8747 207610592 www.kto-dzwonil.eu 0 855690 864437
4075 9988 1029405934 archive.popurls.com 0 840335 850323
13432 3241 91342808 www.autobazar.eu 254 736414 739655
29929 1023 319221192 europeana.eu 4 733344 734367
6215 8680 2154739850 www.vidics.ch 0 675585 684265
3466 9990 276434310 www.pinporn.eu 0 637656 647646
80 18752 1345758705 www.kaiyuan.eu 3228 622248 641000
2354 9992 296958719 www.xnxxx.eu 0 627382 637374
71 19271 1391601338 ww.kaiyuan.eu 5909 616143 635414
535 10074 388117940 www.astroshop.eu 0 621649 631723
9528 5110 997490309 lilac-travel.eu 0 626549 631659
614 10002 735129969 kaiyuan.eu 3104 611364 621366
5770 9318 678941829 wwww.kaiyuan.eu 3180 574444 583762
3786 9989 1030166051 sitl.eu 28 567101 577090
2302 9992 262526698 www.turism-europa.eu 0 556741 566733
189 12964 1955383326 www.goreapparel.eu 0 539984 552948
3776 9989 1204738807 regioni.amnesty.it 0 534716 544705
... ... ... ... ... ... ...
1085357 0 0 peugeot-ludix.autobazar.eu 0 0 0
1085167 0 0 automark.autobazar.eu 0 0 0
1085359 0 0 rastislav-jelus.autobazar.eu 0 0 0
1085360 0 0 s-mackovic.autobazar.eu 0 0 0
1085243 0 0 euroservis2000.autobazar.eu 0 0 0
1085242 0 0 eurofinance.autobazar.eu 0 0 0
1085363 0 0 sbosacky.autobazar.eu 0 0 0
1085168 0 0 autonamiru.autobazar.eu 0 0 0
1085365 0 0 sharan-galaxy.autobazar.eu 0 0 0
1085289 0 0 honda-vision.autobazar.eu 0 0 0
1085367 0 0 sona1122.autobazar.eu 0 0 0
1085164 0 0 autoapk.autobazar.eu 0 0 0
1085163 0 0 auto-schwab.autobazar.eu 0 0 0
1085160 0 0 apauto.autobazar.eu 0 0 0
1085265 0 0 holubf.autobazar.eu 0 0 0
1085271 0 0 honda-cmx.autobazar.eu 0 0 0
1085270 0 0 honda-clr.autobazar.eu 0 0 0
1085269 0 0 honda-cg.autobazar.eu 0 0 0
1085268 0 0 honda-cbx.autobazar.eu 0 0 0
1085267 0 0 honda-ca.autobazar.eu 0 0 0
1085266 0 0 honda-black-widow.autobazar.eu 0 0 0
1085264 0 0 hb-autouvaly.autobazar.eu 0 0 0
1085252 0 0 fiat-616.autobazar.eu 0 0 0
1085263 0 0 hambalek.autobazar.eu 0 0 0
1085261 0 0 gaz-gazela.autobazar.eu 0 0 0
1085159 0 0 amaxeu.autobazar.eu 0 0 0
1085255 0 0 finecar.autobazar.eu 0 0 0
1085254 0 0 fiat-strada.autobazar.eu 0 0 0
1085253 0 0 fiat-panda-van.autobazar.eu 0 0 0
1085256 0 0 finky.autobazar.eu 0 0 0

1085444 rows × 6 columns

In [81]:
plt = order_dfhosts[['TOTAL_URLS','[host]']][:20].plot(kind='bar',x='[host]',figsize=(10,10))
plt.set_ylabel('Number of URLs', fontsize=15)
plt.set_xlabel('Hosts', fontsize=15)
plt.set_title('TOP 20 Hosts with mode URLs', fontsize=25)
Out[81]:
<matplotlib.text.Text at 0x7fe9e4ebd8d0>

Top 30 URL Crawled Hosts

In [85]:
dfhosts[['[#urls]','[host]']][1:31]
Out[85]:
[#urls] [host]
1 19994 www.mammoth-shop.eu
2 19992 kidswholesale.eu
3 19992 www.marksandspencer.eu
4 19992 www.notebookcheck.net
5 19991 www.elemental.eu
6 19991 www.gandi.net
7 19990 www.active24.cz
8 19989 moodrat.eu
9 19989 www.ill.eu
10 19988 www.m-s-v.eu
11 19988 www.ozbekperde.eu
12 19987 i.ytimg.com
13 19987 www.blogger.com
14 19986 easa.europa.eu
15 19986 www.shifters.eu
16 19986 www.youtube.com
17 19985 eurofound.europa.eu
18 19985 twitter.com
19 19985 www.eba.europa.eu
20 19985 www.ecoporio.eu
21 19984 www.menatwork.nl
22 19984 www.mywort.lu
23 19984 www.pricebreaker.eu
24 19984 www.skydsl.eu
25 19984 www.youtube-nocookie.com
26 19983 cdn3.primor.eu
27 19983 www.nazuby.eu
28 19982 community.ebay.com
29 19982 www.apteka-zdrowie.eu
30 19982 www.designshops.eu

Total URLS Remaining

In [13]:
dfhosts['[#remaining]'].sum()
Out[13]:
204640341

Top 20 hosts with URL remaining

In [87]:
ax = dfhosts[['[host]','[#remaining]']].sort(columns='[#remaining]', ascending = False)[:20].plot(kind='bar',x='[host]',figsize=(10,10))
ax.set_ylabel('Number of URLs', fontsize=15)
ax.set_xlabel('Hosts', fontsize=15)
ax.set_title('TOP 20 hosts with more URLs remaining', fontsize=25)
Out[87]:
<matplotlib.text.Text at 0x7fe9db576090>

Remaining URLs distribution

In [15]:
dfhosts['[#remaining]'].describe()
Out[15]:
count    1085445.000000
mean         188.531285
std         7537.386922
min           -1.000000
25%            0.000000
50%            0.000000
75%            0.000000
max      3597848.000000
Name: [#remaining], dtype: float64

Crawled URLs distirbution

In [19]:
dfhosts['[#urls]'].describe()
Out[19]:
count    1085445.000000
mean         151.320549
std         2145.629697
min            0.000000
25%            2.000000
50%            4.000000
75%           18.000000
max      2021136.000000
Name: [#urls], dtype: float64

Crawl logs Links Extracted from crawl.log

Round 1: cleaning seeds to just .eu uris

In [93]:
df_urls = read_csv('seeds_extracted.txt', header=None, names=['URL'])
df_eu_urls = df_urls[df_urls['URL'].str.contains('^.*.eu$')]
print "Number of seeds round1:"
df_eu_urls.count()
Number of seeds round1:

Out[93]:
URL    270452
dtype: int64

Round 2: cleaning spam seeds indentified

In [99]:
df_eu_urls_2 = df_eu_urls[~df_eu_urls['URL'].str.contains('.*dbquanti.eu$|.*autobazar.eu$|.*in-links.eu$|.*myface4u.eu$|.*share-with.eu$|.*prace-jobs.eu$|.*cutegirls.eu$')]
print "Number of seeds round2:"
df_eu_urls_2.count()
Number of seeds round2:

Out[99]:
URL    239912
dtype: int64

Round 3: Cleaning some more spam detected later (e-mp3s.eu)

In [100]:
df_eu_urls_3 = df_eu_urls_2[~df_eu_urls['URL'].str.contains('.*\.e-mp3s.eu$')]
print "Number of seeds round3:"
df_eu_urls_3.count()
Number of seeds round3:

Out[100]:
URL    199227
dtype: int64

Top 20 Domains with more subdomains associated

In [109]:
d = DataFrame(df_eu_urls_3['URL'].apply(lambda str : str.split('.')[-2:]))
dauxfilter = DataFrame(d['URL'].apply(lambda x :  x[0] + "." + x[1]))
dfilter = dauxfilter['URL'].value_counts()
dfilter[:20]
Out[109]:
gabinet.eu                  22873
stronwww.eu                 13549
viaromania.eu                9454
softfree.eu                  8550
maxtrader.eu                 7300
5mp.eu                       4394
heimat.eu                    2221
npage.eu                     2176
zooburza.eu                  2149
jouwpagina.eu                1772
telesoft.eu                  1376
mastertopforum.eu            1231
pcburza.eu                   1006
islive.eu                     979
forfiter.eu                   861
czech-mountains.eu            828
hotelsbudapesthungary.eu      649
europa.eu                     634
blogsport.eu                  620
napredaj.eu                   565
dtype: int64