Personal tools
  •  
You are here: Home Portuguese Web Archive Crawler

Portuguese Web Archive Crawler

Find out if your web site has been archived and, if you desire, how to prevent it.

 

What is the Portuguese Web Archive crawler?

The Portuguese Web Archive crawler is the system that automatically collects contents from the web to be archived. These kind of systems are also known as spiders or harvesters.

How does it work?

A crawler iteratively harvests contents from the web. It downloads a content and extracts its embedded links to find new contents. A new crawl is bootstrapped with an initial set of web addresses, called seeds. In each new crawl, the home pages of all web sites under .PT previously successfully crawled are used as seeds.

Have I been crawled?

Webmasters may detect their servers have been crawled by checking the logs for requests identified by the following user agent:

Arquivo-web-crawler  (compatible; heritrix/1.12.1 +http://arquivo-web.fccn.pt)

If you detect any unexpected behavior please contact us, indicating the full User-Agent identification, dates of access and a description of the identified problem as thorough as possible.

How was it implemented?

Its current version is based on Heritrix, a crawler especially created by the Internet Archive to meet the requirements of Web Archiving.

Which contents are crawled?

The crawler is able to collect all kinds of web documents so that the largest amount of information is preserved for the future. However, to ensure a proper functioning of the crawler when visiting malicious or malfunctioning web sites, some constraints must be imposed, as for example, a limit size for the downloaded contents.

What is the frequency of requests made to my web site?

The crawler respects a courtesy pause of 10 seconds between requests to the same site , so that its actions do not overload web servers. The current value for the courtesy pause imposes a lower load than the one imposed by a browser  when opening, for example, an HTML page and the corresponding images. If you detect any harmful behavior carried out by our crawler, please let us know.

Can I prevent my web site from being visited?

Yes. Portuguese Web Archive crawler respects the Robots Exclusion Protocol. If you want to prevent your web site from being partially or totally visited by our crawler, and therefore, excluded from the Portuguese Web Archive, follow the instructions for compliance with the Robots Exclusion Protocol.

When will I be able to see the archived versions of my web site?

The development of the Portuguese Web Archive began in January 2008. Therefore, the historical collection of archived contents is currently being built and is meant mainly for system testing and experiments. We plan to launch a search service over the archived contents within two years. The archived contents will be available after a minimum time delay to reduce the possibility of competition with the original publisher web site.

FCCN - Fundação para a Computação Científica Nacional POSC - Programa Operacional Sociedade do Conhecimento UMIC - Agência para a Sociedade do Conhecimento UE - União Europeia - FEDER - Fundo Europeu de Desenvolvimento Regional