Content crawling and archiving

Crawling and archiving Portuguese Web contents

What is the Portuguese Web?

Everything hosted under the .pt domain and other contents hosted outside this domain that are of clear interest to the Portuguese community are considered to be part of the Portuguese Web.

What is the Portuguese Web Archive crawler?

The Portuguese Web Archive crawler is the system that automatically collects contents from the web to be archived. These kind of systems are also known as spiders or harvesters.

How does it work?

A crawler iteratively harvests contents from the web. It downloads a content and extracts its embedded links to find new contents. A new crawl is bootstrapped with an initial set of web addresses, called seeds. In each new crawl, the home pages of all web sites under .PT previously successfully crawled are used as seeds.

Have I been crawled?

Webmasters may detect their servers have been crawled by checking the logs for requests identified by the following user agent:

Arquivo-web-crawler  (compatible; heritrix/1.12.1 +http://arquivo-web.fccn.pt)

If you detect any unexpected behavior please contact us, indicating the full User-Agent

identification, dates of access and a description of the identified problem as thorough as possible.

What is the frequency of requests made to my web site?

The crawler respects a courtesy pause of 10 seconds between requests to the same site , so that its actions do not overload web servers. The current value for the courtesy pause imposes a lower load than the one imposed by a browser when opening, for example, an HTML page and the corresponding images. If you detect any harmful behavior carried out by our crawler, please let us know.

How often do you collect the Portuguese Web and how long does it take?

We are performing 3 to 4 crawls per year. About 90% of the contents are crawled within 7 days. However, the crawl continues for slower sites or with higher amount of contents. Soon we plan to crawl selected Portuguese publications more frequently.

Do you collect the whole Portuguese Web?

No. Some constrains are imposed, for instance, to the:

maximum size of contents downloaded from the Web
number of contents per site
number of links the crawler follows from an initial address until it reaches the content

On the other hand, the boundaries of the Portuguese Web are difficult to define accurately. Many contents are hosted outside the .PT domain and those require particular effort in identifying them. If you wish, you may suggest a site to be archived.

Which media types do you archive?

All media types.

What about the dynamically generated pages?

Dynamically generated pages are collected the same way as the static ones, as long as there is a link to it.

Do you archive restrict access data?

No. The Portuguese Web Archive crawls only the public Web. Pages protected by password or other forms of access restriction are not crawled.

Does the Archive crawler fill in forms?

No. If you notice such a situation please let us know.

Can I prevent my web site from being visited?

Yes. The Portuguese Web Archive crawler respects the Robots Exclusion Protocol. If you want to prevent your web site from being partially or totally visited by our crawler, and therefore, excluded from the Portuguese Web Archive, follow the instructions for compliance with the Robots Exclusion Protocol.