Personal tools
  •  
You are here: Home Collaborate Recommendations for web authors to enable web archiving Robots Exclusion Protocol to indicate access restrictions

Robots Exclusion Protocol to indicate access restrictions

To enable the respect of access restrictions by the Portuguese Web Archive, it is advisable that authors use the Robots Exclusion Protocol.

The Robots Exclusion Protocol (REP) can be used to list contents that should not be crawled and archived.

The access restrictions should be specified on a single file named robots.txt that is hosted on the root directory of a site (e.g. http://arquivo.pt/robots.txt).

Allow web archive crawlers to harvest all the files required to reproduce pages

  • Search engines just need to crawl textual contents to present results from a site.
  • Web archives need all the files embedded on a web page to later reproduce it (e.g. CSS, JavaScript or images files).
  • Default Robots Exclusion in some Content Management Systems needs to be changed to enable efficient web archiving (e.g. Joomla, Mambo)

Allowing the full crawl of a site by the Portuguese Web Archive

User-agent: Arquivo-web-crawler 
Disallow:

Disallow harmful contents for crawlers

Authors can contribute to facilitate web archiving by using REP to identify irrelevant for archiving or harmful contents for crawlers. This way:

  • The visited sites save resources (e.g. bandwidth)
  • The Portuguese Web Archive saves resources ( e.g. disk space)

Disallowing the crawl of a directory using robots.txt

For instance, a robots.txt file with the following instructions, would forbid the crawl by the Portuguese Web Archive of all the contents under the folder /calendar/:

User-agent: Arquivo-web-crawler 
Disallow: /calendar/

Disallowing the crawl and indexing of page using the meta-tag ROBOTS

Alternatively, access restrictions can be described on each page, through the inclusion of the meta tag ROBOTS in its source code.

The following example would forbid the crawl and indexing of the page by all robots:

<meta name="ROBOTS" content="NOINDEX, NOFOLLOW" />

The exclusions defined through the ROBOTS meta tag are applied to all robots, including search engines, such as Google.

 

Share | |