Robots Exclusion Protocol to indicate access restrictions
To enable the respect of access restrictions by the Portuguese Web Archive, it is advisable that authors use the Robots Exclusion Protocol.
The Robots Exclusion Protocol (REP) can be used to list contents that should not be crawled and archived.
Authors can contribute to facilitate web archiving by using REP to identify irrelevant for archiving or harmful contents for crawlers. This way:
- The visited sites save resources (e.g. bandwidth)
- The Portuguese Web Archive saves resources ( e.g. disk space)
Example of the Robots Exclusion Protocol using robots.txt
The access restrictions should be specified on a single file named robots.txt that is hosted on the root directory of a site (e.g. http://arquivo.pt/robots.txt).
For instance, a robots.txt file with the following instructions, would forbid the crawl by the Portuguese Web Archive of all the contents under the folder /calendar/:
User-agent: Arquivo-web-crawler Disallow: /calendar/
Example of the Robots Exclusion Protocol using the meta-tag ROBOTS
Alternatively, access restrictions can be described on each page, through the inclusion of the meta tag ROBOTS in its source code.
The following example would forbid the crawl and indexing of the page by all robots:
<meta name="ROBOTS" content="NOINDEX, NOFOLLOW" />
The exclusions defined through the ROBOTS meta tag are applied to all robots, including search engines, such as Google.