Robots Exclusion Protocol to indicate access restrictions
To enable the respect of access restrictions by the Portuguese Web Archive, it is advisable that authors use the Robots Exclusion Protocol.
The Robots Exclusion Protocol (REP) can be used to list contents that should not be crawled and archived.
The access restrictions should be specified on a single file named robots.txt that is hosted on the root directory of a site (e.g. http://arquivo.pt/robots.txt).
Allow web archive crawlers to harvest all the files required to reproduce pages
- Search engines just need to crawl textual contents to present results from a site.
 
- Web archives need all the files embedded on a web page to later reproduce it (e.g. CSS, JavaScript or images files).
- Default Robots Exclusion in some Content Management Systems needs to be changed to enable efficient web archiving (e.g. Joomla, Mambo)
Allowing the full crawl of a site by the Portuguese Web Archive
User-agent: Arquivo-web-crawler Disallow:
Disallow harmful contents for crawlers
Authors can contribute to facilitate web archiving by using REP to identify irrelevant for archiving or harmful contents for crawlers. This way:
- The visited sites save resources (e.g. bandwidth)
- The Portuguese Web Archive saves resources ( e.g. disk space)
Disallowing the crawl of a directory using robots.txt
For instance, a robots.txt file with the following instructions, would forbid the crawl by the Portuguese Web Archive of all the contents under the folder /calendar/:
User-agent: Arquivo-web-crawler Disallow: /calendar/
Disallowing the crawl and indexing of page using the meta-tag ROBOTS
Alternatively, access restrictions can be described on each page, through the inclusion of the meta tag ROBOTS in its source code.
The following example would forbid the crawl and indexing of the page by all robots:
<meta name="ROBOTS" content="NOINDEX, NOFOLLOW" />
The exclusions defined through the ROBOTS meta tag are applied to all robots, including search engines, such as Google.
 
  
    Web archivists: please answer 3 quick questions regarding resources committed to web archiving
                Web archivists: please answer 3 quick questions regarding resources committed to web archiving
                