Personal tools
  •  
You are here: Home Collaborate Recommendations for web authors to enable web archiving One link for each content

One link for each content

To efficiently crawl and archive a web site, it is fundamental to have one link for each content.

Any content of a web site must be referenced directly by an URL, including images, videos or pages. For instance, the URL http://arquivo-web.fccn.pt/logo.jpg references the Portuguese Web Archive logo.

The PWA crawler is only able to find and archive contents referenced by at least one link presented on a page of a web site. There are two cases that require particular attention:

  • Videos provided in streaming: are downloaded by specific applications, such as Flash Player, Windows Movie Player or Real Player. However, crawlers can only download contents available through the HTTP protocol. Hence, to enable the possibility of archiving a video available on the web, there must be a link to download the full video file.
  • Contents hidden behind forms: crawlers cannot fill out forms. Therefore, all the contents exclusively available after authentication, acceptance of terms or other kind of forms, cannot be archived. Unless, there are links on pages that enable direct  access to them.

Patch existing sites

Having each content referenced by an URL brings many advantages. However, it might impossible to restructure an existing site to comply with this best practice.

A possible solution is to provide alternative information about content location on the site through a:

  • User sitemap: it improves usability and enables the crawl of all the pages of a site;
  • RSS feeds archive: RSS feeds are used to publish the latest updates on a web site. An archive of feeds helps crawlers to find contents to download;
  • XML Sitemap: is a file containing information regarding each URL of a site (e.g. last modification date, priority, frequency of change). Although, the PWA does not process XML sitemaps yet, this protocol is supported by companies such as Google, Yahoo! or Microsoft.

It is crucial to keep the information contained in these files up-to-date.

FCCN - Fundação para a Computação Científica Nacional UMIC - Agência para a Sociedade do Conhecimento POSC - Programa Operacional Sociedade do Conhecimento UE - União Europeia - FEDER - Fundo Europeu de Desenvolvimento Regional