Personal tools
  •  
You are here: Home About How does the web archive work? Architecture

Architecture

Overview of the system's architecture and functioning.

Figure 1. Architecture of Portuguese Web Archive system.

Architecture of the Portuguese Web Archive system.

Figure 1 presents an overview of the Portuguese Web Archive architecture. The archive comprises two sub-systems:

  • Gathering, storage and indexing: intended to collect, store and preserve information from the web. This sub-system can operate independently from the former;
  • Global Search: intended to provide searches on all the information archived.

Archive Nodes

Archive Nodes collect, store and index the information from the web.

Mergers

Mergers accesses the Archive Nodes indexes to respond to global searches as, for example: search all the archive pages that have the terms: “1998 elections”.

Front-end

The Front-end component receives results returned by the Mergers and presents them to the users. The system supports as search parameters the term or URL to search and the date range of the search.

Inside an archive node

 

Figure 2. Components of an Archive Node.

Components of an Archive Node. 

Figure 2 presents the internal architecture of an Archive Node. At the beginning of a Portuguese Web crawl, a number of web sites is assigned to each Archive Node, which are then crawled and stored in the Content Storage Volume.

The new web sites under .PT found during the crawl are saved as candidates for the next crawl. At the end of each crawl, the contents are indexed to provide efficient searches over the archived collection (Term and URL Indexes).

Each Archive Node provides a User Interface that enables human users to search the information stored in it. Thus, the information stored in each Archive Node may be accessed independently from the remaining the system, increasing its chances of preservation in case of failure of external components.

The Query Servers respond to requests made by the Mergers to enable searches over the whole archived collection.

The paper Introducing the Portuguese Web Archive details the functioning of the system. If you want to learn more, visit the our publications page.

FCCN - Fundação para a Computação Científica Nacional UMIC - Agência para a Sociedade do Conhecimento POSC - Programa Operacional Sociedade do Conhecimento UE - União Europeia - FEDER - Fundo Europeu de Desenvolvimento Regional