Personal tools
  •  
You are here: Home How does the web archive work? Architecture

Architecture

Overview of the system's architecture and functioning.

Figure 1. Architecture of Portuguese Web Archive system.

 

Figure 1 presents an overview of the Portuguese Web Archive architecture. The archive comprises two sub-systems:

  • Global Search: intended to provide searches on all the information archived;
  • Gathering, storage and indexing: intended to collect, store and preserve information from the web. This sub-system can operate independently from the former.

Archive Nodes

Archive Nodes collect, store and index the information from the web.

Mergers

Mergers accesses the Archive Nodes indexes to respond to global searches as, for example: search all the archive pages that have the terms: “1998 elections”.

Front-end

The Front-end component receives results returned by the Mergers and presents them to the users. The system supports as search parameters the term or URL to search and the date range of the search.

Inside an archive node

 

Figure 2. Components of an Archive Node.

 

Figure 2 presents the internal architecture of an Archive Node. At the beginning of a Portuguese Web crawl, a number of web sites is assigned to each Archive Node, which are then crawled and stored in the Content Storage Volume.

The new web sites under .PT found during the crawl are saved as candidates for the next crawl. At the end of each crawl, the contents are indexed to provide efficient searches over the archived collection (Term and URL Indexes).

Each Archive Node provides a User Interface that enables human users to search the information stored in it. Thus, the information stored in each Archive Node may be accessed independently from the remaining the system, increasing its chances of preservation in case of failure of external components.

The Query Servers respond to requests made by the Mergers to enable searches over the whole archived collection.

FCCN - Fundação para a Computação Científica Nacional UMIC - Agência para a Sociedade do Conhecimento