Technology
The main technologies used to implement the Portuguese Web Archive system.
Initially, the different web archiving initiatives worked independently, developing their own systems from scratch. This resulted in a huge waste of resources. Problems related to web archives were felt by everyone, but each one tried to solve them alone. On the other hand, the web did not stop evolving and new problems keep on appearing. It became clear that it would be necessary to join efforts together to archive the web successfully.
The software technologies used in the Portuguese Web Archive are provided mainly by the Archive-access project that joins several free and open-source tools developed to support web archiving. Open source software contributes to enable the long-term system maintenance and preservation of the archived information.
- The Crawler was implemented using Heritrix and the Deduplicator add-on module;
- Search is based on the Wayback Machine, NutchWax and Lucene search engine;
- The spellchecker uses Hunspell.
- Distributed processing of data is done using Hadoop, a powerful free platform for parallel computing supported by the Apache Software Foundation;
- The operating system mainly used is Red Hat Enterprise Linux;
- The main programming language used is Java;
- As a database management system we use PostgreSQL;
- Development and web publishing systems are supported by Mantis, Plone, Apache http server, Tomcat, Mediawiki and Zope.
This open source
technology is a valuable basis for the development of the Portuguese Web
Archive system. However, specific tools for
web archiving are in permanent evolution, and frequently they cannot be used as off-the-shelf products.
Often, the installation and
operation processes are undocumented and there are errors and incompatibilities
between releases. Therefore, the decision to use Archive-access tools requires
us to involve in its improvement and find solutions to web preservation
problems.