Goals
Here we present the main goals to achieve with the Portuguese Web Archive project.
Goals of the Portuguese Web Archive
The creation of a Portuguese Web Archive represents a historic milestone in the preservation of knowledge for future generations. With the creation of a system that supports regular crawls of the Portuguese web, its long term storage and access, we intend to provide the following services:
- Term search over the archived contents: it will enable the identification of archived contents over the years that contain certain terms;
- URL search over the archived contents: it will allow to identify several versions of a content gathered from a given URL;
- New search engine over the Portuguese web: the archive will enable searching over several Portuguese web collections. Providing a search service over the most recent collection, as current web search engines do, can be attainable in a relatively small additional effort and it is an interesting service for the Portuguese community;
- Historical collections of web contents for research purposes: the web has information about the most various subjects reflecting society changes across time. Researchers from different fields use the web as a source of information for their studies. Providing web collections will enable these researchers to store and process web data locally on their computers without having to crawl the web themselves;
- Characterization reports
of the Portuguese web: a web archive system must be tunned according to the characteristics of the archived data. Therefore, Portuguese web characterizations must be periodically generated. As these studies are interesting to a broader audience, they will be published. Characterizing national webs is interesting to measure the spread of information technologies in different societies and the evolution of the web across time;
- Backup system of the archived information (rARC): it will be a distributed system that will enable Internet users to provide disk space to store a backup copies of the archived contents through the installation of a small application on their computers. If a failure happens on the central repository, the archived collection will be recovered from the backup copies stored on the users’ computers. Any individual or institution can contribute to preserve the web by providing some disk space on their computers;
- Archived data parallel processing system: it will allow researchers to execute their programs over the archived web data using several computers in parallel.
We also want to achieve the following goals:
- Train human resources in web archiving to enable the maintenance of the system in the future;
- Export know-how, experience and technology in web archiving to other countries, specially the Portuguese language ones;
- Contribute to increase the number of domains registered under .PT, the free historical archiving of the information published under this domain could be an additional motivation for registrars;
- Publish scientific and technical papers that enable the sharing of the acquired knowledge and receiving feedback from the community regarding the work performed.