GAppA: Grid Appliance for the Archive (suspended)
GAppA is a software platform designed to provide remote access to the archived data and enable its cooperative processing by several computer clusters.
This project is suspended due to lack of resources. If you are interested in continuing it, please contact us.
Periodically the Portuguese Web Archive (PWA) crawls and archives web data. This information is a precious research resource that can be used accross different fields, such as History, Sociology or Linguistics.
Access mechanisms must be provided to enable the processing of the archived data. However, due to the large amounts of data involved, this processing may have computational requirements that cannot be supported by researchers.
GAppA is a software platform designed to provide remote access to the archived data and enable its cooperative processing by several computer clusters:
- Researchers will be able to execute their programs
using simultaneously the PWA and their own computers.
- On the other hand, the PWA system will be able to extend its processing capacity by using external computers.
A computer joins the cluster through the installation of a client application that submits jobs to the cluster. GAppA implements security measures so that the execution of the jobs does not compromise neither the integrity of the archived data nor the underlying infrastructure. One of these measures is to run jobs submitted by clusters external to the PWA in virtual machines.
The PWA processing cluster is implemented using Hadoop and only jobs implementable with this technology can be executed. A GAppA prototype was configured using the IPOP Grid Appliance developed by the Advanced Computing and Information Systems Laboratory of the University of Florida. Hadoop On-Demand (HOD) is a system for provisioning virtual Hadoop clusters over a large physical cluster supported by the Apache Software Foundation.
The integration of the IPOP Grid Appliance with the Hadoop On-Demand platform is being studied to enable the dynamic extension of the virtual cluster of computers that execute the jobs.