XLDB Team    
XMLBase

XMLBase - Semi-Structured Data Management
(Gestão de Dados Semi-estruturados)

The XMLBase project researched analysis, design and implementation methods for systems for management of semi-structured data distributed over the Internet. XMLBase is a component-based framework for indexing and searching collections of XML (and HTML) documents. The framework is used to conduct multiple performance analysis measurements that have been used do compare strategies for storing, versioning, indexing and querying XML data collections. This provided a validation environment, upon which we built a prototype Web application (the tumba! search engine) that we used as a benchmark for comparing various alternative strategies for collecting, storing, indexing and querying Web data.
XMLBase has 6 main components:

Versus: a Web meta-data manager, vith versioning and parallel loading/updating capabilities.
WebStore: a repository for Web contents
WebCAT: a Web Contents Analysis Tool
ViúvaNegra: a web crawler built upon Versus and WebCAT
SIDRA: a text indexing and ranking system for Web pages
XSuga: a combined XQuery/text engine for information selection from web data and meta-data.

Some of the software developed under XMLBase is used as part of Tumba!, a search engine for the Portuguese Web.

XML standards enable the definition of complete data management systems for a universe of information much wider than that of current database management systems. XML will be at the core of the next generation of business intelligence systems, data integration and multi-lingual/multi-platform publishing systems.

Our research includes an implementation and a comparative evaluation of the alternatives for integrating and querying heterogeneous data across large-scale networks.

Research contributions of XMLBase include:

A specification of a model for an XML data repository, with meta-data describing the organization and supporting versioning of the managed information.
Query processing strategies for XML data repositories supporting XQuery and XSLT.

Research Team
André Santos
Bruno Martins
Daniel Gomes
Mário J. Silva (research advisor)
Miguel Costa

Funding

XMLBase received funding from FCT
Period: 1-Oct-02 to 30-Sep-04
Proj #: POSI / SRI / 40193 / 2001
Funding: ? 35.000,00

Publications
Bruno Martins, Mário J. Silva,
Language Identification in Web Pages
ACM-SAC-DE, 20th ACM Symposium on Applied Computing, Document Engeneering Track, pp 764-768.
April 2005
Links: Document, Presentation, doi:10.1145/1066677.1066852
Daniel Gomes, Mário J. Silva,
Characterizing a National Community Web
ACM Transactions on Internet Technology (TOIT), 5(3) 508-531
August 2005
Links: Document, BibTex Entry, doi:10.1145/1084772.1084775

Miguel Costa,
SIDRA: a Flexible Web Search System
Master dissertation. Also available as FCUL Technical Report DI/FCUL TR 4-17.
November 2004
Links: Technical Report
Daniel Gomes, André Santos, Mário J. Silva,
Webstore: A Manager for Incremental Storage of Contents
FCUL Technical Report DI/FCUL TR 4-15.
November 2004
Links: Document, BibTex Entry, Technical Report
Bruno Martins,
Inter-document similarity in Web Searches
Master dissertation. Also available as FCUL Technical Report DI/FCUL TR 4-11.
October 2004
Links: Technical Report
Bruno Martins, Mário J. Silva,
Spelling Correction for Search Engine Queries
EsTAL - España for Natural Language Processing, Alicante, Spain.
October 2004
Links: Other
Bruno Martins, Mário J. Silva,
A Statistical Study of the Tumba! Corpus
EsTAL - España for Natural Language Processing, Alicante, Spain. Also available as FCUL Technical Report DI/FCUL TR 4-4.
May 2004
Links: Technical Report
Miguel Costa, Mário J. Silva,
Distributed Index Creation of Large Scale Web Collections in the Sidra System.
JISBD\'2004, IX Jornadas de Ingeniería del Software y Bases de Datos. Málaga.
November 2004
Links: Document, Presentation, BibTex Entry, Conference Web Site
Miguel Costa, Mário J. Silva,
Optimizing Ranking Calculation in Web Search Engines: a Case Study
SBBD 2004, 19º Simpósio Brasileiro de Banco de Dados. Brasilia.
October 2004
Links: Document, BibTex Entry, Conference Web Site




Mário J. Silva,
The Case for a Portuguese Web Search Engine
IADIS International Conference WWW Internet 2003.
November 2003
Links: Document, Conference Web Site
Mário J. Silva,
Searching and Archiving the Web with Tumba!
CAPSI 2003 - 4a. Conferência da Associação Portuguesa de Sistemas de Informação.
November 2003
Links: Document, Conference Web Site
Miguel Costa, Mário J. Silva,
Sidra: a Flexible Distributed Indexing and Ranking Architecture for Web Search.
JISBD 2003 - VIII Jornadas de Ingeniería del Software y Bases de Datos, Alicante, Spain.
November 2003
Links: Document, Presentation, BibTex Entry, Conference Web Site
Daniel Gomes, Mário J. Silva,
A Characterization of the Portuguese Web
3rd ECDL Workshop on Web Archives. Trondheim, Norway.
August 2003
Links: Document, Presentation, BibTex Entry, Conference Web Site
Daniel Gomes, Mário J. Silva,
Collecting Statistics about the Portuguese Web
FCUL Technical Report DI/FCUL TR 03-10.
June 2003
Links: Document, BibTex Entry, Technical Report
Mário J. Silva,
The Case for a Portuguese Web Search Engine.
FCUL Technical Report DI/FCUL TR-03-3.
March 2003
Links: Document, Technical Report
João P. Campos,
Versus: a Web Data Repository with Time Support.
Master dissertation. Faculdade de Ciências da Universidade de Lisboa. Also available as FCUL Technical Report DI/FCUL 03-8.
May 2003
Links: Technical Report

Daniel Gomes, João P. Campos, Mário J. Silva,
Versus: A Web Repository
WDAS-2002: Workshop on Distributed Data & Structures
March 2002
Links: Document, Presentation, BibTex Entry

João P. Campos, Mário J. Silva,
Versus: A Model for a Web Repository
CRC\'01 - 4ª Conferência de Redes de Computadores, Covilhã, Portugal.
November 2001
Links: Document, Presentation
Miguel Costa, Mário J. Silva,
Ranking no Motor de Busca TUMBA
CRC\'01 - 4ª Conferência de Redes de Computadores, Covilhã, Portugal.
November 2001
Links: Document, BibTex Entry
Bruno Martins, Mário J. Silva,
Is it Portuguese? Language detection in large document collections.
CRC\'01 - 4ª Conferência de Redes de Computadores, Covilhã, Portugal.
November 2001
Links: Document
Daniel Gomes, Mário J. Silva,
Tarântula - Sistema de Recolha de Documentos da Web.
CRC\'01 - 4ª Conferência de Redes de Computadores, Covilhã, Portugal.
November 2001
Links: Document, Presentation, BibTex Entry
Daniel Gomes,
Tarântula - Sistema de Recolha de Documentos na WWW.
Relatório do Estágio Profissionalizante da FCUL.
July 2001
Links: Document, Presentation, BibTex Entry