Personal tools
  •  
You are here: Home About How does the web archive work? WebClass: automatic content classification system

WebClass: automatic content classification system

WebClass is a content classification system that automatically analyzes and assigns subjects to archived contents.

A web archive must provide alternative access methods to fulfill different users requirements. A content classification system automatically analyzes and assigns subjects to each archived content according to a predefined list of classes.

Classifying contents by subject may help users to refine their searches and find more relevant results. For instance, if users want to find archived information about the football player named Figo, they are probably not interested in the International Federation of Gynecology and Obstetrics, also known as FIGO. If users would restricted their search for ’Figo’ to pages classified as belonging to the subject Sports, the FIGO organization page would not be presented.

On the other hand, users could also discover in how many
different classes the word Figo was used. Classifying archived contents would enable their grouping and presentation in a class tree that could be navigated in a web directory fashion and would enable the listing of documents belonging to a given class.

One possible approach to implement the classification system is to use Support Vector Machines. In a nutshell, the Support Vector Machine is trained with contents belonging to determined classes and generates a classifier for each class. Then, the classifiers are applied to each content and return values of affinity of the content with each class.

In a first stage, the list of classes will be restricted to the sections commonly used in newspapers because users are familiar with the meaning of these class labels and understand which contents they expect to find inside them. Plus, the training data sets can be obtained from distinct online newspapers that follow the same section structure.

The classification of contents at deeper granularity raises new problems. It becomes more difficult to obtain accurate training sets written in the Portuguese language and the overlapping between classes becomes more frequent
raising subject ambiguity.

For instance, let’s consider the
Sports class divided into several classes of different sports. It is not obvious to which class a newspaper article about the football player Figo playing golf should belong to. Classifying the article as Football, Golf or both will hardly be consensually accepted by the users.



FCCN - Fundação para a Computação Científica Nacional UMIC - Agência para a Sociedade do Conhecimento POSC - Programa Operacional Sociedade do Conhecimento UE - União Europeia - FEDER - Fundo Europeu de Desenvolvimento Regional