Аннотация
Information extraction from websites is nowadays a relevant problem,
usually performed by software modules called wrappers. A key requirement
is that the wrapper generation process should be automated to the
largest extent, in order to allow for large-scale extraction tasks
even in presence of changes in the underlying sites. So far, however,
only semi-automatic proposals have appeared in the literature. We
present a novel approach to information extraction from websites,
which reconciles recent proposals for supervised wrapper induction
with the more traditional field of grammar inference. Grammar inference
provides a promising theoretical framework for the study of unsupervised�that
is, fully automatic�wrapper generation algorithms. However, due to
some unrealistic assumptions on the input, these algorithms are not
practically applicable to Web information extraction tasks. The main
contributions of the article stand in the definition of a class of
regular languages, called the prefix mark-up languages, that abstract
the structures usually found in HTML pages, and in the definition
of a polynomial-time unsupervised learning algorithm for this class.
The article shows that, differently from other known classes, prefix
mark-up languages and the associated algorithm can be practically
used for information extraction purposes. A system based on the techniques
described in the article has been implemented in a working prototype.
We present some experimental results on known Websites, and discuss
opportunities and limitations of the proposed approach.
Пользователи данного ресурса
Пожалуйста,
войдите в систему, чтобы принять участие в дискуссии (добавить собственные рецензию, или комментарий)