Automatic information extraction from large websites

V. Crescenzi, и G. Mecca.
J. ACM, 51 (5): 731--779 (2004)
DOI: 10.1145/1017460.1017462

Аннотация

Information extraction from websites is nowadays a relevant problem, usually performed by software modules called wrappers. A key requirement is that the wrapper generation process should be automated to the largest extent, in order to allow for large-scale extraction tasks even in presence of changes in the underlying sites. So far, however, only semi-automatic proposals have appeared in the literature. We present a novel approach to information extraction from websites, which reconciles recent proposals for supervised wrapper induction with the more traditional field of grammar inference. Grammar inference provides a promising theoretical framework for the study of unsupervised�that is, fully automatic�wrapper generation algorithms. However, due to some unrealistic assumptions on the input, these algorithms are not practically applicable to Web information extraction tasks. The main contributions of the article stand in the definition of a class of regular languages, called the prefix mark-up languages, that abstract the structures usually found in HTML pages, and in the definition of a polynomial-time unsupervised learning algorithm for this class. The article shows that, differently from other known classes, prefix mark-up languages and the associated algorithm can be practically used for information extraction purposes. A system based on the techniques described in the article has been implemented in a working prototype. We present some experimental results on known Websites, and discuss opportunities and limitations of the proposed approach.

ключ BibTeX: Crescenzi2004
тип записи: article
адрес: New York, NY, USA
год: 2004
журнал: J. ACM
номер: 5
страницы: 731--779
издательство: ACM
том: 51
issn: 0004-5411
DOI: 10.1145/1017460.1017462

тэги

Пользователи данного ресурса

Комментарии и рецензиипоказать / перейти в невидимый режим

Пожалуйста, войдите в систему, чтобы принять участие в дискуссии (добавить собственные рецензию, или комментарий)

Цитировать эту публикацию

%0 Journal Article %1 Crescenzi2004 %A Crescenzi, Valter %A Mecca, Giansalvatore %C New York, NY, USA %D 2004 %I ACM %J J. ACM %K imported %N 5 %P 731--779 %R 10.1145/1017460.1017462 %T Automatic information extraction from large websites %V 51 %X Information extraction from websites is nowadays a relevant problem, usually performed by software modules called wrappers. A key requirement is that the wrapper generation process should be automated to the largest extent, in order to allow for large-scale extraction tasks even in presence of changes in the underlying sites. So far, however, only semi-automatic proposals have appeared in the literature. We present a novel approach to information extraction from websites, which reconciles recent proposals for supervised wrapper induction with the more traditional field of grammar inference. Grammar inference provides a promising theoretical framework for the study of unsupervised�that is, fully automatic�wrapper generation algorithms. However, due to some unrealistic assumptions on the input, these algorithms are not practically applicable to Web information extraction tasks. The main contributions of the article stand in the definition of a class of regular languages, called the prefix mark-up languages, that abstract the structures usually found in HTML pages, and in the definition of a polynomial-time unsupervised learning algorithm for this class. The article shows that, differently from other known classes, prefix mark-up languages and the associated algorithm can be practically used for information extraction purposes. A system based on the techniques described in the article has been implemented in a working prototype. We present some experimental results on known Websites, and discuss opportunities and limitations of the proposed approach.

@article{Crescenzi2004, abstract = {Information extraction from websites is nowadays a relevant problem, usually performed by software modules called wrappers. A key requirement is that the wrapper generation process should be automated to the largest extent, in order to allow for large-scale extraction tasks even in presence of changes in the underlying sites. So far, however, only semi-automatic proposals have appeared in the literature. We present a novel approach to information extraction from websites, which reconciles recent proposals for supervised wrapper induction with the more traditional field of grammar inference. Grammar inference provides a promising theoretical framework for the study of unsupervised�that is, fully automatic�wrapper generation algorithms. However, due to some unrealistic assumptions on the input, these algorithms are not practically applicable to Web information extraction tasks. The main contributions of the article stand in the definition of a class of regular languages, called the prefix mark-up languages, that abstract the structures usually found in HTML pages, and in the definition of a polynomial-time unsupervised learning algorithm for this class. The article shows that, differently from other known classes, prefix mark-up languages and the associated algorithm can be practically used for information extraction purposes. A system based on the techniques described in the article has been implemented in a working prototype. We present some experimental results on known Websites, and discuss opportunities and limitations of the proposed approach.}, added-at = {2013-08-04T14:38:52.000+0200}, address = {New York, NY, USA}, author = {Crescenzi, Valter and Mecca, Giansalvatore}, biburl = {https://www.bibsonomy.org/bibtex/2176982d629f89150201511bb238aac01/francesco.k}, doi = {10.1145/1017460.1017462}, interhash = {59432b24e6a9d621c2886268de23cd04}, intrahash = {176982d629f89150201511bb238aac01}, issn = {0004-5411}, journal = {J. ACM}, keywords = {imported}, number = 5, pages = {731--779}, publisher = {ACM}, timestamp = {2013-08-04T14:38:52.000+0200}, title = {Automatic information extraction from large websites}, volume = 51, year = 2004 }

BibSonomy