@porta

Learning table extraction from examples

, , and . Proceedings of the 20th international conference on Computational Linguistics, Stroudsburg, PA, USA, Association for Computational Linguistics, (2004)
DOI: 10.3115/1220355.1220497

Abstract

Information extraction from tables in web pages is a challenging problem due to the diverse nature of table formats and the vocabulary variants in attribute names. This paper presents a new approach to automated table extraction that exploits formatting cues in semi-structured HTML tables, learns lexical variants from training examples and uses a vector space model to deal with non-exact matches among labels. We conducted experiments with this method on a set of tables collected from 157 university web sites, and obtained the information extraction performance of 91.4% in the Fl-measure, showing the effectiveness of the combined use of structural table parsing and example-based label learning.

Links and resources

Tags

community

  • @dblp
  • @porta
  • @langatium
@porta's tags highlighted