Unpublished,

XPath-Wrapper induction by generalizing tree traversal patterns

.
(2005)

Abstract

We introduce a wrapper induction algorithm for extracting information from tree-structured docu-ments like HTML or XML. It derives XPath-compatible extraction rules from a set of anno-tated example documents. The approach builds a minimally generalized tree traversal pattern, and augments it with conditions. Another variant se-lects a subset of conditions so that (a) the pattern is consistent with the training data, (b) the pat-tern’s document coverage is minimized, and (c) conditions that match structures preceding the target nodes are preferred. We discuss the ro-bustness of rules induced by this selection strat-egy and we illustrate how these rules exhibit knowledge of the target concept.

Tags

Users

  • @tobias

Comments and Reviews