Abstract

The purpose of this paper is to characterize a constituent boundary parsing algorithm, using an information-theoretic measure called generalized mutual information, which serves as an alternative to traditional grammar-based parsing methods. This method is based on the hypothesis that constituent boundaries can be extracted from a given sentence (or word sequence) by analyzing the mutual information values of the part-ofspeech n-grams within the sentence. This hypothesis is supported by the performance of an implementation of this parsing algorithm which determines a recursive unlabeled bracketing of unrestricted English text with a relatively low error rate. This paper derives the generalized mutual information statistic, describes the parsing algorithm, and presents results and sample output from the parser. Introduction A standard approach to parsing a natural language is to characterize the language using a set of rules, a grammar. A grammar-based parsing algori...

Description

Parsing a Natural Language Using Mutual Information Statistics

Links and resources

Tags

community

  • @dblp
  • @jil
  • @davidswelt
@jil's tags highlighted