Article,

Bitext Maps and Alignment via Pattern Recognition

.
Comput. Linguist., 25 (1): 107--130 (March 1999)

Abstract

Texts that are available in two languages (bitexts) are becoming more and more plentiful, both in private data warehouses and on publicly accessible sites on the World Wide Web. As with other kinds of data, the value of bitexts largely depends on the efficacy of the available data mining tools. The first step in extracting useful information from bitexts is to find corresponding words and/or text segment boundaries in their two halves (bitext maps).This article advances the state of the art of bitext mapping by formulating the problem in terms of pattern recognition. From this point of view, the success of a bitext mapping algorithm hinges on how well it performs three tasks: signal generation, noise filtering, and search. The Smooth Injective Map Recognizer (SIMR) algorithm presented here integrates innovative approaches to each of these tasks. Objective evaluation has shown that SIMR's accuracy is consistently high for language pairs as diverse as French/English and Korean/English. If necessary, SIMR's bitext maps can be efficiently converted into segment alignments using the Geometric Segment Alignment (GSA) algorithm, which is also presented here.SIMR has produced bitext maps for over 200 megabytes of French-English bitexts. GSA has converted these maps into alignments. Both the maps and the alignments are available from the Linguistic Data Consortium.

Tags

Users

  • @filologanoga
  • @nlp
  • @dblp

Comments and Reviews