The Web as a parallel corpus

Abstract

Parallel corpora have become an essential resource for work in multilingual natural language processing. In this article, we report on our work using the STRAND system for mining parallel text on the World Wide Web,first reviewing the original algorithm and results and then presenting a set of significant enhancements. These enhancements include the use of supervised learning based on structural features of documents to improve classification performance, a new content-based measure of translational equivalence, and adaptation of the system to take advantage of the Internet Archive for mining parallel text from the Web on a large scale. Finally, the value of these techniques is demonstrated in the construction of a significant parallel corpus for a low-density language pair.

BibTeX key: resnik2003parallel
entry type: article
address: Cambridge, MA, USA
year: 2003
month: sep
journal: Computational Linguistics
number: 3
pages: 349--380
publisher: MIT Press
volume: 29
issn: 0891-2017
acmid: 964753
numpages: 32
issue_date: September 2003
DOI: 10.1162/089120103322711578
url: http://dx.doi.org/10.1162/089120103322711578

BibSonomy

The Web as a parallel corpus

Abstract

Tags

Users

Comments and Reviewsshow / hide

Cite this publication

More citation styles

search on