Abstract

In addition to ordinary words and names, real text contains non-standard &\#x201c;words&\#x22; (NSWs), including numbers, abbreviations, dates, currency amounts and acronyms. Typically, one cannot find NSWs in a dictionary, nor can one find their pronunciation by an application of ordinary &\#x201c;letter-to-sound&\#x22; rules. Non-standard words also have a greater propensity than ordinary words to be ambiguous with respect to their interpretation or pronunciation. In many applications, it is desirable to &\#x201c;normalize&\#x22; text by replacing the NSWs with the contextually appropriate ordinary word or sequence of words. Typical technology for text normalization involves sets of ad hoc rules tuned to handle one or two genres of text (often newspaper-style text) with the expected result that the techniques do not usually generalize well to new domains. The purpose of the work reported here is to take some initial steps towards addressing deficiencies in previous approaches to text normalization. We developed a taxonomy of NSWs on the basis of four rather distinct text types&\#x2014;news text, a recipes newsgroup, a hardware-product-specific newsgroup, and real-estate classified ads. We then investigated the application of several general techniques including n-gram language models, decision trees and weighted finite-state transducers to the range of NSW types, and demonstrated that a systematic treatment can lead to better results than have been obtained by the ad hoc treatments that have typically been used in the past. For abbreviation expansion in particular, we investigated both supervised and unsupervised approaches. We report results in terms of word-error rate, which is standard in speech recognition evaluations, but which has only occasionally been used as an overall measure in evaluating text normalization systems.

Links and resources

Tags