copy delete add this publication to your clipboard
community post
history of this post
URL
DOI
BibTeX
EndNote
APA
Chicago
DIN 1505
Harvard
MSOffice XML

Normalization of non-standard words

R. Sproat. Computer Speech & Language, 15 (3): 287--333 (July 2001)
DOI: http://dx.doi.org/10.1006/csla.2001.0169

Abstract

In addition to ordinary words and names, real text contains non-standard &\#x201c;words&\#x22; (NSWs), including numbers, abbreviations, dates, currency amounts and acronyms. Typically, one cannot find NSWs in a dictionary, nor can one find their pronunciation by an application of ordinary &\#x201c;letter-to-sound&\#x22; rules. Non-standard words also have a greater propensity than ordinary words to be ambiguous with respect to their interpretation or pronunciation. In many applications, it is desirable to &\#x201c;normalize&\#x22; text by replacing the NSWs with the contextually appropriate ordinary word or sequence of words. Typical technology for text normalization involves sets of ad hoc rules tuned to handle one or two genres of text (often newspaper-style text) with the expected result that the techniques do not usually generalize well to new domains. The purpose of the work reported here is to take some initial steps towards addressing deficiencies in previous approaches to text normalization. We developed a taxonomy of NSWs on the basis of four rather distinct text types&\#x2014;news text, a recipes newsgroup, a hardware-product-specific newsgroup, and real-estate classified ads. We then investigated the application of several general techniques including n-gram language models, decision trees and weighted finite-state transducers to the range of NSW types, and demonstrated that a systematic treatment can lead to better results than have been obtained by the ad hoc treatments that have typically been used in the past. For abbreviation expansion in particular, we investigated both supervised and unsupervised approaches. We report results in terms of word-error rate, which is standard in speech recognition evaluations, but which has only occasionally been used as an overall measure in evaluating text normalization systems.

Cite this publication

%0 Journal Article %1 Sproat2001Normalization %A Sproat, R. %D 2001 %J Computer Speech & Language %K badtext classifieds %N 3 %P 287--333 %R http://dx.doi.org/10.1006/csla.2001.0169 %T Normalization of non-standard words %U http://dx.doi.org/10.1006/csla.2001.0169 %V 15 %X In addition to ordinary words and names, real text contains non-standard &\#x201c;words&\#x22; (NSWs), including numbers, abbreviations, dates, currency amounts and acronyms. Typically, one cannot find NSWs in a dictionary, nor can one find their pronunciation by an application of ordinary &\#x201c;letter-to-sound&\#x22; rules. Non-standard words also have a greater propensity than ordinary words to be ambiguous with respect to their interpretation or pronunciation. In many applications, it is desirable to &\#x201c;normalize&\#x22; text by replacing the NSWs with the contextually appropriate ordinary word or sequence of words. Typical technology for text normalization involves sets of ad hoc rules tuned to handle one or two genres of text (often newspaper-style text) with the expected result that the techniques do not usually generalize well to new domains. The purpose of the work reported here is to take some initial steps towards addressing deficiencies in previous approaches to text normalization. We developed a taxonomy of NSWs on the basis of four rather distinct text types&\#x2014;news text, a recipes newsgroup, a hardware-product-specific newsgroup, and real-estate classified ads. We then investigated the application of several general techniques including n-gram language models, decision trees and weighted finite-state transducers to the range of NSW types, and demonstrated that a systematic treatment can lead to better results than have been obtained by the ad hoc treatments that have typically been used in the past. For abbreviation expansion in particular, we investigated both supervised and unsupervised approaches. We report results in terms of word-error rate, which is standard in speech recognition evaluations, but which has only occasionally been used as an overall measure in evaluating text normalization systems.

@article{Sproat2001Normalization, abstract = {In addition to ordinary words and names, real text contains non-standard \&\#x201c;words\&\#x22; (NSWs), including numbers, abbreviations, dates, currency amounts and acronyms. Typically, one cannot find NSWs in a dictionary, nor can one find their pronunciation by an application of ordinary \&\#x201c;letter-to-sound\&\#x22; rules. Non-standard words also have a greater propensity than ordinary words to be ambiguous with respect to their interpretation or pronunciation. In many applications, it is desirable to \&\#x201c;normalize\&\#x22; text by replacing the NSWs with the contextually appropriate ordinary word or sequence of words. Typical technology for text normalization involves sets of ad hoc rules tuned to handle one or two genres of text (often newspaper-style text) with the expected result that the techniques do not usually generalize well to new domains. The purpose of the work reported here is to take some initial steps towards addressing deficiencies in previous approaches to text normalization. We developed a taxonomy of NSWs on the basis of four rather distinct text types\&\#x2014;news text, a recipes newsgroup, a hardware-product-specific newsgroup, and real-estate classified ads. We then investigated the application of several general techniques including n-gram language models, decision trees and weighted finite-state transducers to the range of NSW types, and demonstrated that a systematic treatment can lead to better results than have been obtained by the ad hoc treatments that have typically been used in the past. For abbreviation expansion in particular, we investigated both supervised and unsupervised approaches. We report results in terms of word-error rate, which is standard in speech recognition evaluations, but which has only occasionally been used as an overall measure in evaluating text normalization systems.}, added-at = {2008-12-09T03:00:06.000+0100}, author = {Sproat, R.}, biburl = {https://www.bibsonomy.org/bibtex/283348da1166f3dc62d681cfca8c7696a/jamesh}, citeulike-article-id = {3749342}, doi = {http://dx.doi.org/10.1006/csla.2001.0169}, interhash = {e4fd0b0e12d97748497b5c00ffeba564}, intrahash = {83348da1166f3dc62d681cfca8c7696a}, issn = {08852308}, journal = {Computer Speech \& Language}, keywords = {badtext classifieds}, month = {July}, number = 3, pages = {287--333}, posted-at = {2008-12-05 02:39:34}, priority = {2}, timestamp = {2008-12-09T09:59:02.000+0100}, title = {Normalization of non-standard words}, url = {http://dx.doi.org/10.1006/csla.2001.0169}, volume = 15, year = 2001 }

BibSonomy

copy delete add this publication to your clipboard
community post
history of this post
URL
DOI
BibTeX
EndNote
APA
Chicago
DIN 1505
Harvard
MSOffice XML

Normalization of non-standard words

Abstract

Links and resources

Tags

Cite this publication

More citation styles

search on

Meta data

Comments and Reviews
(0)

BibSonomy

copydeleteadd this publication to your clipboardcommunity posthistory of this postURLDOIBibTeXEndNoteAPAChicagoDIN 1505HarvardMSOffice XML Normalization of non-standard words

Abstract

Links and resources

Tags

Cite this publication

More citation styles

search on

Meta data

Comments and Reviews (0)

copy delete add this publication to your clipboard
community post
history of this post
URL
DOI
BibTeX
EndNote
APA
Chicago
DIN 1505
Harvard
MSOffice XML

Normalization of non-standard words

Comments and Reviews
(0)