Techreport,

Text Extraction via an Edge-bounded Averaging and a Parametric Character Model

J. Fan.
HPL-2002-294. Hewlett Packard Laboratories, (2002)

Abstract

We present a text extraction algorithm that is deterministic and parametric. The algorithm relies on three basic assumptions about text: color/luminance uniformity of the interior region, closed boundaries of sharp edges and the consistency of local contrast. The algorithm is basically independent of the character alphabet, text layout, font size and orientation. The heart of this algorithm is an edge- bounded averaging for the classification of smooth regions that enhances robustness against noise without sacrificing boundary accuracy. We have also developed a verification process to clean up the residue of incoherent segmentation. Our framework provides a symmetric treatment for both regular and inverse text. We have proposed three heuristics for identifying the type of text from a cluster consisting of two types of pixel aggregates. Finally, we have demonstrated the advantages of the proposed algorithm over adaptive thresholding and block-based clustering methods in terms of boundary accuracy, segmentation coherency, and capability to identify inverse text and separate characters from background patches. Notes: To be published in and presented at Electronic Imaging (SPIE) 2003, 23 January 2003, San Jose, CA

BibTeX key: Fan2002
entry type: techreport
year: 2002
institution: Hewlett Packard Laboratories
number: HPL-2002-294
pages: 20
file: :./HPL-2002-294.pdf:PDF
Document: http://www.hpl.hp.com/techreports/2002/HPL-2002-294.html; http://www.hpl.hp.com/techreports/2002/HPL-2002-294.pdf

Users

Comments and Reviewsshow / hide

Please log in to take part in the discussion (add own reviews or comments).

Cite this publication

@techreport{Fan2002, abstract = {We present a text extraction algorithm that is deterministic and parametric. The algorithm relies on three basic assumptions about text: color/luminance uniformity of the interior region, closed boundaries of sharp edges and the consistency of local contrast. The algorithm is basically independent of the character alphabet, text layout, font size and orientation. The heart of this algorithm is an edge- bounded averaging for the classification of smooth regions that enhances robustness against noise without sacrificing boundary accuracy. We have also developed a verification process to clean up the residue of incoherent segmentation. Our framework provides a symmetric treatment for both regular and inverse text. We have proposed three heuristics for identifying the type of text from a cluster consisting of two types of pixel aggregates. Finally, we have demonstrated the advantages of the proposed algorithm over adaptive thresholding and block-based clustering methods in terms of boundary accuracy, segmentation coherency, and capability to identify inverse text and separate characters from background patches. Notes: To be published in and presented at Electronic Imaging (SPIE) 2003, 23 January 2003, San Jose, CA}, added-at = {2011-03-27T19:35:34.000+0200}, author = {Fan, Jian}, biburl = {https://www.bibsonomy.org/bibtex/23be6dd9c1d2981f25515b4290d599f91/cocus}, file = {:./HPL-2002-294.pdf:PDF}, institution = {Hewlett Packard Laboratories}, interhash = {9da93834e0dd92852dd5bd428bb52186}, intrahash = {3be6dd9c1d2981f25515b4290d599f91}, keywords = {character compound document extraction; image optical processing; recognition segmentation; text}, number = {HPL-2002-294}, pages = 20, timestamp = {2011-03-27T19:35:39.000+0200}, title = {Text Extraction via an Edge-bounded Averaging and a Parametric Character Model}, url = {http://www.hpl.hp.com/techreports/2002/HPL-2002-294.html; http://www.hpl.hp.com/techreports/2002/HPL-2002-294.pdf}, year = 2002 }

BibSonomy

Text Extraction via an Edge-bounded Averaging and a Parametric Character Model

Abstract

Tags

Users

Comments and Reviewsshow / hide

Cite this publication

More citation styles

search on