Inproceedings,

Automatic document metadata extraction using support vector machines

H. Giles, E. Manavoglu, H. Zha, Z. Zhang, and E. Fox.
In JCDL ’03: Proceedings of the 3rd ACM/IEEE-CS Joint Conference on Digital Libraries, page 37--48. (2003)

Abstract

Automatic metadata generation provides scalability and usability for digital libraries and their collections. Machine learning methods offer robust and adaptable automatic metadata extraction. We describe a Support Vector Machine classification-based method for metadata extraction from header part of research papers and show that it outperforms other machine learning methods on the same task. The method first classifies each line of the header into one or more of 15 classes. An iterative convergence procedure is then used to improve the line classification by using the predicted class labels of its neighbor lines in the previous round. Further metadata extraction is done by seeking the best chunk boundaries of each line. We found that discovery and use of the structural patterns of the data and domain based word clustering can improve the metadata extraction performance. An appropriate feature normalization also greatly improves the classification performance. Our metadata extraction method was originally designed to improve the metadata extraction quality of the digital libraries Citeseer17 and EbizSearch24. We believe it can be generalized to other digital libraries. 1 Introduction and related work Interoperability is crucial to the effective use of Digital

BibTeX key: Giles03automaticdocument
entry type: inproceedings
booktitle: In JCDL ’03: Proceedings of the 3rd ACM/IEEE-CS Joint Conference on Digital Libraries
year: 2003
pages: 37--48
url: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.147.3718

Users

Comments and Reviewsshow / hide

Please log in to take part in the discussion (add own reviews or comments).

Cite this publication

@inproceedings{Giles03automaticdocument, abstract = {Automatic metadata generation provides scalability and usability for digital libraries and their collections. Machine learning methods offer robust and adaptable automatic metadata extraction. We describe a Support Vector Machine classification-based method for metadata extraction from header part of research papers and show that it outperforms other machine learning methods on the same task. The method first classifies each line of the header into one or more of 15 classes. An iterative convergence procedure is then used to improve the line classification by using the predicted class labels of its neighbor lines in the previous round. Further metadata extraction is done by seeking the best chunk boundaries of each line. We found that discovery and use of the structural patterns of the data and domain based word clustering can improve the metadata extraction performance. An appropriate feature normalization also greatly improves the classification performance. Our metadata extraction method was originally designed to improve the metadata extraction quality of the digital libraries Citeseer[17] and EbizSearch[24]. We believe it can be generalized to other digital libraries. 1 Introduction and related work Interoperability is crucial to the effective use of Digital}, added-at = {2012-08-08T12:01:37.000+0200}, author = {Giles, Hui Han C. Lee and Manavoglu, Eren and Zha, Hongyuan and Zhang, Zhenyue and Fox, Edward A.}, biburl = {https://www.bibsonomy.org/bibtex/2d381bee9e9618dc5fc6a409ab91386bd/wla}, booktitle = {In JCDL ’03: Proceedings of the 3rd ACM/IEEE-CS Joint Conference on Digital Libraries}, description = {CiteSeerX — Automatic document metadata extraction using support vector machines}, interhash = {eadad89cb8e14541b87373f126281722}, intrahash = {d381bee9e9618dc5fc6a409ab91386bd}, keywords = {automatic document extraction metadata}, pages = {37--48}, timestamp = {2012-08-08T12:01:37.000+0200}, title = {Automatic document metadata extraction using support vector machines}, url = {http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.147.3718}, year = 2003 }

BibSonomy

Automatic document metadata extraction using support vector machines

Abstract

Tags

Users

Comments and Reviewsshow / hide

Cite this publication

More citation styles

search on