Inproceedings,

An End-to-end Approach for Extracting and Segmenting High-Variance References from PDF Documents

Z. Boukhers, S. Ambhore, and S. Staab.
Proceedings of the 23rd ACM/IEEE Joint Conference on Digital Libraries, page 1-10. ACM, (June 2019)
DOI: 10.1109/JCDL.2019.00035

Abstract

This paper addresses the problem of extracting and segmenting references from PDF documents. The novelty of the presented approach lies in its capability to discover highly varying references mainly in terms of content, length and location in the document. Unlike existing works, the proposed method does not follow the classical pipeline that consists of sequential phases. It rather learns the different characteristics of references to be used in a coherent scheme that reduces the error accumulation by following a probabilistic approach. Contrary to conventional references, mentioning the sources of information in some publications, such as those of social science, is not subject to the same specifications such as being located in a unique reference section. Therefore, the proposed method aims to extract references of highly varying reference characteristics by relaxing the restrictions of existing methods. Additionally, we present in this paper a new challenging dataset of annotated references in German social science publications. The main purpose of this work is to serve the indexation of missing references by extracting them from challenging publications such as those of German social science. The effectiveness of the presented methods in terms of both extraction and segmentation is evaluated on different datasets, including the German social science set.

BibTeX key: BoukhersJCDL2019
entry type: inproceedings
booktitle: Proceedings of the 23rd ACM/IEEE Joint Conference on Digital Libraries
year: 2019
month: June
pages: 1-10
publisher: ACM
DOI: 10.1109/JCDL.2019.00035
url: https://www.researchgate.net/publication/332980244_An_End-to-end_Approach_for_Extracting_and_Segmenting_High-Variance_References_from_PDF_Documents

Users

Comments and Reviewsshow / hide

Please log in to take part in the discussion (add own reviews or comments).

Cite this publication

@inproceedings{BoukhersJCDL2019, abstract = {This paper addresses the problem of extracting and segmenting references from PDF documents. The novelty of the presented approach lies in its capability to discover highly varying references mainly in terms of content, length and location in the document. Unlike existing works, the proposed method does not follow the classical pipeline that consists of sequential phases. It rather learns the different characteristics of references to be used in a coherent scheme that reduces the error accumulation by following a probabilistic approach. Contrary to conventional references, mentioning the sources of information in some publications, such as those of social science, is not subject to the same specifications such as being located in a unique reference section. Therefore, the proposed method aims to extract references of highly varying reference characteristics by relaxing the restrictions of existing methods. Additionally, we present in this paper a new challenging dataset of annotated references in German social science publications. The main purpose of this work is to serve the indexation of missing references by extracting them from challenging publications such as those of German social science. The effectiveness of the presented methods in terms of both extraction and segmentation is evaluated on different datasets, including the German social science set.}, added-at = {2023-12-14T15:03:14.000+0100}, author = {Boukhers, Zeyd and Ambhore, Shriharsh and Staab, Steffen}, biburl = {https://www.bibsonomy.org/bibtex/2026fa6496b1a147ef84613e40f41e348/admin}, booktitle = {Proceedings of the 23rd ACM/IEEE Joint Conference on Digital Libraries}, doi = {10.1109/JCDL.2019.00035}, interhash = {e69e724c88d2684521b1455995f5a7bd}, intrahash = {026fa6496b1a147ef84613e40f41e348}, keywords = {}, month = {June}, pages = {1-10}, publisher = {ACM}, timestamp = {2023-12-14T15:03:14.000+0100}, title = {An End-to-end Approach for Extracting and Segmenting High-Variance References from PDF Documents}, url = {https://www.researchgate.net/publication/332980244_An_End-to-end_Approach_for_Extracting_and_Segmenting_High-Variance_References_from_PDF_Documents}, year = 2019 }

BibSonomy

An End-to-end Approach for Extracting and Segmenting High-Variance References from PDF Documents

Abstract

Tags

Users

Comments and Reviewsshow / hide

Cite this publication

More citation styles

search on