Article,

Efficient Extraction of Protein-Protein Interactions from Full-Text Articles.

J. Hakenberg, R. Leaman, N. Vo, S. Jonnalagadda, R. Sullivan, C. Miller, L. Tari, C. Baral, and G. Gonzalez.
IEEE/ACM Trans. Comput. Biology Bioinform., 7 (3): 481-494 (2010)

Abstract

Abstract—Proteins and their interactions govern virtually all cellular processes, such as regulation, signaling, metabolism, and structure. Most experimental findings pertaining to such interactions are discussed in research papers, which, in turn, get curated by protein interaction databases. Authors, editors, and publishers benefit from efforts to alleviate the tasks of searching for relevant papers, evidence for physical interactions, and proper identifiers for each protein involved. The BioCreative II.5 community challenge addressed these tasks in a competition-style assessment to evaluate and compare different methodologies, to make aware of the increasing accuracy of automated methods, and to guide future implementations. In this paper, we present our approaches for proteinnamed entity recognition, including normalization, and for extraction of protein-protein interactions from full text. Our overall goal is to identify efficient individual components, and we compare various compositions to handle a single full-text article in between 10 seconds and 2 minutes. We propose strategies to transfer document-level annotations to the sentence-level, which allows for the creation of a more fine-grained training corpus; we use this corpus to automatically derive around 5,000 patterns. We rank sentences by relevance to the task of finding novel interactions with physical evidence, using a sentence classifier built from this training corpus. Heuristics for paraphrasing sentences help to further remove unnecessary information that might interfere with patterns, such as additional adjectives, clauses, or bracketed expressions. In BioCreative II.5, we achieved an f-score of 22 percent for finding protein interactions, and 43 percent for mapping proteins to UniProt IDs; disregarding species, f-scores are 30 percent and 55 percent, respectively. On average, our best-performing setup required around 2 minutes per full text. All data and pattern sets as well as Java classes that extend third-party software are available as supplementary information (see Appendix).

BibTeX key: journals/tcbb/HakenbergLVJSMTBG10
entry type: article
year: 2010
journal: IEEE/ACM Trans. Comput. Biology Bioinform.
number: 3
pages: 481-494
volume: 7
ee: http://doi.acm.org/10.1145/1843144.1843154
url: http://www.computer.org/portal/c/document_library/get_file?uuid=ed4bbc07-42a3-41a1-a598-7c431c1f455c&groupId=525767

Users

Comments and Reviewsshow / hide

Please log in to take part in the discussion (add own reviews or comments).

Cite this publication

%0 Journal Article %1 journals/tcbb/HakenbergLVJSMTBG10 %A Hakenberg, Jörg %A Leaman, Robert %A Vo, Nguyen Ha %A Jonnalagadda, Siddhartha %A Sullivan, Ryan %A Miller, Christopher %A Tari, Luis %A Baral, Chitta %A Gonzalez, Graciela %D 2010 %J IEEE/ACM Trans. Comput. Biology Bioinform. %K %N 3 %P 481-494 %T Efficient Extraction of Protein-Protein Interactions from Full-Text Articles. %U http://www.computer.org/portal/c/document_library/get_file?uuid=ed4bbc07-42a3-41a1-a598-7c431c1f455c&groupId=525767 %V 7 %X Abstract—Proteins and their interactions govern virtually all cellular processes, such as regulation, signaling, metabolism, and structure. Most experimental findings pertaining to such interactions are discussed in research papers, which, in turn, get curated by protein interaction databases. Authors, editors, and publishers benefit from efforts to alleviate the tasks of searching for relevant papers, evidence for physical interactions, and proper identifiers for each protein involved. The BioCreative II.5 community challenge addressed these tasks in a competition-style assessment to evaluate and compare different methodologies, to make aware of the increasing accuracy of automated methods, and to guide future implementations. In this paper, we present our approaches for proteinnamed entity recognition, including normalization, and for extraction of protein-protein interactions from full text. Our overall goal is to identify efficient individual components, and we compare various compositions to handle a single full-text article in between 10 seconds and 2 minutes. We propose strategies to transfer document-level annotations to the sentence-level, which allows for the creation of a more fine-grained training corpus; we use this corpus to automatically derive around 5,000 patterns. We rank sentences by relevance to the task of finding novel interactions with physical evidence, using a sentence classifier built from this training corpus. Heuristics for paraphrasing sentences help to further remove unnecessary information that might interfere with patterns, such as additional adjectives, clauses, or bracketed expressions. In BioCreative II.5, we achieved an f-score of 22 percent for finding protein interactions, and 43 percent for mapping proteins to UniProt IDs; disregarding species, f-scores are 30 percent and 55 percent, respectively. On average, our best-performing setup required around 2 minutes per full text. All data and pattern sets as well as Java classes that extend third-party software are available as supplementary information (see Appendix).

@article{journals/tcbb/HakenbergLVJSMTBG10, abstract = {Abstract—Proteins and their interactions govern virtually all cellular processes, such as regulation, signaling, metabolism, and structure. Most experimental findings pertaining to such interactions are discussed in research papers, which, in turn, get curated by protein interaction databases. Authors, editors, and publishers benefit from efforts to alleviate the tasks of searching for relevant papers, evidence for physical interactions, and proper identifiers for each protein involved. The BioCreative II.5 community challenge addressed these tasks in a competition-style assessment to evaluate and compare different methodologies, to make aware of the increasing accuracy of automated methods, and to guide future implementations. In this paper, we present our approaches for proteinnamed entity recognition, including normalization, and for extraction of protein-protein interactions from full text. Our overall goal is to identify efficient individual components, and we compare various compositions to handle a single full-text article in between 10 seconds and 2 minutes. We propose strategies to transfer document-level annotations to the sentence-level, which allows for the creation of a more fine-grained training corpus; we use this corpus to automatically derive around 5,000 patterns. We rank sentences by relevance to the task of finding novel interactions with physical evidence, using a sentence classifier built from this training corpus. Heuristics for paraphrasing sentences help to further remove unnecessary information that might interfere with patterns, such as additional adjectives, clauses, or bracketed expressions. In BioCreative II.5, we achieved an f-score of 22 percent for finding protein interactions, and 43 percent for mapping proteins to UniProt IDs; disregarding species, f-scores are 30 percent and 55 percent, respectively. On average, our best-performing setup required around 2 minutes per full text. All data and pattern sets as well as Java classes that extend third-party software are available as supplementary information (see Appendix).}, added-at = {2023-12-13T00:01:13.000+0100}, author = {Hakenberg, Jörg and Leaman, Robert and Vo, Nguyen Ha and Jonnalagadda, Siddhartha and Sullivan, Ryan and Miller, Christopher and Tari, Luis and Baral, Chitta and Gonzalez, Graciela}, biburl = {https://www.bibsonomy.org/bibtex/2735c44c7afcf846f0568d3b6a54a8948/admin}, ee = {http://doi.acm.org/10.1145/1843144.1843154}, interhash = {5da4b8c764126403f60aab90ac1d7849}, intrahash = {735c44c7afcf846f0568d3b6a54a8948}, journal = {IEEE/ACM Trans. Comput. Biology Bioinform.}, keywords = {}, number = 3, pages = {481-494}, timestamp = {2023-12-13T00:01:13.000+0100}, title = {Efficient Extraction of Protein-Protein Interactions from Full-Text Articles.}, url = {http://www.computer.org/portal/c/document_library/get_file?uuid=ed4bbc07-42a3-41a1-a598-7c431c1f455c&groupId=525767}, volume = 7, year = 2010 }

BibSonomy

Efficient Extraction of Protein-Protein Interactions from Full-Text Articles.

Abstract

Tags

Users

Comments and Reviewsshow / hide

Cite this publication

More citation styles

search on