PhD thesis,

Computational-Linguistic Approaches to Biological Text Mining

A. Clegg.
University of London, (2008)

Abstract

As the body of published literature grows at an accelerating rate, increasingly sophisticated computational methods for natural language processing are required to manage and mine the written knowledge available to life sciences researchers. One important topic within this field is the problem of relationship extraction. Given a text about molecular biology, the challenge is to automatically retrieve the biophysical, biochemical or genetic interactions described therein. Much progress has been made on this problem and others like it by using statistical Information retrieval techniques, regular expressions, finite state automata, sequence alignment and other relatively superficial approaches. However, there are a variety of more linguistically-informed methods available which treat each sentence as a tree or graph rather than simply a collection or sequence of words. Various natural-language parsers are available which facilitate this kind of solution, and the experimental work in this thesis begins with a comparison of several of these on a standard molecular biology corpus using established benchmarking techniques. This is followed by some experiments using evaluation measures tailored to specific biologically-important tasks. A processing pipeline is then described which uses the best of these parsers, along with several other open-source tools, to produce highquality dependency graph representations of input sentences. Finally, three novel deterministic algorithms for relationship extraction are presented. Two of these take dependency graphs as input and return interactions between pre-tagged gene and protein entities, outperforming most existing methods on a standard publically-available test corpus; the other is a strong baseline method using no linguistic information. An appendix discusses the related problems of entity recognition and identification, which --- while outside the main scope of this thesis ---are prerequisites for the development of relationship extraction applications.

BibTeX key: Clegg:2008
entry type: phdthesis
year: 2008
school: University of London
library: Bibsonomy (May 2009)

Users

Comments and Reviewsshow / hide

Please log in to take part in the discussion (add own reviews or comments).

Cite this publication

%0 Thesis %1 Clegg:2008 %A Clegg, Andrew B. %D 2008 %K dependencies inf-extraction parsing biomedical %T Computational-Linguistic Approaches to Biological Text Mining %X As the body of published literature grows at an accelerating rate, increasingly sophisticated computational methods for natural language processing are required to manage and mine the written knowledge available to life sciences researchers. One important topic within this field is the problem of relationship extraction. Given a text about molecular biology, the challenge is to automatically retrieve the biophysical, biochemical or genetic interactions described therein. Much progress has been made on this problem and others like it by using statistical Information retrieval techniques, regular expressions, finite state automata, sequence alignment and other relatively superficial approaches. However, there are a variety of more linguistically-informed methods available which treat each sentence as a tree or graph rather than simply a collection or sequence of words. Various natural-language parsers are available which facilitate this kind of solution, and the experimental work in this thesis begins with a comparison of several of these on a standard molecular biology corpus using established benchmarking techniques. This is followed by some experiments using evaluation measures tailored to specific biologically-important tasks. A processing pipeline is then described which uses the best of these parsers, along with several other open-source tools, to produce highquality dependency graph representations of input sentences. Finally, three novel deterministic algorithms for relationship extraction are presented. Two of these take dependency graphs as input and return interactions between pre-tagged gene and protein entities, outperforming most existing methods on a standard publically-available test corpus; the other is a strong baseline method using no linguistic information. An appendix discusses the related problems of entity recognition and identification, which --- while outside the main scope of this thesis ---are prerequisites for the development of relationship extraction applications.

@phdthesis{Clegg:2008, abstract = {As the body of published literature grows at an accelerating rate, increasingly sophisticated computational methods for natural language processing are required to manage and mine the written knowledge available to life sciences researchers. One important topic within this field is the problem of relationship extraction. Given a text about molecular biology, the challenge is to automatically retrieve the biophysical, biochemical or genetic interactions described therein. Much progress has been made on this problem and others like it by using statistical Information retrieval techniques, regular expressions, finite state automata, sequence alignment and other relatively superficial approaches. However, there are a variety of more linguistically-informed methods available which treat each sentence as a tree or graph rather than simply a collection or sequence of words. Various natural-language parsers are available which facilitate this kind of solution, and the experimental work in this thesis begins with a comparison of several of these on a standard molecular biology corpus using established benchmarking techniques. This is followed by some experiments using evaluation measures tailored to specific biologically-important tasks. A processing pipeline is then described which uses the best of these parsers, along with several other open-source tools, to produce highquality dependency graph representations of input sentences. Finally, three novel deterministic algorithms for relationship extraction are presented. Two of these take dependency graphs as input and return interactions between pre-tagged gene and protein entities, outperforming most existing methods on a standard publically-available test corpus; the other is a strong baseline method using no linguistic information. An appendix discusses the related problems of entity recognition and identification, which --- while outside the main scope of this thesis ---are prerequisites for the development of relationship extraction applications.}, added-at = {2009-07-22T09:57:26.000+0200}, author = {Clegg, Andrew B.}, biburl = {https://www.bibsonomy.org/bibtex/2d60cf1178f7d5c1d174fece60f574382/diego_ma}, interhash = {1cb0d306a6012be0407119d8c663fade}, intrahash = {d60cf1178f7d5c1d174fece60f574382}, keywords = {dependencies inf-extraction parsing biomedical}, library = {Bibsonomy (May 2009)}, school = {University of London}, timestamp = {2009-07-22T09:57:26.000+0200}, title = {Computational-Linguistic Approaches to Biological Text Mining}, year = 2008 }

BibSonomy

Computational-Linguistic Approaches to Biological Text Mining

Abstract

Tags

Users

Comments and Reviewsshow / hide

Cite this publication

More citation styles

search on