Article,

NSF IIS #0205470: Constructing Protein Ontologies Using Text Mining

, and .
(2002)

Abstract

Given the vast amounts of genomic and molecular data being generated by scientific research, there is a pressing need to develop advanced bioinformatics infrastructures for biological knowledge management. An ontology is a semantic model that contains a shared vocabulary and classification of concepts in a domain. Ontologies for biology are crucial in data integration from multiple databases and in literature mining for knowledge extraction and evidence attribution. This project focuses on the development of an ontology of protein names, consisting of a data dictionary and links to more specific, more general and synonymous protein names. Ontology development, however, currently requires substantial human effort. This project will exploit statistical and computational linguistics methods to induce an ontology of protein names using text corpora from MEDLINE and a knowledge base developed at the Protein Information Resource (PIR); terms in the induced ontology will also be linked to the functional hierarchy of the Gene Ontology. The induced ontology can then be further edited by a human. This project aims at demonstrating that this domain-independent method of ontology induction is more cost-effective than having humans develop an ontology from scratch. The approach could therefore be of practical value in other domains where there is a need to develop ontologies linking text corpora and nomenclature in databases. Both the ontology and software system developed in this project will be freely distributed to the scientific community via the PIR web site in standard XML-based ontology interchange formats and for intelligent literature mining and PubMed searching.

Tags

Users

  • @huiyangsfsu

Comments and Reviews