@huiyangsfsu

New directions in biomedical text annotation: definitions, guidelines and corpus construction

, , and . BMC Bioinformatics., (2006)

Abstract

Background While biomedical text mining is emerging as an important research area, practical results have proven difficult to achieve. We believe that an important first step towards more accurate text-mining lies in the ability to identify and characterize text that satisfies various types of information needs. We report here the results of our inquiry into properties of scientific text that have sufficient generality to transcend the confines of a narrow subject area, while supporting practical mining of text for factual information. Our ultimate goal is to annotate a significant corpus of biomedical text and train machine learning methods to automatically categorize such text along certain dimensions that we have defined. Results We have identified five qualitative dimensions that we believe characterize a broad range of scientific sentences, and are therefore useful for supporting a general approach to text-mining: focus, polarity, certainty, evidence, and directionality. We define these dimensions and describe the guidelines we have developed for annotating text with regard to them. To examine the effectiveness of the guidelines, twelve annotators independently annotated the same set of 101 sentences that were randomly selected from current biomedical periodicals. Analysis of these annotations shows 70–80% inter-annotator agreement, suggesting that our guidelines indeed present a well-defined, executable and reproducible task. Conclusion We present our guidelines defining a text annotation task, along with annotation results from multiple independently produced annotations, demonstrating the feasibility of the task. The annotation of a very large corpus of documents along these guidelines is currently ongoing. These annotations form the basis for the categorization of text along multiple dimensions, to support viable text mining for experimental results, methodology statements, and other forms of information. We are currently developing machine learning methods, to be trained and tested on the annotated corpus, that would allow for the automatic categorization of biomedical text along the general dimensions that we have presented. The guidelines in full detail, along with annotated examples, are publicly available. Focus Each text fragment may convey one (and sometimes more) of: • Scientific content, findings and discovery; we refer to this type of information as Science, and indicate it by the tag S. • Generic-level information; General state of knowledge and science outside the scope of the paper, the structure of the paper itself or the state of the world. Such statements are not usually based on scientific experiment, and may reflect an opinion or an observation that would have been as truthful, and probably as valid, if made by a layperson. We refer to it as Generic, and denote it with the tag G. • Methodology that was used in an experiment or a study. We refer to it as Methodology, and denote it with the tag M. We note that the focus of a statement may be viewed differently depending on the context (e.g. section, paragraph, sentence) in which it appears. What may be regarded as a scientific finding in one context is a methodology in another. In fact, most scientific methods are based on what were at one time reported scientific findings. Thus the annotator will inevitably face ambiguity in trying to distinguish science and methodology. Our approach is therefore only to annotate methodology when the sentence under annotation contains an indication that methodology is being discussed. In contrast to zone-based annotation schemes, we note that not every sentence appearing in a Methodology section discusses methodology, and not every sentence discussing methodology appears in the Methodology section. Further, nothing is gained if we annotate a sentence as methodology when it is indistinguishable from sentences discussing science. We are interested in learning how the text of a sentence itself signals that methodology is being discussed. See Appendices B–F for annotated examples 47. Polarity A fragment with any focus can be stated either positively (P) or negatively (N). For statements that convey lack-of-knowledge, (e.g. "It is still unknown whether..."), the default assignment is P. The lack of knowledge in this case will be reflected by a certainty degree of 0, as explained in the next item. Every fragment should be annotated by its polarity, regardless of its focus or its certainty. Certainty Each fragment conveys a degree of certainty about the validity of the assertion it makes. Our annotation uses a scale in the range 0–3 as a measure of certainty, for both positive and negative statements. The lowest degree (0) represents complete uncertainty, that is, the fragment explicitly states that there is an uncertainty or lack of knowledge about a particular phenomenon ("it is unknown..." or "it is unclear whether..." etc.). The highest degree, (3), represents complete certainty, reflecting an accepted, known and/or proven fact. The intermediate degree (1) represents a low certainty, while (2) is assigned to high-likelihood expressions that are still short of complete certainty. Evidence This dimension indicates for any fragment, regardless of its focus and certainty, if its assertion is supported by evidence. The existence – or the lack – of evidence is denoted by a tag starting with the letter E. The letter is followed by one or more digits, in the range 0–3, indicating the type of evidence or its absence: • E0: No indication of evidence in the fragment whatsoever, or an explicit statement in the text indicates lack of evidence. • E1: A claim of evidence, but no verifying information is explicitly given. Evidence is not shown within the annotated sentence/fragment, and no explicit reference to it is provided. The evidence is merely asserted to exist in some form, possibly in the preceding text, or in prior experiments, but its location is not explicitly stated. Note that in this case the indirect implication of evidence may not be explicit in the fragment, but implied by a use of terms referring to a previous fragment. For instance, a sentence may begin with the fragment "Previous experiments show that...", followed by the fragment, "therefore, it is likely that ...". Both fragments are of evidence level 1; the first because it points to experiments without an explicit reference, and the second, because of the "therefore" term which uses the previous assertion as an indirect evidence. • E2: Evidence is not given within the sentence/fragment, but explicit reference is made to other papers (citations) to support the assertion. • E3: Evidence is provided, within the fragment, in one of the following forms: ○ A reference to experiments previously reported within the body of the paper by a direct description of the finding as an experimental result (e.g. Öur data indicates...", "...our results show"...) ○ A verb (typically in the past-tense) within the statement indicates an observation or an experimental finding which is described within the paper, (e.g. "We found that...", "We see that..."). ○ A reference to an experimental figure or a table of data given within the paper. A statement about a certain finding may be assigned different levels of evidence depending on the wording used. For instance, something reported as a finding by the authors would be annotated as E3. (e.g., Öur data demonstrate that ICG-001 has no effect on AP1 ..."). In this case the words Öur data demonstrate" indicate the evidence. However, a similar statement may occur without any indication of evidence. (e.g., "ICG-001 has no effect on AP1 ..."). In that case, stated without any support, it would be annotated as E0. This same statement would be annotated as E1 if accompanied by a non-explicit reference (e.g. "Previous studies suggest that ICG-001 has no effect on AP1 ..."). Finally, if explicit reference to the original work is given: "Previous studies suggest that ICG-001 has no effect on AP1 ... 25", the tag would be E2. We note that it is not the scientific details themselves, be they ever so intricate, that constitute the evidence. Rather, it is the specific wording that points to a certain type of evidence.

Links and resources

Tags

community

  • @huiyangsfsu
  • @dblp
@huiyangsfsu's tags highlighted