@jamesh

Role of Local Context in Automatic Deidentification of Ungrammatical, Fragmented Text

, and . In Proceedings of the North American Chapter of the Association for Computational Linguistics (NAACL, 65, (2006)

Abstract

Deidentification of clinical records is a crucial step before these records can be distributed to non-hospital researchers. Most approaches to deidentification rely heavily on dictionaries and heuristic rules; these approaches fail to remove most personal health information (PHI) that cannot be found in dictionaries. They also can fail to remove PHI that is ambiguous between PHI and non-PHI. Named entity recognition (NER) technologies can be used for deidentification. Some of these technologies exploit both local and global context of a word to identify its entity type. When documents are grammatically written, global context can improve NER. In this paper, we show that we can deidentify medical discharge summaries using support vector machines that rely on a statistical representation of local context. We compare our approach with three different systems. Comparison with a rulebased approach shows that a statistical representation of local context contributes more to deidentification than dictionaries and hand-tailored heuristics. Comparison with two well-known systems, SNoW and IdentiFinder, shows that when the language of documents is fragmented, local context contributes more to deidentification than global context.

Links and resources

Tags