@publishnetwork

The impact of vocabulary normalization

, and . Journal of Software: Evolution and Process, 27 (4): 255--273 (2015)

Abstract

Software development, evolution, and maintenance depend on ever increasing tool support. Recent tools have incorporated increasing analysis of the natural language found in source code, predominately in the identifiers and comments. However, when coders combine abbreviations and acronyms to form multi-word identifiers, they, in essence, invent new vocabulary making the source code's vocabulary differ from that of other software artifacts. This vocabulary mismatch is a potential problem for many techniques imported from information retrieval and natural language processing, which implicitly assume the use of a single common vocabulary. Vocabulary normalization aims to bring the vocabulary of the source in line with that of other artifacts.A prior small-scale experiment demonstrated the value of vocabulary normalization for C code. A more comprehensive experiment using Java code is presented where normalization fails to bring benefit. To investigate the potential underlying causes, over 20,000 non-dictionary words extracted from the program JabRef were normalized by hand (often requiring significant external information). The experiment, repeated using the hand-normalized identifiers, again found that normalization brought no improvement. In response to this unexpected result, the vocabulary differences between Java and C codes are considered and used to help frame directions for future work. Copyright © 2015 John Wiley & Sons, Ltd.

Links and resources

Tags

community