Generating an NLP Corpus from Java Source Code: The SSL Javadoc Doclet
N. Khamis, J. Rilling, and R. Witte. New Challenges for NLP Frameworks, page 41--45. Valletta, Malta, ELRA, (May 2010)
Abstract
Source code contains a large amount of natural language text, particularly in the form of comments, which makes it an emerging target of text analysis techniques. Due to the mix with program code, it is difficult to process source code comments directly within NLP frameworks such as GATE. Within this work we present an effective means for generating a corpus using information found in source code and in-line documentation, by developing a custom doclet for the Javadoc tool. The generated corpus uses a schema that is easily processed by NLP applications, which allows language engineers to focus their efforts on text analysis tasks, like automatic quality control of source code comments. The SSLDoclet is available as open source software.
%0 Conference Paper
%1 javadoclet2010
%A Khamis, Ninus
%A Rilling, Juergen
%A Witte, René
%B New Challenges for NLP Frameworks
%C Valletta, Malta
%D 2010
%I ELRA
%K doclet java javadoc nlp softwaremaintenance ssl
%P 41--45
%T Generating an NLP Corpus from Java Source Code: The SSL Javadoc Doclet
%X Source code contains a large amount of natural language text, particularly in the form of comments, which makes it an emerging target of text analysis techniques. Due to the mix with program code, it is difficult to process source code comments directly within NLP frameworks such as GATE. Within this work we present an effective means for generating a corpus using information found in source code and in-line documentation, by developing a custom doclet for the Javadoc tool. The generated corpus uses a schema that is easily processed by NLP applications, which allows language engineers to focus their efforts on text analysis tasks, like automatic quality control of source code comments. The SSLDoclet is available as open source software.
@inproceedings{javadoclet2010,
abstract = {Source code contains a large amount of natural language text, particularly in the form of comments, which makes it an emerging target of text analysis techniques. Due to the mix with program code, it is difficult to process source code comments directly within NLP frameworks such as GATE. Within this work we present an effective means for generating a corpus using information found in source code and in-line documentation, by developing a custom doclet for the Javadoc tool. The generated corpus uses a schema that is easily processed by NLP applications, which allows language engineers to focus their efforts on text analysis tasks, like automatic quality control of source code comments. The SSLDoclet is available as open source software.},
added-at = {2010-07-18T02:45:53.000+0200},
address = {Valletta, Malta},
author = {Khamis, Ninus and Rilling, Juergen and Witte, Ren\'{e}},
biburl = {https://www.bibsonomy.org/bibtex/2e605bae80e017c2a34c206dd608ac0c9/renew},
booktitle = {New Challenges for NLP Frameworks},
interhash = {fe4fca61bd6ad8ef26f74fbcf3b16b31},
intrahash = {e605bae80e017c2a34c206dd608ac0c9},
keywords = {doclet java javadoc nlp softwaremaintenance ssl},
month = {May 22},
pages = {41--45},
publisher = {ELRA},
timestamp = {2010-07-18T02:45:54.000+0200},
title = {{Generating an NLP Corpus from Java Source Code: The SSL Javadoc Doclet}},
year = 2010
}