„Developing Stop Word Lists for Natural Language Program Analysis“ at WSRE 2014
Publiziert bei dem 16. Workshop Software-Reengineering & Evolution
28. – 30. April 2014 in Bad Honnef, Deutschland
http://fg-sre.gi.de/wsre2014.html
Autoren
Benjamin Klatt, Klaus Krogmann
FZI Forschungszentrum Informatik, Software Engineering
Haid-und-Neu-Str. 10-14, 76131 Karlsruhe, Germany
Volker Kuttruff
CAS software AG
CAS-Weg 1-5, 76131 Karlsruhe, Germany
Abstract
When implementing a software, developers express conceptual knowledge (e.g. about a specific feature) not only in program language syntax and semantics but also in linguistic information stored in identifiers (e.g. method or class names). Based on this habit, Natural Language Program Analysis (NLPA) is used to improve many different areas in software engineering such as code recommendations or program analysis. Simplified, NLPA algorithms collect identifier names and apply term processing such as camel case splitting (i.e. „MyIdentifier“ to „My“ and „Identifier“) or stemming (i.e. „records“ to „record“) to subsequently perform further analyzes. For example, in our research context, we search for code locations sharing similar terms to link them with each other.
However, just collecting, splitting, and stemming the identifier names, can result in a list of terms with divergent grade of usefulness. For example, the terms „get“ and „set“ are used in most Java application as common coding conventions and not because of any conceptual knowledge. Taking these terms into account can corrupt the results of the program analysis.
To reduce this noise, a typical approach in natural language processing is to filter terms to be known as useless (aka „stop words“).
For natural languages, stop word lists are language-dependent and many lists are publicly available for English, German or other languages. However, as Host et al. identified, developer use a more specific language with different vocabulary. So common stop word lists are not reasonable to be used in program analysis, they even depend on domain, application type, developing company, and project settings.
In this paper, we propose an approach to develop reusable stop word lists to improve NLPA.
We i) propose to distinguish different scopes a stop word list applies to (i.e. programming language, technology, and domain) and ii) recommend types of sources for terms to include.
Our approach is not limited to a specific technology but according to our research context, examples given are from to the Java technology.