A Word Embedding Model for Fault Localization using Bug and Software Change Repositories

  • Aqib Rehman Quaid-i-Azam University Islamabad, Pakistan
Keywords: fault localization, word embedding, bug repositories


Software developed and then deployed in a real world environment is inevitable to exhibit some undesirable behavior. Therefore, developers need to provide maintenance facilities to enable the bugs causing the undesirable behavior to be fixed. However, prior to fixing the bug, the suspicious part of the code needs to be identified. For this purpose, they usually perform fault localization. This can be done manually as well as automatically. Several techniques exist in the literature for fault localization. However, most of them are static based techniques because they do not depend on a specific programming language along with the possibility to work on underdeveloped software and some other benefits. These techniques are largely based on lexical matching of terms which leads to mismatch of terms, large precision value because of limited vocabulary of a programming language and some techniques consider the semantics but it is computationally expensive to localize faults through this. In this paper we have proposed a fault localization technique which is based on the machine learning concept of word embedding. Our proposed approach aims at looking at the relatedness between the bug terms and source code artifact. We mined the bug repositories and software change repositories to train the word embedding model on the mined repositories data. On the arrival of a new bug, the cluster of the bugs from the model is searched and the files from the software change repositories are retrieved which are used for fixing those bugs. We have compared the results of our approach with the latest technique proposed in year 2018 Pointwise Mutual Information (PMI) and Normalized Google Distance (NGD) which consider the context and also with existing lexical techniques Vector Space Model (VSM) and the semantic based method Latent Semantic Indexing (LSI). We have used the benchmark dataset “MoreBugs” which has been widely used in this domain. The results show that our approach outperforms other techniques.