When simplicity meets effectiveness: Detecting code comments coherence with word embeddings and LSTM

Igbomezie, Michael Dubem; Nguyen, Phuong; Di Ruscio, Davide

doi:10.1145/3661167.3661187

Code comments play a crucial role in software development, as they provide programmers with practical information, allowing them to understand better the intent and semantics of the underpinning code. Nevertheless, developers tend to leave comments unchanged after updating the code, resulting in a discrepancy between the two artifacts. Such a discrepancy may trigger misunderstanding and confusion among developers, impeding various activities, including code comprehension and maintenance. Thus, it is crucial to identify if, given a code snippet, its corresponding comment is coherent and reflects well the intent behind the code. Unfortunately, existing approaches to this problem, while obtaining an encouraging performance, either rely on heavily pre-trained models, or treat input data as text, neglecting the intrinsic features contained in comments and code, including word order and synonyms.We present Co3D as a practical approach to the detection of code comment coherence. We pay attention to internal meaning of words and sequential order of words in text while predicting coherence in code-comment pairs. We deployed a combination of Gensim word2vec encoding and a simple recurrent neural network, a combination of Gensim word2vec encoding and an LSTM model, and CodeBERT. The first two methods can be considered a manual version of the last two methods, with the last two methods involving the use of transfer learning from large pre-trained models. The experimental results show that Co3D obtains a promising prediction performance, thus outperforming well-established baselines. We conclude that depending on the context, using a simple architecture can introduce a satisfying prediction.

When simplicity meets effectiveness: Detecting code comments coherence with word embeddings and LSTM

Igbomezie, Michael Dubem;Nguyen, Phuong;Di Ruscio, Davide

2024-01-01

Abstract

Code comments play a crucial role in software development, as they provide programmers with practical information, allowing them to understand better the intent and semantics of the underpinning code. Nevertheless, developers tend to leave comments unchanged after updating the code, resulting in a discrepancy between the two artifacts. Such a discrepancy may trigger misunderstanding and confusion among developers, impeding various activities, including code comprehension and maintenance. Thus, it is crucial to identify if, given a code snippet, its corresponding comment is coherent and reflects well the intent behind the code. Unfortunately, existing approaches to this problem, while obtaining an encouraging performance, either rely on heavily pre-trained models, or treat input data as text, neglecting the intrinsic features contained in comments and code, including word order and synonyms.We present Co3D as a practical approach to the detection of code comment coherence. We pay attention to internal meaning of words and sequential order of words in text while predicting coherence in code-comment pairs. We deployed a combination of Gensim word2vec encoding and a simple recurrent neural network, a combination of Gensim word2vec encoding and an LSTM model, and CodeBERT. The first two methods can be considered a manual version of the last two methods, with the last two methods involving the use of transfer learning from large pre-trained models. The experimental results show that Co3D obtains a promising prediction performance, thus outperforming well-established baselines. We conclude that depending on the context, using a simple architecture can introduce a satisfying prediction.