GitHub has become a precious service for storing and managing software source code. Over the last year, 10M new developers have joined the GitHub community, contributing to more than 44M repositories. In order to help developers increase the reachability of their repositories, in 2017 GitHub introduced the possibility to classify them by means of topics. However, assigning wrong topics to a given repository can compromise the possibility of helping other developers approach it, and thus preventing them from contributing to its development. In this paper we investigate the application of Multinomial Naïve Bayesian (MNB) networks to automatically classify GitHub repositories. By analyzing the README file(s) of the repository to be classified and the source code implementing it, the conceived approach is able to recommend GitHub topics. To the best of our knowledge, this is the first supervised approach addressing the considered problem. Consequently, since there exists no suitable baseline for the comparison, we validated the approach by considering different metrics, aiming to study various quality aspects.
A Multinomial Naïve Bayesian (MNB) Network to Automatically Recommend Topics for GitHub Repositories
Di Sipio C.;Rubei R.;Di Ruscio D.;NGUYEN THANH PHUONG
2020-01-01
Abstract
GitHub has become a precious service for storing and managing software source code. Over the last year, 10M new developers have joined the GitHub community, contributing to more than 44M repositories. In order to help developers increase the reachability of their repositories, in 2017 GitHub introduced the possibility to classify them by means of topics. However, assigning wrong topics to a given repository can compromise the possibility of helping other developers approach it, and thus preventing them from contributing to its development. In this paper we investigate the application of Multinomial Naïve Bayesian (MNB) networks to automatically classify GitHub repositories. By analyzing the README file(s) of the repository to be classified and the source code implementing it, the conceived approach is able to recommend GitHub topics. To the best of our knowledge, this is the first supervised approach addressing the considered problem. Consequently, since there exists no suitable baseline for the comparison, we validated the approach by considering different metrics, aiming to study various quality aspects.Pubblicazioni consigliate
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.