Abstract:
GitHub's pull-based development model is widely used by software development teams to manage software complexity. Contributors create pull requests for merging changes into the main codebase, and integrators review these requests to maintain quality and stability. However, a high volume of pull requests can overburden integrators, causing feedback delays. Previous studies have used machine learning and statistical techniques with tabular data as features, but these may lose meaningful information. Additionally, acceptance and latency may not be sufficient for the pull request evaluation. Moreover, reopened pull requests can add maintenance costs and burden already-busy developers. This thesis proposes a novel multi-output deep learning-based approach that early predicts acceptance, latency, and reopening of pull requests, handling various data sources, including tabular and textual data, effectively. Our approach also applies SMOTE and VAE techniques to address the highly imbalanced nature of the pull request reopening. We evaluate our approach on 143,886 pull requests from 54 well-known projects across four popular programming languages. The experimental results show that our approach significantly outperforms the randomized baseline. Moreover, our approach improves Accuracy by 8.68% and F1-Score by 6.77% in acceptance prediction, and MMAE by 6.07% in latency prediction, while improving Balanced Accuracy by 9.43% and AUC by 9.37% in reopening prediction over the existing approach.