Predicting continuous integration build failures using evolutionary search

https://doi.org/10.1016/j.infsof.2020.106392Get rights and content

Abstract

Context: Continuous Integration (CI) is a common practice in modern software development and it is increasingly adopted in the open-source as well as the software industry markets. CI aims at supporting developers in integrating code changes constantly and quickly through an automated build process. However, in such context, the build process is typically time and resource-consuming which requires a high maintenance effort to avoid build failure.

Objective: The goal of this study is to introduce an automated approach to cut the expenses of CI build time and provide support tools to developers by predicting the CI build outcome.

Method: In this paper, we address problem of CI build failure by introducing a novel search-based approach based on Multi-Objective Genetic Programming (MOGP) to build a CI build failure prediction model. Our approach aims at finding the best combination of CI built features and their appropriate threshold values, based on two conflicting objective functions to deal with both failed and passed builds.

Results: We evaluated our approach on a benchmark of 56,019 builds from 10 large-scale and long-lived software projects that use the Travis CI build system. The statistical results reveal that our approach outperforms the state-of-the-art techniques based on machine learning by providing a better balance between both failed and passed builds. Furthermore, we use the generated prediction rules to investigate which factors impact the CI build results, and found that features related to (1) specific statistics about the project such as team size, (2) last build information in the current build and (3) the types of changed files are the most influential to indicate the potential failure of a given build.

Conclusion: This paper proposes a multi-objective search-based approach for the problem of CI build failure prediction. The performances of the models developed using our MOGP approach were statistically better than models developed using machine learning techniques. The experimental results show that our approach can effectively reduce both false negative rate and false positive rate of CI build failures in highly imbalanced datasets.

Introduction

Continuous integration (CI) [1] is a set of software development practices that are widely adopted in industry and open source environments [2]. A typical CI system, such as Travis CI1, advocates to continuously integrate code changes, introduced by different developers, into a shared repository branch. The key to making this possible, according to Fowler [3], is automating the process of building and testing, which reduces the cost and risk of delivering defective changes. From the academic side, the study of CI adoption has become an active research topic and it has already been shown that CI improves developers’ productivity [4], helps to maintain code quality [2] and allows for a higher release frequency [5].

However, despite its valuable benefits, CI brings its own challenges. Hilton et al. [6]. revealed that build failure is a major barrier that developers face when using CI. A build failure, i.e., failing to compile the software into machine executable code, represents a blocker that prevents developers from proceeding further with development, as it requires an immediate action to resolve it. In addition, the build resolution may take hours or even days to complete, which severely affects both, the speed of software development and the productivity of developers [7]. Such challenges motivated researchers and practitioners to develop techniques for preemptively detecting when a software state is most likely to trigger a failure when built, and thus developers can take the necessary preventive actions to avoid it.

Existing studies leverage the history of previous build success and failures in order to train machine learning (ML) models. Such models learn from the CI builds history and use the domain knowledge to extract features and predict the outcome of a given input build. For instance, Foyzul and Wang [8] used Random Forest (RF), for the binary classification of build outcome, and Ni and Li [9] adapted the cascaded classifiers to improve the accuracy of CI build prediction. Although these works have advocated that predicting CI build outcome is possible and beneficial, none of them accommodated for the imbalanced distribution of the successful and failed classes when building their prediction models. This challenges their applicability due to the performance bias that can occur when an imbalanced distribution of class examples is used in the learning process [10], [11], [12], [13]. Hence, the minority class instances, i.e., the failed builds class in our case, is much more likely to be miss-classified. However, in CI context, a good accuracy on the failed builds prediction is more important than the passed builds accuracy. Also, increasing the accuracy of the builds failure class (known as probability of detection) can result in maximizing also the number of incorrectly classified failed builds (i.e., false alarms) which makes these two objectives in conflict [10], [14].

To deal with the above mentioned challenges, Evolutionary Multi-Objective Optimisation (EMO) [15], [16], [17], [18], [19] have been found useful for developing software engineering predictive models [20], [21]. Researchers have advocated that the use of (EMO) is appropriate because it allows adapting the fitness function to evolve classifiers with good classification ability across both the minority and majority classes, e.g., balance between failed and passed builds. This is accomplished by treating the conflicting objectives independently in the learning process using the notion of Pareto Dominance. Additionally, to deal with the imbalanced nature of the dataset, a Multi-Objective Genetic Programming (MOGP) approach [22], that promotes diversity between solutions equally on both minority and majority classes, allows the imbalanced training data to be used directly in the learning process i.e.without relying on sampling techniques to re-balance the data [12], [23] which advocates that MOGP approaches are more suitable for binary classification tasks with imbalanced data [10].

In this paper, we introduce a novel MOGP approach to predict CI build outcome. The idea is based on the adaption of the Non-dominated Sorting Genetic Algorithm (NSGA-II) [24] with a tree-based solution representation, in order to generate rules from historical data of CI builds using two competing objectives in the learning process, namely the probability of detection and the probability of false alarms. As a solution to this binary classification problem, a candidate rule is expressed as a combination of metrics and their appropriate threshold values; and should cover as much as possible the build results from the base of build results. In a nutshell, our approach takes as input, a given build, calculates a set a metrics that are fed into our rule, previously generated using the history of builds, and whose binary output predicts whether the input build is most likely to succeed or fail, based on its likelihood to the successful or failed builds.

To evaluate our approach, we conducted an empirical study on a benchmark composed of 56,019 build instances from 10 open source projects that use the Travis CI system, one of the most popular CI systems. We compare our predictive performance to existing Genetic Programming (GP) algorithms and three widely-used ML techniques namely Random Forest, Decision Tree and Naive Bayes. The statistical results reveal that our approach advances the state-of-the art by outperforming existing prediction models. Moreover, we examine the most important features, used by our generated rules, in indicating the correct CI build outcome, in order to provide the practitioners with useful insights on how to avoid build failures. In summary, the contributions of this work are the following:

  • A novel formulation of the CI build prediction as a multi-objective optimization problem to handle imbalance nature of CI builds as well as to achieve a good predictive performance on both classes (passed and failed). To the best of our knowledge, this is the first attempt to use a search-based approach for the CI build prediction.

  • An empirical study of our MOGP technique compared to different existing approaches based on a benchmark of 10 large and long-lived projects. The obtained results reveal that our proposal is more efficient than existing techniques with a median of AUC (Area Under The Curve) of 68% compared to 61% achieved by existing ML techniques for which we applied re-sampling. Additionally, our approach is able to strike a better balance between both failed and passed builds achieving an improvement of at least 15% for the balance metric [25]. These are interesting and actionable results considering the highly imbalanced nature of the studied projects with an average failure rate of 19% in the minority class.

  • A qualitative evidence of the potential reasons behind build failure through a novel feature ranking approach. The rules analysis shows that the metrics related to (1) specific statistics about the project such as team size, (2) last build information in the current build and (3) the types of changed files are the most influential to indicate the potential failure of a given build.

  • A comprehensive dataset [26] collected from 10 long-lived software projects, containing over 56,019 records of build results.

Replication Package. The comprehensive dataset collected and used in our study is publicly available in [26] for future replications and extensions. Also, we provide all details about the validation results as well as illustrative examples of the generated rules available for the research community.

Paper Organization. The remainder of this paper is organized as follows. Section 2 provides an overview of the CI build process and the related work. We present our approach in Section 3. Section 4 shows the experimental setup of our empirical study. Section 5 presents the results and findings of our studied research questions. Section 6 discusses the implications of our findings for developers, researchers and tool builders. Section 7 reviews the threats to the validity of our results. Finally, Section 8 concludes the paper and outlines avenues for future work.

Section snippets

Background and related work

In this section, we provide an overview of CI and the related work.

Search-based prediction of CI build failure

In this section, we describe our approach that uses multi-objective GP based on an adaptation of NSGA-II.

Validation

In this section, we report the results of a large-scale empirical study on a benchmark of 56,019 build instances. The comprehensive dataset collected and used in our study is publicly available in [26] for future replications and extensions.

Fig. 5 provides an overview of our experimental design used in the validation of our approach. First, we evaluate our predictive performance against existing approaches in the two first questions. At this step, we run search-based algorithms and non

Experimental results

This section presents the experimental results obtained for RQ1-3.

Discussion

In this section, we discuss our findings and their implications for developers, researchers and tool builders.

Threats to validity

This section describes the threats to the validity of our experiments.

Internal validity. One threat to internal validity is related to training and test sets selection. As an attempt to mitigate this issue, we considered online validation which is a realistic scenario as it considers the chronological order of CI builds and mimics what happens during the continuous integration process. Future work is planned to validate our approach considering other scenarios such as cross-project validation.

Conclusions and future work

In this article, we introduced a new search-based approach for CI build failure prediction. In our genetic programming (GP) adaptation, prediction rules are represented as a combination of metrics and threshold values that should correctly predict as much as possible the failed builds extracted from a base of real world examples. Considering online validation, the statistical analysis of the obtained results provides evidence that our approach outperforms three Machine Learning (ML) techniques,

CRediT authorship contribution statement

Islem Saidani: Conceptualization, Data curation, Formal analysis, Methodology, Software, Validation, Investigation, Writing - original draft. Ali Ouni: Conceptualization, Validation, Supervision, Resources, Writing - review & editing, Funding acquisition, Project administration. Moataz Chouchen: Software, Validation, Investigation. Mohamed Wiem Mkaouer: Methodology, Validation, Writing - review & editing.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgements

This work is supported by the Natural Sciences and Engineering Research Council of Canada (NSERC).

References (75)

  • Y. Zhao et al.

    The impact of continuous integration on other software development practices: A large-scale empirical study

    32nd IEEE/ACM International Conference on Automated Software Engineering

    (2017)
  • M. Hilton et al.

    Trade-offs in continuous integration: assurance, security, and flexibility

    11th Joint Meeting on Foundations of Software Engineering

    (2017)
  • R. Abdalkareem et al.

    Which commits can be ci skipped?

    IEEE Trans. Software Eng.

    (2019)
  • F. Hassan et al.

    Change-aware build prediction model for stall avoidance in continuous integration

    ACM/IEEE International Symposium on Empirical Software Engineering and Measurement

    (2017)
  • A. Ni et al.

    Cost-effective build outcome prediction using cascaded classifiers

    2017 IEEE/ACM 14th International Conference on Mining Software Repositories (MSR)

    (2017)
  • U. Bhowan et al.

    Evolving ensembles in multi-objective genetic programming for classification with unbalanced data

    Annual conference on Genetic and evolutionary computation (GECCO)

    (2011)
  • U. Bhowan et al.

    Genetic programming for classification with unbalanced data

    European Conference on Genetic Programming

    (2010)
  • U. Bhowan et al.

    Reusing genetic programming for ensemble selection in classification of unbalanced data

    IEEE Trans. Evol. Comput.

    (2013)
  • I. Saidani et al.

    On the prediction of continuous integration build failures using search-based software engineering

    Proceedings of the 2020 Genetic and Evolutionary Computation Conference Companion

    (2020)
  • R. Malhotra et al.

    An exploratory study for software change prediction in object-oriented systems using hybridized techniques

    Automated Software Engineering

    (2017)
  • M. Harman et al.

    Search-based software engineering: trends, techniques and applications

    ACM Computing Surveys (CSUR)

    (2012)
  • J. Nam et al.

    Heterogeneous defect prediction

    IEEE Trans. Software Eng.

    (2017)
  • A. Ouni et al.

    Maintainability defects detection and correction: a multi-objective approach

    Automated Software Engineering

    (2013)
  • J. Chen et al.

    ǣSamplingǥ as a baseline optimizer for search-based software engineering

    IEEE Trans. Software Eng.

    (2018)
  • M. Kessentini et al.

    Detecting android smells using multi-objective genetic programming

    International Conference on Mobile Software Engineering and Systems

    (2017)
  • Z. Eckart et al.

    Improving the strength pareto evolutionary algorithm for multiobjective optimi-zation

    EUROGEN, Evol. Method Des. Optim. Control Ind. Problem

    (2001)
  • Y. Jin et al.

    Pareto-based multiobjective machine learning: an overview and case studies

    IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews)

    (2008)
  • U. Bhowan et al.

    Evolving diverse ensembles using genetic programming for classification with unbalanced data

    IEEE Trans. Evol. Comput.

    (2012)
  • K. Deb et al.

    A fast and elitist multiobjective genetic algorithm: Nsga-ii

    (2002)
  • Dataset for ci build prediction, 2020, (Available at :...
  • J. Xia et al.

    Could we predict the result of a continuous integration build? an empirical study

    2017 IEEE International Conference on Software Quality, Reliability and Security Companion (QRS-C)

    (2017)
  • Z. Xie et al.

    Cutting the software building efforts in continuous integration by semi-supervised online AUC optimization.

    IJCAI

    (2018)
  • J. Xia et al.

    An empirical study on the cross-project predictability of continuous integration outcomes

    2017 14th Web Information Systems and Applications Conference (WISA)

    (2017)
  • T. Rausch et al.

    An empirical analysis of build failures in the continuous integration workflows of java-based open-source software

    Proceedings of the 14th International Conference on Mining Software Repositories

    (2017)
  • M. Beller et al.

    Oops, my tests broke the build: An explorative analysis of travis CI with github

    IEEE/ACM International Conference on Mining Software Repositories

    (2017)
  • Y. Luo et al.

    What are the factors impacting build breakage?

    2017 14th Web Information Systems and Applications Conference (WISA)

    (2017)
  • A. Atchison et al.

    A time series analysis of travistorrent builds: to everything there is a season

    2017 IEEE/ACM 14th International Conference on Mining Software Repositories (MSR)

    (2017)
  • Cited by (25)

    • WhoReview: A multi-objective search-based approach for code reviewers recommendation in modern code review

      2021, Applied Soft Computing
      Citation Excerpt :

      Indeed, RS is the simplest form of search algorithms, which is known as direct-search or derivative-free search. RS is unguided and often fails to find globally optimal solutions as it does not take advantage of the use of genetic operators to evolve the current population [49,51–54]. Moreover, we use common performance metrics to compare the performance of IBEA against three widely-used multi-objective evolutionary search algorithms (MOEA), including the non-dominated sorting genetic algorithm (NSGA-II) [55], theStrength Pareto Evolutionary Algorithm (SPEA2) [56], and the multi-objective evolutionary algorithm based on decomposition (MOEA/D) [57].

    • What Factors Affect the Build Failures Correction Time? A Multi-Project Study

      2023, ACM International Conference Proceeding Series
    View all citing articles on Scopus
    View full text