Predicting continuous integration build failures using evolutionary search

doi:10.1016/j.infsof.2020.106392

Information and Software Technology

Volume 128, December 2020, 106392

https://doi.org/10.1016/j.infsof.2020.106392 Get rights and content

Abstract

Context: Continuous Integration (CI) is a common practice in modern software development and it is increasingly adopted in the open-source as well as the software industry markets. CI aims at supporting developers in integrating code changes constantly and quickly through an automated build process. However, in such context, the build process is typically time and resource-consuming which requires a high maintenance effort to avoid build failure.

Objective: The goal of this study is to introduce an automated approach to cut the expenses of CI build time and provide support tools to developers by predicting the CI build outcome.

Method: In this paper, we address problem of CI build failure by introducing a novel search-based approach based on Multi-Objective Genetic Programming (MOGP) to build a CI build failure prediction model. Our approach aims at finding the best combination of CI built features and their appropriate threshold values, based on two conflicting objective functions to deal with both failed and passed builds.

Results: We evaluated our approach on a benchmark of 56,019 builds from 10 large-scale and long-lived software projects that use the Travis CI build system. The statistical results reveal that our approach outperforms the state-of-the-art techniques based on machine learning by providing a better balance between both failed and passed builds. Furthermore, we use the generated prediction rules to investigate which factors impact the CI build results, and found that features related to (1) specific statistics about the project such as team size, (2) last build information in the current build and (3) the types of changed files are the most influential to indicate the potential failure of a given build.

Conclusion: This paper proposes a multi-objective search-based approach for the problem of CI build failure prediction. The performances of the models developed using our MOGP approach were statistically better than models developed using machine learning techniques. The experimental results show that our approach can effectively reduce both false negative rate and false positive rate of CI build failures in highly imbalanced datasets.

Introduction

Continuous integration (CI) [1] is a set of software development practices that are widely adopted in industry and open source environments [2]. A typical CI system, such as Travis CI¹, advocates to continuously integrate code changes, introduced by different developers, into a shared repository branch. The key to making this possible, according to Fowler [3], is automating the process of building and testing, which reduces the cost and risk of delivering defective changes. From the academic side, the study of CI adoption has become an active research topic and it has already been shown that CI improves developers’ productivity [4], helps to maintain code quality [2] and allows for a higher release frequency [5].

However, despite its valuable benefits, CI brings its own challenges. Hilton et al. [6]. revealed that build failure is a major barrier that developers face when using CI. A build failure, i.e., failing to compile the software into machine executable code, represents a blocker that prevents developers from proceeding further with development, as it requires an immediate action to resolve it. In addition, the build resolution may take hours or even days to complete, which severely affects both, the speed of software development and the productivity of developers [7]. Such challenges motivated researchers and practitioners to develop techniques for preemptively detecting when a software state is most likely to trigger a failure when built, and thus developers can take the necessary preventive actions to avoid it.

Existing studies leverage the history of previous build success and failures in order to train machine learning (ML) models. Such models learn from the CI builds history and use the domain knowledge to extract features and predict the outcome of a given input build. For instance, Foyzul and Wang [8] used Random Forest (RF), for the binary classification of build outcome, and Ni and Li [9] adapted the cascaded classifiers to improve the accuracy of CI build prediction. Although these works have advocated that predicting CI build outcome is possible and beneficial, none of them accommodated for the imbalanced distribution of the successful and failed classes when building their prediction models. This challenges their applicability due to the performance bias that can occur when an imbalanced distribution of class examples is used in the learning process [10], [11], [12], [13]. Hence, the minority class instances, i.e., the failed builds class in our case, is much more likely to be miss-classified. However, in CI context, a good accuracy on the failed builds prediction is more important than the passed builds accuracy. Also, increasing the accuracy of the builds failure class (known as probability of detection) can result in maximizing also the number of incorrectly classified failed builds (i.e., false alarms) which makes these two objectives in conflict [10], [14].

To deal with the above mentioned challenges, Evolutionary Multi-Objective Optimisation (EMO) [15], [16], [17], [18], [19] have been found useful for developing software engineering predictive models [20], [21]. Researchers have advocated that the use of (EMO) is appropriate because it allows adapting the fitness function to evolve classifiers with good classification ability across both the minority and majority classes, e.g., balance between failed and passed builds. This is accomplished by treating the conflicting objectives independently in the learning process using the notion of Pareto Dominance. Additionally, to deal with the imbalanced nature of the dataset, a Multi-Objective Genetic Programming (MOGP) approach [22], that promotes diversity between solutions equally on both minority and majority classes, allows the imbalanced training data to be used directly in the learning process i.e.without relying on sampling techniques to re-balance the data [12], [23] which advocates that MOGP approaches are more suitable for binary classification tasks with imbalanced data [10].

In this paper, we introduce a novel MOGP approach to predict CI build outcome. The idea is based on the adaption of the Non-dominated Sorting Genetic Algorithm (NSGA-II) [24] with a tree-based solution representation, in order to generate rules from historical data of CI builds using two competing objectives in the learning process, namely the probability of detection and the probability of false alarms. As a solution to this binary classification problem, a candidate rule is expressed as a combination of metrics and their appropriate threshold values; and should cover as much as possible the build results from the base of build results. In a nutshell, our approach takes as input, a given build, calculates a set a metrics that are fed into our rule, previously generated using the history of builds, and whose binary output predicts whether the input build is most likely to succeed or fail, based on its likelihood to the successful or failed builds.

To evaluate our approach, we conducted an empirical study on a benchmark composed of 56,019 build instances from 10 open source projects that use the Travis CI system, one of the most popular CI systems. We compare our predictive performance to existing Genetic Programming (GP) algorithms and three widely-used ML techniques namely Random Forest, Decision Tree and Naive Bayes. The statistical results reveal that our approach advances the state-of-the art by outperforming existing prediction models. Moreover, we examine the most important features, used by our generated rules, in indicating the correct CI build outcome, in order to provide the practitioners with useful insights on how to avoid build failures. In summary, the contributions of this work are the following:

•
A novel formulation of the CI build prediction as a multi-objective optimization problem to handle imbalance nature of CI builds as well as to achieve a good predictive performance on both classes (passed and failed). To the best of our knowledge, this is the first attempt to use a search-based approach for the CI build prediction.
•
An empirical study of our MOGP technique compared to different existing approaches based on a benchmark of 10 large and long-lived projects. The obtained results reveal that our proposal is more efficient than existing techniques with a median of AUC (Area Under The Curve) of 68% compared to 61% achieved by existing ML techniques for which we applied re-sampling. Additionally, our approach is able to strike a better balance between both failed and passed builds achieving an improvement of at least 15% for the balance metric [25]. These are interesting and actionable results considering the highly imbalanced nature of the studied projects with an average failure rate of 19% in the minority class.
•
A qualitative evidence of the potential reasons behind build failure through a novel feature ranking approach. The rules analysis shows that the metrics related to (1) specific statistics about the project such as team size, (2) last build information in the current build and (3) the types of changed files are the most influential to indicate the potential failure of a given build.
•
A comprehensive dataset [26] collected from 10 long-lived software projects, containing over 56,019 records of build results.

Replication Package. The comprehensive dataset collected and used in our study is publicly available in [26] for future replications and extensions. Also, we provide all details about the validation results as well as illustrative examples of the generated rules available for the research community.

Paper Organization. The remainder of this paper is organized as follows. Section 2 provides an overview of the CI build process and the related work. We present our approach in Section 3. Section 4 shows the experimental setup of our empirical study. Section 5 presents the results and findings of our studied research questions. Section 6 discusses the implications of our findings for developers, researchers and tool builders. Section 7 reviews the threats to the validity of our results. Finally, Section 8 concludes the paper and outlines avenues for future work.

Section snippets

Background and related work

In this section, we provide an overview of CI and the related work.

Search-based prediction of CI build failure

In this section, we describe our approach that uses multi-objective GP based on an adaptation of NSGA-II.

Validation

In this section, we report the results of a large-scale empirical study on a benchmark of 56,019 build instances. The comprehensive dataset collected and used in our study is publicly available in [26] for future replications and extensions.

Fig. 5 provides an overview of our experimental design used in the validation of our approach. First, we evaluate our predictive performance against existing approaches in the two first questions. At this step, we run search-based algorithms and non

Experimental results

This section presents the experimental results obtained for RQ1-3.

Discussion

In this section, we discuss our findings and their implications for developers, researchers and tool builders.

Threats to validity

This section describes the threats to the validity of our experiments.

Internal validity. One threat to internal validity is related to training and test sets selection. As an attempt to mitigate this issue, we considered online validation which is a realistic scenario as it considers the chronological order of CI builds and mimics what happens during the continuous integration process. Future work is planned to validate our approach considering other scenarios such as cross-project validation.

Conclusions and future work

In this article, we introduced a new search-based approach for CI build failure prediction. In our genetic programming (GP) adaptation, prediction rules are represented as a combination of metrics and threshold values that should correctly predict as much as possible the failed builds extracted from a base of real world examples. Considering online validation, the statistical analysis of the obtained results provides evidence that our approach outperforms three Machine Learning (ML) techniques,

CRediT authorship contribution statement

Islem Saidani: Conceptualization, Data curation, Formal analysis, Methodology, Software, Validation, Investigation, Writing - original draft. Ali Ouni: Conceptualization, Validation, Supervision, Resources, Writing - review & editing, Funding acquisition, Project administration. Moataz Chouchen: Software, Validation, Investigation. Mohamed Wiem Mkaouer: Methodology, Validation, Writing - review & editing.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgements

This work is supported by the Natural Sciences and Engineering Research Council of Canada (NSERC).

References (75)

H. Zhao
A multi-objective genetic programming approach to developing pareto optimal decision trees
Decis Support Syst
(2007)
R. Malhotra
A systematic review of machine learning techniques for software fault prediction
Appl Soft Comput
(2015)
M. Harman et al.
Search-based software engineering
Inf Softw Technol
(2001)
D.C. Karnopp
Random search techniques for optimization problems
Automatica
(1963)
N. Almarimi et al.
Web service api recommendation for automated mashup creation using multi-objective evolutionary search
Appl Soft Comput
(2019)
K. Janssen et al.
Updating methods improved the performance of a clinical prediction model in new patients
J Clin Epidemiol
(2008)
P.M. Duvall et al.
Continuous integration: Improving software quality and reducing risk
(2007)
B. Vasilescu et al.
Quality and productivity outcomes relating to continuous integration in github
10th Joint Meeting on Foundations of Software Engineering
(2015)
M. Fowler, Continuous Integration, 2006, https://www.martinfowler.com/articles/continuousIntegration.html. Accessed:...
M. Hilton et al.
Usage, costs, and benefits of continuous integration in open-source projects
31st IEEE/ACM International Conference on Automated Software Engineering
(2016)

Y. Zhao et al.

The impact of continuous integration on other software development practices: A large-scale empirical study

32nd IEEE/ACM International Conference on Automated Software Engineering

(2017)

M. Hilton et al.

Trade-offs in continuous integration: assurance, security, and flexibility

11th Joint Meeting on Foundations of Software Engineering

(2017)

R. Abdalkareem et al.

Which commits can be ci skipped?

IEEE Trans. Software Eng.

(2019)

F. Hassan et al.

Change-aware build prediction model for stall avoidance in continuous integration

ACM/IEEE International Symposium on Empirical Software Engineering and Measurement

(2017)

A. Ni et al.

Cost-effective build outcome prediction using cascaded classifiers

2017 IEEE/ACM 14th International Conference on Mining Software Repositories (MSR)

(2017)

U. Bhowan et al.

Evolving ensembles in multi-objective genetic programming for classification with unbalanced data

Annual conference on Genetic and evolutionary computation (GECCO)

(2011)

U. Bhowan et al.

Genetic programming for classification with unbalanced data

European Conference on Genetic Programming

(2010)

U. Bhowan et al.

Reusing genetic programming for ensemble selection in classification of unbalanced data

IEEE Trans. Evol. Comput.

(2013)

I. Saidani et al.

On the prediction of continuous integration build failures using search-based software engineering

Proceedings of the 2020 Genetic and Evolutionary Computation Conference Companion

(2020)

R. Malhotra et al.

An exploratory study for software change prediction in object-oriented systems using hybridized techniques

Automated Software Engineering

(2017)

M. Harman et al.

Search-based software engineering: trends, techniques and applications

ACM Computing Surveys (CSUR)

(2012)

J. Nam et al.

Heterogeneous defect prediction

IEEE Trans. Software Eng.

(2017)

A. Ouni et al.

Maintainability defects detection and correction: a multi-objective approach

Automated Software Engineering

(2013)

J. Chen et al.

ǣSamplingǥ as a baseline optimizer for search-based software engineering

IEEE Trans. Software Eng.

(2018)

M. Kessentini et al.

Detecting android smells using multi-objective genetic programming

International Conference on Mobile Software Engineering and Systems

(2017)

Z. Eckart et al.

Improving the strength pareto evolutionary algorithm for multiobjective optimi-zation

EUROGEN, Evol. Method Des. Optim. Control Ind. Problem

(2001)

Y. Jin et al.

Pareto-based multiobjective machine learning: an overview and case studies

IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews)

(2008)

U. Bhowan et al.

Evolving diverse ensembles using genetic programming for classification with unbalanced data

IEEE Trans. Evol. Comput.

(2012)

K. Deb et al.

A fast and elitist multiobjective genetic algorithm: Nsga-ii

(2002)

Dataset for ci build prediction, 2020, (Available at :...

J. Xia et al.

Could we predict the result of a continuous integration build? an empirical study

2017 IEEE International Conference on Software Quality, Reliability and Security Companion (QRS-C)

(2017)

Z. Xie et al.

Cutting the software building efforts in continuous integration by semi-supervised online AUC optimization.

IJCAI

(2018)

J. Xia et al.

An empirical study on the cross-project predictability of continuous integration outcomes

2017 14th Web Information Systems and Applications Conference (WISA)

(2017)

T. Rausch et al.

An empirical analysis of build failures in the continuous integration workflows of java-based open-source software

Proceedings of the 14th International Conference on Mining Software Repositories

(2017)

M. Beller et al.

Oops, my tests broke the build: An explorative analysis of travis CI with github

IEEE/ACM International Conference on Mining Software Repositories

(2017)

Y. Luo et al.

What are the factors impacting build breakage?

2017 14th Web Information Systems and Applications Conference (WISA)

(2017)

A. Atchison et al.

A time series analysis of travistorrent builds: to everything there is a season

2017 IEEE/ACM 14th International Conference on Mining Software Repositories (MSR)

(2017)

Cited by (25)

Ant-colony optimization for automating test model generation in model transformation testing
2024, Journal of Systems and Software
In model transformation (MT) testing, test data generation is of key importance. However, test suites are not available out of the box, and existing approaches to generate them require to provide not only the metamodel to which the models must conform, but some other domain-specific artifacts. For instance, an MT developer aiming to perform an incremental implementation of an MT may need to count on a quality test suite from the very beginning, even before all MT requirements are clear, only having the metamodels as input. We propose a black-box approach for the generation of test models where only the input metamodel of the MT is available. We propose an Ant-Colony Optimization algorithm for the search of test models satisfying the objectives of maximizing internal diversity and maximizing external diversity. We provide a tool prototype that implements this approach and generates the models in the well-established XMI interchange format. A comparison study with state-of-the-art frameworks shows that models are generated in reasonable times with low memory consumption. We empirically demonstrate the adequacy of our approach to generate effective test models, obtaining an overall mutation score above 80% from an evaluation with more than 5000 MT mutants.
WhoReview: A multi-objective search-based approach for code reviewers recommendation in modern code review
2021, Applied Soft Computing
Citation Excerpt :
Indeed, RS is the simplest form of search algorithms, which is known as direct-search or derivative-free search. RS is unguided and often fails to find globally optimal solutions as it does not take advantage of the use of genetic operators to evolve the current population [49,51–54]. Moreover, we use common performance metrics to compare the performance of IBEA against three widely-used multi-objective evolutionary search algorithms (MOEA), including the non-dominated sorting genetic algorithm (NSGA-II) [55], theStrength Pareto Evolutionary Algorithm (SPEA2) [56], and the multi-objective evolutionary algorithm based on decomposition (MOEA/D) [57].
Contemporary software development is distributed and characterized by high dynamics with continuous and frequent changes to fix defects, add new user requirements or adapt to other environmental changes. To manage such changes and ensure software quality, modern code review is broadly adopted as a common and effective practice. Yet several open-source as well as commercial software projects have adopted peer code review as a crucial practice to ensure the quality of their software products using modern tool-based code review. Nevertheless, the selection of peer reviewers is still merely a manual and hard task especially with the growing size of distributed development teams. Indeed, it has been proven that inappropriate peer reviewers selection can consume more time and effort from both developers and reviewers and increase the development costs and time to market. To address this problem, we introduce a multi-objective search-based approach, named WhoReview, to find the optimal set of peer reviewers for code changes. We use the Indicator-Based Evolutionary Algorithm (IBEA) to find the best set of code reviewers that are (1) most experienced with the code change to be reviewed, while (2) considering their current workload, i.e., the number of open code reviews they are working on. We conduct an empirical study on 4 long-lived open source software projects to evaluate our approach. The obtained results show that WhoReview outperforms state-of-the-art approach by an average precision of 68% and recall of 77%. Moreover, we deployed our approach in an industrial context and evaluated it qualitatively from developers perspective. Results show the effectiveness of our approach with a high acceptance ratio in identifying relevant reviewers.
Practitioners’ Challenges and Perceptions of CI Build Failure Predictions at Atlassian
2024, arXiv
A multi-objective effort-aware approach for early code review prediction and prioritization
2024, Empirical Software Engineering
The Why, When, What, and How about Predictive Continuous Integration: A Simulation-Based Investigation
2023, IEEE Transactions on Software Engineering
What Factors Affect the Build Failures Correction Time? A Multi-Project Study
2023, ACM International Conference Proceeding Series

View all citing articles on Scopus

View full text

Predicting continuous integration build failures using evolutionary search

Abstract

Introduction

Section snippets

Background and related work

Search-based prediction of CI build failure

Validation

Experimental results

Discussion

Threats to validity

Conclusions and future work

CRediT authorship contribution statement

Declaration of Competing Interest

Acknowledgements

Decis Support Syst

Appl Soft Comput

Inf Softw Technol

Automatica

Appl Soft Comput

J Clin Epidemiol

Continuous integration: Improving software quality and reducing risk

Quality and productivity outcomes relating to continuous integration in github

10th Joint Meeting on Foundations of Software Engineering

Usage, costs, and benefits of continuous integration in open-source projects

31st IEEE/ACM International Conference on Automated Software Engineering

The impact of continuous integration on other software development practices: A large-scale empirical study

32nd IEEE/ACM International Conference on Automated Software Engineering

Trade-offs in continuous integration: assurance, security, and flexibility

11th Joint Meeting on Foundations of Software Engineering

Which commits can be ci skipped?

IEEE Trans. Software Eng.

Change-aware build prediction model for stall avoidance in continuous integration

ACM/IEEE International Symposium on Empirical Software Engineering and Measurement

Cost-effective build outcome prediction using cascaded classifiers

2017 IEEE/ACM 14th International Conference on Mining Software Repositories (MSR)

Evolving ensembles in multi-objective genetic programming for classification with unbalanced data

Annual conference on Genetic and evolutionary computation (GECCO)

Genetic programming for classification with unbalanced data

European Conference on Genetic Programming

Reusing genetic programming for ensemble selection in classification of unbalanced data

IEEE Trans. Evol. Comput.

On the prediction of continuous integration build failures using search-based software engineering

Proceedings of the 2020 Genetic and Evolutionary Computation Conference Companion

An exploratory study for software change prediction in object-oriented systems using hybridized techniques

Automated Software Engineering

Search-based software engineering: trends, techniques and applications

ACM Computing Surveys (CSUR)

Heterogeneous defect prediction

IEEE Trans. Software Eng.

Maintainability defects detection and correction: a multi-objective approach

Automated Software Engineering

ǣSamplingǥ as a baseline optimizer for search-based software engineering

IEEE Trans. Software Eng.

Detecting android smells using multi-objective genetic programming

International Conference on Mobile Software Engineering and Systems

Improving the strength pareto evolutionary algorithm for multiobjective optimi-zation

EUROGEN, Evol. Method Des. Optim. Control Ind. Problem

Pareto-based multiobjective machine learning: an overview and case studies

IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews)

Evolving diverse ensembles using genetic programming for classification with unbalanced data

IEEE Trans. Evol. Comput.

A fast and elitist multiobjective genetic algorithm: Nsga-ii

Could we predict the result of a continuous integration build? an empirical study

2017 IEEE International Conference on Software Quality, Reliability and Security Companion (QRS-C)

Cutting the software building efforts in continuous integration by semi-supervised online AUC optimization.

IJCAI

An empirical study on the cross-project predictability of continuous integration outcomes

2017 14th Web Information Systems and Applications Conference (WISA)

An empirical analysis of build failures in the continuous integration workflows of java-based open-source software

Proceedings of the 14th International Conference on Mining Software Repositories

Oops, my tests broke the build: An explorative analysis of travis CI with github

IEEE/ACM International Conference on Mining Software Repositories

What are the factors impacting build breakage?

2017 14th Web Information Systems and Applications Conference (WISA)

A time series analysis of travistorrent builds: to everything there is a season

2017 IEEE/ACM 14th International Conference on Mining Software Repositories (MSR)