Prediction of faults-slip-through in large software projects: an empirical evaluation

Afzal, Wasif; Torkar, Richard; Feldt, Robert; Gorschek, Tony

doi:10.1007/s11219-013-9205-3

Prediction of faults-slip-through in large software projects: an empirical evaluation

Published: 15 May 2013

Volume 22, pages 51–86, (2014)
Cite this article

Software Quality Journal Aims and scope Submit manuscript

Wasif Afzal¹,
Richard Torkar^2,3,
Robert Feldt^2,3 &
…
Tony Gorschek³

465 Accesses
4 Citations
Explore all metrics

Abstract

A large percentage of the cost of rework can be avoided by finding more faults earlier in a software test process. Therefore, determination of which software test phases to focus improvement work on has considerable industrial interest. We evaluate a number of prediction techniques for predicting the number of faults slipping through to unit, function, integration, and system test phases of a large industrial project. The objective is to quantify improvement potential in different test phases by striving toward finding the faults in the right phase. The results show that a range of techniques are found to be useful in predicting the number of faults slipping through to the four test phases; however, the group of search-based techniques (genetic programming, gene expression programming, artificial immune recognition system, and particle swarm optimization–based artificial neural network) consistently give better predictions, having a representation at all of the test phases. Human predictions are consistently better at two of the four test phases. We conclude that the human predictions regarding the number of faults slipping through to various test phases can be well supported by the use of search-based techniques. A combination of human and an automated search mechanism (such as any of the search-based techniques) has the potential to provide improved prediction results.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Sampling in software engineering research: a critical review and guidelines

Article 28 April 2022

Applications of AI in classical software engineering

Article Open access 26 July 2020

Software defect prediction: future directions and challenges

Article 27 February 2024

Notes

According to IEEE Standard Glossary of Software Engineering Terminology (IEEE 1990), a fault is a manifestation of a human mistake.
Statistical techniques (multiple regression, pace regression), tree-structured techniques (M5P, REPTree), nearest neighbor techniques (K-Star, K-nearest neighbor), ensemble techniques (bagging and rotation forest), machine-learning techniques (support vector machines and back-propagation artificial neural networks), search-based techniques (genetic programming, artificial immune recognition systems, particle swarm optimization based artificial neural networks and gene expression programming), and expert judgement.

References

Afzal, W. (2009). Search-based approaches to software fault prediction and software testing. Blekinge Institute of Technology Licentiate Series No. 2009:06, Ronneby, Sweden.
Afzal, W. (2010). Using faults-slip-through metric as a predictor of fault-proneness: Proceedings of the 21st Asia Pacific Software Engineering Conference (APSEC’10), IEEE.
Afzal, W., & Torkar, R. (2008). A comparative evaluation of using genetic programming for predicting fault count data: Proceedings of the 3rd International Conference on Software Engineering Advances (ICSEA’08), IEEE.
Afzal, W., Torkar, R., Feldt, R., & Wikstrand, G. (2010). Search-based prediction of fault-slip-through in large software projects: Proceedings of the 2nd International Symposium on Search-Based Software Engineering (SSBSE’10), IEEE Computer Society. pp. 79–88.
Aha, D. W., Kibler, D., & Albert, M. K. (1991). Instance-based learning algorithms. Machine Learning, 6, 37–66.
Google Scholar
Antolić, Ž. (2007). Fault slip through measurement process implementation in CPP software implementation: Proceedings of the 30th Jubilee International Convention (MIPRO’07). Ericsson Nikola Tesla.
Arisholm, E., Briand, L. C., & Johannessen, E. B. (2010). A systematic and comprehensive investigation of methods to build and evaluate fault prediction models. Journal of Systems and Software, 83(1), 2–17.
Article Google Scholar
Blickle, T. (1996). Theory of evolutionary algorithms and application to system synthesis. PhD thesis. Zurich, Switzerland: Swiss Federal Institute of Technology.
Boehm, B., & Basili, V.R. (2001). Software defect reduction top 10 list. Computer, 34(1), 135–137.
Article Google Scholar
Breiman, L. (1996). Bagging predictors. Machine Learning, 24, 123–140.
MATH MathSciNet Google Scholar
Briand, L., Emam, K., Freimut, B., & Laitenberger, O. (2000). A comprehensive evaluation of capture-recapture models for estimating software defect content. IEEE Transactions on Software Engineering, 26(6).
Canu S, Grandvalet Y, Guigue V, & Rakotomamonjy A (2005) SVM and kernel methods toolbox. Perception Systémes et Information, INSA de Rouen, Rouen, France.
Catal, C., & Diri, B. (2009). A systematic review of software fault prediction studies. Expert Systems with Applications, 36(4), 7346–7354.
Article Google Scholar
Challagulla, V., Bastani, F., Yen, I., & Paul, R. (2005). Empirical assessment of machine learning based software defect prediction techniques. Proceedings of the 10th IEEE workshop on object oriented real-time dependable systems.
Cleary, J. G., & Trigg, L. E. (1995). K*: An instance-based learner using an entropic distance measure. 12th International Conference on Machine Learning (ICML’95).
Damm, L. O. (2007). Early and cost-effective software fault detection—Measurement and implementation in an industrial setting. PhD thesis, Blekinge Institute of Technology.
Damm, L. O., Lundberg, L., & Wohlin, C. (2006). Faults-slip-through—a concept for measuring the efficiency of the test process. Software Process: Improvement & Practice, 11(1), 47–59.
Google Scholar
Fenton, N. E., & Neil, M. (1999). A critique of software defect prediction models. IEEE Transactions on Software Engineering, 25(5), 675–689.
Article Google Scholar
Ferreira, C. (2001). Gene expression programming: A new adaptive algorithm for solving problems. Complex Systems, 13(2).
Gyimothy, T., Ferenc, R., & Siket, I. (2005). Empirical validation of object-oriented metrics on open source software for fault prediction. IEEE Transactions on Software Engineering, 31(10). 897–910.
Google Scholar
Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., & Witten, I. H. (2009), The WEKA data mining software: An update. SIGKDD Explorations Newsletter, 11(1), 10–18.
Article Google Scholar
Harman, M. (2007). The current state and future of search based software engineering. Proceeding of the Future of Software Engineering (FOSE’07). Washington, DC, USA: IEEE Computer Society, pp. 342–357.
Harman, M. (2010). The relationship between search based software engineering and predictive modeling: Proceedings of the 6th International Conference on Predictive Models in Software Engineering (PROMISE’10). New York, NY, USA: ACM.
Harman, M., & Jones, B. (2001). Search-based software engineering. Information and Software Technology, 43(14), 833–839.
Article Google Scholar
Hribar, L. (2008) Implementation of FST in design phase of the project: Proceedings of the 31st Jubilee International Convention (MIPRO’08). Ericsson Nikola Tesla.
Hughes, R. T. (1996). Expert judgement as an estimating method. Information and Software Technology, 38(2), 67–75.
Article Google Scholar
IEEE. (1990) IEEE standard glossary of software engineering terminology—IEEE Std 610.12-1990. Standards Coordinating Committee of the Computer Society of the IEEE, IEEE Standards Board. New York, USA: The Institute of Electrical and Electronic Engineers, Inc.
Ioannidis, J. P. A. (2005) Why most published research findings are false. PLoS Medicine, 2(8), 696–701.
Article Google Scholar
Jha, G. K., Thulasiraman, P., & Thulasiram, R. K. (2009). PSO based neural network for time series forecasting: International Joint Conference on Neural Networks.
Jørgensen, M., Kirkebøen, G., Sjøberg, D. I. K., Anda, B., & Bratthall, L. (2000). Human judgement in effort estimation of software projects: Proceedings of the workshop on using multi-disciplinary approaches in empirical software engineering research, co-located with ICSE’00. Ireland: Limerick.
Juristo, N., & Moreno, A. M. (2001). Basics of software engineering experimentation. Dordrecht: Kluwer Academic Publishers.
Book MATH Google Scholar
Kachigan, S. K. (1982). Statistical analysis—an interdisciplinary introduction to univariate and multivariate methods. New York: Radius Press.
Google Scholar
Kitchenham, B., Pickard, L., MacDonell, S., & Shepperd, M. (2001). What accuracy statistics really measure? IEE Proceedings Software, 148(3), 81–85.
Google Scholar
Lavesson, N., & Davidsson, P. (2008). Generic methods for multi-criteria evaluation: Proceedings of the SIAM International Conference on Data Mining (SD’08).
Lessmann, S., Baesens, B., Mues, C., & Pietsch, S. (2008). Benchmarking classification models for software defect prediction: A proposed framework and novel findings. IEEE Transactions on Software Engineering, 34(4), 485–496.
Article Google Scholar
Liu, Y., Khoshgoftaar, T., & Seliya, N. (2010). Evolutionary optimization of software quality modeling with multiple repositories. IEEE Transactions on Software Engineering (Article in print).
Lyu, M.R. (ed) (1996). Handbook of software reliability engineering. Hightstown, NJ: McGraw-Hill Inc.
Google Scholar
Mohagheghi, P., Conradi, R., Killi, O. M., & Schwarz, H. (2004). An empirical study of software reuse vs. defect-density and stability: Proceedings of the 26th International Conference on Software Engineering (ICSE’04). Washington, DC, USA: IEEE Computer Society.
Nagappan, N., & Ball, T. (2005). Static analysis tools as early indicators of pre-release defect density: Proceedings of the 27th international conference on Software engineering (ICSE’05). New York, NY, USA: ACM.
Nagappan, N., Murphy, B., & Basili, V. (2008). The influence of organizational structure on software quality: An empirical case study: Proceedings of the 30th International Conference on Software Engineering (ICSE’08). New York, NY, USA: ACM.
Pickard, L., Kitchenham, B., & Linkman, S. (1999). An investigation of analysis techniques for software datasets. In: Proceedings of the 6th International Software Metrics Symposium (METRICS’99). Los Alamitos, USA: IEEE Computer Society.
Poli, R., Langdon, W. B., & McPhee, N. F. (2008). A field guide to genetic programming. Published via http://lulu.com and freely available at http://www.gp-field-guide.org.uk.
Rakitin, S. R. (2001). Software verification and validation for practitioners and managers (2nd ed.). 685 Canton Street, Norwood, MA, USA: Artech House., Inc.
Rodriguez, J. J., Kuncheva, L. I., & Alonso, C. J. (2006). Rotation forest: A new classifier ensemble method. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28, 1619–1630.
Article Google Scholar
Runeson, P., Andersson, C., Thelin, T., Andrews, A., & Berling, T. (2006). What do we know about defect detection methods? IEEE Software, 23(3), 82–90.
Article Google Scholar
Russell S., & Norvig P. (2003) Artificial intelligence—a modern approach. Prentice Hall Series in Artificial Intelligence, USA
Google Scholar
Shepperd, M., Cartwright, M., & Kadoda, G. (2000). On building prediction systems for software engineers. Empirical Software Engineering, 5(3), 175–182.
Google Scholar
STD. (2008) IEEE standard 12207-2008 systems and software engineering—Software life cycle processes. Software and systems engineering standards committee of the IEEE computer society. New York, USA: The Institute of Electrical and Electronic Engineers, Inc.
Staron, M., & Meding, W. (2008). Predicting weekly defect inflow in large software projects based on project planning and test status. Information & Software Technology, 50(7–8), 782–796.
Google Scholar
Tian, J. (2004). Quality-evaluation models and measurements. IEEE Software, 21(3), 84–91.
Article Google Scholar
Tomaszewski, P., Håkansson, J., Grahn, H., & Lundberg, L. (2007). Statistical models vs. expert estimation for fault prediction in modified code—an industrial case study. Journal of Systems and Software, 80(8), 1227–1238.
Article Google Scholar
Trelea, I. C. (2003). The PSO algorithm: Convergence analysis and parameter selection. IP Letters, 85(6).
Veevers, A., & Marshall, A. C. (1994). A relationship between software coverage metrics and reliability. Software Testing, Verification and Reliability, 4(1), 3–8.
Google Scholar
Wagner, S. (2006). A literature survey of the quality economics of defect-detection techniques: Proceedings of the ACM/IEEE International Symposium on Empirical Software Engineering (ISESE’06)
Wang, Y. (2000). A new approach to fitting linear models in high dimensional spaces. PhD thesis. New Zealand: Department of Computer Science, University of Waikato.
Wang, Y, & Witten, I. H. (1996). Induction of model trees for predicting continuous classes. Technical report, University of Waikato, Department of Computer Science, Hamilton, New Zealand, URL http://www.cs.waikato.ac.nz/pubs/wp/1996/uow-cs-wp-1996-23.pdf.
Watkins, A., Timmis, J., & Boggess, L. (2004). Artificial immune recognition system (AIRS): An immune-inspired supervised learning algorithm. Genetic programming and Evolvable Machines, 5(3), 291–317.
Google Scholar
Weka Documentation (2010) Class REPTree.Tree. URL http://www.dbs.ifi.lmu.de/~zimek/diplomathesis/implementations/EHNDs/doc/weka/classifiers/trees/REPTree.Tree.html.
Weyuker, E. J., Ostrand, T. J., & Bell, R. M. (2010). Comparing the effectiveness of several modeling methods for fault prediction. Empirical Software Engineering, 15(3), 277–295.
Article Google Scholar
Witten I, Frank E (2005) Data mining—practical machine learning tools and techniques. USA: Morgan–Kaufmann Publishers
MATH Google Scholar
Zhong, S., Khoshgoftaar, T. M., & Seliya, N. (2004). Unsupervised learning for expert-based software quality estimation: Proceedings of the 8th IEEE International Symposium on High Assurance Systems Engineering (HASE’04).

Download references

Acknowledgment

We are grateful to Prof. Anneliese Andrews, University of Denver, for reading and commenting on the initial concept paper.

Author information

Authors and Affiliations

Department of Computer Sciences, Bahria University, Shangrilla Road, Sector E-8, Islamabad, 44000, Pakistan
Wasif Afzal
Department of Computer Science and Engineering, Chalmers University of Technology, 41296, Gothenburg, Sweden
Richard Torkar & Robert Feldt
School of Computing, Blekinge Institute of Technology, 37179, Karlskrona, Sweden
Richard Torkar, Robert Feldt & Tony Gorschek

Authors

Wasif Afzal
View author publications
You can also search for this author in PubMed Google Scholar
Richard Torkar
View author publications
You can also search for this author in PubMed Google Scholar
Robert Feldt
View author publications
You can also search for this author in PubMed Google Scholar
Tony Gorschek
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Wasif Afzal.

Appendix: Model training and testing procedure

This section discusses the parameter settings that have been considered for different techniques during model selection. These settings may be used for a future replication of this study and to quantify the impact of changing the parameter settings, perhaps using different data sets. As given in Sect. 3.1, we use data from 45 weeks of the baseline project to train the models, while the results are evaluated on the data from 15 weeks of an ongoing project. The experimental evaluation process is also summarized in Procedure 1.

The least-square multiple regression does not require selection of parameters, rather the coefficients are determined from the training data. Different estimators implemented in the WEKA machine-learning tool (Hall et al. 2009) have been evaluated for pace regression that includes empirical Bayes, ordinary least square, Akaike’s information criterion (AIC), and risk inflation criterion (RIC). The estimator giving the least ARE is selected as the best pace regression model.

The M5P technique requires setting the minimum number of instances at a leaf node and has been varied in the range [2, 4, …, 10] with pruning and smoothing. The model with minimum ARE is retained. The REPTree technique requires setting the maximum depth of the tree, the minimum total weight of the instances in a leaf, the minimum variance proportion at a node required for splitting, the number of folds of data used for pruning, and the seed value used for randomizing the data. We have imposed no restriction on the maximum depth of the tree while the minimum total weight of the instances in a leaf is varied in the range [2,4, …, 10]. The minimum variance proportion at a node, the number of folds of data used for pruning, and the seed value used for randomization are kept constant at their default values of 0.0010, 3, and 1, respectively.

The K* instance-based technique requires setting the blending parameter that has a value between 0 and 100 %. This parameter has been varied in the range of [0, 20, 40, …, 100]. For k-NN, the number of neighbors has been varied in the range of [1, 3, 5, …, 15].

For SVM, two types of parameters have to be set by the user, i.e., values for the epsilon parameter, \(\varepsilon\) and the regularization parameter, C . Setting the value of C near the range of the output values has been found to be a successful heuristic. We therefore vary C within the range [1, 3, …, 11]. The value of \(\varepsilon\) is varied in the range [0.001, 0.003] while the kernel used is the radial basis function. Training an artificial neural network (ANN) requires deciding on the number of layers and the number of nodes at each layer. We considered the ANN architecture with 1 input layer, 2 hidden layers, and 1 output layer. The number of independent variables in the problem determined the number of input nodes. The two hidden layers used a varied number of nodes in the range [1, 3, 5, 7], while the output layer used a single node. The hyperbolic tangent sigmoid and linear transfer functions have been used for the hidden and output nodes, respectively. Finally, the number of epochs used is 500 and the weights are updated using a learning rate of 0.3 and a momentum of 0.2.

Model selection for Bagging involves deciding upon the size of the bag as a percentage of the training set size and the number of iterations to be performed. These two parameters have been varied in the range [25, 50, 75, 100] and [5, 10, 15], respectively. The REPTree technique is used as the base learner. For rotation forest, the number of iterations have been varied in the range [5, 10, 15] and the base learner used is the REPTree technique.

GP requires setting a number of control parameters. Although the affect of changing these control parameters on the end solution is still an active area of research, we nevertheless experimented with different function and terminal sets. Initially, we experimented with a minimal set of functions and the terminal set containing the independent variable only. We incrementally increased the function set with additional functions and later on also complemented the terminal set with a random constant. The best model having the best fitness was chosen from all the runs of the GP system with different variations of function and terminal sets. The GP programs were evaluated according to the sum of absolute differences between the obtained and expected results in all fitness cases, \(\sum\nolimits_{i=1}^n |e_{i}-e_{i}^{\prime}|,\) where e _i is the actual fault count data, \(e_i^{\prime}\) is the estimated value of the fault count data and n is the size of the data set used to train the GP models. The control parameters that were chosen for the GP system are shown in Table 9. For GEP, the solutions are evaluated for fitness using mean squared error and the control parameters are shown in Table 10. The AIRS algorithm also requires setting a number of parameters. While it is not possible to experiment with all the different combinations of these parameters, however, the value of k for the majority voting has been varied in the range [1, 3, 5, …, 15]. Rest of the parameters used were as follows: affinity threshold = 0.2, clonal rate = 10, hypermutation rate = 2, mutation rate = 0.1, stimulation value = 0.9, and total resources = 150. For PSO-ANN, the architecture similar to the basic ANN is followed except that the weights are now optimized using PSO with the number of particles in the swarm set to 25 and the number of iterations varied in the range [500, 1,000, 15,000, 2,000]. The mean squared error is used as the fitness function.

Table 9 GP control parameters

Full size table

Table 10 GEP control parameters

Full size table

Rights and permissions

Reprints and permissions

About this article

Cite this article

Afzal, W., Torkar, R., Feldt, R. et al. Prediction of faults-slip-through in large software projects: an empirical evaluation. Software Qual J 22, 51–86 (2014). https://doi.org/10.1007/s11219-013-9205-3

Download citation

Published: 15 May 2013
Issue Date: March 2014
DOI: https://doi.org/10.1007/s11219-013-9205-3

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Prediction of faults-slip-through in large software projects: an empirical evaluation

Abstract

Access this article

Similar content being viewed by others

Sampling in software engineering research: a critical review and guidelines

Applications of AI in classical software engineering

Software defect prediction: future directions and challenges

Notes

References

Acknowledgment

Author information

Authors and Affiliations

Corresponding author

Appendix: Model training and testing procedure

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Prediction of faults-slip-through in large software projects: an empirical evaluation

Abstract

Access this article

Similar content being viewed by others

Sampling in software engineering research: a critical review and guidelines

Applications of AI in classical software engineering

Software defect prediction: future directions and challenges

Notes

References

Acknowledgment

Author information

Authors and Affiliations

Corresponding author

Appendix: Model training and testing procedure

Appendix: Model training and testing procedure

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation