Abstract
Despite the general acceptance that software engineering datasets often contain noisy, irrelevant or redundant variables, very few benchmark studies of feature subset selection (FSS) methods on real-life data from software projects have been conducted. This paper provides an empirical comparison of state-of-the-art FSS methods: information gain attribute ranking (IG); Relief (RLF); principal component analysis (PCA); correlation-based feature selection (CFS); consistency-based subset evaluation (CNS); wrapper subset evaluation (WRP); and an evolutionary computation method, genetic programming (GP), on five fault prediction datasets from the PROMISE data repository. For all the datasets, the area under the receiver operating characteristic curve—the AUC value averaged over 10-fold cross-validation runs—was calculated for each FSS method-dataset combination before and after FSS. Two diverse learning algorithms, C4.5 and naïve Bayes (NB) are used to test the attribute sets given by each FSS method. The results show that although there are no statistically significant differences between the AUC values for the different FSS methods for both C4.5 and NB, a smaller set of FSS methods (IG, RLF, GP) consistently select fewer attributes without degrading classification accuracy. We conclude that in general, FSS is beneficial as it helps improve classification accuracy of NB and C4.5. There is no single best FSS method for all datasets but IG, RLF and GP consistently select fewer attributes without degrading classification accuracy within statistically significant boundaries.
Keywords
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
The requirement that the number of training data points to be an exponential function of the feature dimension.
- 2.
Section 4 provides more details about AUC.
References
Khoshgoftaar, T.M., Seliya, N.: Fault prediction modeling for software quality estimation: Comparing commonly used techniques. Empirical Softw. Eng. 8(3), 255–283 (2004)
Catal, C., Diri, B.: A systematic review of software fault prediction studies. Expert Syst. Appl. 36(4), 7346–7354 (2009)
Hall, T., Beecham, S., Bowes, D., Gray, D., Counsell, S.: A systematic review of fault prediction performance in software engineering. IEEE Trans. Softw. Eng. (99) (2011)
Lessmann, S., Baesens, B., Mues, C., Pietsch, S.: Benchmarking classification models for software defect prediction: a proposed framework and novel findings. IEEE Trans. Softw. Eng. 34(4), 485–496 (2008)
Fenton, N.E., Neil, M.: A critique of software defect prediction models. IEEE Trans. Softw. Eng. 25(5), 675–689 (1999)
Song, Q., Jia, Z., Shepperd, M., Ying, S., Liu, J.: A general software defect-proneness prediction framework. IEEE Trans. Softw. Eng. 37(3), 356–370 (2011)
Foss, T., Stensrud, E., Kitchenham, B.A., Myrtveit, I.: A simulation study of the model evaluation criterion MMRE. IEEE Trans. Softw. Eng. 29(11) (2003)
Afzal, W., Torkar, R., Feldt, R.: Resampling methods in software quality classification. Int. J. Software Eng. Knowl. Eng. 22, 203–223 (2012)
Gray, D., Bowes, D., Davey, N., Sun, Y., Christianson, B.: The misuse of the NASA metrics data program data sets for automated software defect prediction. IET Semin. Dig. 1, 96–103 (2011)
Khoshgoftaar, T.M., Gao, K., Seliya, N.: Attribute selection and imbalanced data: Problems in software defect prediction. IEEE Computer Society, Los Alamitos, CA, USA (2010)
Shivaji, S., Whitehead, J.E.J, Akella, R., Kim, S. Reducing features to improve bug prediction. In: Proceedings of the 2009 IEEE/ACM International Conference on Automated Software Engineering (ASE’09), IEEE Computer Society, Washington, DC, USA (2009)
Rodriguez, D., Ruiz, R., Cuadrado-Gallego, J., Aguilar-Ruiz, J.: Detecting fault modules applying feature selection to classifiers. In: IEEE International Conference on Information Reuse and Integration (IRI’07) (2007a)
Rodriguez, D., Ruiz, R., Cuadrado-Gallego, J., Aguilar-Ruiz, J., Garre, M.: Attribute selection in software engineering datasets for detecting fault modules. In: 33rd EUROMICRO Conference on Software Engineering and Advanced Applications (EUROMICRO’07) (2007b)
Kohavi, R., John, G.H.: Wrappers for feature subset selection. Artif. Intell. 97, 273–324 (1997)
Hall, M.A., Holmes, G.: Benchmarking attribute selection techniques for discrete class data mining. IEEE Trans. Knowl. Data Eng. 15, 1437–1447 (2003)
Jain, A.K., Duin, R.P.W., Mao, J.: Statistical pattern recognition: a review. IEEE Trans. Pattern Anal. Mach. Intell. 22, 4–37 (2000)
Chen, Z., Boehm, B., Menzies, T., Port, D.: Finding the right data for software cost modeling. IEEE Softw. 22, 38–46 (2005)
Janecek, A., Gansterer, W., Demel, M., Ecker, G.: On the relationship between feature selection and classification accuracy. In: Proceedings of the 3rd Workshop on New Challenges for Feature Selection in Data Mining and Knowledge Discovery (FSDM’08), Microtome Publishing, Brookline, MA, USA (2008)
Burke, E.K., Kendall, G. (eds.): Search methodologies—Introductory tutorials in optimization and decision support techniques. Springer Science and Business Media, Inc., 233 Spring Street, New York, USA (2005)
Dybå, T., Kampenes, V.B., Sjøberg, D.I.: A systematic review of statistical power in software engineering experiments. Inf. Softw. Technol. 48(8), 745–755 (2006)
Afzal, W., Torkar, R., Feldt, R., Gorschek, T.: Genetic programming for cross-release fault count predictions in large and complex software projects. In: Chis, M. (ed.) Evolutionary Computation and Optimization Algorithms in Software Engineering: Applications and Techniques, pp. 94–126. IGI Global, Hershey, USA (2009)
Muni, D., Pal, N., Das, J.: Genetic programming for simultaneous feature selection and classifier design. IEEE Trans. Syst. Man Cybern. B Cybern. 36(1), 106–117 (2006)
Smith, M.G., Bull. L.: Feature construction and selection using genetic programming and a genetic algorithm. In: Proceedings of the 6th European Conference on Genetic Programming (EuroGP’03), Springer-Verlag, Berlin, Heidelberg (2003)
Vivanco, R., Kamei, Y., Monden, A., Matsumoto, K., Jin, D.: Using search-based metric selection and oversampling to predict fault prone modules. In: 2010 23rd Canadian Conference on Electrical and Computer Engineering (CCECE’10) (2010)
Yang, J., Honavar, V.: Feature subset selection using a genetic algorithm. IEEE Intell. Syst. and Their Appl. 13(2), 44–49 (1998)
Boetticher, G., Menzies, T., Ostrand, T.: PROMISE repository of empirical software engineering data. http://promisedata.org/ repository, West Virginia University, Department of Computer Science (2007)
Molina, L.C., Belanche, L., Nebot, Àngela: Feature selection algorithms: a survey and experimental evaluation. Proceedings of the 2002 IEEE International Conference on Data Mining (ICDM’02), pp. 306–313. IEEE Computer Society, Washington, DC, USA (2002)
Guyon, I., Elisseeff, A.: An introduction to variable and feature selection. J. Mach. Learn. Res. 3, 1157–1182 (2003)
Blum, A.L., Langley, P.: Selection of relevant features and examples in machine learning. Artif. Intell. 97, 245–271 (1997)
Dash, M., Liu, H.: Feature selection for classification. Intelligent Data Analysis 1(1–4), 131–156 (1997)
Liu, H., Yu, L.: Toward integrating feature selection algorithms for classification and clustering. IEEE Trans. Knowl. Data Eng. 17(4), 491–502 (2005)
Dejaeger, K., Verbeke, W., Martens, D., Baesens, B.: Data mining techniques for software effort estimation: a comparative study. IEEE Trans. Softw. Eng. 38, 375–397 (2012)
Chen, Z., Menzies, T., Port, D., Boehm, B.: Feature subset selection can improve software cost estimation accuracy. SIGSOFT Softw. Eng. Notes 30(4), 1–6 (2005)
Menzies, T., Jalali, O., Hihn, J., Baker, D., Lum, K.: Stable rankings for different effort models. Autom. Softw. Eng. 17, 409–437 (2010)
Kirsopp, C., Shepperd, M.J., Hart, J.: Search heuristics, case-based reasoning and software project effort prediction. Proceedings of the 2002 Genetic and Evolutionary Computation Conference (GECCO’02), pp. 1367–1374. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA (2002)
Azzeh, M., Neagu, D., Cowling, P.: Improving analogy software effort estimation using fuzzy feature subset selection algorithm. In: Proceedings of the 4th International Workshop on Predictor Models in Software Engineering (PROMISE’08), ACM, New York, NY, USA (2008)
Li, Y., Xie, M., Goh, T.: A study of mutual information based feature selection for case based reasoning in software cost estimation. Expert Systems with Applications 36(3, Part 2):5921–5931 (2009)
Menzies, T., Greenwald, J., Frank, A.: Data mining static code attributes to learn defect predictors. IEEE Trans. Softw. Eng. 33(1), 2–13 (2007)
Catal, C., Diri, B.: Investigating the effect of dataset size, metrics sets, and feature selection techniques on software fault prediction problem. Inf. Sci. 179, 1040–1058 (2009)
Khoshgoftaar, T.M., Seliya, N., Sundaresh, N.: An empirical study of predicting software faults with case-based reasoning. Softw. Qual. Control 14, 85–111 (2006)
Wang, H., Khoshgoftaar, T., Gao, K., Seliya, N.: High-dimensional software engineering data and feature selection. In: 21st International Conference on Tools with Artificial Intelligence (ICTAI’09), pp. 83–90 (2009)
Khoshgoftaar, T.M., Nguyen, L., Gao, K., Rajeevalochanam, J.: Application of an attribute selection method to CBR-based software quality classification. In: Proceedings of the 15th IEEE International Conference on Tools with Artificial Intelligence (ICTAI’03), IEEE Computer Society, Washington, DC, USA (2003)
Altidor, W., Khoshgoftaar, T.M., Gao, K.: Wrapper-based feature ranking techniques for determining relevance of software engineering metrics. Int. J. Reliab. Qual. Saf. Eng. 17, 425–464 (2010)
Gao, K., Khoshgoftaar, T., Seliya, N.: Predicting high-risk program modules by selecting the right software measurements. Softw. Qual. J. 20, 3–42 (2012)
Gao, K., Khoshgoftaar, T.M., Wang, H., Seliya, N.: Choosing software metrics for defect prediction: an investigation on feature selection techniques. Softw. Pract. Experience 41(5), 579–606 (2011)
Khoshgoftaar, T.M., Gao, K., Napolitano, A.: An empirical study of feature ranking techniques for software quality prediction. Int. J. Softw. Eng. Knowl. Eng. (IJSEKE) 22, 161–183 (2012)
Wang, H., Khoshgoftaar, T.M., Napolitano, A.: Software measurement data reduction using ensemble techniques. Neurocomputing 92, 124–132 (2012)
Quinlan, J.R.: C4.5: programs for machine learning. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA (1993)
Novakovic, J.: Using information gain attribute evaluation to classify sonar targets. In: Proceedings of the 17th Telecommunications forum (TELFOR’09) (2009)
Kira, K., Rendell, L.A.: The feature selection problem: traditional methods and a new algorithm. In: Proceedings of the 10th National Conference on Artificial Intelligence (AAAI’92) (1992)
Sikonja, M., Kononenko, I.: An adaptation of relief for attribute estimation in regression. In: Proceedings of the 14th International Conference on Machine Learning (ICML’97) (1997)
Hall, M.A.: Correlation-based feature selection for discrete and numeric class machine learning. In: Proceedings of the 2000 International Conference on Machine Learning (ICML’00), Morgan Kaufmann Publishers Inc., San Francisco, CA, USA (2000)
Liu, H., Setiono, R.: A probabilistic approach to feature selection—A filter solution. Proceedings of the 1996 International Conference on Machine Learning (ICML’96), pp. 319–327. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA (1996)
Poli, R., Langdon, W.B., McPhee, N.F.: A field guide to genetic programming. Published via http://lulu.com and freely available at http://www.gp-field-guide.org.uk. URL: http://www.gp-field-guide.org.uk, (with contributions by Koza, J.R.) (2008)
Koza, J.R.: Genetic programming: on the programming of computers by means of natural selection. MIT Press, Cambridge, MA, USA (1992)
Silva, S.: GPLAB—A genetic programming toolbox for MATLAB. http://gplab.sourceforge.net, Last checked: 22 Dec 2014 (2007)
Friedman, N., Geiger, D., Goldszmidt, M.: Bayesian network classifiers. Mach. Learn. 29(2–3), 131–163 (1997)
Rish, I.: An empirical study of the naive Bayes classifier. In: Proceedings of the workshop on empirical methods in AI (IJCAI’01) (2001)
Kotsiantis, S., Zaharakis, I., Pintelas, P.: Machine learning: a review of classification and combining techniques. Artif. Intell. Rev. 26(3), 159–190 (2007)
Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The WEKA data mining software: an update. SIGKDD Explor. Newsl. 11, 10–18 (2009)
Menzies, T., DiStefano, J., Orrego, A., Chapman, R.M.: Assessing predictors of software defects. In: Proceedings of the Workshop on Predictive Software Models, collocated with ICSM’04. URL: http://menzies.us/pdf/04psm.pdf (2004)
El-Emam, K., Benlarbi, S., Goel, N., Rai, S.N.: Comparing case-based reasoning classifiers for predicting high risk software components. J. Syst. Softw. 55(3), 301–320 (2001)
Ma, Y., Cukic, B.: Adequate and precise evaluation of quality models in software engineering studies. In: Proceedings of the 3rd International Workshop on Predictor Models in Software Engineering (PROMISE’07), IEEE Computer Society, pp 1, Washington, DC, USA(2007)
Fawcett, T.: An introduction to ROC analysis. Pattern Recogn. Lett. 27(8), 861–874 (2006)
Hanley, J.A., McNeil, B.J.: The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology 143(1):29–36 (1982)
Ling, C.X., Huang, J., Zhang, H.: AUC: a statistically consistent and more discriminating measure than accuracy. In: Proceedings of the Eighteenth International Joint Conference on Artificial Intelligence (IJCAI’03) (2003)
Yousef, W.A., Wagner, R.F., Loew, M.H.: Comparison of non-parametric methods for assessing classifier performance in terms of ROC parameters. In: Proceedings of the 33rd Applied Imagery Pattern Recognition Workshop (AIPR’04), IEEE Computer Society, Washington, DC, USA (2004)
Jiang, Y., Cukic, B., Menzies, T., Bartlow, N.: Comparing design and code metrics for software quality prediction. In: Proceedings of the 4th international workshop on predictor models in software engineering (PROMISE’08), ACM, New York, NY, USA (2008)
Jiang, Y., Cukic, B., Menzies, T.: Fault prediction using early lifecycle data. In: Proceedings of the 18th IEEE International Symposium on Software Reliability (ISSRE’07), IEEE Computer Society, Washington, DC, USA (2007)
Bradley, A.P.: The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recogn. 30, 1145–1159 (1997)
Kitchenham, B.A., Pickard, L.M., MacDonell, S., Shepperd, M.: What accuracy statistics really measure? IEE Proc. Softw. 148(3) (2001)
Myrtveit, I., Stensrud, E., Shepperd, M.: Reliability and validity in comparative studies of software prediction models. IEEE Trans. Softw. Eng. 31(5), 380–391 (2005)
Langdon, W.B., Buxton, B.F.: Genetic programming for mining DNA chip data from cancer patients. Genet. Program Evolvable Mach. 5, 251–257 (2004)
Wohlin, C., Runeson, P., Höst, M., Ohlsson, M., Regnell, B., Wesslén, A.: Experimentation in software engineering: an introduction. Kluwer Academic Publishers, USA (2000)
Kohavi, R.: A study of cross-validation and bootstrap for accuracy estimation and model selection. In: Proceedings of the 14th International Joint conference on Artificial Intelligence (IJCAI’95), Morgan Kaufmann Publishers Inc., San Francisco, CA, USA (1995)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer International Publishing Switzerland
About this chapter
Cite this chapter
Afzal, W., Torkar, R. (2016). Towards Benchmarking Feature Subset Selection Methods for Software Fault Prediction. In: Pedrycz, W., Succi, G., Sillitti, A. (eds) Computational Intelligence and Quantitative Software Engineering. Studies in Computational Intelligence, vol 617. Springer, Cham. https://doi.org/10.1007/978-3-319-25964-2_3
Download citation
DOI: https://doi.org/10.1007/978-3-319-25964-2_3
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-25962-8
Online ISBN: 978-3-319-25964-2
eBook Packages: EngineeringEngineering (R0)