Abstract
Many industrial and real-world datasets suffer from an unavoidable problem of missing values. The ability to deal with missing values is an essential requirement for classification because inadequate treatment of missing values may lead to large errors on classification. The problem of missing data has been addressed extensively in the statistics literature, and also, but to a lesser extent in the classification literature. One of the most popular approaches to deal with missing data is to use imputation methods to fill missing values with plausible values. Some powerful imputation methods such as regression-based imputations in MICE [36] are often suitable for batch imputation tasks. However, they are often expensive to impute missing values for every single incomplete instance in the unseen set for classification. This paper proposes a genetic programming-based imputation (GPI) method for classification with missing data that uses genetic programming as a regression method to impute missing values. The experiments on six benchmark datasets and five popular classifiers compare GPI with five other popular and advanced regression-based imputation methods in MICE on two measures: classification accuracy and computation time. The results showed that, in most cases, GPI achieves classification accuracy at least as good as the other imputation methods, and sometimes significantly better. However, using GPI to impute missing values for every single incomplete instance is dramatically faster than the other imputation methods.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Agapitos, A., Brabazon, A., O’Neill, M.: Controlling overfitting in symbolic regression based on a bias/variance error decomposition. In: Coello, C.A.C., Cutello, V., Deb, K., Forrest, S., Nicosia, G., Pavone, M. (eds.) PPSN 2012, Part I. LNCS, vol. 7491, pp. 438–447. Springer, Heidelberg (2012)
Andridge, R.R., Little, R.J.: A review of hot deck imputation for survey non-response. Int. Stat. Rev. 78, 40–64 (2010)
Asuncion, A., Newman, D.: UCI machine learning repository (2007). http://www.ics.uci.edu/~mlearn/MLRepository.html
Augusto, D.A., Barbosa, H.J.: Symbolic regression via genetic programming. In: Sixth Brazilian Symposium on Neural Networks, 2000, Proceedings, pp. 173–178 (2000)
Barmpalexis, P., Kachrimanis, K., Tsakonas, A., Georgarakis, E.: Symbolic regression via genetic programming in the optimization of a controlled release pharmaceutical formulation. Chemometr. Intell. Lab. Syst. 107, 75–82 (2011)
Barnard, J., Meng, X.L.: Applications of multiple imputation in medical studies: from AIDS to NHANES. Stat. Methods Med. Res. 8, 7–36 (1999)
Berger, J.O.: Statistical Decision Theory and Bayesian Analysis. Springer Science & Business Media, New York (2013)
Breiman, L., Friedman, J., Stone, C.J., Olshen, R.A.: Classification and Regression Trees. CRC Press, Boca Raton (1984)
Buuren, S., Groothuis-Oudshoorn, K.: MICE: multivariate imputation by chained equations in R. J. Stat. Soft. 45, 1–67 (2011)
Cortes, C., Vapnik, V.: Support-vector networks. Mach. Learn. 20, 273–297 (1995)
Cunningham, P., Delany, S.J.: k-Nearest Neighbour classifiers. In: Multiple Classifier Systems, pp. 1–17 (2007)
Draper, N.R., Smith, H., Pownell, E.: Applied Regression Analysis, vol. 3. Wiley, New York (1966)
Farhangfar, A., Kurgan, L., Dy, J.: Impact of imputation of missing values on classification error for discrete data. Pattern Recogn. 41, 3692–3705 (2008)
Farhangfar, A., Kurgan, L.A., Pedrycz, W.: A novel framework for imputation of missing values in databases. IEEE Trans. Syst. Man Cybern. Part A: Syst. Hum. 37, 692–709 (2007)
GarcÃa-Laencina, P.J., Sancho-Gómez, J.L., Figueiras-Vidal, A.R.: Pattern classification with missing data: a review. Neural Comput. Appl. 19, 263–282 (2010)
Graham, J.W.: Missing data analysis: making it work in the real world. Ann. Rev. Psychol. 60, 549–576 (2009)
Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The WEKA data mining software: an update. ACM SIGKDD Explor. Newslett. 11, 10–18 (2009)
Han, J., Kamber, M., Pei, J.: Data Mining, Southeast Asia Edition: Concepts and Techniques. Morgan Kaufmann, San Francisco (2006)
Hornik, K., Stinchcombe, M., White, H.: Multilayer feedforward networks are universal approximators. Neural Netw. 2, 359–366 (1989)
Keijzer, M.: Improving symbolic regression with interval arithmetic and linear scaling. In: Genetic programming, pp. 70–82 (2003)
Kleinbaum, D., Kupper, L., Nizam, A., Rosenberg, E.: Applied regression analysis and other multivariable methods. Cengage Learning (2013)
Koza, J.R.: Genetic Programming: On the Programming of Computers by Means of Natural Selection. MIT Press, Cambridge (1992)
Liaw, A., Wiener, M.: Classification and regression by randomforest. R News 2, 18–22 (2002)
Little, R.J.A., Rubin, D.B.: Statistical Analysis with Missing Data. Wiley-Interscience, New York (2002)
Luke, S., Panait, L., Balan, G., Paus, S., Skolicki, Z., Bassett, J., Hubley, R., Chircop, A.: ECJ: A java-based evolutionary computation research system (2006) Downloadable versions and documentation can be found at the following http://cs.gmu.edu/eclab/projects/ecj
Minka, T.: Bayesian linear regression. Technical report, 3594 Security Ticket Control (1999)
Murphy, K.P.: Naive Bayes classifiers. University of British Columbia (2006)
Quinlan, J.R.: C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers Inc., San Francisco (1993)
Schafer, J.L.: Analysis of Incomplete Multivariate Data. Monographs on Statistics & Applied Probability. Chapman & Hall/CRC, New York (1997)
Schafer, J.L.: Analysis of Incomplete Multivariate Data. CRC Press, New York (1997)
Silva, S., Dignum, S., Vanneschi, L.: Operator equalisation for bloat free genetic programming and a survey of bloat control methods. Genet. Program. Evolvable Mach. 13, 197–238 (2012)
Topchy, A., Punch, W.F.: Faster genetic programming based on local gradient search of numeric leaf values. In: Proceedings of the Genetic and Evolutionary Computation Conference (GECCO-2001), vol. 155162 (2001)
Tran, C.T., Zhang, M., Andreae, P.: Multiple imputation for missing data using genetic programming. In: Proceedings of the 2015 on Genetic and Evolutionary Computation Conference, pp. 583–590 (2015)
Uy, N.Q., Hoai, N.X., O’Neill, M., Mckay, R.I., Galván-López, E.: Semantically-based crossover in genetic programming: application to real-valued symbolic regression. Genet. Program. Evolvable Mach. 12, 91–119 (2011)
Van Buuren, S., Oudshoorn, C.: Multivariate imputation by chained equations. MICE V1. 0 user’s manual. Leiden: TNO Preventie en Gezondheid (2000)
Van Buuren, S., Oudshoorn, K.: Flexible multivariate imputation by MICE. Technical report, PG/VGZ/99.054: TNO Prevention and Health, Leiden (1999)
Vladislavleva, E.J., Smits, G.F., Den Hertog, D.: Order of nonlinearity as a complexity measure for models generated by symbolic regression via pareto genetic programming. IEEE Trans. Evol. Comput. 13, 333–349 (2009)
White, I.R., Royston, P., Wood, A.M.: Multiple imputation using chained equations: issues and guidance for practice. Stat. Med. 30, 377–399 (2011)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer International Publishing Switzerland
About this paper
Cite this paper
Tran, C.T., Zhang, M., Andreae, P. (2016). A Genetic Programming-Based Imputation Method for Classification with Missing Data. In: Heywood, M., McDermott, J., Castelli, M., Costa, E., Sim, K. (eds) Genetic Programming. EuroGP 2016. Lecture Notes in Computer Science(), vol 9594. Springer, Cham. https://doi.org/10.1007/978-3-319-30668-1_10
Download citation
DOI: https://doi.org/10.1007/978-3-319-30668-1_10
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-30667-4
Online ISBN: 978-3-319-30668-1
eBook Packages: Computer ScienceComputer Science (R0)