A Genetic Programming-Based Imputation Method for Classification with Missing Data

Tran, Cao Truong; Zhang, Mengjie; Andreae, Peter

doi:10.1007/978-3-319-30668-1_10

Cao Truong Tran¹⁸,
Mengjie Zhang¹⁸ &
Peter Andreae¹⁸

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 9594))

Included in the following conference series:

European Conference on Genetic Programming

1254 Accesses
13 Citations

Abstract

Many industrial and real-world datasets suffer from an unavoidable problem of missing values. The ability to deal with missing values is an essential requirement for classification because inadequate treatment of missing values may lead to large errors on classification. The problem of missing data has been addressed extensively in the statistics literature, and also, but to a lesser extent in the classification literature. One of the most popular approaches to deal with missing data is to use imputation methods to fill missing values with plausible values. Some powerful imputation methods such as regression-based imputations in MICE [36] are often suitable for batch imputation tasks. However, they are often expensive to impute missing values for every single incomplete instance in the unseen set for classification. This paper proposes a genetic programming-based imputation (GPI) method for classification with missing data that uses genetic programming as a regression method to impute missing values. The experiments on six benchmark datasets and five popular classifiers compare GPI with five other popular and advanced regression-based imputation methods in MICE on two measures: classification accuracy and computation time. The results showed that, in most cases, GPI achieves classification accuracy at least as good as the other imputation methods, and sometimes significantly better. However, using GPI to impute missing values for every single incomplete instance is dramatically faster than the other imputation methods.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Agapitos, A., Brabazon, A., O’Neill, M.: Controlling overfitting in symbolic regression based on a bias/variance error decomposition. In: Coello, C.A.C., Cutello, V., Deb, K., Forrest, S., Nicosia, G., Pavone, M. (eds.) PPSN 2012, Part I. LNCS, vol. 7491, pp. 438–447. Springer, Heidelberg (2012)
Chapter Google Scholar
Andridge, R.R., Little, R.J.: A review of hot deck imputation for survey non-response. Int. Stat. Rev. 78, 40–64 (2010)
Article Google Scholar
Asuncion, A., Newman, D.: UCI machine learning repository (2007). http://www.ics.uci.edu/~mlearn/MLRepository.html
Augusto, D.A., Barbosa, H.J.: Symbolic regression via genetic programming. In: Sixth Brazilian Symposium on Neural Networks, 2000, Proceedings, pp. 173–178 (2000)
Google Scholar
Barmpalexis, P., Kachrimanis, K., Tsakonas, A., Georgarakis, E.: Symbolic regression via genetic programming in the optimization of a controlled release pharmaceutical formulation. Chemometr. Intell. Lab. Syst. 107, 75–82 (2011)
Article Google Scholar
Barnard, J., Meng, X.L.: Applications of multiple imputation in medical studies: from AIDS to NHANES. Stat. Methods Med. Res. 8, 7–36 (1999)
Article Google Scholar
Berger, J.O.: Statistical Decision Theory and Bayesian Analysis. Springer Science & Business Media, New York (2013)
Google Scholar
Breiman, L., Friedman, J., Stone, C.J., Olshen, R.A.: Classification and Regression Trees. CRC Press, Boca Raton (1984)
MATH Google Scholar
Buuren, S., Groothuis-Oudshoorn, K.: MICE: multivariate imputation by chained equations in R. J. Stat. Soft. 45, 1–67 (2011)
Article Google Scholar
Cortes, C., Vapnik, V.: Support-vector networks. Mach. Learn. 20, 273–297 (1995)
MATH Google Scholar
Cunningham, P., Delany, S.J.: k-Nearest Neighbour classifiers. In: Multiple Classifier Systems, pp. 1–17 (2007)
Google Scholar
Draper, N.R., Smith, H., Pownell, E.: Applied Regression Analysis, vol. 3. Wiley, New York (1966)
Google Scholar
Farhangfar, A., Kurgan, L., Dy, J.: Impact of imputation of missing values on classification error for discrete data. Pattern Recogn. 41, 3692–3705 (2008)
Article MATH Google Scholar
Farhangfar, A., Kurgan, L.A., Pedrycz, W.: A novel framework for imputation of missing values in databases. IEEE Trans. Syst. Man Cybern. Part A: Syst. Hum. 37, 692–709 (2007)
Article Google Scholar
García-Laencina, P.J., Sancho-Gómez, J.L., Figueiras-Vidal, A.R.: Pattern classification with missing data: a review. Neural Comput. Appl. 19, 263–282 (2010)
Article Google Scholar
Graham, J.W.: Missing data analysis: making it work in the real world. Ann. Rev. Psychol. 60, 549–576 (2009)
Article Google Scholar
Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The WEKA data mining software: an update. ACM SIGKDD Explor. Newslett. 11, 10–18 (2009)
Article Google Scholar
Han, J., Kamber, M., Pei, J.: Data Mining, Southeast Asia Edition: Concepts and Techniques. Morgan Kaufmann, San Francisco (2006)
MATH Google Scholar
Hornik, K., Stinchcombe, M., White, H.: Multilayer feedforward networks are universal approximators. Neural Netw. 2, 359–366 (1989)
Article Google Scholar
Keijzer, M.: Improving symbolic regression with interval arithmetic and linear scaling. In: Genetic programming, pp. 70–82 (2003)
Google Scholar
Kleinbaum, D., Kupper, L., Nizam, A., Rosenberg, E.: Applied regression analysis and other multivariable methods. Cengage Learning (2013)
Google Scholar
Koza, J.R.: Genetic Programming: On the Programming of Computers by Means of Natural Selection. MIT Press, Cambridge (1992)
MATH Google Scholar
Liaw, A., Wiener, M.: Classification and regression by randomforest. R News 2, 18–22 (2002)
Google Scholar
Little, R.J.A., Rubin, D.B.: Statistical Analysis with Missing Data. Wiley-Interscience, New York (2002)
Book MATH Google Scholar
Luke, S., Panait, L., Balan, G., Paus, S., Skolicki, Z., Bassett, J., Hubley, R., Chircop, A.: ECJ: A java-based evolutionary computation research system (2006) Downloadable versions and documentation can be found at the following http://cs.gmu.edu/eclab/projects/ecj
Minka, T.: Bayesian linear regression. Technical report, 3594 Security Ticket Control (1999)
Google Scholar
Murphy, K.P.: Naive Bayes classifiers. University of British Columbia (2006)
Google Scholar
Quinlan, J.R.: C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers Inc., San Francisco (1993)
Google Scholar
Schafer, J.L.: Analysis of Incomplete Multivariate Data. Monographs on Statistics & Applied Probability. Chapman & Hall/CRC, New York (1997)
Book MATH Google Scholar
Schafer, J.L.: Analysis of Incomplete Multivariate Data. CRC Press, New York (1997)
Book MATH Google Scholar
Silva, S., Dignum, S., Vanneschi, L.: Operator equalisation for bloat free genetic programming and a survey of bloat control methods. Genet. Program. Evolvable Mach. 13, 197–238 (2012)
Article Google Scholar
Topchy, A., Punch, W.F.: Faster genetic programming based on local gradient search of numeric leaf values. In: Proceedings of the Genetic and Evolutionary Computation Conference (GECCO-2001), vol. 155162 (2001)
Google Scholar
Tran, C.T., Zhang, M., Andreae, P.: Multiple imputation for missing data using genetic programming. In: Proceedings of the 2015 on Genetic and Evolutionary Computation Conference, pp. 583–590 (2015)
Google Scholar
Uy, N.Q., Hoai, N.X., O’Neill, M., Mckay, R.I., Galván-López, E.: Semantically-based crossover in genetic programming: application to real-valued symbolic regression. Genet. Program. Evolvable Mach. 12, 91–119 (2011)
Article Google Scholar
Van Buuren, S., Oudshoorn, C.: Multivariate imputation by chained equations. MICE V1. 0 user’s manual. Leiden: TNO Preventie en Gezondheid (2000)
Google Scholar
Van Buuren, S., Oudshoorn, K.: Flexible multivariate imputation by MICE. Technical report, PG/VGZ/99.054: TNO Prevention and Health, Leiden (1999)
Google Scholar
Vladislavleva, E.J., Smits, G.F., Den Hertog, D.: Order of nonlinearity as a complexity measure for models generated by symbolic regression via pareto genetic programming. IEEE Trans. Evol. Comput. 13, 333–349 (2009)
Article Google Scholar
White, I.R., Royston, P., Wood, A.M.: Multiple imputation using chained equations: issues and guidance for practice. Stat. Med. 30, 377–399 (2011)
Article MathSciNet Google Scholar

Download references

Author information

Authors and Affiliations

School of Engineering and Computer Science, Victoria University of Wellington, PO Box 600, Wellington, 6140, New Zealand
Cao Truong Tran, Mengjie Zhang & Peter Andreae

Authors

Cao Truong Tran
View author publications
You can also search for this author in PubMed Google Scholar
Mengjie Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Peter Andreae
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Cao Truong Tran .

Editor information

Editors and Affiliations

Dalhousie University, Halifax, Nova Scotia, Canada
Malcolm I. Heywood
University College Dublin, Dublin, Ireland
James McDermott
Universidade Nova de Lisboa, Lisboa, Portugal
Mauro Castelli
CIUSC Dept Informatics Engg, University of Coimbra, Coimbra, Portugal
Ernesto Costa
Edinburgh Napier University, Edinburgh, United Kingdom
Kevin Sim

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Tran, C.T., Zhang, M., Andreae, P. (2016). A Genetic Programming-Based Imputation Method for Classification with Missing Data. In: Heywood, M., McDermott, J., Castelli, M., Costa, E., Sim, K. (eds) Genetic Programming. EuroGP 2016. Lecture Notes in Computer Science(), vol 9594. Springer, Cham. https://doi.org/10.1007/978-3-319-30668-1_10

Download citation

DOI: https://doi.org/10.1007/978-3-319-30668-1_10
Published: 24 March 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-30667-4
Online ISBN: 978-3-319-30668-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics