Skip to main content

A Genetic Programming-Based Imputation Method for Classification with Missing Data

  • Conference paper
  • First Online:
Genetic Programming (EuroGP 2016)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 9594))

Included in the following conference series:

Abstract

Many industrial and real-world datasets suffer from an unavoidable problem of missing values. The ability to deal with missing values is an essential requirement for classification because inadequate treatment of missing values may lead to large errors on classification. The problem of missing data has been addressed extensively in the statistics literature, and also, but to a lesser extent in the classification literature. One of the most popular approaches to deal with missing data is to use imputation methods to fill missing values with plausible values. Some powerful imputation methods such as regression-based imputations in MICE [36] are often suitable for batch imputation tasks. However, they are often expensive to impute missing values for every single incomplete instance in the unseen set for classification. This paper proposes a genetic programming-based imputation (GPI) method for classification with missing data that uses genetic programming as a regression method to impute missing values. The experiments on six benchmark datasets and five popular classifiers compare GPI with five other popular and advanced regression-based imputation methods in MICE on two measures: classification accuracy and computation time. The results showed that, in most cases, GPI achieves classification accuracy at least as good as the other imputation methods, and sometimes significantly better. However, using GPI to impute missing values for every single incomplete instance is dramatically faster than the other imputation methods.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Agapitos, A., Brabazon, A., O’Neill, M.: Controlling overfitting in symbolic regression based on a bias/variance error decomposition. In: Coello, C.A.C., Cutello, V., Deb, K., Forrest, S., Nicosia, G., Pavone, M. (eds.) PPSN 2012, Part I. LNCS, vol. 7491, pp. 438–447. Springer, Heidelberg (2012)

    Chapter  Google Scholar 

  2. Andridge, R.R., Little, R.J.: A review of hot deck imputation for survey non-response. Int. Stat. Rev. 78, 40–64 (2010)

    Article  Google Scholar 

  3. Asuncion, A., Newman, D.: UCI machine learning repository (2007). http://www.ics.uci.edu/~mlearn/MLRepository.html

  4. Augusto, D.A., Barbosa, H.J.: Symbolic regression via genetic programming. In: Sixth Brazilian Symposium on Neural Networks, 2000, Proceedings, pp. 173–178 (2000)

    Google Scholar 

  5. Barmpalexis, P., Kachrimanis, K., Tsakonas, A., Georgarakis, E.: Symbolic regression via genetic programming in the optimization of a controlled release pharmaceutical formulation. Chemometr. Intell. Lab. Syst. 107, 75–82 (2011)

    Article  Google Scholar 

  6. Barnard, J., Meng, X.L.: Applications of multiple imputation in medical studies: from AIDS to NHANES. Stat. Methods Med. Res. 8, 7–36 (1999)

    Article  Google Scholar 

  7. Berger, J.O.: Statistical Decision Theory and Bayesian Analysis. Springer Science & Business Media, New York (2013)

    Google Scholar 

  8. Breiman, L., Friedman, J., Stone, C.J., Olshen, R.A.: Classification and Regression Trees. CRC Press, Boca Raton (1984)

    MATH  Google Scholar 

  9. Buuren, S., Groothuis-Oudshoorn, K.: MICE: multivariate imputation by chained equations in R. J. Stat. Soft. 45, 1–67 (2011)

    Article  Google Scholar 

  10. Cortes, C., Vapnik, V.: Support-vector networks. Mach. Learn. 20, 273–297 (1995)

    MATH  Google Scholar 

  11. Cunningham, P., Delany, S.J.: k-Nearest Neighbour classifiers. In: Multiple Classifier Systems, pp. 1–17 (2007)

    Google Scholar 

  12. Draper, N.R., Smith, H., Pownell, E.: Applied Regression Analysis, vol. 3. Wiley, New York (1966)

    Google Scholar 

  13. Farhangfar, A., Kurgan, L., Dy, J.: Impact of imputation of missing values on classification error for discrete data. Pattern Recogn. 41, 3692–3705 (2008)

    Article  MATH  Google Scholar 

  14. Farhangfar, A., Kurgan, L.A., Pedrycz, W.: A novel framework for imputation of missing values in databases. IEEE Trans. Syst. Man Cybern. Part A: Syst. Hum. 37, 692–709 (2007)

    Article  Google Scholar 

  15. García-Laencina, P.J., Sancho-Gómez, J.L., Figueiras-Vidal, A.R.: Pattern classification with missing data: a review. Neural Comput. Appl. 19, 263–282 (2010)

    Article  Google Scholar 

  16. Graham, J.W.: Missing data analysis: making it work in the real world. Ann. Rev. Psychol. 60, 549–576 (2009)

    Article  Google Scholar 

  17. Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The WEKA data mining software: an update. ACM SIGKDD Explor. Newslett. 11, 10–18 (2009)

    Article  Google Scholar 

  18. Han, J., Kamber, M., Pei, J.: Data Mining, Southeast Asia Edition: Concepts and Techniques. Morgan Kaufmann, San Francisco (2006)

    MATH  Google Scholar 

  19. Hornik, K., Stinchcombe, M., White, H.: Multilayer feedforward networks are universal approximators. Neural Netw. 2, 359–366 (1989)

    Article  Google Scholar 

  20. Keijzer, M.: Improving symbolic regression with interval arithmetic and linear scaling. In: Genetic programming, pp. 70–82 (2003)

    Google Scholar 

  21. Kleinbaum, D., Kupper, L., Nizam, A., Rosenberg, E.: Applied regression analysis and other multivariable methods. Cengage Learning (2013)

    Google Scholar 

  22. Koza, J.R.: Genetic Programming: On the Programming of Computers by Means of Natural Selection. MIT Press, Cambridge (1992)

    MATH  Google Scholar 

  23. Liaw, A., Wiener, M.: Classification and regression by randomforest. R News 2, 18–22 (2002)

    Google Scholar 

  24. Little, R.J.A., Rubin, D.B.: Statistical Analysis with Missing Data. Wiley-Interscience, New York (2002)

    Book  MATH  Google Scholar 

  25. Luke, S., Panait, L., Balan, G., Paus, S., Skolicki, Z., Bassett, J., Hubley, R., Chircop, A.: ECJ: A java-based evolutionary computation research system (2006) Downloadable versions and documentation can be found at the following http://cs.gmu.edu/eclab/projects/ecj

  26. Minka, T.: Bayesian linear regression. Technical report, 3594 Security Ticket Control (1999)

    Google Scholar 

  27. Murphy, K.P.: Naive Bayes classifiers. University of British Columbia (2006)

    Google Scholar 

  28. Quinlan, J.R.: C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers Inc., San Francisco (1993)

    Google Scholar 

  29. Schafer, J.L.: Analysis of Incomplete Multivariate Data. Monographs on Statistics & Applied Probability. Chapman & Hall/CRC, New York (1997)

    Book  MATH  Google Scholar 

  30. Schafer, J.L.: Analysis of Incomplete Multivariate Data. CRC Press, New York (1997)

    Book  MATH  Google Scholar 

  31. Silva, S., Dignum, S., Vanneschi, L.: Operator equalisation for bloat free genetic programming and a survey of bloat control methods. Genet. Program. Evolvable Mach. 13, 197–238 (2012)

    Article  Google Scholar 

  32. Topchy, A., Punch, W.F.: Faster genetic programming based on local gradient search of numeric leaf values. In: Proceedings of the Genetic and Evolutionary Computation Conference (GECCO-2001), vol. 155162 (2001)

    Google Scholar 

  33. Tran, C.T., Zhang, M., Andreae, P.: Multiple imputation for missing data using genetic programming. In: Proceedings of the 2015 on Genetic and Evolutionary Computation Conference, pp. 583–590 (2015)

    Google Scholar 

  34. Uy, N.Q., Hoai, N.X., O’Neill, M., Mckay, R.I., Galván-López, E.: Semantically-based crossover in genetic programming: application to real-valued symbolic regression. Genet. Program. Evolvable Mach. 12, 91–119 (2011)

    Article  Google Scholar 

  35. Van Buuren, S., Oudshoorn, C.: Multivariate imputation by chained equations. MICE V1. 0 user’s manual. Leiden: TNO Preventie en Gezondheid (2000)

    Google Scholar 

  36. Van Buuren, S., Oudshoorn, K.: Flexible multivariate imputation by MICE. Technical report, PG/VGZ/99.054: TNO Prevention and Health, Leiden (1999)

    Google Scholar 

  37. Vladislavleva, E.J., Smits, G.F., Den Hertog, D.: Order of nonlinearity as a complexity measure for models generated by symbolic regression via pareto genetic programming. IEEE Trans. Evol. Comput. 13, 333–349 (2009)

    Article  Google Scholar 

  38. White, I.R., Royston, P., Wood, A.M.: Multiple imputation using chained equations: issues and guidance for practice. Stat. Med. 30, 377–399 (2011)

    Article  MathSciNet  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Cao Truong Tran .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer International Publishing Switzerland

About this paper

Cite this paper

Tran, C.T., Zhang, M., Andreae, P. (2016). A Genetic Programming-Based Imputation Method for Classification with Missing Data. In: Heywood, M., McDermott, J., Castelli, M., Costa, E., Sim, K. (eds) Genetic Programming. EuroGP 2016. Lecture Notes in Computer Science(), vol 9594. Springer, Cham. https://doi.org/10.1007/978-3-319-30668-1_10

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-30668-1_10

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-30667-4

  • Online ISBN: 978-3-319-30668-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics