Abstract
Data incompleteness represents a serious issue in real-world applications of machine learning. Imputation methods are algorithms for restoring missing values in the data based on other available entries. Imputation methods have an influence on the learning performance on incomplete data. Therefore, the choice of a right imputation method has an important role when constructing prediction models. It is common to use one imputation method to impute all the incomplete features. However, the imputation method that works well for some features might not be suitable for others, hence, it would be more useful to select the right imputation method for each feature. In fact, selecting an imputation method for the whole data set is still a challenging issue, let al one selecting different imputation methods for all incomplete features. Therefore, this work proposes the use of genetic programming to search for the right combination of imputation methods for symbolic regression. The role of GP is to select imputation methods for incomplete features and evolve symbolic regression. It incorporates a heterogeneous set of imputation methods as part of the symbolic regression process. The results show that the proposed method can automatically find the most effective combination of imputation methods for a variety of incomplete regression data sets.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Al-Helali, B., Chen, Q., Xue, B., Zhang, M.: A hybrid GP-KNN imputation for symbolic regression with missing values. In: Mitrovic, T., Xue, B., Li, X. (eds.) AI 2018. LNCS (LNAI), vol. 11320, pp. 345–357. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-03991-2_33
Al-Helali, B., Chen, Q., Xue, B., Zhang, M.: Genetic programming-based simultaneous feature selection and imputation for symbolic regression with incomplete data. In: Palaiahnakote, S., Sanniti di Baja, G., Wang, L., Yan, W.Q. (eds.) ACPR 2019. LNCS, vol. 12047, pp. 566–579. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-41299-9_44
Al-Helali, B., Chen, Q., Xue, B., Zhang, M.: A genetic programming-based wrapper imputation method for symbolic regression with incomplete data. In: 2019 IEEE Symposium Series on Computational Intelligence (SSCI), pp. 2395–2402. IEEE (2019)
Al-Helali, B., Chen, Q., Xue, B., Zhang, M.: Genetic programming for imputation predictor selection and ranking in symbolic regression with high-dimensional Incomplete Data. In: Liu, J., Bailey, J. (eds.) AI 2019. LNCS (LNAI), vol. 11919, pp. 523–535. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-35288-2_42
Al-Helali, B., Chen, Q., Xue, B., Zhang, M.: Hessian complexity measure for genetic programming-based imputation predictor selection in symbolic regression with incomplete data. In: Hu, T., Lourenço, N., Medvet, E., Divina, F. (eds.) EuroGP 2020. LNCS, vol. 12101, pp. 1–17. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-44094-7_1
Angelov, B.: Towards data science: working with missing data in machine learning (2017). https://towardsdatascience.com/working-with-missing-data-in-machine-learning-9c0a430df4ce
Arslan, A.K., Tunç, Z., Güldoğan, E., Çolak, C.: Performance comparison of some imputation methods used in missing value (s)analysis: a simulation study. Turk. Klinikleri J. Biostatistics11(1) (2019)
Austel, V., et al.: Globally optimal symbolic regression. arXiv preprint arXiv:1710.10720 (2017)
Banzhaf, W., Nordin, P., Keller, R.E., Francone, F.D.: Genetic programming: an introduction, vol. 1. Morgan Kaufmann San Francisco (1998)
Brandejsky, T.: Model identification from incomplete data set describing state variable subset only-the problem of optimizing and predicting heuristic incorporation into evolutionary system. In: Nostradamus 2013: Prediction, Modeling and Analysis of Complex Systems, pp. 181–189. Springer (2013)
Çüm, S., Demir, E.K., Gelbal, S., Kışla, T.: A comparison of advanced methods used for missing data imputation under different conditions (2019)
Dheeru, D., Karra Taniskidou, E.: UCI machine learning repository (2017). http://archive.ics.uci.edu/ml
Donders, A.R.T., Van Der Heijden, G.J., Stijnen, T., Moons, K.G.: A gentle introduction to imputation of missing values. J. Clin. Epidemiol. 59(10), 1087–1091 (2006)
Fortin, F.A., Rainville, F.M.D., Gardner, M.A., Parizeau, M., Gagné, C.: Deap: evolutionary algorithms made easy. J. Mach. Learn. Res. 13, 2171–2175 (2012)
Garciarena, U., Mendiburu, A., Santana, R.: Towards a more efficient representation of imputation operators in tpot. arXiv preprint arXiv:1801.04407 (2018)
Garciarena, U., Santana, R., Mendiburu, A.: Evolving imputation strategies for missing data in classification problems with tpot. arXiv preprint arXiv:1706.01120 (2017)
Heidt, K.: Comparison of imputation methods for mixed data missing at random (2019)
Kearney, J., Barkat, S.: Autoimpute, a python package for handling missing data. https://pypi.org/project/autoimpute/
McPhee, N.F., Poli, R., Langdon, W.B.: Field Guide to Genetic Programming. Lulu. com, Morrisville (2008)
Olson, R.S., Moore, J.H.: TPOT: a tree-based pipeline optimization tool for automating machine learning. In: Hutter, F., Kotthoff, L., Vanschoren, J. (eds.) Automated Machine Learning. TSSCML, pp. 151–160. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-05318-5_8
Pornprasertmanit, S., Miller, P., Schoemann, A., Quick, C., Jorgensen, T., Pornprasertmanit, M.S.: Package ‘simsem’ (2016)
Rubin, D.B.: Multiple Imputation for Nonresponse in Surveys, vol. 81. JohnWiley & Sons, New Jersey (2004)
Schafer, J.L.: Multiple imputation: a primer. Stat. Methods Med. Res. 8(1), 3–15 (1999)
Suganuma, M., Shirakawa, S., Nagao, T.: A genetic programming approach to designing convolutional neural network architectures. In: Proceedings of the Genetic and Evolutionary Computation Conference, pp. 497–504 (2017)
Takahashi, M., Ito, T.: Multiple imputation of turnover in edinet data: toward the improvement of imputation for the economic census, pp. 24–26. Work Session on Statistical Data Editing, UNECE (2012)
Vanschoren, J., Van Rijn, J.N., Bischl, B., Torgo, L.: Openml: networked science in machine learning. ACM SIGKDD Explor. Newsl. 15(2), 49–60 (2014)
Vladislavleva, E., Smits, G., Den Hertog, D.: On the importance of data balancing for symbolic regression. IEEE Trans. Evol. Comput. 14(2), 252–277 (2010)
Zhang, F., Mei, Y., Nguyen, S., Zhang, M.: Evolving scheduling heuristics viagenetic programming with feature selection in dynamic flexible job shopscheduling. IEEE Trans. Cybern. (2020)
Author information
Authors and Affiliations
Corresponding authors
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Al-Helali, B., Chen, Q., Xue, B., Zhang, M. (2020). Genetic Programming-Based Selection of Imputation Methods in Symbolic Regression with Missing Values. In: Gallagher, M., Moustafa, N., Lakshika, E. (eds) AI 2020: Advances in Artificial Intelligence. AI 2020. Lecture Notes in Computer Science(), vol 12576. Springer, Cham. https://doi.org/10.1007/978-3-030-64984-5_13
Download citation
DOI: https://doi.org/10.1007/978-3-030-64984-5_13
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-64983-8
Online ISBN: 978-3-030-64984-5
eBook Packages: Computer ScienceComputer Science (R0)