Abstract
Symbolic regression via genetic programming has been used successfully for empirical modeling from given data sets. However, real-world data sets might contain missing values. Although there are different approaches to dealing with incomplete data sets for classification, symbolic regression with missing values has been rarely investigated. Similarly, only a few studies have been conducted on feature selection for symbolic regression, but none of them addresses the incompleteness issue. In this work, a genetic programming-based method for simultaneous imputation and feature selection is developed. This method selects the predictive features for the incomplete features whilst constructing their imputation models. Such models are designed to be suitable for data sets with mixed numerical and categorical features. The performance of the proposed method is compared with state-of-the-art widely used imputation methods from three aspects: the imputation accuracy, the feature selection effectiveness, and the symbolic regression performance.
Keywords
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
Al-Helali, B., Chen, Q., Xue, B., Zhang, M.: A hybrid GP-KNN imputation for symbolic regression with missing values. In: Mitrovic, T., Xue, B., Li, X. (eds.) AI 2018. LNCS (LNAI), vol. 11320, pp. 345–357. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-03991-2_33
Arslan, S., Ozturk, C.: Multi hive artificial bee colony programming for high dimensional symbolic regression with feature selection. Appl. Soft Comput. 78, 515–527 (2019)
Austel, V., et al.: Globally optimal symbolic regression. arXiv preprint arXiv:1710.10720 (2017)
Bhardwaj, H., Sakalle, A., Bhardwaj, A., Tiwari, A., Verma, M.: Breast cancer diagnosis using simultaneous feature selection and classification: a genetic programming approach. In: 2018 IEEE Symposium Series on Computational Intelligence (SSCI), pp. 2186–2192. IEEE (2018)
Buuren, S.V., Groothuis-Oudshoorn, K.: MICE: multivariate imputation by chained equations in R. J. Stat. softw. 15, 1–68 (2010)
Chen, Q., Zhang, M., Xue, B.: Feature selection to improve generalization of genetic programming for high-dimensional symbolic regression. IEEE Trans. Evol. Comput. 21(5), 792–806 (2017)
Davidson, J.W., Savic, D.A., Walters, G.A.: Symbolic and numerical regression: experiments and applications. Inf. Sci. 150(1–2), 95–117 (2003)
Donders, A.R.T., Van Der Heijden, G.J., Stijnen, T., Moons, K.G.: A gentle introduction to imputation of missing values. J. Clin. Epidemiol. 59(10), 1087–1091 (2006)
Fortin, F.A., Rainville, F.M.D., Gardner, M.A., Parizeau, M., Gagné, C.: Deap: evolutionary algorithms made easy. J. Mach. Learn. Res. 13(Jul), 2171–2175 (2012)
García-Laencina, P.J., Sancho-Gómez, J.L., Figueiras-Vidal, A.R.: Pattern classification with missing data: a review. Neural Comput. Appl. 19(2), 263–282 (2010)
Koza, J.R.: Genetic Programming II, Automatic Discovery of Reusable Subprograms. MIT Press, Cambridge (1992)
Loh, P.L., Wainwright, M.J.: High-dimensional regression with noisy and missing data: provable guarantees with non-convexity. In: Advances in Neural Information Processing Systems, pp. 2726–2734 (2011)
Nag, K., Pal, N.R.: Genetic programming for classification and feature selection. In: Bansal, J.C., Singh, P.K., Pal, N.R. (eds.) Evolutionary and Swarm Intelligence Algorithms. SCI, vol. 779, pp. 119–141. Springer, Cham (2019). https://doi.org/10.1007/978-3-319-91341-4_7
Quinlan, J.R.: C4.5: Programs for Machine Learning. Elsevier, San Francisco (2014)
Tran, C.T., Zhang, M., Andreae, P.: Multiple imputation for missing data using genetic programming. In: Proceedings of the 2015 Annual Conference on Genetic and Evolutionary Computation, pp. 583–590. ACM (2015)
Tran, C.T., Zhang, M., Andreae, P.: A genetic programming-based imputation method for classification with missing data. In: Heywood, M.I., McDermott, J., Castelli, M., Costa, E., Sim, K. (eds.) EuroGP 2016. LNCS, vol. 9594, pp. 149–163. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-30668-1_10
Tran, C.T., Zhang, M., Andreae, P., Xue, B.: Multiple imputation and genetic programming for classification with incomplete data. In: Proceedings of the Genetic and Evolutionary Computation Conference, pp. 521–528. ACM (2017)
Vanschoren, J., Van Rijn, J.N., Bischl, B., Torgo, L.: Openml: networked science in machine learning. ACM SIGKDD Explor. Newsl. 15(2), 49–60 (2014)
Viegas, F., et al.: A genetic programming approach for feature selection in highly dimensional skewed data. Neurocomputing 273, 554–569 (2018)
Xue, B., Zhang, M.: Evolutionary computation for feature manipulation: key challenges and future directions. In: 2016 IEEE Congress on Evolutionary Computation (CEC), pp. 3061–3067. IEEE (2016)
Xue, B., Zhang, M.: Evolutionary feature manipulation in data mining/big data. ACM SIGEVOlution 10(1), 4–11 (2017)
Zhang, M., Ciesielski, V.: Genetic programming for multiple class object detection. In: Foo, N. (ed.) AI 1999. LNCS (LNAI), vol. 1747, pp. 180–192. Springer, Heidelberg (1999). https://doi.org/10.1007/3-540-46695-9_16
Author information
Authors and Affiliations
Corresponding authors
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Al-Helali, B., Chen, Q., Xue, B., Zhang, M. (2020). Genetic Programming-Based Simultaneous Feature Selection and Imputation for Symbolic Regression with Incomplete Data. In: Palaiahnakote, S., Sanniti di Baja, G., Wang, L., Yan, W. (eds) Pattern Recognition. ACPR 2019. Lecture Notes in Computer Science(), vol 12047. Springer, Cham. https://doi.org/10.1007/978-3-030-41299-9_44
Download citation
DOI: https://doi.org/10.1007/978-3-030-41299-9_44
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-41298-2
Online ISBN: 978-3-030-41299-9
eBook Packages: Computer ScienceComputer Science (R0)