Hessian Complexity Measure for Genetic Programming-Based Imputation Predictor Selection in Symbolic Regression with Incomplete Data

Al-Helali, Baligh; Chen, Qi; Xue, Bing; Zhang, Mengjie

doi:10.1007/978-3-030-44094-7_1

Hessian Complexity Measure for Genetic Programming-Based Imputation Predictor Selection in Symbolic Regression with Incomplete Data

Baligh Al-Helali¹²,
Qi Chen¹²,
Bing Xue¹² &
…
Mengjie Zhang¹²

Conference paper
First Online: 09 April 2020

806 Accesses
8 Citations

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 12101))

Abstract

Missing values bring several challenges when learning from real-world data sets. Imputation is a widely adopted approach to estimating missing values. However, it has not been adequately investigated in symbolic regression. When imputing the missing values in an incomplete feature, the other features that are used in the prediction process are called imputation predictors. In this work, a method for imputation predictor selection using regularized genetic programming (GP) models is presented for symbolic regression tasks on incomplete data. A complexity measure based on the Hessian matrix of the phenotype of the evolving models is proposed. It is employed as a regularizer in the fitness function of GP for model selection and the imputation predictors are selected from the selected models. In addition to the baseline which uses all the available predictors, the proposed selection method is compared with two GP-based feature selection variations: the standard GP feature selector and GP with feature selection pressure. The trends in the results reveal that in most cases, using the predictors selected by regularized GP models could achieve a considerable reduction in the imputation error and improve the symbolic regression performance as well.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

Al-Helali, B., Chen, Q., Xue, B., Zhang, M.: A hybrid GP-KNN imputation for symbolic regression with missing values. In: Mitrovic, T., Xue, B., Li, X. (eds.) AI 2018. LNCS (LNAI), vol. 11320, pp. 345–357. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-03991-2_33
Chapter Google Scholar
Arslan, S., Ozturk, C.: Multi hive artificial bee colony programming for high dimensional symbolic regression with feature selection. Appl. Soft Comput. 78, 515–527 (2019)
Article Google Scholar
Burnham, K.P., Anderson, D.R.: Model Selection and Multi-model Inference: A Practical Information-Theoretic Approach, 2nd edn. Springer, New York (2002). https://doi.org/10.1007/b97636
Book MATH Google Scholar
Camargos, V.P., César, C.C., Caiaffa, W.T., Xavier, C.C., Proietti, F.A.: Multiple imputation and complete case analysis in logistic regression models: a practical assessment of the impact of incomplete covariate data. Cadernos de saude publica 27(12), 2299–2313 (2011)
Article Google Scholar
Chen, Q.: Improving the generalisation of genetic programming for symbolic regression. Ph.D. thesis, Victoria University of Wellington (2018)
Google Scholar
Chen, Q., Xue, B., Shang, L., Zhang, M.: Improving generalisation of genetic programming for symbolic regression with structural risk minimisation. In: Proceedings of the Genetic and Evolutionary Computation Conference 2016, pp. 709–716. ACM (2016)
Google Scholar
Chen, Q., Zhang, M., Xue, B.: Feature selection to improve generalization of genetic programming for high-dimensional symbolic regression. IEEE Trans. Evol. Comput. 21(5), 792–806 (2017)
Article Google Scholar
Chen, Q., Zhang, M., Xue, B.: Structural risk minimisation-driven genetic programming for enhancing generalisation in symbolic regression. IEEE Trans. Evol. Comput. (2018)
Google Scholar
Donders, A.R.T., Van Der Heijden, G.J., Stijnen, T., Moons, K.G.: A gentle introduction to imputation of missing values. J. Clin. Epidemiol. 59(10), 1087–1091 (2006)
Article Google Scholar
Dubčáková, R.: Eureqa: software review. Genet. Program. Evolvable Mach. 12(2), 173–178 (2011). https://doi.org/10.1007/s10710-010-9124-z
Article Google Scholar
Fortin, F.A., Rainville, F.M.D., Gardner, M.A., Parizeau, M., Gagné, C.: DEAP: evolutionary algorithms made easy. J. Mach. Learn. Res. 13, 2171–2175 (2012)
MathSciNet Google Scholar
Heidt, K.: Comparison of imputation methods for mixed data missing at random (2019)
Google Scholar
Keijzer, M.: Improving symbolic regression with interval arithmetic and linear scaling. In: Ryan, C., Soule, T., Keijzer, M., Tsang, E., Poli, R., Costa, E. (eds.) EuroGP 2003. LNCS, vol. 2610, pp. 70–82. Springer, Heidelberg (2003). https://doi.org/10.1007/3-540-36599-0_7
Chapter Google Scholar
Korns, M.F., May, T.: Strong typing, swarm enhancement, and deep learning feature selection in the pursuit of symbolic regression-classification. In: Banzhaf, W., Spector, L., Sheneman, L. (eds.) Genetic Programming Theory and Practice XVI. GEC, pp. 59–84. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-04735-1_4
Chapter Google Scholar
Koyré, A.: The Astronomical Revolution: Copernicus-Kepler-Borelli. Routledge, New York (2013)
Book Google Scholar
Koza, J.R.: Genetic Programming II, Automatic Discovery of Reusable Subprograms. MIT Press, Cambridge (1992)
Google Scholar
Le, N., Xuan, H.N., Brabazon, A., Thi, T.P.: Complexity measures in genetic programming learning: a brief review. In: IEEE Congress on Evolutionary Computation (CEC), pp. 2409–2416. IEEE (2016)
Google Scholar
Lin, W.-C., Tsai, C.-F.: Missing value imputation: a review and analysis of the literature (2006–2017). Artif. Intell. Rev. 53, 1487–1509 (2020)
Article Google Scholar
Little, R.J., Rubin, D.B.: Statistical Analysis with Missing Data, vol. 793. Wiley, New York (2019)
MATH Google Scholar
van der Loo, M.: Simputation: Simple Imputation. R package version 0.2.2 (2017)
Google Scholar
Meurer, A., et al.: SymPy: Symbolic computing in Python. PeerJ Comput. Sci. 3, e103 (2017)
Article Google Scholar
Murray, K., Conner, M.M.: Methods to quantify variable importance: implications for the analysis of noisy ecological data. Ecology 90(2), 348–355 (2009)
Article Google Scholar
Ni, J., Drieberg, R.H., Rockett, P.I.: The use of an analytic quotient operator in genetic programming. IEEE Trans. Evol. Comput. 17(1), 146–152 (2012)
Article Google Scholar
Ni, J., Rockett, P.: Tikhonov regularization as a complexity measure in multiobjective genetic programming. IEEE Trans. Evol. Comput. 19(2), 157–166 (2014)
Article Google Scholar
Nikolaev, N.Y., Iba, H.: Regularization approach to inductive genetic programming. IEEE Trans. Evol. Comput. 5(4), 359–375 (2001)
Article Google Scholar
Niyogi, P., Girosi, F.: On the relationship between generalization error, hypothesis complexity, and sample complexity for radial basis functions. Neural Comput. 8(4), 819–842 (1996)
Article Google Scholar
Pornprasertmanit, S., Miller, P., Schoemann, A., Quick, C., Jorgensen, T., Pornprasertmanit, M.S.: Package ‘SIMSEM’ (2016)
Google Scholar
Raymond, C., Chen, Q., Xue, B., Zhang, M.: Genetic programming with Rademacher complexity for symbolic regression. In: IEEE Congress on Evolutionary Computation (CEC), pp. 2657–2664. IEEE (2019)
Google Scholar
Tran, C.T., Zhang, M., Andreae, P.: A genetic programming-based imputation method for classification with missing data. In: Heywood, M.I., McDermott, J., Castelli, M., Costa, E., Sim, K. (eds.) EuroGP 2016. LNCS, vol. 9594, pp. 149–163. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-30668-1_10
Chapter Google Scholar
Udrescu, S.M., Tegmark, M.: Ai Feynman: a physics-inspired method for symbolic regression. arXiv preprint arXiv:1905.11481 (2019)
Vanschoren, J., Van Rijn, J.N., Bischl, B., Torgo, L.: OpenML: networked science in machine learning. ACM SIGKDD Explor. Newsl. 15(2), 49–60 (2014)
Article Google Scholar
Vladislavleva, E., Smits, G., Den Hertog, D.: On the importance of data balancing for symbolic regression. IEEE Trans. Evol. Comput. 14(2), 252–277 (2010)
Article Google Scholar
Vladislavleva, E.J., Smits, G.F., Den Hertog, D.: Order of nonlinearity as a complexity measure for models generated by symbolic regression via pareto genetic programming. IEEE Trans. Evol. Comput. 13(2), 333–349 (2008)
Article Google Scholar
Wu, Y., Lu, J., Sun, Y.: Genetic programming based on an adaptive regularization method. In: International Conference on Computational Intelligence and Security, vol. 1, pp. 324–327. IEEE (2006)
Google Scholar
Xue, B., Zhang, M.: Evolutionary feature manipulation in data mining/big data. ACM SIGEVOlution 10(1), 4–11 (2017)
Article Google Scholar
Yeun, Y.S., Lee, K.H., Han, S.M., Yang, Y.S.: Smooth fitting with a method for determining the regularization parameter under the genetic programming algorithm. Inf. Sci. 133(3–4), 175–194 (2001)
Article Google Scholar
Zhang, M., Ciesielski, V.: Genetic programming for multiple class object detection. In: Foo, N. (ed.) AI 1999. LNCS (LNAI), vol. 1747, pp. 180–192. Springer, Heidelberg (1999). https://doi.org/10.1007/3-540-46695-9_16
Chapter Google Scholar

Download references

Author information

Authors and Affiliations

School of Engineering and Computer Science, Victoria University of Wellington, PO Box 600, Wellington, 6400, New Zealand
Baligh Al-Helali, Qi Chen, Bing Xue & Mengjie Zhang

Authors

Baligh Al-Helali
View author publications
You can also search for this author in PubMed Google Scholar
Qi Chen
View author publications
You can also search for this author in PubMed Google Scholar
Bing Xue
View author publications
You can also search for this author in PubMed Google Scholar
Mengjie Zhang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Baligh Al-Helali , Qi Chen , Bing Xue or Mengjie Zhang .

Editor information

Editors and Affiliations

Queen's University, Kingston, ON, Canada
Ting Hu
University of Coimbra, Coimbra, Portugal
Nuno Lourenço
University of Trieste, Trieste, Italy
Eric Medvet
Pablo de Olavide University, Seville, Spain
Federico Divina

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Al-Helali, B., Chen, Q., Xue, B., Zhang, M. (2020). Hessian Complexity Measure for Genetic Programming-Based Imputation Predictor Selection in Symbolic Regression with Incomplete Data. In: Hu, T., Lourenço, N., Medvet, E., Divina, F. (eds) Genetic Programming. EuroGP 2020. Lecture Notes in Computer Science(), vol 12101. Springer, Cham. https://doi.org/10.1007/978-3-030-44094-7_1

Download citation

DOI: https://doi.org/10.1007/978-3-030-44094-7_1
Published: 09 April 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-44093-0
Online ISBN: 978-3-030-44094-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics