Skip to main content

Hessian Complexity Measure for Genetic Programming-Based Imputation Predictor Selection in Symbolic Regression with Incomplete Data

  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 12101))

Abstract

Missing values bring several challenges when learning from real-world data sets. Imputation is a widely adopted approach to estimating missing values. However, it has not been adequately investigated in symbolic regression. When imputing the missing values in an incomplete feature, the other features that are used in the prediction process are called imputation predictors. In this work, a method for imputation predictor selection using regularized genetic programming (GP) models is presented for symbolic regression tasks on incomplete data. A complexity measure based on the Hessian matrix of the phenotype of the evolving models is proposed. It is employed as a regularizer in the fitness function of GP for model selection and the imputation predictors are selected from the selected models. In addition to the baseline which uses all the available predictors, the proposed selection method is compared with two GP-based feature selection variations: the standard GP feature selector and GP with feature selection pressure. The trends in the results reveal that in most cases, using the predictors selected by regularized GP models could achieve a considerable reduction in the imputation error and improve the symbolic regression performance as well.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

  1. Al-Helali, B., Chen, Q., Xue, B., Zhang, M.: A hybrid GP-KNN imputation for symbolic regression with missing values. In: Mitrovic, T., Xue, B., Li, X. (eds.) AI 2018. LNCS (LNAI), vol. 11320, pp. 345–357. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-03991-2_33

    Chapter  Google Scholar 

  2. Arslan, S., Ozturk, C.: Multi hive artificial bee colony programming for high dimensional symbolic regression with feature selection. Appl. Soft Comput. 78, 515–527 (2019)

    Article  Google Scholar 

  3. Burnham, K.P., Anderson, D.R.: Model Selection and Multi-model Inference: A Practical Information-Theoretic Approach, 2nd edn. Springer, New York (2002). https://doi.org/10.1007/b97636

    Book  MATH  Google Scholar 

  4. Camargos, V.P., César, C.C., Caiaffa, W.T., Xavier, C.C., Proietti, F.A.: Multiple imputation and complete case analysis in logistic regression models: a practical assessment of the impact of incomplete covariate data. Cadernos de saude publica 27(12), 2299–2313 (2011)

    Article  Google Scholar 

  5. Chen, Q.: Improving the generalisation of genetic programming for symbolic regression. Ph.D. thesis, Victoria University of Wellington (2018)

    Google Scholar 

  6. Chen, Q., Xue, B., Shang, L., Zhang, M.: Improving generalisation of genetic programming for symbolic regression with structural risk minimisation. In: Proceedings of the Genetic and Evolutionary Computation Conference 2016, pp. 709–716. ACM (2016)

    Google Scholar 

  7. Chen, Q., Zhang, M., Xue, B.: Feature selection to improve generalization of genetic programming for high-dimensional symbolic regression. IEEE Trans. Evol. Comput. 21(5), 792–806 (2017)

    Article  Google Scholar 

  8. Chen, Q., Zhang, M., Xue, B.: Structural risk minimisation-driven genetic programming for enhancing generalisation in symbolic regression. IEEE Trans. Evol. Comput. (2018)

    Google Scholar 

  9. Donders, A.R.T., Van Der Heijden, G.J., Stijnen, T., Moons, K.G.: A gentle introduction to imputation of missing values. J. Clin. Epidemiol. 59(10), 1087–1091 (2006)

    Article  Google Scholar 

  10. Dubčáková, R.: Eureqa: software review. Genet. Program. Evolvable Mach. 12(2), 173–178 (2011). https://doi.org/10.1007/s10710-010-9124-z

    Article  Google Scholar 

  11. Fortin, F.A., Rainville, F.M.D., Gardner, M.A., Parizeau, M., Gagné, C.: DEAP: evolutionary algorithms made easy. J. Mach. Learn. Res. 13, 2171–2175 (2012)

    MathSciNet  Google Scholar 

  12. Heidt, K.: Comparison of imputation methods for mixed data missing at random (2019)

    Google Scholar 

  13. Keijzer, M.: Improving symbolic regression with interval arithmetic and linear scaling. In: Ryan, C., Soule, T., Keijzer, M., Tsang, E., Poli, R., Costa, E. (eds.) EuroGP 2003. LNCS, vol. 2610, pp. 70–82. Springer, Heidelberg (2003). https://doi.org/10.1007/3-540-36599-0_7

    Chapter  Google Scholar 

  14. Korns, M.F., May, T.: Strong typing, swarm enhancement, and deep learning feature selection in the pursuit of symbolic regression-classification. In: Banzhaf, W., Spector, L., Sheneman, L. (eds.) Genetic Programming Theory and Practice XVI. GEC, pp. 59–84. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-04735-1_4

    Chapter  Google Scholar 

  15. Koyré, A.: The Astronomical Revolution: Copernicus-Kepler-Borelli. Routledge, New York (2013)

    Book  Google Scholar 

  16. Koza, J.R.: Genetic Programming II, Automatic Discovery of Reusable Subprograms. MIT Press, Cambridge (1992)

    Google Scholar 

  17. Le, N., Xuan, H.N., Brabazon, A., Thi, T.P.: Complexity measures in genetic programming learning: a brief review. In: IEEE Congress on Evolutionary Computation (CEC), pp. 2409–2416. IEEE (2016)

    Google Scholar 

  18. Lin, W.-C., Tsai, C.-F.: Missing value imputation: a review and analysis of the literature (2006–2017). Artif. Intell. Rev. 53, 1487–1509 (2020)

    Article  Google Scholar 

  19. Little, R.J., Rubin, D.B.: Statistical Analysis with Missing Data, vol. 793. Wiley, New York (2019)

    MATH  Google Scholar 

  20. van der Loo, M.: Simputation: Simple Imputation. R package version 0.2.2 (2017)

    Google Scholar 

  21. Meurer, A., et al.: SymPy: Symbolic computing in Python. PeerJ Comput. Sci. 3, e103 (2017)

    Article  Google Scholar 

  22. Murray, K., Conner, M.M.: Methods to quantify variable importance: implications for the analysis of noisy ecological data. Ecology 90(2), 348–355 (2009)

    Article  Google Scholar 

  23. Ni, J., Drieberg, R.H., Rockett, P.I.: The use of an analytic quotient operator in genetic programming. IEEE Trans. Evol. Comput. 17(1), 146–152 (2012)

    Article  Google Scholar 

  24. Ni, J., Rockett, P.: Tikhonov regularization as a complexity measure in multiobjective genetic programming. IEEE Trans. Evol. Comput. 19(2), 157–166 (2014)

    Article  Google Scholar 

  25. Nikolaev, N.Y., Iba, H.: Regularization approach to inductive genetic programming. IEEE Trans. Evol. Comput. 5(4), 359–375 (2001)

    Article  Google Scholar 

  26. Niyogi, P., Girosi, F.: On the relationship between generalization error, hypothesis complexity, and sample complexity for radial basis functions. Neural Comput. 8(4), 819–842 (1996)

    Article  Google Scholar 

  27. Pornprasertmanit, S., Miller, P., Schoemann, A., Quick, C., Jorgensen, T., Pornprasertmanit, M.S.: Package ‘SIMSEM’ (2016)

    Google Scholar 

  28. Raymond, C., Chen, Q., Xue, B., Zhang, M.: Genetic programming with Rademacher complexity for symbolic regression. In: IEEE Congress on Evolutionary Computation (CEC), pp. 2657–2664. IEEE (2019)

    Google Scholar 

  29. Tran, C.T., Zhang, M., Andreae, P.: A genetic programming-based imputation method for classification with missing data. In: Heywood, M.I., McDermott, J., Castelli, M., Costa, E., Sim, K. (eds.) EuroGP 2016. LNCS, vol. 9594, pp. 149–163. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-30668-1_10

    Chapter  Google Scholar 

  30. Udrescu, S.M., Tegmark, M.: Ai Feynman: a physics-inspired method for symbolic regression. arXiv preprint arXiv:1905.11481 (2019)

  31. Vanschoren, J., Van Rijn, J.N., Bischl, B., Torgo, L.: OpenML: networked science in machine learning. ACM SIGKDD Explor. Newsl. 15(2), 49–60 (2014)

    Article  Google Scholar 

  32. Vladislavleva, E., Smits, G., Den Hertog, D.: On the importance of data balancing for symbolic regression. IEEE Trans. Evol. Comput. 14(2), 252–277 (2010)

    Article  Google Scholar 

  33. Vladislavleva, E.J., Smits, G.F., Den Hertog, D.: Order of nonlinearity as a complexity measure for models generated by symbolic regression via pareto genetic programming. IEEE Trans. Evol. Comput. 13(2), 333–349 (2008)

    Article  Google Scholar 

  34. Wu, Y., Lu, J., Sun, Y.: Genetic programming based on an adaptive regularization method. In: International Conference on Computational Intelligence and Security, vol. 1, pp. 324–327. IEEE (2006)

    Google Scholar 

  35. Xue, B., Zhang, M.: Evolutionary feature manipulation in data mining/big data. ACM SIGEVOlution 10(1), 4–11 (2017)

    Article  Google Scholar 

  36. Yeun, Y.S., Lee, K.H., Han, S.M., Yang, Y.S.: Smooth fitting with a method for determining the regularization parameter under the genetic programming algorithm. Inf. Sci. 133(3–4), 175–194 (2001)

    Article  Google Scholar 

  37. Zhang, M., Ciesielski, V.: Genetic programming for multiple class object detection. In: Foo, N. (ed.) AI 1999. LNCS (LNAI), vol. 1747, pp. 180–192. Springer, Heidelberg (1999). https://doi.org/10.1007/3-540-46695-9_16

    Chapter  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Baligh Al-Helali , Qi Chen , Bing Xue or Mengjie Zhang .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Al-Helali, B., Chen, Q., Xue, B., Zhang, M. (2020). Hessian Complexity Measure for Genetic Programming-Based Imputation Predictor Selection in Symbolic Regression with Incomplete Data. In: Hu, T., Lourenço, N., Medvet, E., Divina, F. (eds) Genetic Programming. EuroGP 2020. Lecture Notes in Computer Science(), vol 12101. Springer, Cham. https://doi.org/10.1007/978-3-030-44094-7_1

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-44094-7_1

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-44093-0

  • Online ISBN: 978-3-030-44094-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics