Skip to main content

Shapley Value Based Feature Selection to Improve Generalization of Genetic Programming for High-Dimensional Symbolic Regression

  • Conference paper
  • First Online:
Data Science and Machine Learning (AusDM 2023)

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1943))

Included in the following conference series:

  • 256 Accesses

Abstract

Symbolic regression (SR) on high-dimensional data is a challenging problem, often leading to poor generalization performance. While feature selection can improve the generalization ability and efficiency of learning methods, it is still a hard problem for genetic programming (GP) for high-dimensional SR. Shapley value has been used in an additive feature attribution method to attribute the difference between the output of the model and an average baseline to the input features. Owing to its solid game-theoretic principles, Shapley value has the ability to fairly compute each feature importance. In this paper, we propose a novel feature selection algorithm based on the Shapley value to select informative features in GP for high-dimensional SR. A set of experiments on ten high-dimensional regression datasets show that, compared with standard GP, the proposed algorithm has better learning and generalization performance on most of the datasets. A further analysis shows that the proposed method evolves more compact models containing highly informative features.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 59.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 79.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Ray, P., Reddy, S., Banerjee, T.: Various dimension reduction techniques for high dimensional data analysis: A review. Artif. Intell. Review. 54, 3473–3515 (2021)

    Article  Google Scholar 

  2. Zhang, H., Zhou, A., Chen, Q., Xue, B., Zhang, M.: SR-Forest: a genetic programming based heterogeneous ensemble learning method. IEEE Trans. Evol. Comput. (2023). https://doi.org/10.1109/TEVC.2023.3243172

    Article  Google Scholar 

  3. Neshatian, K., Zhang, M.: Pareto front feature selection: Using genetic programming to explore feature space. In: Proceedings of the Genetic and Evolutionary Computation Conference, pp. 1027–1034 (2009)

    Google Scholar 

  4. Koza, J.: Genetic Programming: On the Programming of Computers by Means of Natural Selection. MIT Press, Cambridge, MA, USA (1992)

    Google Scholar 

  5. Chen, Q., Xue, B., Niu, B., Zhang, M.: Improving generalisation of genetic programming for high-dimensional symbolic regression with feature selection. In: Proceedings of the IEEE International Conference on Evolutionary Computation, pp. 3793–3800 (2016)

    Google Scholar 

  6. Chen, Q., Zhang, M., Xue, B.: Feature selection to improve generalization of genetic programming for high-dimensional symbolic regression. IEEE Trans. Evol. Comput. 21(5), 792–806 (2017)

    Article  Google Scholar 

  7. Helali, B., Chen, Q., Xue, B., Zhang, M.: Genetic programming-based selection of imputation methods in symbolic regression with missing values. In: AI 2020: Advances in Artificial Intelligence, pp. 12576 (2020)

    Google Scholar 

  8. Zhang, H., Zhou, A., Zhang, H.: An evolutionary forest for regression. IEEE Trans. Evol. Comput. 26(4), 735–749 (2022)

    Article  MathSciNet  Google Scholar 

  9. Zhang, H., Zhou, A., Qian, H., Zhang, H.: PS-tree: a piecewise symbolic regression tree. Swarm Evol. Comput. 71, 101061 (2022)

    Article  Google Scholar 

  10. O’Neill, M., Vanneschi, L., Gustafson, S., Banzhaf, W.: Open issues in genetic programming. Genet. Program. Evol. Mach. 11(3), 339–363 (2010)

    Article  Google Scholar 

  11. Molnar, C.: Interpretable machine learning: a guide for making black box models explainable (2nd ed.). https://christophm.github.io/interpretable-ml-book (2022)

  12. Heskes, T., Sijben, E., Bucur, I., Claassen, T.: Causal shapley values: exploiting causal knowledge to explain individual predictions of complex models. Adv. Neural Info. Proc. Syst. 33, 4778–4789 (2020)

    Google Scholar 

  13. Haeri, M., Ebadzadeh, M., Folino, G.: Improving GP generalization: a variance-based layered learning approach. Genet. Program. Evol. Mach. 16(1), 27–55 (2015)

    Article  Google Scholar 

  14. Astarabadi, S., Ebadzadeh, M.: Avoiding overfitting in symbolic regression using the first order derivative of GP trees. In: Proceedings of the Genetic and Evolutionary Computation Conference, pp. 1441–1442 (2015)

    Google Scholar 

  15. Sandinetal, I.: Aggressive and effective feature selection using genetic programming. In: Proceedings of the IEEE International Conference on Evolutionary Computation, pp. 1–8 (2012)

    Google Scholar 

  16. Chen, Q., Xue, B., Zhang, M.: Rademacher complexity for enhancing the generalization of genetic programming for symbolic regression. IEEE Trans. Cybern. 52(4), 2382–2395 (2022)

    Article  Google Scholar 

  17. Lundberg, S., Lee, S.: A unified approach to interpreting model predictions. Adv. Neural Inf. Process. Syst. 30 (2017)

    Google Scholar 

  18. Strumbelj, E., Kononenko, I.: Explaining prediction models and individual predictions with feature contributions. Know. Inf. Syst. 41(3), 647–665 (2014)

    Article  Google Scholar 

  19. Seijo-Pardo, B., Porto-Díaz, I., Bolón-Canedo, V., Alonso-Betanzos, A.: Ensemble feature selection: homogeneous and heterogeneous approaches. Knowl.-Based Syst. 118, 124–139 (2017)

    Article  Google Scholar 

  20. Keijzer, M.: Improving symbolic regression with interval arithmetic and linear scaling. In: Proceedings of the European Conference on Genetic Programming, pp. 70–82 (2003)

    Google Scholar 

  21. Lichman, M.: UCI Machine Learning Repository. http://archive.ics.uci.edu/ (2013)

  22. Olson, R., Cava, W., Orzechowski, P., Urbanowicz, R., Moore, J.: PMLB: a large benchmark suite for machine learning evaluation and comparison. BioData Mining. 10, 1–13 (2017)

    Article  Google Scholar 

  23. Vanschoren, J., Rijn, J., Bischl, B., Torgo, L.: OpenML: networked science in machine learning. ACM SIGKDD Explo. Newsletter. 15(2), 49–60 (2014)

    Article  Google Scholar 

Download references

Acknowledgement

This work is supported in part by the Marsden Fund of New Zealand Government under Contract MFP-VUW2016, MFP-VUW1914 and MFP-VUW1913.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Qi Chen .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Wang, C., Chen, Q., Xue, B., Zhang, M. (2024). Shapley Value Based Feature Selection to Improve Generalization of Genetic Programming for High-Dimensional Symbolic Regression. In: Benavides-Prado, D., Erfani, S., Fournier-Viger, P., Boo, Y.L., Koh, Y.S. (eds) Data Science and Machine Learning. AusDM 2023. Communications in Computer and Information Science, vol 1943. Springer, Singapore. https://doi.org/10.1007/978-981-99-8696-5_12

Download citation

  • DOI: https://doi.org/10.1007/978-981-99-8696-5_12

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-99-8695-8

  • Online ISBN: 978-981-99-8696-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics