Shapley Value Based Feature Selection to Improve Generalization of Genetic Programming for High-Dimensional Symbolic Regression

Wang, Chunyu; Chen, Qi; Xue, Bing; Zhang, Mengjie

doi:10.1007/978-981-99-8696-5_12

Chunyu Wang¹⁰,
Qi Chen¹⁰,
Bing Xue¹⁰ &
…
Mengjie Zhang¹⁰

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1943))

Included in the following conference series:

Australasian Conference on Data Science and Machine Learning

256 Accesses

Abstract

Symbolic regression (SR) on high-dimensional data is a challenging problem, often leading to poor generalization performance. While feature selection can improve the generalization ability and efficiency of learning methods, it is still a hard problem for genetic programming (GP) for high-dimensional SR. Shapley value has been used in an additive feature attribution method to attribute the difference between the output of the model and an average baseline to the input features. Owing to its solid game-theoretic principles, Shapley value has the ability to fairly compute each feature importance. In this paper, we propose a novel feature selection algorithm based on the Shapley value to select informative features in GP for high-dimensional SR. A set of experiments on ten high-dimensional regression datasets show that, compared with standard GP, the proposed algorithm has better learning and generalization performance on most of the datasets. A further analysis shows that the proposed method evolves more compact models containing highly informative features.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 59.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Ray, P., Reddy, S., Banerjee, T.: Various dimension reduction techniques for high dimensional data analysis: A review. Artif. Intell. Review. 54, 3473–3515 (2021)
Article Google Scholar
Zhang, H., Zhou, A., Chen, Q., Xue, B., Zhang, M.: SR-Forest: a genetic programming based heterogeneous ensemble learning method. IEEE Trans. Evol. Comput. (2023). https://doi.org/10.1109/TEVC.2023.3243172
Article Google Scholar
Neshatian, K., Zhang, M.: Pareto front feature selection: Using genetic programming to explore feature space. In: Proceedings of the Genetic and Evolutionary Computation Conference, pp. 1027–1034 (2009)
Google Scholar
Koza, J.: Genetic Programming: On the Programming of Computers by Means of Natural Selection. MIT Press, Cambridge, MA, USA (1992)
Google Scholar
Chen, Q., Xue, B., Niu, B., Zhang, M.: Improving generalisation of genetic programming for high-dimensional symbolic regression with feature selection. In: Proceedings of the IEEE International Conference on Evolutionary Computation, pp. 3793–3800 (2016)
Google Scholar
Chen, Q., Zhang, M., Xue, B.: Feature selection to improve generalization of genetic programming for high-dimensional symbolic regression. IEEE Trans. Evol. Comput. 21(5), 792–806 (2017)
Article Google Scholar
Helali, B., Chen, Q., Xue, B., Zhang, M.: Genetic programming-based selection of imputation methods in symbolic regression with missing values. In: AI 2020: Advances in Artificial Intelligence, pp. 12576 (2020)
Google Scholar
Zhang, H., Zhou, A., Zhang, H.: An evolutionary forest for regression. IEEE Trans. Evol. Comput. 26(4), 735–749 (2022)
Article MathSciNet Google Scholar
Zhang, H., Zhou, A., Qian, H., Zhang, H.: PS-tree: a piecewise symbolic regression tree. Swarm Evol. Comput. 71, 101061 (2022)
Article Google Scholar
O’Neill, M., Vanneschi, L., Gustafson, S., Banzhaf, W.: Open issues in genetic programming. Genet. Program. Evol. Mach. 11(3), 339–363 (2010)
Article Google Scholar
Molnar, C.: Interpretable machine learning: a guide for making black box models explainable (2nd ed.). https://christophm.github.io/interpretable-ml-book (2022)
Heskes, T., Sijben, E., Bucur, I., Claassen, T.: Causal shapley values: exploiting causal knowledge to explain individual predictions of complex models. Adv. Neural Info. Proc. Syst. 33, 4778–4789 (2020)
Google Scholar
Haeri, M., Ebadzadeh, M., Folino, G.: Improving GP generalization: a variance-based layered learning approach. Genet. Program. Evol. Mach. 16(1), 27–55 (2015)
Article Google Scholar
Astarabadi, S., Ebadzadeh, M.: Avoiding overfitting in symbolic regression using the first order derivative of GP trees. In: Proceedings of the Genetic and Evolutionary Computation Conference, pp. 1441–1442 (2015)
Google Scholar
Sandinetal, I.: Aggressive and effective feature selection using genetic programming. In: Proceedings of the IEEE International Conference on Evolutionary Computation, pp. 1–8 (2012)
Google Scholar
Chen, Q., Xue, B., Zhang, M.: Rademacher complexity for enhancing the generalization of genetic programming for symbolic regression. IEEE Trans. Cybern. 52(4), 2382–2395 (2022)
Article Google Scholar
Lundberg, S., Lee, S.: A unified approach to interpreting model predictions. Adv. Neural Inf. Process. Syst. 30 (2017)
Google Scholar
Strumbelj, E., Kononenko, I.: Explaining prediction models and individual predictions with feature contributions. Know. Inf. Syst. 41(3), 647–665 (2014)
Article Google Scholar
Seijo-Pardo, B., Porto-Díaz, I., Bolón-Canedo, V., Alonso-Betanzos, A.: Ensemble feature selection: homogeneous and heterogeneous approaches. Knowl.-Based Syst. 118, 124–139 (2017)
Article Google Scholar
Keijzer, M.: Improving symbolic regression with interval arithmetic and linear scaling. In: Proceedings of the European Conference on Genetic Programming, pp. 70–82 (2003)
Google Scholar
Lichman, M.: UCI Machine Learning Repository. http://archive.ics.uci.edu/ (2013)
Olson, R., Cava, W., Orzechowski, P., Urbanowicz, R., Moore, J.: PMLB: a large benchmark suite for machine learning evaluation and comparison. BioData Mining. 10, 1–13 (2017)
Article Google Scholar
Vanschoren, J., Rijn, J., Bischl, B., Torgo, L.: OpenML: networked science in machine learning. ACM SIGKDD Explo. Newsletter. 15(2), 49–60 (2014)
Article Google Scholar

Download references

Acknowledgement

This work is supported in part by the Marsden Fund of New Zealand Government under Contract MFP-VUW2016, MFP-VUW1914 and MFP-VUW1913.

Author information

Authors and Affiliations

Centre for Data Science and Artificial Intelligence and School of Engineering and Computer Science, Victoria University of Wellington, PO Box 600, Wellington, 6400, New Zealand
Chunyu Wang, Qi Chen, Bing Xue & Mengjie Zhang

Authors

Chunyu Wang
View author publications
You can also search for this author in PubMed Google Scholar
Qi Chen
View author publications
You can also search for this author in PubMed Google Scholar
Bing Xue
View author publications
You can also search for this author in PubMed Google Scholar
Mengjie Zhang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Qi Chen .

Editor information

Editors and Affiliations

The University of Auckland, Auckland, New Zealand
Diana Benavides-Prado
The University of Melbourne, Carlton, VIC, Australia
Sarah Erfani
Shenzhen University, Shenzhen, China
Philippe Fournier-Viger
RMIT University, Melbourne, VIC, Australia
Yee Ling Boo
The University of Auckland, Auckland, New Zealand
Yun Sing Koh

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Wang, C., Chen, Q., Xue, B., Zhang, M. (2024). Shapley Value Based Feature Selection to Improve Generalization of Genetic Programming for High-Dimensional Symbolic Regression. In: Benavides-Prado, D., Erfani, S., Fournier-Viger, P., Boo, Y.L., Koh, Y.S. (eds) Data Science and Machine Learning. AusDM 2023. Communications in Computer and Information Science, vol 1943. Springer, Singapore. https://doi.org/10.1007/978-981-99-8696-5_12

Download citation

DOI: https://doi.org/10.1007/978-981-99-8696-5_12
Published: 05 December 2023
Publisher Name: Springer, Singapore
Print ISBN: 978-981-99-8695-8
Online ISBN: 978-981-99-8696-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Shapley Value Based Feature Selection to Improve Generalization of Genetic Programming for High-Dimensional Symbolic Regression