Abstract
Symbolic regression (SR) on high-dimensional data is a challenging problem, often leading to poor generalization performance. While feature selection can improve the generalization ability and efficiency of learning methods, it is still a hard problem for genetic programming (GP) for high-dimensional SR. Shapley value has been used in an additive feature attribution method to attribute the difference between the output of the model and an average baseline to the input features. Owing to its solid game-theoretic principles, Shapley value has the ability to fairly compute each feature importance. In this paper, we propose a novel feature selection algorithm based on the Shapley value to select informative features in GP for high-dimensional SR. A set of experiments on ten high-dimensional regression datasets show that, compared with standard GP, the proposed algorithm has better learning and generalization performance on most of the datasets. A further analysis shows that the proposed method evolves more compact models containing highly informative features.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Ray, P., Reddy, S., Banerjee, T.: Various dimension reduction techniques for high dimensional data analysis: A review. Artif. Intell. Review. 54, 3473–3515 (2021)
Zhang, H., Zhou, A., Chen, Q., Xue, B., Zhang, M.: SR-Forest: a genetic programming based heterogeneous ensemble learning method. IEEE Trans. Evol. Comput. (2023). https://doi.org/10.1109/TEVC.2023.3243172
Neshatian, K., Zhang, M.: Pareto front feature selection: Using genetic programming to explore feature space. In: Proceedings of the Genetic and Evolutionary Computation Conference, pp. 1027–1034 (2009)
Koza, J.: Genetic Programming: On the Programming of Computers by Means of Natural Selection. MIT Press, Cambridge, MA, USA (1992)
Chen, Q., Xue, B., Niu, B., Zhang, M.: Improving generalisation of genetic programming for high-dimensional symbolic regression with feature selection. In: Proceedings of the IEEE International Conference on Evolutionary Computation, pp. 3793–3800 (2016)
Chen, Q., Zhang, M., Xue, B.: Feature selection to improve generalization of genetic programming for high-dimensional symbolic regression. IEEE Trans. Evol. Comput. 21(5), 792–806 (2017)
Helali, B., Chen, Q., Xue, B., Zhang, M.: Genetic programming-based selection of imputation methods in symbolic regression with missing values. In: AI 2020: Advances in Artificial Intelligence, pp. 12576 (2020)
Zhang, H., Zhou, A., Zhang, H.: An evolutionary forest for regression. IEEE Trans. Evol. Comput. 26(4), 735–749 (2022)
Zhang, H., Zhou, A., Qian, H., Zhang, H.: PS-tree: a piecewise symbolic regression tree. Swarm Evol. Comput. 71, 101061 (2022)
O’Neill, M., Vanneschi, L., Gustafson, S., Banzhaf, W.: Open issues in genetic programming. Genet. Program. Evol. Mach. 11(3), 339–363 (2010)
Molnar, C.: Interpretable machine learning: a guide for making black box models explainable (2nd ed.). https://christophm.github.io/interpretable-ml-book (2022)
Heskes, T., Sijben, E., Bucur, I., Claassen, T.: Causal shapley values: exploiting causal knowledge to explain individual predictions of complex models. Adv. Neural Info. Proc. Syst. 33, 4778–4789 (2020)
Haeri, M., Ebadzadeh, M., Folino, G.: Improving GP generalization: a variance-based layered learning approach. Genet. Program. Evol. Mach. 16(1), 27–55 (2015)
Astarabadi, S., Ebadzadeh, M.: Avoiding overfitting in symbolic regression using the first order derivative of GP trees. In: Proceedings of the Genetic and Evolutionary Computation Conference, pp. 1441–1442 (2015)
Sandinetal, I.: Aggressive and effective feature selection using genetic programming. In: Proceedings of the IEEE International Conference on Evolutionary Computation, pp. 1–8 (2012)
Chen, Q., Xue, B., Zhang, M.: Rademacher complexity for enhancing the generalization of genetic programming for symbolic regression. IEEE Trans. Cybern. 52(4), 2382–2395 (2022)
Lundberg, S., Lee, S.: A unified approach to interpreting model predictions. Adv. Neural Inf. Process. Syst. 30 (2017)
Strumbelj, E., Kononenko, I.: Explaining prediction models and individual predictions with feature contributions. Know. Inf. Syst. 41(3), 647–665 (2014)
Seijo-Pardo, B., Porto-Díaz, I., Bolón-Canedo, V., Alonso-Betanzos, A.: Ensemble feature selection: homogeneous and heterogeneous approaches. Knowl.-Based Syst. 118, 124–139 (2017)
Keijzer, M.: Improving symbolic regression with interval arithmetic and linear scaling. In: Proceedings of the European Conference on Genetic Programming, pp. 70–82 (2003)
Lichman, M.: UCI Machine Learning Repository. http://archive.ics.uci.edu/ (2013)
Olson, R., Cava, W., Orzechowski, P., Urbanowicz, R., Moore, J.: PMLB: a large benchmark suite for machine learning evaluation and comparison. BioData Mining. 10, 1–13 (2017)
Vanschoren, J., Rijn, J., Bischl, B., Torgo, L.: OpenML: networked science in machine learning. ACM SIGKDD Explo. Newsletter. 15(2), 49–60 (2014)
Acknowledgement
This work is supported in part by the Marsden Fund of New Zealand Government under Contract MFP-VUW2016, MFP-VUW1914 and MFP-VUW1913.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Wang, C., Chen, Q., Xue, B., Zhang, M. (2024). Shapley Value Based Feature Selection to Improve Generalization of Genetic Programming for High-Dimensional Symbolic Regression. In: Benavides-Prado, D., Erfani, S., Fournier-Viger, P., Boo, Y.L., Koh, Y.S. (eds) Data Science and Machine Learning. AusDM 2023. Communications in Computer and Information Science, vol 1943. Springer, Singapore. https://doi.org/10.1007/978-981-99-8696-5_12
Download citation
DOI: https://doi.org/10.1007/978-981-99-8696-5_12
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-99-8695-8
Online ISBN: 978-981-99-8696-5
eBook Packages: Computer ScienceComputer Science (R0)