Genetic Programming with Synthetic Data for Interpretable Regression Modelling and Limited Data

Ramlan, Fitria Wulandari; McDermott, James

doi:10.1007/978-3-031-53969-5_12

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14505))

Included in the following conference series:

International Conference on Machine Learning, Optimization, and Data Science

221 Accesses

Abstract

A trained regression model can be used to create new synthetic training data by drawing from a distribution over independent variables and calling the model to produce a prediction for the dependent variable. We investigate how this idea can be used together with genetic programming (GP) to address two important issues in regression modelling, interpretability and limited data. In particular, we have two hypotheses. (1) Given a trained and non-interpretable regression model (e.g., a neural network (NN) or random forest (RF)), GP can be used to create an interpretable model while maintaining accuracy by training on synthetic data formed from the existing model’s predictions. (2) In the context of limited data, an initial regression model (e.g., NN, RF, or GP) can be trained and then used to create abundant synthetic data for training a second regression model (again, NN, RF, or GP), and this second model can perform better than it would if trained on the original data alone. We carry out experiments on four well-known regression datasets comparing results between an initial model and a model trained on the initial model’s outputs; we find some results which are positive for each hypothesis and some which are negative. We also investigate the effect of the limited data size on the final results.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 69.99; Price excludes VAT (USA)

Softcover Book: USD 89.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Bezanson, J., Edelman, A., Karpinski, S., Shah, V.B.: Julia: a fresh approach to numerical computing. SIAM Rev. 59(1), 65–98 (2017). https://doi.org/10.1137/141000671
Article MathSciNet Google Scholar
Cao, V.L., Nicolau, M., McDermott, J.: One-class classification for anomaly detection with kernel density estimation and genetic programming. In: Heywood, M.I., McDermott, J., Castelli, M., Costa, E., Sim, K. (eds.) EuroGP 2016. LNCS, vol. 9594, pp. 3–18. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-30668-1_1
Chapter Google Scholar
Cortez, P., Cerdeira, A., Almeida, F., Matos, T., Reis, J.: Modeling wine preferences by data mining from physicochemical properties. Decis. Support Syst. 47(4), 547–553 (2009)
Article Google Scholar
Cranmer, M.: Interpretable machine learning for science with PySR and SymbolicRegression.jl. arXiv preprint arXiv:2305.01582 (2023)
Ferreira, L.A., Guimarães, F.G., Silva, R.: Applying genetic programming to improve interpretability in machine learning models. In: 2020 IEEE Congress on Evolutionary Computation (CEC), pp. 1–8. IEEE (2020)
Google Scholar
Gilpin, L.H., Bau, D., Yuan, B.Z., Bajwa, A., Specter, M., Kagal, L.: Explaining explanations: an overview of interpretability of machine learning. In: 2018 IEEE 5th International Conference on Data Science and Advanced Analytics (DSAA), pp. 80–89. IEEE (2018)
Google Scholar
Gou, J., Yu, B., Maybank, S.J., Tao, D.: Knowledge distillation: a survey. Int. J. Comput. Vision 129, 1789–1819 (2021)
Article Google Scholar
Harris, C.R., et al.: Array programming with NumPy. Nature 585(7825), 357–362 (2020). https://doi.org/10.1038/s41586-020-2649-2
Article Google Scholar
Hinton, G., Vinyals, O., Dean, J.: Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531 (2015)
Hunter, J.D.: Matplotlib: a 2D graphics environment. Comput. Sci. Eng. 9(3), 90–95 (2007). https://doi.org/10.1109/MCSE.2007.55
Article Google Scholar
Miranda Filho, R., Lacerda, A., Pappa, G.L.: Explaining symbolic regression predictions. In: 2020 IEEE Congress on Evolutionary Computation (CEC), pp. 1–8. IEEE (2020)
Google Scholar
Olson, R.S., La Cava, W., Orzechowski, P., Urbanowicz, R.J., Moore, J.H.: PMLB: a large benchmark suite for machine learning evaluation and comparison. Bio-Data Min. 10(36), 1–13 (2017). https://doi.org/10.1186/s13040-017-0154-4
Article Google Scholar
Pedregosa, F., et al.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)
MathSciNet Google Scholar
Poli, R., Langdon, W.B., McPhee, N.F.: A field guide to genetic programming (2008). Published via http://lulu.com and freely available at http://www.gp-field-guide.org.uk (With contributions by J. R. Koza)
Ribeiro, M.T., Singh, S., Guestrin, C.: Why should I trust you? Explaining the predictions of any classifier. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1135–1144 (2016)
Google Scholar
The pandas development team: pandas-dev/pandas: Pandas (2020). https://doi.org/10.5281/zenodo.3509134
Watson, D.S.: Conceptual challenges for interpretable machine learning. Synthese 200(2), 65 (2022)
Article MathSciNet Google Scholar

Download references

Acknowledgement

This publication has emanated from research conducted with the financial support of Science Foundation Ireland under Grant number 18/CRT/6223.

Author information

Authors and Affiliations

School of Computer Science, University of Galway, Galway, Ireland
Fitria Wulandari Ramlan & James McDermott

Authors

Fitria Wulandari Ramlan
View author publications
You can also search for this author in PubMed Google Scholar
James McDermott
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Fitria Wulandari Ramlan .

Editor information

Editors and Affiliations

University of Catania, Catania, Catania, Italy
Giuseppe Nicosia
Newcastle University, Newcastle upon Tyne, UK
Varun Ojha
University of Oxford, Oxford, UK
Emanuele La Malfa
University of Cambridge, Cambridge, UK
Gabriele La Malfa
University of Florida, Gainesville, FL, USA
Panos M. Pardalos
Dana-Farber Cancer Institute, Boston, MA, USA
Renato Umeton

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ramlan, F.W., McDermott, J. (2024). Genetic Programming with Synthetic Data for Interpretable Regression Modelling and Limited Data. In: Nicosia, G., Ojha, V., La Malfa, E., La Malfa, G., Pardalos, P.M., Umeton, R. (eds) Machine Learning, Optimization, and Data Science. LOD 2023. Lecture Notes in Computer Science, vol 14505. Springer, Cham. https://doi.org/10.1007/978-3-031-53969-5_12

Download citation

DOI: https://doi.org/10.1007/978-3-031-53969-5_12
Published: 16 February 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-53968-8
Online ISBN: 978-3-031-53969-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Genetic Programming with Synthetic Data for Interpretable Regression Modelling and Limited Data