Data Aggregation for Reducing Training Data in Symbolic Regression

Kammerer, Lukas; Kronberger, Gabriel; Kommenda, Michael

doi:10.1007/978-3-030-45093-9_46

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 12013))

Included in the following conference series:

International Conference on Computer Aided Systems Theory

820 Accesses
1 Altmetric

Abstract

The growing volume of data makes the use of computationally intense machine learning techniques such as symbolic regression with genetic programming more and more impractical. This work discusses methods to reduce the training data and thereby also the runtime of genetic programming. The data is aggregated in a preprocessing step before running the actual machine learning algorithm. K-means clustering and data binning is used for data aggregation and compared with random sampling as the simplest data reduction method. We analyze the achieved speed-up in training and the effects on the trained models’ test accuracy for every method on four real-world data sets. The performance of genetic programming is compared with random forests and linear regression. It is shown, that k-means and random sampling lead to very small loss in test accuracy when the data is reduced down to only 30% of the original data, while the speed-up is proportional to the size of the data set. Binning on the contrary, leads to models with very high test error.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

References

Affenzeller, M., Winkler, S., Wagner, S., Beham, A.: Genetic Algorithms and Genetic Programming - Modern Concepts and Practical Applications, Numerical Insights, vol. 6. CRC Press, Chapman & Hall, Boca Raton (2009)
Book Google Scholar
Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001)
Article Google Scholar
Guo, G., Zhang, J.S.: Reducing examples to accelerate support vector regression. Pattern Recogn. Lett. 28(16), 2173–2183 (2007)
Article Google Scholar
Keijzer, M.: Scaled symbolic regression. Genet. Program Evolvable Mach. 5(3), 259–269 (2004). https://doi.org/10.1023/B:GENP.0000030195.77571.f9
Article Google Scholar
Kommenda, M., Kronberger, G., Affenzeller, M., Winkler, S., Feilmayr, C., Wagner, S.: Symbolic regression with sampling. In: 22nd European Modeling and Simulation Symposium EMSS, pp. 13–18 (2010)
Google Scholar
Kugler, C., Hochrein, T., Dietl, K., Heidemeyer, P., Bastian, M.: Softsensoren in der Kunststoffverarbeitung: Qualitätssicherung für die Compoundierung und Extrusion. Shaker Verlag GmbH, SKZ - Forschung und Entwicklung (2015)
Google Scholar
Lloyd, S.: Least squares quantization in PCM. IEEE Trans. Inf. Theory 28(2), 129–137 (1982)
Article MathSciNet Google Scholar
Pagie, L., Hogeweg, P.: Evolutionary consequences of coevolving targets. Evolutionary Comput. 5(4), 401–418 (1997)
Article Google Scholar
Rychetsky, M., Ortmann, S., Ullmann, M., Glesner, M.: Accelerated training of support vector machines. In: Proceedings of the International Joint Conference on Neural Networks, IJCNN 1999, (Cat. No. 99CH36339). vol. 2, pp. 998–1003. IEEE (1999)
Google Scholar
Sculley, D.: Web-scale k-means clustering. In: Proceedings of the 19th International Conference on World Wide Web, pp. 1177–1178. ACM (2010)
Google Scholar
Wagner, S., Affenzeller, M.: HeuristicLab: a generic and extensible optimization environment. In: Ribeiro, B., Albrecht, R.F., Dobnikar, A., Pearson, D.W., Steele, N.C. (eds.) Adaptive and Natural Computing Algorithms, pp. 538–541. Springer, Vienna (2005). https://doi.org/10.1007/3-211-27389-1_130
Chapter Google Scholar
White, D.R., et al.: Better GP benchmarks: community survey results and proposals. Genet. Program Evolvable Mach. 14(1), 3–29 (2013). https://doi.org/10.1007/s10710-012-9177-2
Article Google Scholar
Winkler, S.: Evolutionary system identification: modern concepts and practical applications. Schriften der Johannes Kepler Universität Linz, Universitätsverlag Rudolf Trauner (2009)
Google Scholar

Download references

Acknowledgements

The authors gratefully acknowledge support by the Austrian Research Promotion Agency (FFG) within project #867202, as well as the Christian Doppler Research Association and the Federal Ministry of Digital and Economic Affairs within the Josef Ressel Centre for Symbolic Regression.

Author information

Authors and Affiliations

Josef Ressel Center for Symbolic Regression, Heuristic and Evolutionary Algorithms Laboratory, University of Applied Sciences Upper Austria, Hagenberg, Austria
Lukas Kammerer, Gabriel Kronberger & Michael Kommenda

Authors

Lukas Kammerer
View author publications
You can also search for this author in PubMed Google Scholar
Gabriel Kronberger
View author publications
You can also search for this author in PubMed Google Scholar
Michael Kommenda
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Lukas Kammerer .

Editor information

Editors and Affiliations

University of Las Palmas de Gran Canaria, Las Palmas de Gran Canaria, Spain
Roberto Moreno-Díaz
Johannes Kepler University Linz, Linz, Austria
Franz Pichler
University of Las Palmas de Gran Canaria, Las Palmas de Gran Canaria, Spain
Alexis Quesada-Arencibia

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Kammerer, L., Kronberger, G., Kommenda, M. (2020). Data Aggregation for Reducing Training Data in Symbolic Regression. In: Moreno-Díaz, R., Pichler, F., Quesada-Arencibia, A. (eds) Computer Aided Systems Theory – EUROCAST 2019. EUROCAST 2019. Lecture Notes in Computer Science(), vol 12013. Springer, Cham. https://doi.org/10.1007/978-3-030-45093-9_46

Download citation

DOI: https://doi.org/10.1007/978-3-030-45093-9_46
Published: 15 April 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-45092-2
Online ISBN: 978-3-030-45093-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics