Benchmarking state-of-the-art symbolic regression algorithms

Žegklitz, Jan; Pošík, Petr

doi:10.1007/s10710-020-09387-0

Benchmarking state-of-the-art symbolic regression algorithms

Published: 24 March 2020

Volume 22, pages 5–33, (2021)
Cite this article

Genetic Programming and Evolvable Machines Aims and scope Submit manuscript

2218 Accesses
20 Citations
Explore all metrics

Abstract

Symbolic regression (SR) is a powerful method for building predictive models from data without assuming any model structure. Traditionally, genetic programming (GP) was used as the SR engine. However, for these purely evolutionary methods it was quite hard to even accommodate the function to the range of the data and the training was consequently inefficient and slow. Recently, several SR algorithms emerged which employ multiple linear regression. This allows the algorithms to create models with relatively small error right from the beginning of the search. Such algorithms are claimed to be by orders of magnitude faster than SR algorithms based on classic GP. However, a systematic comparison of these algorithms on a common set of problems is still missing and there is no basis on which to decide which algorithm to use. In this paper we conceptually and experimentally compare several representatives of such algorithms: GPTIPS, FFX, and EFS. We also include GSGP-Red, which is an enhanced version of geometric semantic genetic programming, an important algorithm in the field of SR. They are applied as off-the-shelf, ready-to-use techniques, mostly using their default settings. The methods are compared on several synthetic SR benchmark problems as well as real-world ones ranging from civil engineering to aerodynamics and acoustics. Their performance is also related to the performance of three conventional machine learning algorithms: multiple regression, random forests and support vector regression. The results suggest that across all the problems, the algorithms have comparable performance. We provide basic recommendations to the user regarding the choice of the algorithm.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Artificial Intelligence in Physical Sciences: Symbolic Regression Trends and Perspectives

Article Open access 19 April 2023

Dimitrios Angelis, Filippos Sofos & Theodoros E. Karakasidis

Genetic Programming Symbolic Classification: A Study

Symbolic regression by uniform random global search

Article 06 December 2019

Sohrab Towfighi

Notes

By “vanilla GP” we mean the original system presented by Koza in [14] or derived systems that rely solely on tree manipulation to evolve the final model. However, we do not consider all tree-based GP systems “vanilla”. An example of such a system is GPTIPS (mentioned later in the Introduction and discussed further in Sect. 2.2.1) which is tree-based and using tree manipulation as the main driver of the structural changes but has other features beyond just the tree manipulation.
However, we cannot call GSGP-Red a technology because the code [9] does not actually output the model, only the metrics needed for evaluation of the algorithm. Nevertheless, we included the algorithm because it otherwise fulfills the critera and is an important algorithm in the field of SR.
By an internal constant we mean a constant other than a coefficient of a top-level linear combination. Example: in \(3x^2 + 6\sin (1.3x)\), the “3” and “6” are not internal constants, because these are tuned by the top-level multiple regression, while the “1.3” is internal constant (part of the nonlinear basis function).
When trying to run PGE it crashed several times for reasons not obvious, which prevents the user to run experiments systematically. An example of a bug we found is on line 27 in the source file expand.py (see https://github.com/verdverm/pypge/blob/a6a031fb/pypge/expand.py#L27): there is a typo in a variable name used which prevents using the square root function node. We did not track down the causes of other crashes.
For details about the implementation and parameters see http://scikit-learn.org/0.17/modules/generated/sklearn.ensemble.RandomForestRegressor.html.
For details about the implementation and parameters see http://scikit-learn.org/0.17/modules/generated/sklearn.grid_search.GridSearchCV.html.
For details about the implementation and parameters see http://scikit-learn.org/0.17/modules/generated/sklearn.svm.SVR.html.
The only exception is EFS: we changed the round variable to false (which was originally hard-coded to true) according to the issue we opened on the algorithm’s GitHub repository, see https://github.com/flexgp/efs/issues/1.
FFX has a built-in 50 s timeout for performing the fit of the elastic net. If the elastic net fails to fit within this time a constant model is returned for that particular fit. Note, however, that FFX fits the elastic net multiple times for each of the multiple runs (see Sect. 2.2.3) and there is no support for timeout of this combined run and returning the results obtained so far.
The number of nodes is used as a simple common measure of complexity accross all the algorithms only for reporting purposes. The individual algorithms use their own measures of complexity to find the best model.

References

I. Arnaldo, K. Krawiec, U.M. O’Reilly, Multiple regression genetic programming, in Proceedings of the 2014 Annual Conference on Genetic and Evolutionary Computation, GECCO ’14 (ACM, New York, 2014), pp. 879–886. https://doi.org/10.1145/2576768.2598291
I. Arnaldo, U.M. O’Reilly, K. Veeramachaneni, Building predictive models via feature synthesis, in Proceedings of the 2015 Annual Conference on Genetic and Evolutionary Computation, GECCO ’15 (ACM, New York, 2015), pp. 983–990. https://doi.org/10.1145/2739480.2754693
K. Bache, M. Lichman, UCI machine learning repository (2013). http://archive.ics.uci.edu/ml. Accessed 30 Jan 2016
V.V. De Melo, Kaizen programming, in Proceedings of the 2014 Annual Conference on Genetic and Evolutionary Computation, GECCO ’14 (ACM, New York, 2014), pp. 895–902. https://doi.org/10.1145/2576768.2598264
EFS commit 6d991fa. http://github.com/flexgp/efs/tree/6d991fa. Accessed 12 Oct 2015
FFX 1.3.4. http://pypi.python.org/pypi/ffx/1.3.4. Accessed 27 Aug 2015
J. Friedman, T. Hastie, R. Tibshirani, Regularization paths for generalized linear models via coordinate descent. J. Stat. Softw. 33(1), 1–22 (2010)
Article Google Scholar
A. Garg, A. Garg, K. Tai, A multi-gene genetic programming model for estimating stress-dependent soil water retention curves. Comput. Geosci. 18(1), 45–56 (2013). https://doi.org/10.1007/s10596-013-9381-z
Article Google Scholar
GSGP-Red commit 0e5f4d5. https://github.com/laic-ufmg/GSGP-Red/tree/0e5f4d5. Accessed 6 Dec 2018
M. Hinchliffe, H. Hiden, B. McKay, M. Willis, M. Tham, G. Barton, Modelling chemical process systems using a multi-gene genetic programming algorithm, in Late Breaking Paper, GP’96. Stanford, USA (1996), pp. 56–65
J.H. Holland, Adaptation in Natural and Artificial Systems (MIT Press, Cambridge, 1992)
Book Google Scholar
M. Keijzer, Scaled symbolic regression. Genet. Program. Evolvable Mach. 5(3), 259–269 (2004). https://doi.org/10.1023/B:GENP.0000030195.77571.f9
Article Google Scholar
M.F. Korns, Genetic Programming Theory and Practice IX, chap. Accuracy in Symbolic Regression (Springer, New York, 2011), pp. 129–151. https://doi.org/10.1007/978-1-4614-1770-5_8
J.R. Koza, Genetic Programming: On the Programming of Computers by Means of Natural Selection (MIT Press, Cambridge, 1992)
MATH Google Scholar
LLC, E.A.: DataModeler [Software] (2016). http://www.evolved-analytics.com/. Accessed 14 Dec 2019
S. Luke, L. Panait, Lexicographic parsimony pressure, in Proceedings of the Genetic and Evolutionary Computation Conference, GECCO ’02 (Morgan Kaufmann Publishers Inc., San Francisco, 2002), pp. 829–836. http://dl.acm.org/citation.cfm?id=646205.682619
J.F.B.S. Martins, L.O.V.B. Oliveira, L.F. Miranda, F. Casadei, G.L. Pappa, Solving the exponential growth of symbolic regression trees in geometric semantic genetic programming, in Proceedings of the Genetic and Evolutionary Computation Conference, GECCO ’18 (ACM, New York, 2018), pp. 1151–1158. https://doi.org/10.1145/3205455.3205593
T. McConaghy, FFX: fast, scalable, deterministic symbolic regression technology, in Genetic Programming Theory and Practice IX, Genetic and Evolutionary Computation, ed. by R. Riolo, E. Vladislavleva, J.H. Moore (Springer, New York, 2011), pp. 235–260. https://doi.org/10.1007/978-1-4614-1770-5_13
Chapter Google Scholar
J. McDermott, D.R. White, S. Luke, L. Manzoni, M. Castelli, L. Vanneschi, W. Jaskowski, K. Krawiec, R. Harper, K. De Jong, U.M. O’Reilly, Genetic programming needs better benchmarks, in Proceedings of the 14th Annual Conference on Genetic and Evolutionary Computation, GECCO ’12 (ACM, New York, 2012), pp. 791–798. https://doi.org/10.1145/2330163.2330273
A. Moraglio, K. Krawiec, C. Johnson, Geometric semantic genetic programming, in Parallel Problem Solving from Nature - PPSN XII, Lecture Notes in Computer Science, vol. 7491, ed. C. Coello, V. Cutello, K. Deb, S. Forrest, G. Nicosia, M. Pavone (Springer, Berlin, 2012), pp. 21–31. https://doi.org/10.1007/978-3-642-32937-1_3
J. Ni, R.H. Drieberg, P.I. Rockett, The use of an analytic quotient operator in genetic programming. IEEE Trans. Evolutionary Comput. 17(1), 146–152 (2013). https://doi.org/10.1109/TEVC.2012.2195319
Article Google Scholar
F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, E. Duchesnay, Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)
MathSciNet MATH Google Scholar
scikit-learn 0.17.1. https://pypi.python.org/pypi/scikit-learn/0.17.1. Accessed 21 Jun 2016
M. Schmidt, H. Lipson, Distilling free-form natural laws from experimental data. Science 324(5923), 81–85 (2009)
Article Google Scholar
M. Schmidt, H. Lipson, Eureqa (Version 0.98 beta) [Software] (2014). www.nutonian.com
D.P. Searson, GPTIPS 2 (2015). http://sites.google.com/site/gptips4matlab. Accessed 9 Jun 2015
D.P. Searson, GPTIPS 2: An Open-Source Software Platform for Symbolic Data Mining (Springer International Publishing, Cham, 2015), pp. 551–573. https://doi.org/10.1007/978-3-319-20883-1_22
Book Google Scholar
D.P. Searson, D.E. Leahy, M.J. Willis, GPTIPS: an open source genetic programming toolbox for multigene symbolic regression, in Proceedings of the International MultiConference of Engineers and Computer Scientists, vol. 1 (2010), pp. 77–80
G.F. Smits, M. Kotanchek, Genetic programming theory and practice II, chap., in Pareto-Front Exploitation in Symbolic Regression (Springer US, Boston, 2005), pp. 283–299. https://doi.org/10.1007/0-387-23254-0_17
A. Tsanas, A. Xifara, Accurate quantitative estimation of energy performance of residential buildings using statistical machine learning tools. EnergyBuild. 49, 560–567 (2012). https://doi.org/10.1016/j.enbuild.2012.03.003
Article Google Scholar
E. Vladislavleva, G. Smits, D. den Hertog, Order of nonlinearity as a complexity measure for models generated by symbolic regression via pareto genetic programming. IEEE Trans. Evolut. Comput. 13(2), 333–349 (2009). https://doi.org/10.1109/TEVC.2008.926486
Article Google Scholar
E. Vladislavleva, G. Smits, M. Kotanchek, Better Solutions Faster: Soft Evolution of Robust Regression Models In Pareto Genetic Programming (Springer US, Boston, 2008), pp. 13–32. https://doi.org/10.1007/978-0-387-76308-8_2
Book MATH Google Scholar
T. Worm, Pypge. https://github.com/verdverm/pypge. Accessed 13 Dec 2019
T. Worm, K. Chiu, Prioritized grammar enumeration: symbolic regression by dynamic programming, in Proceedings of the 15th Annual Conference on Genetic and Evolutionary Computation, GECCO ’13 (ACM, New York, 2013), pp. 1021–1028. https://doi.org/10.1145/2463372.2463486
I.C. Yeh, Modeling of strength of high-performance concrete using artificial neural networks. Cem. Concr. Res. 28(12), 1797–1808 (1998). https://doi.org/10.1016/S0008-8846(98)00165-3
Article Google Scholar
H. Zou, T. Hastie, Regularization and variable selection via the elastic net. J. R. Stat. Soc. Ser. B (Stat. Methodol.) 67(2), 301–320 (2005). https://doi.org/10.1111/j.1467-9868.2005.00503.x
Article MathSciNet MATH Google Scholar

Download references

Acknowledgements

Jan Žegklitz was supported by the Czech Science Foundation project Nr. 15-22731S. Petr Pošík was supported by the Grant Agency of the Czech Technical University in Prague, Grant No. SGS14/194/OHK3/3T/13.

Author information

Jan Žegklitz
Present address: Department of Cybernetics, Faculty of Electrical Engineering, Czech Technical University in Prague, Technická 2, 166 27, Prague 6, Czech Republic

Authors and Affiliations

Czech Institute of Informatics, Robotics and Cybernetics, Czech Technical University in Prague, Jugoslávských partyzánů 1580/3, 160 00, Prague 6, Czech Republic
Jan Žegklitz
Department of Cybernetics, Faculty of Electrical Engineering, Czech Technical University in Prague, Technická 2, 166 27, Prague 6, Czech Republic
Petr Pošík

Authors

Jan Žegklitz
View author publications
You can also search for this author in PubMed Google Scholar
Petr Pošík
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jan Žegklitz.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Žegklitz, J., Pošík, P. Benchmarking state-of-the-art symbolic regression algorithms. Genet Program Evolvable Mach 22, 5–33 (2021). https://doi.org/10.1007/s10710-020-09387-0

Download citation

Received: 29 July 2019
Revised: 25 February 2020
Published: 24 March 2020
Issue Date: March 2021
DOI: https://doi.org/10.1007/s10710-020-09387-0

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Benchmarking state-of-the-art symbolic regression algorithms

Abstract

Access this article

Similar content being viewed by others

Artificial Intelligence in Physical Sciences: Symbolic Regression Trends and Perspectives

Genetic Programming Symbolic Classification: A Study

Symbolic regression by uniform random global search

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Benchmarking state-of-the-art symbolic regression algorithms

Abstract

Access this article

Similar content being viewed by others

Artificial Intelligence in Physical Sciences: Symbolic Regression Trends and Perspectives

Genetic Programming Symbolic Classification: A Study

Symbolic regression by uniform random global search

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation