Skip to main content

Advertisement

Log in

Benchmarking state-of-the-art symbolic regression algorithms

  • Published:
Genetic Programming and Evolvable Machines Aims and scope Submit manuscript

Abstract

Symbolic regression (SR) is a powerful method for building predictive models from data without assuming any model structure. Traditionally, genetic programming (GP) was used as the SR engine. However, for these purely evolutionary methods it was quite hard to even accommodate the function to the range of the data and the training was consequently inefficient and slow. Recently, several SR algorithms emerged which employ multiple linear regression. This allows the algorithms to create models with relatively small error right from the beginning of the search. Such algorithms are claimed to be by orders of magnitude faster than SR algorithms based on classic GP. However, a systematic comparison of these algorithms on a common set of problems is still missing and there is no basis on which to decide which algorithm to use. In this paper we conceptually and experimentally compare several representatives of such algorithms: GPTIPS, FFX, and EFS. We also include GSGP-Red, which is an enhanced version of geometric semantic genetic programming, an important algorithm in the field of SR. They are applied as off-the-shelf, ready-to-use techniques, mostly using their default settings. The methods are compared on several synthetic SR benchmark problems as well as real-world ones ranging from civil engineering to aerodynamics and acoustics. Their performance is also related to the performance of three conventional machine learning algorithms: multiple regression, random forests and support vector regression. The results suggest that across all the problems, the algorithms have comparable performance. We provide basic recommendations to the user regarding the choice of the algorithm.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

Notes

  1. By “vanilla GP” we mean the original system presented by Koza in [14] or derived systems that rely solely on tree manipulation to evolve the final model. However, we do not consider all tree-based GP systems “vanilla”. An example of such a system is GPTIPS (mentioned later in the Introduction and discussed further in Sect. 2.2.1) which is tree-based and using tree manipulation as the main driver of the structural changes but has other features beyond just the tree manipulation.

  2. However, we cannot call GSGP-Red a technology because the code [9] does not actually output the model, only the metrics needed for evaluation of the algorithm. Nevertheless, we included the algorithm because it otherwise fulfills the critera and is an important algorithm in the field of SR.

  3. By an internal constant we mean a constant other than a coefficient of a top-level linear combination. Example: in \(3x^2 + 6\sin (1.3x)\), the “3” and “6” are not internal constants, because these are tuned by the top-level multiple regression, while the “1.3” is internal constant (part of the nonlinear basis function).

  4. When trying to run PGE it crashed several times for reasons not obvious, which prevents the user to run experiments systematically. An example of a bug we found is on line 27 in the source file expand.py (see https://github.com/verdverm/pypge/blob/a6a031fb/pypge/expand.py#L27): there is a typo in a variable name used which prevents using the square root function node. We did not track down the causes of other crashes.

  5. For details about the implementation and parameters see http://scikit-learn.org/0.17/modules/generated/sklearn.ensemble.RandomForestRegressor.html.

  6. For details about the implementation and parameters see http://scikit-learn.org/0.17/modules/generated/sklearn.grid_search.GridSearchCV.html.

  7. For details about the implementation and parameters see http://scikit-learn.org/0.17/modules/generated/sklearn.svm.SVR.html.

  8. The only exception is EFS: we changed the round variable to false (which was originally hard-coded to true) according to the issue we opened on the algorithm’s GitHub repository, see https://github.com/flexgp/efs/issues/1.

  9. FFX has a built-in 50 s timeout for performing the fit of the elastic net. If the elastic net fails to fit within this time a constant model is returned for that particular fit. Note, however, that FFX fits the elastic net multiple times for each of the multiple runs (see Sect. 2.2.3) and there is no support for timeout of this combined run and returning the results obtained so far.

  10. The number of nodes is used as a simple common measure of complexity accross all the algorithms only for reporting purposes. The individual algorithms use their own measures of complexity to find the best model.

References

  1. I. Arnaldo, K. Krawiec, U.M. O’Reilly, Multiple regression genetic programming, in Proceedings of the 2014 Annual Conference on Genetic and Evolutionary Computation, GECCO ’14 (ACM, New York, 2014), pp. 879–886. https://doi.org/10.1145/2576768.2598291

  2. I. Arnaldo, U.M. O’Reilly, K. Veeramachaneni, Building predictive models via feature synthesis, in Proceedings of the 2015 Annual Conference on Genetic and Evolutionary Computation, GECCO ’15 (ACM, New York, 2015), pp. 983–990. https://doi.org/10.1145/2739480.2754693

  3. K. Bache, M. Lichman, UCI machine learning repository (2013). http://archive.ics.uci.edu/ml. Accessed 30 Jan 2016

  4. V.V. De Melo, Kaizen programming, in Proceedings of the 2014 Annual Conference on Genetic and Evolutionary Computation, GECCO ’14 (ACM, New York, 2014), pp. 895–902. https://doi.org/10.1145/2576768.2598264

  5. EFS commit 6d991fa. http://github.com/flexgp/efs/tree/6d991fa. Accessed 12 Oct 2015

  6. FFX 1.3.4. http://pypi.python.org/pypi/ffx/1.3.4. Accessed 27 Aug 2015

  7. J. Friedman, T. Hastie, R. Tibshirani, Regularization paths for generalized linear models via coordinate descent. J. Stat. Softw. 33(1), 1–22 (2010)

    Article  Google Scholar 

  8. A. Garg, A. Garg, K. Tai, A multi-gene genetic programming model for estimating stress-dependent soil water retention curves. Comput. Geosci. 18(1), 45–56 (2013). https://doi.org/10.1007/s10596-013-9381-z

    Article  Google Scholar 

  9. GSGP-Red commit 0e5f4d5. https://github.com/laic-ufmg/GSGP-Red/tree/0e5f4d5. Accessed 6 Dec 2018

  10. M. Hinchliffe, H. Hiden, B. McKay, M. Willis, M. Tham, G. Barton, Modelling chemical process systems using a multi-gene genetic programming algorithm, in Late Breaking Paper, GP’96. Stanford, USA (1996), pp. 56–65

  11. J.H. Holland, Adaptation in Natural and Artificial Systems (MIT Press, Cambridge, 1992)

    Book  Google Scholar 

  12. M. Keijzer, Scaled symbolic regression. Genet. Program. Evolvable Mach. 5(3), 259–269 (2004). https://doi.org/10.1023/B:GENP.0000030195.77571.f9

    Article  Google Scholar 

  13. M.F. Korns, Genetic Programming Theory and Practice IX, chap. Accuracy in Symbolic Regression (Springer, New York, 2011), pp. 129–151. https://doi.org/10.1007/978-1-4614-1770-5_8

  14. J.R. Koza, Genetic Programming: On the Programming of Computers by Means of Natural Selection (MIT Press, Cambridge, 1992)

    MATH  Google Scholar 

  15. LLC, E.A.: DataModeler [Software] (2016). http://www.evolved-analytics.com/. Accessed 14 Dec 2019

  16. S. Luke, L. Panait, Lexicographic parsimony pressure, in Proceedings of the Genetic and Evolutionary Computation Conference, GECCO ’02 (Morgan Kaufmann Publishers Inc., San Francisco, 2002), pp. 829–836. http://dl.acm.org/citation.cfm?id=646205.682619

  17. J.F.B.S. Martins, L.O.V.B. Oliveira, L.F. Miranda, F. Casadei, G.L. Pappa, Solving the exponential growth of symbolic regression trees in geometric semantic genetic programming, in Proceedings of the Genetic and Evolutionary Computation Conference, GECCO ’18 (ACM, New York, 2018), pp. 1151–1158. https://doi.org/10.1145/3205455.3205593

  18. T. McConaghy, FFX: fast, scalable, deterministic symbolic regression technology, in Genetic Programming Theory and Practice IX, Genetic and Evolutionary Computation, ed. by R. Riolo, E. Vladislavleva, J.H. Moore (Springer, New York, 2011), pp. 235–260. https://doi.org/10.1007/978-1-4614-1770-5_13

    Chapter  Google Scholar 

  19. J. McDermott, D.R. White, S. Luke, L. Manzoni, M. Castelli, L. Vanneschi, W. Jaskowski, K. Krawiec, R. Harper, K. De Jong, U.M. O’Reilly, Genetic programming needs better benchmarks, in Proceedings of the 14th Annual Conference on Genetic and Evolutionary Computation, GECCO ’12 (ACM, New York, 2012), pp. 791–798. https://doi.org/10.1145/2330163.2330273

  20. A. Moraglio, K. Krawiec, C. Johnson, Geometric semantic genetic programming, in Parallel Problem Solving from Nature - PPSN XII, Lecture Notes in Computer Science, vol. 7491, ed. C. Coello, V. Cutello, K. Deb, S. Forrest, G. Nicosia, M. Pavone (Springer, Berlin, 2012), pp. 21–31. https://doi.org/10.1007/978-3-642-32937-1_3

  21. J. Ni, R.H. Drieberg, P.I. Rockett, The use of an analytic quotient operator in genetic programming. IEEE Trans. Evolutionary Comput. 17(1), 146–152 (2013). https://doi.org/10.1109/TEVC.2012.2195319

    Article  Google Scholar 

  22. F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, E. Duchesnay, Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)

    MathSciNet  MATH  Google Scholar 

  23. scikit-learn 0.17.1. https://pypi.python.org/pypi/scikit-learn/0.17.1. Accessed 21 Jun 2016

  24. M. Schmidt, H. Lipson, Distilling free-form natural laws from experimental data. Science 324(5923), 81–85 (2009)

    Article  Google Scholar 

  25. M. Schmidt, H. Lipson, Eureqa (Version 0.98 beta) [Software] (2014). www.nutonian.com

  26. D.P. Searson, GPTIPS 2 (2015). http://sites.google.com/site/gptips4matlab. Accessed 9 Jun 2015

  27. D.P. Searson, GPTIPS 2: An Open-Source Software Platform for Symbolic Data Mining (Springer International Publishing, Cham, 2015), pp. 551–573. https://doi.org/10.1007/978-3-319-20883-1_22

    Book  Google Scholar 

  28. D.P. Searson, D.E. Leahy, M.J. Willis, GPTIPS: an open source genetic programming toolbox for multigene symbolic regression, in Proceedings of the International MultiConference of Engineers and Computer Scientists, vol. 1 (2010), pp. 77–80

  29. G.F. Smits, M. Kotanchek, Genetic programming theory and practice II, chap., in Pareto-Front Exploitation in Symbolic Regression (Springer US, Boston, 2005), pp. 283–299. https://doi.org/10.1007/0-387-23254-0_17

  30. A. Tsanas, A. Xifara, Accurate quantitative estimation of energy performance of residential buildings using statistical machine learning tools. EnergyBuild. 49, 560–567 (2012). https://doi.org/10.1016/j.enbuild.2012.03.003

    Article  Google Scholar 

  31. E. Vladislavleva, G. Smits, D. den Hertog, Order of nonlinearity as a complexity measure for models generated by symbolic regression via pareto genetic programming. IEEE Trans. Evolut. Comput. 13(2), 333–349 (2009). https://doi.org/10.1109/TEVC.2008.926486

    Article  Google Scholar 

  32. E. Vladislavleva, G. Smits, M. Kotanchek, Better Solutions Faster: Soft Evolution of Robust Regression Models In Pareto Genetic Programming (Springer US, Boston, 2008), pp. 13–32. https://doi.org/10.1007/978-0-387-76308-8_2

    Book  MATH  Google Scholar 

  33. T. Worm, Pypge. https://github.com/verdverm/pypge. Accessed 13 Dec 2019

  34. T. Worm, K. Chiu, Prioritized grammar enumeration: symbolic regression by dynamic programming, in Proceedings of the 15th Annual Conference on Genetic and Evolutionary Computation, GECCO ’13 (ACM, New York, 2013), pp. 1021–1028. https://doi.org/10.1145/2463372.2463486

  35. I.C. Yeh, Modeling of strength of high-performance concrete using artificial neural networks. Cem. Concr. Res. 28(12), 1797–1808 (1998). https://doi.org/10.1016/S0008-8846(98)00165-3

    Article  Google Scholar 

  36. H. Zou, T. Hastie, Regularization and variable selection via the elastic net. J. R. Stat. Soc. Ser. B (Stat. Methodol.) 67(2), 301–320 (2005). https://doi.org/10.1111/j.1467-9868.2005.00503.x

    Article  MathSciNet  MATH  Google Scholar 

Download references

Acknowledgements

Jan Žegklitz was supported by the Czech Science Foundation project Nr. 15-22731S. Petr Pošík was supported by the Grant Agency of the Czech Technical University in Prague, Grant No. SGS14/194/OHK3/3T/13.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jan Žegklitz.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Žegklitz, J., Pošík, P. Benchmarking state-of-the-art symbolic regression algorithms. Genet Program Evolvable Mach 22, 5–33 (2021). https://doi.org/10.1007/s10710-020-09387-0

Download citation

  • Received:

  • Revised:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10710-020-09387-0

Keywords

Navigation