Skip to main content

Symbolic Regression via Control Variable Genetic Programming

  • Conference paper
  • First Online:
Machine Learning and Knowledge Discovery in Databases: Research Track (ECML PKDD 2023)

Abstract

Learning symbolic expressions directly from experiment data is a vital step in AI-driven scientific discovery. Nevertheless, state-of-the-art approaches are limited to learning simple expressions. Regressing expressions involving many independent variables still remain out of reach. Motivated by the control variable experiments widely utilized in science, we propose Control Variable Genetic Programming (CVGP) for symbolic regression over many independent variables. CVGP expedites symbolic expression discovery via customized experiment design, rather than learning from a fixed dataset collected a priori. CVGP starts by fitting simple expressions involving a small set of independent variables using genetic programming, under controlled experiments where other variables are held as constants. It then extends expressions learned in previous generations by adding new independent variables, using new control variable experiments in which these variables are allowed to vary. Theoretically, we show CVGP as an incremental building approach can yield an exponential reduction in the search space when learning a class of expressions. Experimentally, CVGP outperforms several baselines in learning symbolic expressions involving multiple independent variables.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    The code is at: https://github.com/jiangnanhugo/cvgp/. Please refer to the extended version (https://arxiv.org/abs/2306.08057) for the Appendix.

References

  1. Abolafia, D.A., Norouzi, M., Le, Q.V.: Neural program synthesis with priority queue training. CoRR abs/1801.03526 (2018)

    Google Scholar 

  2. Anthony, T., Tian, Z., Barber, D.: Thinking fast and slow with deep learning and tree search. In: NIPS, pp. 5360–5370 (2017)

    Google Scholar 

  3. Balcan, M., Dick, T., Sandholm, T., Vitercik, E.: Learning to branch. In: ICML. Proceedings of Machine Learning Research, vol. 80, pp. 353–362. PMLR (2018)

    Google Scholar 

  4. Biggio, L., Bendinelli, T., Neitz, A., Lucchi, A., Parascandolo, G.: Neural symbolic regression that scales. In: ICML. Proceedings of Machine Learning Research, vol. 139, pp. 936–945. PMLR (2021)

    Google Scholar 

  5. Booch, G., et al.: Thinking fast and slow in AI. In: AAAI, pp. 15042–15046. AAAI Press (2021)

    Google Scholar 

  6. Bradley, E., Easley, M., Stolle, R.: Reasoning about nonlinear system identification. Artif. Intell. 133(1), 139–188 (2001)

    Article  MATH  Google Scholar 

  7. Bridewell, W., Langley, P., Todorovski, L., Džeroski, S.: Inductive process modeling. Mach. Learn. 71, 1–32 (2008)

    Article  MATH  Google Scholar 

  8. Brunton, S.L., Proctor, J.L., Kutz, J.N.: Discovering governing equations from data by sparse identification of nonlinear dynamical systems. Proc. Natl. Acad. Sci. 113(15), 3932–3937 (2016)

    Article  MathSciNet  MATH  Google Scholar 

  9. Cerrato, M., Brugger, J., Schmitt, N., Kramer, S.: Reinforcement learning for automated scientific discovery. In: AAAI Spring Symposium on Computational Approaches to Scientific Discovery (2023)

    Google Scholar 

  10. Chen, C., Luo, C., Jiang, Z.: Elite bases regression: a real-time algorithm for symbolic regression. In: ICNC-FSKD, pp. 529–535. IEEE (2017)

    Google Scholar 

  11. Chen, D., Wang, Y., Gao, W.: Combining a gradient-based method and an evolution strategy for multi-objective reinforcement learning. Appl. Intell. 50(10), 3301–3317 (2020)

    Article  Google Scholar 

  12. Chen, Q., Xue, B., Zhang, M.: Rademacher complexity for enhancing the generalization of genetic programming for symbolic regression. IEEE Trans. Cybern. 52(4), 2382–2395 (2022)

    Article  Google Scholar 

  13. Chen, R.T., Rubanova, Y., Bettencourt, J., Duvenaud, D.K.: Neural ordinary differential equations. In: Advances in Neural Information Processing Systems, vol. 31 (2018)

    Google Scholar 

  14. Cranmer, M.D., et al.: Discovering symbolic models from deep learning with inductive biases. In: NeurIPS (2020)

    Google Scholar 

  15. Dubcáková, R.: Eureqa: software review. Genet. Program Evolvable Mach. 12(2), 173–178 (2011)

    Article  Google Scholar 

  16. Dzeroski, S., Todorovski, L.: Discovering dynamics: from inductive logic programming to machine discovery. J. Intell. Inf. Syst. 4(1), 89–108 (1995)

    Article  Google Scholar 

  17. Fortin, F.A., De Rainville, F.M., Gardner, M.A., Parizeau, M., Gagné, C.: DEAP: evolutionary algorithms made easy. J. Mach. Learn. Res. 13, 2171–2175 (2012)

    Google Scholar 

  18. Glymour, C., Scheines, R., Spirtes, P.: Discovering Causal Structure: Artificial Intelligence, Philosophy of Science, and Statistical Modeling. Academic Press, London (2014)

    Google Scholar 

  19. Golovin, D., Krause, A., Ray, D.: Near-optimal Bayesian active learning with noisy observations. In: Advances in Neural Information Processing Systems, vol. 23 (2010)

    Google Scholar 

  20. Guimerà, R., et al.: A Bayesian machine scientist to aid in the solution of challenging scientific problems. Sci. Adv. 6(5), eaav6971 (2020)

    Google Scholar 

  21. Hanneke, S.: Theory of disagreement-based active learning. Found. Trends Mach. Learn. 7(2–3), 131–309 (2014)

    Article  MATH  Google Scholar 

  22. He, B., Lu, Q., Yang, Q., Luo, J., Wang, Z.: Taylor genetic programming for symbolic regression. In: GECCO, pp. 946–954. ACM (2022)

    Google Scholar 

  23. Iten, R., Metger, T., Wilming, H., Del Rio, L., Renner, R.: Discovering physical concepts with neural networks. Phys. Rev. Lett. 124(1), 010508 (2020)

    Article  Google Scholar 

  24. Jaber, A., Ribeiro, A., Zhang, J., Bareinboim, E.: Causal identification under Markov equivalence: calculus, algorithm, and completeness. Adv. Neural. Inf. Process. Syst. 35, 3679–3690 (2022)

    Google Scholar 

  25. Kahneman, D.: Thinking, Fast and Slow. Macmillan, New York (2011)

    Google Scholar 

  26. Kamienny, P., d’Ascoli, S., Lample, G., Charton, F.: End-to-end symbolic regression with transformers. In: NeurIPS (2022)

    Google Scholar 

  27. Kibler, D.F., Langley, P.: The experimental study of machine learning (1991)

    Google Scholar 

  28. King, R.D., et al.: The automation of science. Science 324(5923), 85–89 (2009)

    Article  Google Scholar 

  29. King, R.D., et al.: Functional genomic hypothesis generation and experimentation by a robot scientist. Nature 427(6971), 247–252 (2004)

    Article  Google Scholar 

  30. La Cava, W., et al.: Contemporary symbolic regression methods and their relative performance. arXiv preprint arXiv:2107.14351 (2021)

  31. Langley, P.: BACON: a production system that discovers empirical laws. In: IJCAI, p. 344. William Kaufmann (1977)

    Google Scholar 

  32. Langley, P.: Rediscovering physics with BACON.3. In: IJCAI, pp. 505–507. William Kaufmann (1979)

    Google Scholar 

  33. Langley, P.: Data-driven discovery of physical laws. Cogn. Sci. 5(1), 31–54 (1981)

    Article  Google Scholar 

  34. Langley, P.: Machine learning as an experimental science. Mach. Learn. 3, 5–8 (1988)

    Article  Google Scholar 

  35. Langley, P.: Scientific discovery, causal explanation, and process model induction. Mind Soc. 18(1), 43–56 (2019)

    Article  Google Scholar 

  36. Langley, P., Bradshaw, G.L., Simon, H.A.: BACON.5: the discovery of conservation laws. In: IJCAI, pp. 121–126. William Kaufmann (1981)

    Google Scholar 

  37. Langley, P.W., Simon, H.A., Bradshaw, G., Zytkow, J.M.: Scientific Discovery: Computational Explorations of the Creative Process. The MIT Press, Cambridge (1987)

    Google Scholar 

  38. Lehman, J.S., Santner, T.J., Notz, W.I.: Designing computer experiments to determine robust control variables. Statistica Sinica, 571–590 (2004)

    Google Scholar 

  39. Lenat, D.B.: The ubiquity of discovery. Artif. Intell. 9(3), 257–285 (1977)

    Article  Google Scholar 

  40. Liu, Z., Tegmark, M.: Machine learning conservation laws from trajectories. Phys. Rev. Lett. 126, 180604 (2021)

    Google Scholar 

  41. Matsubara, Y., Chiba, N., Igarashi, R., Taniai, T., Ushiku, Y.: Rethinking symbolic regression datasets and benchmarks for scientific discovery. arXiv preprint arXiv:2206.10540 (2022)

  42. McConaghy, T.: FFX: fast, scalable, deterministic symbolic regression technology. In: Riolo, R., Vladislavleva, E., Moore, J. (eds.) Genetic Programming Theory and Practice IX. Genetic and Evolutionary Computation, pp. 235–260. Springer, New York (2011). https://doi.org/10.1007/978-1-4614-1770-5_13

  43. Mundhenk, T.N., Landajuela, M., Glatt, R., Santiago, C.P., Faissol, D.M., Petersen, B.K.: Symbolic regression via deep reinforcement learning enhanced genetic programming seeding. In: NeurIPS, pp. 24912–24923 (2021)

    Google Scholar 

  44. Pearl, J.: Causality. Cambridge University Press, Cambridge (2009)

    Google Scholar 

  45. Petersen, B.K., Landajuela, M., Mundhenk, T.N., Santiago, C.P., Kim, S., Kim, J.T.: Deep symbolic regression: recovering mathematical expressions from data via risk-seeking policy gradients. In: ICLR. OpenReview.net (2021)

    Google Scholar 

  46. Raissi, M., Perdikaris, P., Karniadakis, G.: Physics-informed neural networks: a deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations. J. Comput. Phys. 378, 686–707 (2019)

    Article  MathSciNet  MATH  Google Scholar 

  47. Raissi, M., Yazdani, A., Karniadakis, G.E.: Hidden fluid mechanics: learning velocity and pressure fields from flow visualizations. Science 367(6481), 1026–1030 (2020)

    Article  MathSciNet  MATH  Google Scholar 

  48. Razavi, S., Gamazon, E.R.: Neural-network-directed genetic programmer for discovery of governing equations. CoRR abs/2203.08808 (2022)

    Google Scholar 

  49. Ryan, T.P., Morgan, J.P.: Modern experimental design. J. Stat. Theory Pract. 1(3–4), 501–506 (2007)

    Article  MATH  Google Scholar 

  50. Santner, T.J., Williams, B.J., Notz, W.I.: The Design and Analysis of Computer Experiments. Springer Series in Statistics, Springer, New York (2003). https://doi.org/10.1007/978-1-4757-3799-8

    Book  MATH  Google Scholar 

  51. Scavuzzo, L., et al.: Learning to branch with tree MDPs. In: NeurIPS (2022)

    Google Scholar 

  52. Schmidt, M., Lipson, H.: Distilling free-form natural laws from experimental data. Science 324(5923), 81–85 (2009)

    Article  Google Scholar 

  53. Simon, H.A.: Spurious correlation: a causal interpretation. J. Am. Stat. Assoc. 49(267), 467–479 (1954)

    MATH  Google Scholar 

  54. Udrescu, S.M., Tegmark, M.: AI Feynman: a physics-inspired method for symbolic regression. Sci. Adv. 6(16) (2020)

    Google Scholar 

  55. Uy, N.Q., Hoai, N.X., O’Neill, M., McKay, R.I., López, E.G.: Semantically-based crossover in genetic programming: application to real-valued symbolic regression. Genet. Program Evolvable Mach. 12(2), 91–119 (2011)

    Article  Google Scholar 

  56. Valdés-Pérez, R.: Human/computer interactive elucidation of reaction mechanisms: application to catalyzed hydrogenolysis of ethane. Catal. Lett. 28, 79–87 (1994)

    Article  Google Scholar 

  57. Virgolin, M., Alderliesten, T., Bosman, P.A.N.: Linear scaling with and within semantic backpropagation-based genetic programming for symbolic regression. In: GECCO, pp. 1084–1092. ACM (2019)

    Google Scholar 

  58. Virgolin, M., Pissis, S.P.: Symbolic regression is NP-hard. Trans. Mach. Learn. Res. (2022)

    Google Scholar 

  59. Wang, H., et al.: Enabling scientific discovery with artificial intelligence. Nature (2022)

    Google Scholar 

  60. Williams, R.J.: Simple statistical gradient-following algorithms for connectionist reinforcement learning. Mach. Learn. 8, 229–256 (1992)

    Article  MATH  Google Scholar 

  61. Wu, T., Tegmark, M.: Toward an artificial intelligence physicist for unsupervised learning. Phys. Rev. E 100, 033311 (Sep2019)

    Google Scholar 

  62. Xue, Y., Nasim, Md., Zhang, M., Fan, C., Zhang, X., El-Azab, A.: Physics knowledge discovery via neural differential equation embedding. In: Dong, Y., Kourtellis, N., Hammer, B., Lozano, J.A. (eds.) ECML PKDD 2021. LNCS (LNAI), vol. 12979, pp. 118–134. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-86517-7_8

    Chapter  Google Scholar 

  63. Zhang, S., Lin, G.: Robust data-driven discovery of governing physical laws with error bars. Proc. Roy. Soc. A Math. Phys. Eng. Sci. 474(2217), 20180305 (2018)

    MathSciNet  MATH  Google Scholar 

Download references

Acknowledgments

We thank all the reviewers for their constructive comments. This research was supported by NSF grant CCF-1918327.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Nan Jiang .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Jiang, N., Xue, Y. (2023). Symbolic Regression via Control Variable Genetic Programming. In: Koutra, D., Plant, C., Gomez Rodriguez, M., Baralis, E., Bonchi, F. (eds) Machine Learning and Knowledge Discovery in Databases: Research Track. ECML PKDD 2023. Lecture Notes in Computer Science(), vol 14172. Springer, Cham. https://doi.org/10.1007/978-3-031-43421-1_11

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-43421-1_11

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-43420-4

  • Online ISBN: 978-3-031-43421-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics