Automated learning of interpretable models with quantified uncertainty

https://doi.org/10.1016/j.cma.2022.115732Get rights and content

Abstract

Interpretability and uncertainty quantification in machine learning can provide justification for decisions, promote scientific discovery and lead to a better understanding of model behavior. Symbolic regression provides inherently interpretable machine learning, but relatively little work has focused on the use of symbolic regression on noisy data and the accompanying necessity to quantify uncertainty. A new Bayesian framework for genetic-programming-based symbolic regression (GPSR) is introduced that uses model evidence (i.e., marginal likelihood) to formulate replacement probability during the selection phase of evolution. Model parameter uncertainty is automatically quantified, enabling probabilistic predictions with each equation produced by the GPSR algorithm. Model evidence is also quantified in this process, and its use is shown to increase interpretability, improve robustness to noise, and reduce overfitting when compared to a conventional GPSR implementation on both numerical and physical experiments.

Introduction

Machine learning (ML) has become ubiquitous in scientific disciplines. In some applications, accurate data-driven predictions are all that is required; however, in many others, interpretability and explainability of the model is equally important. Interpretability and explainability can provide justification for decisions, promote scientific discovery and ultimately lead to better control/improvement of models [1], [2]. In a complementary fashion, ML models can provide further insight by conveying their level of uncertainty in predictions [3]. Especially in cases of low risk tolerance this type of insight is crucial for building trust in ML models [4].

Rather than focus on black-box ML methods (e.g., neural networks or Gaussian process regression) combined with post hoc explainability tools, the current work focuses on inherently interpretable methods. Interpretable ML methods can be competitive with black-box ML in terms of accuracy and do not require a separate explainability toolkit [4], [5]. Symbolic regression is one such inherently interpretable form of ML wherein an analytic equation is produced that best models input data. Symbolic regression has been successful in a range of scientific applications such as deriving conservation laws in physics [6], inferring dynamic relationships [7], [8], and producing interpretable mechanics models [9]. Unfortunately, little attention has been paid to the use of symbolic regression on noisy data and the consideration of uncertainty.

Schmidt and Lipson [10] tackled the problem of noisy training data in symbolic regression through the inclusion of uniform random variables in model formation. Though uniform random variables can be transformed to represent more complex distributions, doing so drastically increases complexity of equations that must be produced. This can make the symbolic regression process less tractable and less interpretable.

Hirsh et al. [11] incorporated Bayesian uncertainty quantification into the sparse identification of nonlinear dynamics (SINDy) method through the use of sparsifying priors. In this technique, a linear combination of candidate terms (i.e., simple functions of the input data) is produced with random coefficients that are estimated through Bayesian inference. The reliance on candidate terms and linear combinations thereof constitutes only a limited form of symbolic regression (as opposed to the more traditional free-form symbolic regression). As such, the form of the resulting equation may be overly constrained and less insightful.

Others have implemented Bayesian methods in symbolic regression [12], [13]; however, they focused more on the improved efficiency of symbolic regression methods rather than the ability to produce probabilistic models with quantified uncertainty. For instance, Jin et al. [12] used a form of Markov chain Monte Carlo as a means for equation production. Also, Zhang [13] used a Bayesian framework to influence the population dynamics in genetic programming for improved evolution speed and decreased complexity.

In the current work, a new Bayesian framework for genetic-programming-based symbolic regression (GPSR) is developed. In this framework, Bayesian inference is applied to infer unknown distributions of parameters in free-form equations. The marginal likelihood of the equations are then used in a Bayesian model selection scheme to influence evolution towards equations for which the data provides the most evidence. The result is a GPSR framework that can produce interpretable models with quantified uncertainty. Additionally, the Bayesian framework provides regularization with several benefits compared to standard GPSR: increased interpretability, increased robustness to noise, and less tendency to overfit.

Section snippets

Methods

Symbolic regression is the search for analytic equations that best describe some dataset: i.e., attempting to find a function f:RdR such that f(x)=y given a dataset D{(xi,yi)}i=0N with d-dimensional input features x and label y. Several methodologies have been applied to the task of free-form symbolic regression such as genetic programming [14], prioritized grammar enumeration [15], Markov chain Monte Carlo [12], divide and conquer [16], and deep learning [17], [18]. Genetic programming-based

Experiments and discussion

In this section, Bayesian GPSR is applied first to a numerical example, then to benchmark problems from SRBENCH [19], [36], and finally to an experimental example. The ability of Bayesian GPSR to produce interpretable models and make probabilistic predictions is illustrated. Comparisons are made to conventional GPSR that illustrate several benefits to the Bayesian extension beyond its ability to produce probabilistic predictions.

Conclusion

A new Bayesian framework for genetic-programming-based symbolic regression (GPSR) was developed. For each equation in the population, Bayesian inference was used to estimate probability density functions of the unknown constants given the available data. This automatic quantification of uncertainty meant that any equation could be used to make probabilistic predictions using, for example, Monte Carlo simulation. As a byproduct of this process, the normalized marginal likelihood of the

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

References (44)

  • BomaritoG.F. et al.

    Development of interpretable, data-driven plasticity models with symbolic regression

    Comput. Struct.

    (2021)
  • AdadiAmina et al.

    Peeking inside the black-box: a survey on explainable artificial intelligence (xai)

    IEEE Access

    (2018)
  • DuMengnan et al.

    Techniques for interpretable machine learning

    Commun. ACM

    (2019)
  • Umang Bhatt, Alice Xiang, Shubham Sharma, Adrian Weller, Ankur Taly, Yunhan Jia, Joydeep Ghosh, Ruchir Puri, José M.F....
  • RudinC.

    Please stop explaining black box models for high stakes decisions

    (2018)
  • RudinCynthia et al.

    Why are we using black box models in ai when we don’t need to? a lesson from an explainable ai competition

    Harvard Data Sci. Rev.

    (2019)
  • SchmidtMichael et al.

    Distilling free-form natural laws from experimental data

    Science

    (2009)
  • BruntonSteven L. et al.

    Discovering governing equations from data by sparse identification of nonlinear dynamical systems

    Proc. Natl. Acad. Sci.

    (2016)
  • GaliotoNicholas et al.

    Bayesian system id: Optimal management of parameter, model, and measurement uncertainty

    Nonlinear Dynam.

    (2020)
  • Michael D. Schmidt, Hod Lipson, Learning noise, in: Proceedings of the 9th Annual Conference on Genetic and...
  • HirshSeth M. et al.

    Sparsifying priors for bayesian uncertainty quantification in model discovery

    (2021)
  • JinYing et al.

    Bayesian symbolic regression

    (2019)
  • ZhangByong-Tak

    Bayesian methods for efficient genetic programming

    Genet. Program. Evol. Mach.

    (2000)
  • KozaJohn R. et al.

    Genetic Programming: On the Programming of Computers By Means of Natural Selection, vol. 1

    (1992)
  • Tony Worm, Kenneth Chiu, Prioritized grammar enumeration: symbolic regression by dynamic programming, in: Proceedings...
  • UdrescuSilviu-Marian et al.

    Ai feynman: A physics-inspired method for symbolic regression

    Sci. Adv.

    (2020)
  • PetersenBrenden K. et al.

    Deep symbolic regression: Recovering mathematical expressions from data via risk-seeking policy gradients

    (2019)
  • ValipourMojtaba et al.

    Symbolicgpt: A generative transformer model for symbolic regression

    (2021)
  • CavaWilliam La et al.

    Contemporary symbolic regression methods and their relative performance

    (2021)
  • BomaritoGeoffrey

    Bingo

    (2022)
  • Michael Schmidt, Hod Lipson, Comparison of tree and graph encodings as function of problem complexity, in: Proceedings...
  • Michael Kommenda, Gabriel Kronberger, Stephan Winkler, Michael Affenzeller, Stefan Wagner, Effects of constant...
  • Cited by (4)

    • Priors For Symbolic Regression

      2023, GECCO 2023 Companion - Proceedings of the 2023 Genetic and Evolutionary Computation Conference Companion
    1

    These authors contributed equally.

    View full text