Automated learning of interpretable models with quantified uncertainty
Introduction
Machine learning (ML) has become ubiquitous in scientific disciplines. In some applications, accurate data-driven predictions are all that is required; however, in many others, interpretability and explainability of the model is equally important. Interpretability and explainability can provide justification for decisions, promote scientific discovery and ultimately lead to better control/improvement of models [1], [2]. In a complementary fashion, ML models can provide further insight by conveying their level of uncertainty in predictions [3]. Especially in cases of low risk tolerance this type of insight is crucial for building trust in ML models [4].
Rather than focus on black-box ML methods (e.g., neural networks or Gaussian process regression) combined with post hoc explainability tools, the current work focuses on inherently interpretable methods. Interpretable ML methods can be competitive with black-box ML in terms of accuracy and do not require a separate explainability toolkit [4], [5]. Symbolic regression is one such inherently interpretable form of ML wherein an analytic equation is produced that best models input data. Symbolic regression has been successful in a range of scientific applications such as deriving conservation laws in physics [6], inferring dynamic relationships [7], [8], and producing interpretable mechanics models [9]. Unfortunately, little attention has been paid to the use of symbolic regression on noisy data and the consideration of uncertainty.
Schmidt and Lipson [10] tackled the problem of noisy training data in symbolic regression through the inclusion of uniform random variables in model formation. Though uniform random variables can be transformed to represent more complex distributions, doing so drastically increases complexity of equations that must be produced. This can make the symbolic regression process less tractable and less interpretable.
Hirsh et al. [11] incorporated Bayesian uncertainty quantification into the sparse identification of nonlinear dynamics (SINDy) method through the use of sparsifying priors. In this technique, a linear combination of candidate terms (i.e., simple functions of the input data) is produced with random coefficients that are estimated through Bayesian inference. The reliance on candidate terms and linear combinations thereof constitutes only a limited form of symbolic regression (as opposed to the more traditional free-form symbolic regression). As such, the form of the resulting equation may be overly constrained and less insightful.
Others have implemented Bayesian methods in symbolic regression [12], [13]; however, they focused more on the improved efficiency of symbolic regression methods rather than the ability to produce probabilistic models with quantified uncertainty. For instance, Jin et al. [12] used a form of Markov chain Monte Carlo as a means for equation production. Also, Zhang [13] used a Bayesian framework to influence the population dynamics in genetic programming for improved evolution speed and decreased complexity.
In the current work, a new Bayesian framework for genetic-programming-based symbolic regression (GPSR) is developed. In this framework, Bayesian inference is applied to infer unknown distributions of parameters in free-form equations. The marginal likelihood of the equations are then used in a Bayesian model selection scheme to influence evolution towards equations for which the data provides the most evidence. The result is a GPSR framework that can produce interpretable models with quantified uncertainty. Additionally, the Bayesian framework provides regularization with several benefits compared to standard GPSR: increased interpretability, increased robustness to noise, and less tendency to overfit.
Section snippets
Methods
Symbolic regression is the search for analytic equations that best describe some dataset: i.e., attempting to find a function such that given a dataset with -dimensional input features and label . Several methodologies have been applied to the task of free-form symbolic regression such as genetic programming [14], prioritized grammar enumeration [15], Markov chain Monte Carlo [12], divide and conquer [16], and deep learning [17], [18]. Genetic programming-based
Experiments and discussion
In this section, Bayesian GPSR is applied first to a numerical example, then to benchmark problems from SRBENCH [19], [36], and finally to an experimental example. The ability of Bayesian GPSR to produce interpretable models and make probabilistic predictions is illustrated. Comparisons are made to conventional GPSR that illustrate several benefits to the Bayesian extension beyond its ability to produce probabilistic predictions.
Conclusion
A new Bayesian framework for genetic-programming-based symbolic regression (GPSR) was developed. For each equation in the population, Bayesian inference was used to estimate probability density functions of the unknown constants given the available data. This automatic quantification of uncertainty meant that any equation could be used to make probabilistic predictions using, for example, Monte Carlo simulation. As a byproduct of this process, the normalized marginal likelihood of the
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
References (44)
- et al.
Development of interpretable, data-driven plasticity models with symbolic regression
Comput. Struct.
(2021) - et al.
Peeking inside the black-box: a survey on explainable artificial intelligence (xai)
IEEE Access
(2018) - et al.
Techniques for interpretable machine learning
Commun. ACM
(2019) - Umang Bhatt, Alice Xiang, Shubham Sharma, Adrian Weller, Ankur Taly, Yunhan Jia, Joydeep Ghosh, Ruchir Puri, José M.F....
Please stop explaining black box models for high stakes decisions
(2018)- et al.
Why are we using black box models in ai when we don’t need to? a lesson from an explainable ai competition
Harvard Data Sci. Rev.
(2019) - et al.
Distilling free-form natural laws from experimental data
Science
(2009) - et al.
Discovering governing equations from data by sparse identification of nonlinear dynamical systems
Proc. Natl. Acad. Sci.
(2016) - et al.
Bayesian system id: Optimal management of parameter, model, and measurement uncertainty
Nonlinear Dynam.
(2020) - Michael D. Schmidt, Hod Lipson, Learning noise, in: Proceedings of the 9th Annual Conference on Genetic and...
Sparsifying priors for bayesian uncertainty quantification in model discovery
Bayesian symbolic regression
Bayesian methods for efficient genetic programming
Genet. Program. Evol. Mach.
Genetic Programming: On the Programming of Computers By Means of Natural Selection, vol. 1
Ai feynman: A physics-inspired method for symbolic regression
Sci. Adv.
Deep symbolic regression: Recovering mathematical expressions from data via risk-seeking policy gradients
Symbolicgpt: A generative transformer model for symbolic regression
Contemporary symbolic regression methods and their relative performance
Bingo
Cited by (4)
Discovering stochastic partial differential equations from limited data using variational Bayes inference
2024, Computer Methods in Applied Mechanics and EngineeringModel-driven identification framework for optimal constitutive modeling from kinematics and rheological arrangement
2023, Computer Methods in Applied Mechanics and EngineeringPriors For Symbolic Regression
2023, GECCO 2023 Companion - Proceedings of the 2023 Genetic and Evolutionary Computation Conference CompanionPriors For Symbolic Regression
2023, arXiv
- 1
These authors contributed equally.