Mechanism discovery and model identification using genetic feature extraction and statistical testing
Introduction
The lack of mechanistic or causal interpretation and explanation in conventional machine learning (ML) regression approaches is a serious drawback, in general, and worrisome in safety-critical applications, in particular. Machine learning results often leave us with no insight about how they were obtained and hence their ‘black-box’ character (Venkatasubramanian, 2019; Psichogios and Ungar, 1992; Gawthrop, et al. 1993; Gray, et al. 1998). The limitations of popular model-free parametric regression strategies such as neural networks and genetic programming symbolic regression (GPSR) are two-fold – (i) They provide complicated and uninterpretable input-output relationships and (ii) Parameter sensitivity is not captured. These hinder the usability of such models, specifically for the identification of underlying mechanisms. They often develop intractable functional transformations (Wang et al., 2019), further limiting explainability and aiding overfitting. Neural networks, for example, might use relatively simple activation functions, but their multi-layered architectures often lead to complicated hidden representations that can be hard to interpret and explain (Sivaram, Das, Venkatasubramanian, 2020a, Sivaram et al., 2020, Tu, 1996). GPSR often results in complicated functional forms that don’t have first-principles mechanisms to support them (Augusto and Barbosa, 2000).
One way to address the interpretability issue is to develop hybrid models (Psichogios, Ungar, 1992, Thompson, Kramer, 1994, Sundaram, Ghosh, Caruthers, Venkatasubramanian, 2001, Penha, 2002, Tan, Li, 2002, Caruthers, Lauterbach, Thomson, Venkatasubramanian, Snively, Bhan, Katare, Oskarsdottir, 2003, te Braake, 1997, Johansen, 1994). Hybrid models incorporate first-principles concepts with data-driven techniques (Penha, 2002, Tholudur, Ramirez, 1996). Typically, first-principles such as conservation equations are augmented with data-driven models like neural networks (Tholudur, Ramirez, 1996, Simutis, Lubbert, 1997, Can, Hellinga, Luyben, Heijnen, Braake, 1996, Madar, Abonyi, Szeifert, 2005) and fuzzy logic (van Lith, Betlem, Roffel, 2002, van Lith, Betlem, Roffel, 2003) for parameter estimation. This preserves a first-principles understanding of the phenomena, while letting much of the uninterpretability go into the feature-parameter relationship identification. This, however, comes at the cost of gaining insights into the causal relationships in the data.
In this paper, we address a different challenge: Can we develop a machine learning system that can automatically identify the first-principles-based models that capture the underlying physical, chemical, and/or biological mechanisms generating the data? In some sense, this is what science is all about – understanding the world around us in terms of a few fundamental principles and mechanisms expressed mathematically. This is a different challenge, in comparison with hybrid modeling, where the first-principles model itself is not known a priori, but needs to be determined from data (Schmidt and Lipson, 2009a). We address this challenge in a manner similar to how humans go about reasoning and discovering mechanisms and models, guided by intelligent guesses, heuristics, and inferences. For example, in reaction kinetics, the reaction rates are driven by the concentration of various species in the system. It is hence advisable to obtain relationships between the species concentration and rates, despite not knowing the actual functional form. In many chemical engineering applications, phenomena driven by the underlying physics and chemistry first-principles yield equations that are composed of elementary functional forms. Even for more complicated cases, the proposed approach could yield simpler, reduced-order models that are easier to interpret, and gain insights from, without sacrificing accuracy too much.
We use a genetic algorithm to identify the simpler, elementary, functional forms of a model, guided by domain knowledge, that can be directly coupled to first-principles-based physio-chemical mechanisms. For instance, in reaction engineering, our domain knowledge informs us that rate variables, concentration variables, and certain mathematical functional forms are important. In our framework, we exploit this understanding by incorporating such knowledge in the creation of the library or pool of elementary functions to work with. This is augmented by statistical regression using the identified functions (Rogers, Rogers, Hopfinger, 1994). In conventional genetic function approximation methods, the tree representation of the structure gives rise to unnecessarily complex functions, which we would like to avoid. The use of statistical methods, such as ordinary least squares (OLS), partial least squares (PLS) or least absolute shrinkage and selection operator (LASSO) regression, provide the coefficients needed to determine the final functional form, along with an estimation of errors. Thus, one can understand which inputs are most strongly predictive of an outcome based on the coefficients of the functions in the obtained functional form. This approach has the benefit of restricting the complexity of the model obtained. This is often a problem with conventional GPSR algorithms, where the resultant composite functions must be simplified to a great extent to uncover the actual function (McKay, Willis, Barton, 1997, Babu, Karthik, 2007, Barmpalexis, Kachrimanis, Tsakonas, Georgarakis, 2011, Goldstein, Coco, Murray, 2013, Yang, Li, Wang, Lian, Ma, 2015). Further, confidence levels in parameter estimates are not determined either. For example, though Eureqa® (Schmidt and Lipson, 2013), an evolutionary algorithm-based model identification engine, predicts functional forms from data (and often yields complex functional forms (Goldstein, Coco, Murray, 2013, Tinoco, Goldstein, Coco, 2015)), it is unable to provide statistical significance of the estimated parameters.
This paper is organized as follows. We first motivate the need for elementary functional forms commonly seen in chemical engineering, such as in reaction systems, followed by a discussion of our Genetic Feature Extraction and Statistical Testing (GFEST) framework. We demonstrate a series of case-studies for reaction systems, starting with single-input single-output systems, followed by multiple-input single-output systems. We conclude the article, with an emphasis on future directions.
Section snippets
GFEST methodology
In our approach, the simplest model form (as justified by Occam’s Razor) that fits the data adequately is obtained using the genetic algorithm (GA) that searches through a space of allowed elementary functions and their combinations. These elementary functions in the GA function pool or library were selected based on our domain knowledge of chemical engineering, knowing what kinds of functional forms and their combination often appear in our models. These, in turn, are based on our
Case studies: Noiseless data
In this section, we discuss the physio-chemical mechanisms which have been tested on by GFEST. These involve the fundamental realms of chemical engineering such as reaction engineering and transport phenomena. The objective was that given data, the algorithm must be able to obtain the best functional form – accurate, yet not too complex – that maps the input variable to the feature(s), and thus assist in identifying the underlying mechanism. The meta-parameters used are the number of features (m
Case studies: noisy data
In many practical applications, the measurement data is often noisy or erroneous, posing a considerable challenge for mechanism and model identification. Noisy outlier data, for example, can lead to incorrect functional forms that are quite different from the true model. For instance, exponentials can be approximated by higher order polynomials. As a result, this incorrect identification could lead to an incorrect mechanism.
Table 2 presents the results of the case-studies in Section 3 with
Discussion and conclusion
In most of the cases mentioned in Table 1, one can see that the algorithm correctly identified the underlying model structure. However, nonlinear mechanisms (Michaelis-Menten Equation and Monod Equation) are identified with some error. The model form obtained for these cases are lower-order representations of the complex nonlinear functions.
All example pathways such as common cause (Trambouze 3.1.5), common effect (Fractional Kinetics 3.2.1) and causal chains (Series Reactions 3.2.3) have been
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
References (37)
- et al.
Application of adaptive Savitzky–Golay filter for EEG signal processing
Perspect. Sci.
(2016) - et al.
Symbolic regression via genetic programming in the optimization of a controlled release pharmaceutical formulation
Chemom. Intell. Lab. Syst.
(2011) - et al.
Catalyst design: knowledge extraction from high-throughput experimentation
J. Catal.
(2003) - et al.
Prediction of wave ripple characteristics using genetic programming
Cont. Shelf Res.
(2013) - et al.
Nonlinear model structure identification using genetic programming
Control Eng. Pract.
(1998) - et al.
A structured modeling approach for dynamic hybrid fuzzy-first principles models
J. Process Control
(2002) - et al.
Combining prior knowledge with data driven modeling of a batch distillation column including start-up
Comput. Chem. Eng.
(2003) - et al.
Feedback linearizing control using hybrid neural networks identified by sensitivity approach
Eng. Appl. Artif. Intell.
(2005) - et al.
Steady-state modelling of chemical process systems using genetic programming
Comput. Chem. Eng.
(1997) - et al.
Hidden representations in deep neural networks: part 1. Classification problems
Comput. Chem. Eng.
(2020)
Grey-box model identification via evolutionary computing
Control Eng. Pract.
Advantages and disadvantages of using artificial neural networks versus logistic regression for predicting medical outcomes
J. Clin. Epidemiol.
Modeling oil production based on symbolic regression
Energy Policy
Symbolic regression via genetic programming
Proceedings. Vol.1. Sixth Brazilian Symposium on Neural Networks
Genetic programming for symbolic regression of chemical process systems.
Eng. Lett.
Neural Control of Biotechnological Processes
Strategy for dynamic process modeling based on neural networks in macroscopic balances
AlChE J.
Understanding interactions among genetic algorithm parameters.
FOGA
Cited by (14)
Hybrid AI modeling techniques for pilot scale bubble column aeration: A comparative study
2024, Computers and Chemical EngineeringAutomated assembly of hybrid dynamic models for CHO cell culture processes
2023, Biochemical Engineering JournalCitation Excerpt :Automated model identification for biological systems has been demonstrated for yeast cultures, with the aim of identifying biological phenomena and biochemical reaction networks from experimental data [32,33]. More recently, algorithms based on symbolic regression and genetic feature extraction have shown promise in automatic model assembly, incorporating prior domain knowledge with experimental data to automatically generate mechanistic and empirical expressions for various phenomena ranging from simplistic enzyme kinetics to more complex microbial macro-kinetic models [34,35]. The development of deployable approaches for automated model assembly is of pertinence to the entire process lifecycle.
Identifying first-principles models for bubble column aeration using machine learning
2023, Computer Aided Chemical EngineeringFinding physical insights in catalysis with machine learning
2022, Current Opinion in Chemical EngineeringCitation Excerpt :The objective of SR is to construct f without imposing predefined functional forms [25,26]. Several algorithms have been developed to complete this task [27–30]. Two approaches we will discuss here are genetic algorithm [27] (GA) and compressed-sensing [31] (CS) methods (Figure 2).
A review of automated and data-driven approaches for pathway determination and reaction monitoring in complex chemical systems
2022, Digital Chemical EngineeringHybrid AI Models in Chemical Engineering – A Purpose-driven Perspective
2022, Computer Aided Chemical EngineeringCitation Excerpt :Symbolic knowledge (i.e., symbolic AI) is accounted for through symbolic regression, where the task is to identify the model form that could have generated the data, and subsequently obtain the parameters. Genetic algorithms have been used as a feasible method for searching through the large function space, under user specified first-principles-based mechanistic constraints (Chakraborty et al., 2020). These models can be linear or nonlinear (Chakraborty et al., 2021).