Mechanism discovery and model identification using genetic feature extraction and statistical testing

doi:10.1016/j.compchemeng.2020.106900

Computers & Chemical Engineering

Volume 140, 2 September 2020, 106900

https://doi.org/10.1016/j.compchemeng.2020.106900 Get rights and content

Highlights

•
Instead of black-box models which are hard to interpret, a machine learning system is proposed to identify mechanistic models from data to gain insights.
•
Genetic programming is used to extract model features based on elementary functions that are founded on physio-chemical mechanisms in many chemical engineering applications.
•
Statistical testing is done during the course of feature extraction to identify significant features more efficiently.

Abstract

One main drawback of many machine learning-based regression models is that they are difficult to interpret and explain. Mechanism-based first-principles models, on the other hand, can be interpreted and hence preferable. However, as they are often quite challenging to develop, the appeal of machine learning-based black-box models is natural. Here, we report a genetic algorithm-based machine learning system that automatically discovers mechanistic models from data using limited human guidance. The advantage of this approach is that it yields simple, interpretable, features and can be used to identify model forms and fundamental mechanisms that are often seen in chemical engineering. We demonstrate our system on several case studies in reaction kinetics and transport phenomena, and discuss its strengths and limitations.

Introduction

The lack of mechanistic or causal interpretation and explanation in conventional machine learning (ML) regression approaches is a serious drawback, in general, and worrisome in safety-critical applications, in particular. Machine learning results often leave us with no insight about how they were obtained and hence their ‘black-box’ character (Venkatasubramanian, 2019; Psichogios and Ungar, 1992; Gawthrop, et al. 1993; Gray, et al. 1998). The limitations of popular model-free parametric regression strategies such as neural networks and genetic programming symbolic regression (GPSR) are two-fold – (i) They provide complicated and uninterpretable input-output relationships and (ii) Parameter sensitivity is not captured. These hinder the usability of such models, specifically for the identification of underlying mechanisms. They often develop intractable functional transformations (Wang et al., 2019), further limiting explainability and aiding overfitting. Neural networks, for example, might use relatively simple activation functions, but their multi-layered architectures often lead to complicated hidden representations that can be hard to interpret and explain (Sivaram, Das, Venkatasubramanian, 2020a, Sivaram et al., 2020, Tu, 1996). GPSR often results in complicated functional forms that don’t have first-principles mechanisms to support them (Augusto and Barbosa, 2000).

One way to address the interpretability issue is to develop hybrid models (Psichogios, Ungar, 1992, Thompson, Kramer, 1994, Sundaram, Ghosh, Caruthers, Venkatasubramanian, 2001, Penha, 2002, Tan, Li, 2002, Caruthers, Lauterbach, Thomson, Venkatasubramanian, Snively, Bhan, Katare, Oskarsdottir, 2003, te Braake, 1997, Johansen, 1994). Hybrid models incorporate first-principles concepts with data-driven techniques (Penha, 2002, Tholudur, Ramirez, 1996). Typically, first-principles such as conservation equations are augmented with data-driven models like neural networks (Tholudur, Ramirez, 1996, Simutis, Lubbert, 1997, Can, Hellinga, Luyben, Heijnen, Braake, 1996, Madar, Abonyi, Szeifert, 2005) and fuzzy logic (van Lith, Betlem, Roffel, 2002, van Lith, Betlem, Roffel, 2003) for parameter estimation. This preserves a first-principles understanding of the phenomena, while letting much of the uninterpretability go into the feature-parameter relationship identification. This, however, comes at the cost of gaining insights into the causal relationships in the data.

In this paper, we address a different challenge: Can we develop a machine learning system that can automatically identify the first-principles-based models that capture the underlying physical, chemical, and/or biological mechanisms generating the data? In some sense, this is what science is all about – understanding the world around us in terms of a few fundamental principles and mechanisms expressed mathematically. This is a different challenge, in comparison with hybrid modeling, where the first-principles model itself is not known a priori, but needs to be determined from data (Schmidt and Lipson, 2009a). We address this challenge in a manner similar to how humans go about reasoning and discovering mechanisms and models, guided by intelligent guesses, heuristics, and inferences. For example, in reaction kinetics, the reaction rates are driven by the concentration of various species in the system. It is hence advisable to obtain relationships between the species concentration and rates, despite not knowing the actual functional form. In many chemical engineering applications, phenomena driven by the underlying physics and chemistry first-principles yield equations that are composed of elementary functional forms. Even for more complicated cases, the proposed approach could yield simpler, reduced-order models that are easier to interpret, and gain insights from, without sacrificing accuracy too much.

We use a genetic algorithm to identify the simpler, elementary, functional forms of a model, guided by domain knowledge, that can be directly coupled to first-principles-based physio-chemical mechanisms. For instance, in reaction engineering, our domain knowledge informs us that rate variables, concentration variables, and certain mathematical functional forms are important. In our framework, we exploit this understanding by incorporating such knowledge in the creation of the library or pool of elementary functions to work with. This is augmented by statistical regression using the identified functions (Rogers, Rogers, Hopfinger, 1994). In conventional genetic function approximation methods, the tree representation of the structure gives rise to unnecessarily complex functions, which we would like to avoid. The use of statistical methods, such as ordinary least squares (OLS), partial least squares (PLS) or least absolute shrinkage and selection operator (LASSO) regression, provide the coefficients needed to determine the final functional form, along with an estimation of errors. Thus, one can understand which inputs are most strongly predictive of an outcome based on the coefficients of the functions in the obtained functional form. This approach has the benefit of restricting the complexity of the model obtained. This is often a problem with conventional GPSR algorithms, where the resultant composite functions must be simplified to a great extent to uncover the actual function (McKay, Willis, Barton, 1997, Babu, Karthik, 2007, Barmpalexis, Kachrimanis, Tsakonas, Georgarakis, 2011, Goldstein, Coco, Murray, 2013, Yang, Li, Wang, Lian, Ma, 2015). Further, confidence levels in parameter estimates are not determined either. For example, though Eureqa® (Schmidt and Lipson, 2013), an evolutionary algorithm-based model identification engine, predicts functional forms from data (and often yields complex functional forms (Goldstein, Coco, Murray, 2013, Tinoco, Goldstein, Coco, 2015)), it is unable to provide statistical significance of the estimated parameters.

This paper is organized as follows. We first motivate the need for elementary functional forms commonly seen in chemical engineering, such as in reaction systems, followed by a discussion of our Genetic Feature Extraction and Statistical Testing (GFEST) framework. We demonstrate a series of case-studies for reaction systems, starting with single-input single-output systems, followed by multiple-input single-output systems. We conclude the article, with an emphasis on future directions.

Section snippets

GFEST methodology

In our approach, the simplest model form (as justified by Occam’s Razor) that fits the data adequately is obtained using the genetic algorithm (GA) that searches through a space of allowed elementary functions and their combinations. These elementary functions in the GA function pool or library were selected based on our domain knowledge of chemical engineering, knowing what kinds of functional forms and their combination often appear in our models. These, in turn, are based on our

Case studies: Noiseless data

In this section, we discuss the physio-chemical mechanisms which have been tested on by GFEST. These involve the fundamental realms of chemical engineering such as reaction engineering and transport phenomena. The objective was that given data, the algorithm must be able to obtain the best functional form – accurate, yet not too complex – that maps the input variable to the feature(s), and thus assist in identifying the underlying mechanism. The meta-parameters used are the number of features (m

Case studies: noisy data

In many practical applications, the measurement data is often noisy or erroneous, posing a considerable challenge for mechanism and model identification. Noisy outlier data, for example, can lead to incorrect functional forms that are quite different from the true model. For instance, exponentials can be approximated by higher order polynomials. As a result, this incorrect identification could lead to an incorrect mechanism.

Table 2 presents the results of the case-studies in Section 3 with

Discussion and conclusion

In most of the cases mentioned in Table 1, one can see that the algorithm correctly identified the underlying model structure. However, nonlinear mechanisms (Michaelis-Menten Equation and Monod Equation) are identified with some error. The model form obtained for these cases are lower-order representations of the complex nonlinear functions.

All example pathways such as common cause (Trambouze 3.1.5), common effect (Fractional Kinetics 3.2.1) and causal chains (Series Reactions 3.2.3) have been

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

References (37)

D. Acharya et al.
Application of adaptive Savitzky–Golay filter for EEG signal processing
Perspect. Sci.
(2016)
P. Barmpalexis et al.
Symbolic regression via genetic programming in the optimization of a controlled release pharmaceutical formulation
Chemom. Intell. Lab. Syst.
(2011)
J. Caruthers et al.
Catalyst design: knowledge extraction from high-throughput experimentation
J. Catal.
(2003)
E.B. Goldstein et al.
Prediction of wave ripple characteristics using genetic programming
Cont. Shelf Res.
(2013)
G.J. Gray et al.
Nonlinear model structure identification using genetic programming
Control Eng. Pract.
(1998)
P.F. van Lith et al.
A structured modeling approach for dynamic hybrid fuzzy-first principles models
J. Process Control
(2002)
P.F. van Lith et al.
Combining prior knowledge with data driven modeling of a batch distillation column including start-up
Comput. Chem. Eng.
(2003)
J. Madar et al.
Feedback linearizing control using hybrid neural networks identified by sensitivity approach
Eng. Appl. Artif. Intell.
(2005)
B. McKay et al.
Steady-state modelling of chemical process systems using genetic programming
Comput. Chem. Eng.
(1997)
A. Sivaram et al.
Hidden representations in deep neural networks: part 1. Classification problems
Comput. Chem. Eng.
(2020)

K. Tan et al.

Grey-box model identification via evolutionary computing

Control Eng. Pract.

(2002)

J.V. Tu

Advantages and disadvantages of using artificial neural networks versus logistic regression for predicting medical outcomes

J. Clin. Epidemiol.

(1996)

G. Yang et al.

Modeling oil production based on symbolic regression

Energy Policy

(2015)

D. Augusto et al.

Symbolic regression via genetic programming

Proceedings. Vol.1. Sixth Brazilian Symposium on Neural Networks

(2000)

B. Babu et al.

Genetic programming for symbolic regression of chemical process systems.

Eng. Lett.

(2007)

H. te Braake

Neural Control of Biotechnological Processes

(1997)

H.J.L.V. Can et al.

Strategy for dynamic process modeling based on neural networks in macroscopic balances

AlChE J.

(1996)

K. Deb et al.

Understanding interactions among genetic algorithm parameters.

FOGA

(1998)

Cited by (14)

Hybrid AI modeling techniques for pilot scale bubble column aeration: A comparative study
2024, Computers and Chemical Engineering
With increased accessibility of process data from the production lines in chemical and biochemical production plants, the use of data-based modeling methods is gaining interest. In this work, three different data-based modeling approaches are applied for modeling aeration in a pilot scale bubble column. In all three modeling approaches the same serial hybrid-model structure is used, combining a species conservation balance based on first-principles with different data-based models for the overall volumetric mass transfer coefficient ( $K_{L} a$ ). Simple empirical correlations with parameters fit to process data provide transparent models but lack the accuracy of Artificial Neural Networks (ANNs). ANNs provide models with high accuracy within the operation regimes used for training, however, the models are prone to overfitting, and their black-box nature results in models that are difficult to interpret. As an alternative, a symbolic regression-inspired technique is used for discovering symbolic equations, resulting in interpretable models with accuracy that is comparable to that of the ANN.
Automated assembly of hybrid dynamic models for CHO cell culture processes
2023, Biochemical Engineering Journal
Citation Excerpt :
Automated model identification for biological systems has been demonstrated for yeast cultures, with the aim of identifying biological phenomena and biochemical reaction networks from experimental data [32,33]. More recently, algorithms based on symbolic regression and genetic feature extraction have shown promise in automatic model assembly, incorporating prior domain knowledge with experimental data to automatically generate mechanistic and empirical expressions for various phenomena ranging from simplistic enzyme kinetics to more complex microbial macro-kinetic models [34,35]. The development of deployable approaches for automated model assembly is of pertinence to the entire process lifecycle.
The emergent realisation of Industry 4.0 principles across biomanufacturing, through recent endeavours, will markedly enhance the development and manufacture of modern therapeutics. Through implementation of digital process models, a greater understanding of the intricate relationship between product quality attributes and manufacturing process performance may be established. While contributing towards accelerated process development, representative process models enable advanced optimisation of process parameters, thus having a tangible impact on the assurance of product quality and manufacturing robustness. Hybrid approaches, which couple mechanistic interpretability with statistical data-fitting, are posed to broaden the value and utility of digital models. To augment the advancement in modelling techniques and high-throughput technology, there is a growing requirement for automated approaches towards data processing and model assembly. In this study, a novel strategy is proposed, which leverages saturation and sigmoidal relationships, along with an underlying material balance framework, for the automated assembly of hybrid dynamic models of cell growth. The proposed hybrid model is compared against an equivalent mechanistic model based on Monod expressions. While both models achieve a reasonable fit against experimental data, the hybrid model demonstrates superior predictive performance. Development of automated hybrid models, as demonstrated in this study, may greatly accelerate process digitalisation across biopharmaceutical manufacture.
Identifying first-principles models for bubble column aeration using machine learning
2023, Computer Aided Chemical Engineering
Mass transfer of oxygen is investigated in this work using a pilot-scale bubble column unit with a two-fluid nozzle for aeration. First-principles models for the bubble column unit are identified by utilizing concepts in artificial intelligence (AI) and machine learning (ML), and applying the same to experimental data. By combining process knowledge with data-driven modeling, we discovered interpretable models for oxygen transport phenomena in bubble columns. By virtue of obtaining symbolic models, it is possible to perform post-hoc analyses on the same in order to gain physical insights into the mechanisms occurring in the system -- a convenience lost when using black-box models such as neural networks. This provides valuable understanding which can be applied when modeling more complex systems such as fermentation processes.
Finding physical insights in catalysis with machine learning
2022, Current Opinion in Chemical Engineering
Citation Excerpt :
The objective of SR is to construct f without imposing predefined functional forms [25,26]. Several algorithms have been developed to complete this task [27–30]. Two approaches we will discuss here are genetic algorithm [27] (GA) and compressed-sensing [31] (CS) methods (Figure 2).
Machine learning (ML) has emerged as an invaluable approach for deriving predictive models in the catalysis field. While they are successful in making accurate predictions, many ML models are complex and difficult to interpret. In this opinion, we discuss recent progress in the development of explainable ML models in catalysis. In particular, we focus on the prospect of using symbolic regression (SR) to derive physical models that are based on analytical functional forms rooted in fundamental physics. We overview the basic concepts underlying two popular SR methods (genetic algorithms and compressed sensing), as well as provide recent examples of their application in the catalysis literature.
A review of automated and data-driven approaches for pathway determination and reaction monitoring in complex chemical systems
2022, Digital Chemical Engineering
In this work, we review the state of the art on approaches for the determination of reaction networks and the real-time monitoring of reactions in complex chemical systems consisting of multiple reactive components using automated and data-driven methods. This complexity of the system results in uncertainty about both the dominant species and reactions in the system. Automated approaches to reaction network or pathway determination include rule-based or algorithmically extracted methods, quantum mechanical simulations, and machine learning approaches. We also identify the effect of explicit pathway determination on the approach for reaction monitoring. Furthermore, we compare and contrast the automated and data-driven approaches for reaction pathway determination with some heuristics commonly used to develop reaction mechanisms in complex chemistries.
Hybrid AI Models in Chemical Engineering – A Purpose-driven Perspective
2022, Computer Aided Chemical Engineering
Citation Excerpt :
Symbolic knowledge (i.e., symbolic AI) is accounted for through symbolic regression, where the task is to identify the model form that could have generated the data, and subsequently obtain the parameters. Genetic algorithms have been used as a feasible method for searching through the large function space, under user specified first-principles-based mechanistic constraints (Chakraborty et al., 2020). These models can be linear or nonlinear (Chakraborty et al., 2021).
Recent successes of machine learning in applications such as gaming, computer vision, and natural language processing, have generated considerable excitement for the application of purely black-box data driven techniques in other areas. However, unlike such applications, chemical engineering systems are governed by fundamental principles comprising of conservation laws and constitutive equations. Incorporating such natural constraints is valuable in many applications. However, as the complexity of systems increases, obtaining these first-principles models becomes exceedingly difficult. Hence the appeal of black-box models, which manage to perform well in some practical applications. This, however, comes at the cost of not being able to interpret and explain such a model’s performance, which might limit its acceptance. As a result, hybrid AI models that combine first-principles with data driven techniques have been proposed in the literature. These attempt to eliminate the drawbacks of both approaches and provide insights into the system. Hybrid AI models can be developed for different end-user purposes – interpretability, interoperability, meeting desired performance targets and constraints, etc. This review article describes these disparate but related approaches, and provides a summary of recent progress in this field. Further, it provides a perspective for potential future research in this domain.

View all citing articles on Scopus

View full text

Mechanism discovery and model identification using genetic feature extraction and statistical testing

Highlights

Abstract

Introduction

Section snippets

GFEST methodology

Case studies: Noiseless data

Case studies: noisy data

Discussion and conclusion

Declaration of Competing Interest

Perspect. Sci.

Chemom. Intell. Lab. Syst.

J. Catal.

Cont. Shelf Res.

Control Eng. Pract.

J. Process Control

Comput. Chem. Eng.

Eng. Appl. Artif. Intell.

Comput. Chem. Eng.

Comput. Chem. Eng.

Control Eng. Pract.

J. Clin. Epidemiol.

Energy Policy

Symbolic regression via genetic programming

Proceedings. Vol.1. Sixth Brazilian Symposium on Neural Networks

Genetic programming for symbolic regression of chemical process systems.

Eng. Lett.

Neural Control of Biotechnological Processes

Strategy for dynamic process modeling based on neural networks in macroscopic balances

AlChE J.

Understanding interactions among genetic algorithm parameters.

FOGA