Mechanism discovery and model identification using genetic feature extraction and statistical testing

https://doi.org/10.1016/j.compchemeng.2020.106900Get rights and content

Highlights

  • Instead of black-box models which are hard to interpret, a machine learning system is proposed to identify mechanistic models from data to gain insights.

  • Genetic programming is used to extract model features based on elementary functions that are founded on physio-chemical mechanisms in many chemical engineering applications.

  • Statistical testing is done during the course of feature extraction to identify significant features more efficiently.

Abstract

One main drawback of many machine learning-based regression models is that they are difficult to interpret and explain. Mechanism-based first-principles models, on the other hand, can be interpreted and hence preferable. However, as they are often quite challenging to develop, the appeal of machine learning-based black-box models is natural. Here, we report a genetic algorithm-based machine learning system that automatically discovers mechanistic models from data using limited human guidance. The advantage of this approach is that it yields simple, interpretable, features and can be used to identify model forms and fundamental mechanisms that are often seen in chemical engineering. We demonstrate our system on several case studies in reaction kinetics and transport phenomena, and discuss its strengths and limitations.

Introduction

The lack of mechanistic or causal interpretation and explanation in conventional machine learning (ML) regression approaches is a serious drawback, in general, and worrisome in safety-critical applications, in particular. Machine learning results often leave us with no insight about how they were obtained and hence their ‘black-box’ character (Venkatasubramanian, 2019; Psichogios and Ungar, 1992; Gawthrop, et al. 1993; Gray, et al. 1998). The limitations of popular model-free parametric regression strategies such as neural networks and genetic programming symbolic regression (GPSR) are two-fold – (i) They provide complicated and uninterpretable input-output relationships and (ii) Parameter sensitivity is not captured. These hinder the usability of such models, specifically for the identification of underlying mechanisms. They often develop intractable functional transformations (Wang et al., 2019), further limiting explainability and aiding overfitting. Neural networks, for example, might use relatively simple activation functions, but their multi-layered architectures often lead to complicated hidden representations that can be hard to interpret and explain (Sivaram, Das, Venkatasubramanian, 2020a, Sivaram et al., 2020, Tu, 1996). GPSR often results in complicated functional forms that don’t have first-principles mechanisms to support them (Augusto and Barbosa, 2000).

One way to address the interpretability issue is to develop hybrid models (Psichogios, Ungar, 1992, Thompson, Kramer, 1994, Sundaram, Ghosh, Caruthers, Venkatasubramanian, 2001, Penha, 2002, Tan, Li, 2002, Caruthers, Lauterbach, Thomson, Venkatasubramanian, Snively, Bhan, Katare, Oskarsdottir, 2003, te Braake, 1997, Johansen, 1994). Hybrid models incorporate first-principles concepts with data-driven techniques (Penha, 2002, Tholudur, Ramirez, 1996). Typically, first-principles such as conservation equations are augmented with data-driven models like neural networks (Tholudur, Ramirez, 1996, Simutis, Lubbert, 1997, Can, Hellinga, Luyben, Heijnen, Braake, 1996, Madar, Abonyi, Szeifert, 2005) and fuzzy logic (van Lith, Betlem, Roffel, 2002, van Lith, Betlem, Roffel, 2003) for parameter estimation. This preserves a first-principles understanding of the phenomena, while letting much of the uninterpretability go into the feature-parameter relationship identification. This, however, comes at the cost of gaining insights into the causal relationships in the data.

In this paper, we address a different challenge: Can we develop a machine learning system that can automatically identify the first-principles-based models that capture the underlying physical, chemical, and/or biological mechanisms generating the data? In some sense, this is what science is all about – understanding the world around us in terms of a few fundamental principles and mechanisms expressed mathematically. This is a different challenge, in comparison with hybrid modeling, where the first-principles model itself is not known a priori, but needs to be determined from data (Schmidt and Lipson, 2009a). We address this challenge in a manner similar to how humans go about reasoning and discovering mechanisms and models, guided by intelligent guesses, heuristics, and inferences. For example, in reaction kinetics, the reaction rates are driven by the concentration of various species in the system. It is hence advisable to obtain relationships between the species concentration and rates, despite not knowing the actual functional form. In many chemical engineering applications, phenomena driven by the underlying physics and chemistry first-principles yield equations that are composed of elementary functional forms. Even for more complicated cases, the proposed approach could yield simpler, reduced-order models that are easier to interpret, and gain insights from, without sacrificing accuracy too much.

We use a genetic algorithm to identify the simpler, elementary, functional forms of a model, guided by domain knowledge, that can be directly coupled to first-principles-based physio-chemical mechanisms. For instance, in reaction engineering, our domain knowledge informs us that rate variables, concentration variables, and certain mathematical functional forms are important. In our framework, we exploit this understanding by incorporating such knowledge in the creation of the library or pool of elementary functions to work with. This is augmented by statistical regression using the identified functions (Rogers, Rogers, Hopfinger, 1994). In conventional genetic function approximation methods, the tree representation of the structure gives rise to unnecessarily complex functions, which we would like to avoid. The use of statistical methods, such as ordinary least squares (OLS), partial least squares (PLS) or least absolute shrinkage and selection operator (LASSO) regression, provide the coefficients needed to determine the final functional form, along with an estimation of errors. Thus, one can understand which inputs are most strongly predictive of an outcome based on the coefficients of the functions in the obtained functional form. This approach has the benefit of restricting the complexity of the model obtained. This is often a problem with conventional GPSR algorithms, where the resultant composite functions must be simplified to a great extent to uncover the actual function (McKay, Willis, Barton, 1997, Babu, Karthik, 2007, Barmpalexis, Kachrimanis, Tsakonas, Georgarakis, 2011, Goldstein, Coco, Murray, 2013, Yang, Li, Wang, Lian, Ma, 2015). Further, confidence levels in parameter estimates are not determined either. For example, though Eureqa® (Schmidt and Lipson, 2013), an evolutionary algorithm-based model identification engine, predicts functional forms from data (and often yields complex functional forms (Goldstein, Coco, Murray, 2013, Tinoco, Goldstein, Coco, 2015)), it is unable to provide statistical significance of the estimated parameters.

This paper is organized as follows. We first motivate the need for elementary functional forms commonly seen in chemical engineering, such as in reaction systems, followed by a discussion of our Genetic Feature Extraction and Statistical Testing (GFEST) framework. We demonstrate a series of case-studies for reaction systems, starting with single-input single-output systems, followed by multiple-input single-output systems. We conclude the article, with an emphasis on future directions.

Section snippets

GFEST methodology

In our approach, the simplest model form (as justified by Occam’s Razor) that fits the data adequately is obtained using the genetic algorithm (GA) that searches through a space of allowed elementary functions and their combinations. These elementary functions in the GA function pool or library were selected based on our domain knowledge of chemical engineering, knowing what kinds of functional forms and their combination often appear in our models. These, in turn, are based on our

Case studies: Noiseless data

In this section, we discuss the physio-chemical mechanisms which have been tested on by GFEST. These involve the fundamental realms of chemical engineering such as reaction engineering and transport phenomena. The objective was that given data, the algorithm must be able to obtain the best functional form – accurate, yet not too complex – that maps the input variable to the feature(s), and thus assist in identifying the underlying mechanism. The meta-parameters used are the number of features (m

Case studies: noisy data

In many practical applications, the measurement data is often noisy or erroneous, posing a considerable challenge for mechanism and model identification. Noisy outlier data, for example, can lead to incorrect functional forms that are quite different from the true model. For instance, exponentials can be approximated by higher order polynomials. As a result, this incorrect identification could lead to an incorrect mechanism.

Table 2 presents the results of the case-studies in Section 3 with

Discussion and conclusion

In most of the cases mentioned in Table 1, one can see that the algorithm correctly identified the underlying model structure. However, nonlinear mechanisms (Michaelis-Menten Equation and Monod Equation) are identified with some error. The model form obtained for these cases are lower-order representations of the complex nonlinear functions.

All example pathways such as common cause (Trambouze 3.1.5), common effect (Fractional Kinetics 3.2.1) and causal chains (Series Reactions 3.2.3) have been

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

References (37)

  • K. Tan et al.

    Grey-box model identification via evolutionary computing

    Control Eng. Pract.

    (2002)
  • J.V. Tu

    Advantages and disadvantages of using artificial neural networks versus logistic regression for predicting medical outcomes

    J. Clin. Epidemiol.

    (1996)
  • G. Yang et al.

    Modeling oil production based on symbolic regression

    Energy Policy

    (2015)
  • D. Augusto et al.

    Symbolic regression via genetic programming

    Proceedings. Vol.1. Sixth Brazilian Symposium on Neural Networks

    (2000)
  • B. Babu et al.

    Genetic programming for symbolic regression of chemical process systems.

    Eng. Lett.

    (2007)
  • H. te Braake

    Neural Control of Biotechnological Processes

    (1997)
  • H.J.L.V. Can et al.

    Strategy for dynamic process modeling based on neural networks in macroscopic balances

    AlChE J.

    (1996)
  • K. Deb et al.

    Understanding interactions among genetic algorithm parameters.

    FOGA

    (1998)
  • Cited by (14)

    • Automated assembly of hybrid dynamic models for CHO cell culture processes

      2023, Biochemical Engineering Journal
      Citation Excerpt :

      Automated model identification for biological systems has been demonstrated for yeast cultures, with the aim of identifying biological phenomena and biochemical reaction networks from experimental data [32,33]. More recently, algorithms based on symbolic regression and genetic feature extraction have shown promise in automatic model assembly, incorporating prior domain knowledge with experimental data to automatically generate mechanistic and empirical expressions for various phenomena ranging from simplistic enzyme kinetics to more complex microbial macro-kinetic models [34,35]. The development of deployable approaches for automated model assembly is of pertinence to the entire process lifecycle.

    • Finding physical insights in catalysis with machine learning

      2022, Current Opinion in Chemical Engineering
      Citation Excerpt :

      The objective of SR is to construct f without imposing predefined functional forms [25,26]. Several algorithms have been developed to complete this task [27–30]. Two approaches we will discuss here are genetic algorithm [27] (GA) and compressed-sensing [31] (CS) methods (Figure 2).

    • Hybrid AI Models in Chemical Engineering – A Purpose-driven Perspective

      2022, Computer Aided Chemical Engineering
      Citation Excerpt :

      Symbolic knowledge (i.e., symbolic AI) is accounted for through symbolic regression, where the task is to identify the model form that could have generated the data, and subsequently obtain the parameters. Genetic algorithms have been used as a feasible method for searching through the large function space, under user specified first-principles-based mechanistic constraints (Chakraborty et al., 2020). These models can be linear or nonlinear (Chakraborty et al., 2021).

    View all citing articles on Scopus
    View full text