Inference of compact nonlinear dynamic models by epigenetic local search

https://doi.org/10.1016/j.engappai.2016.07.004Get rights and content

Abstract

We introduce a method to enhance the inference of meaningful dynamic models from observational data by genetic programming (GP). This method incorporates an inheritable epigenetic layer that specifies active and inactive genes for a more effective local search of the model structure space. We define several GP implementations using different features of epigenetics, such as passive structure, phenotypic plasticity, and inheritable gene regulation. To test these implementations, we use hundreds of data sets generated from nonlinear ordinary differential equations (ODEs) in several fields of engineering and from randomly constructed nonlinear ODE models. The results indicate that epigenetic hill climbing consistently produces more compact dynamic equations with better fitness values, and that it identifies the exact solution of the system more often, validating the categorical improvement of GP by epigenetic local search. The results further indicate that when faced with complex dynamics, epigenetic hill climbing reduces the computational effort required to infer the correct underlying dynamics. We then apply the method to the identification of three real-world systems: a cascaded tanks system, a chemical distillation tower, and an industrial wind turbine. We analyze its solutions in comparison to theoretical and black-box approaches in terms of accuracy and intelligibility. Finally, we analyze population homology to evaluate the efficiency of the method. The results indicate that the epigenetic implementations provide protection from premature convergence by maintaining diversity in silenced portions of programs.

Introduction

A major goal of science is to characterize analytically the dynamic behavior of natural phenomena associated with biological, ecological, social, and economic systems, as well as the dynamics of artifacts such as wind turbines, robots, and aircraft. Dynamic behaviors are usually characterized by differential equations which in aggregate represent the dynamic model of the system. These dynamic models are the essence of the simulations that estimate/predict system behavior for policy decisions, design, optimization, control, and/or automation. This paper presents a method for construction of concise and mechanistically meaningful dynamic models from observations.

Dynamic models are preferably formulated according to first principles, to embody the knowledge of the process. However, first-principles models cannot often fully characterize the nonlinear dynamics of the process, as represented by process observations. In regress, first-principles models may be abandoned in favor of empirical models such as neural networks (Narendra and Parthasarathy, 1990, Gregorčič and Lightbody, 2008), linear or nonlinear autoregressive moving average (ARMAX) models (Ljung, 1999, Billings, 2013), or others (Ni et al., 1996, Sadollah et al., 2015), that have the structural flexibility to accommodate the measured process observations. Although these empirical models provide an effective basis for estimation/prediction, they have two major drawbacks. One is their ‘black-box’ format which obscures the knowledge of the process acquired through adaptation. The second is their case-specificity which makes them potentially deficient in representing the process under conditions (inputs) not encompassed by the measured observations. To remedy the black-box nature of these empirical models, dynamic models consisting of differential equations can be defined in algebraic form by symbolic regression (Gray et al., 1998, Cao et al., 2000, Bongard and Lipson, 2007), wherein both the structure (topology) and parameters (constants) are inferred from measured observations. Since these symbolic models are intelligible, they have the capacity to elucidate the process physics. Symbolic regression is typically conducted using genetic programming (GP) (Koza, 1992), which is a bio-inspired machine learning technique that constructs candidate models from mathematical building blocks and proceeds with selection, recombination and mutation over several generations before converging on a model that best fits the process observations.

In comparison to system identification methods that presume fixed model structures, symbolic regression can be computationally expensive because of its expanded search space. Furthermore, when guided solely by an error metric, it can yield unwieldy equations that are elusive to physical interpretation. To remedy these shortcomings, this paper introduces a new method of symbolic regression that fine-tunes candidate model structures by local search (La Cava et al., 2015). This fine tuning is enabled by the addition of an epigenetic layer for selection of program components (consisting of variables and instructions) to be included in the model. The incorporation of this epigenetic layer is motivated by two hypotheses: first, that the benefits of epigenetic regulation observed in biology may confer analogous improvements on GP systems; and second, that generalized local search methods enabled by epigenetics may improve the ability of GP to find correct model structures.

As to the first hypothesis, despite the highly regulated nature of biological genes, the role of epigenetics in regulating gene expressions is traditionally ignored in GP (with some exceptions, e.g. (Ferreira, 2001)). However, epigenetic processes may provide several evolutionary benefits. For example, because epigenetic processes allow the underlying genotype to encode various expressions and lead to neutral variation through crossover and mutation of non-coding segments, they may allow populations to avoid evolutionary bottlenecks or let them respond to changing evolutionary pressures (Jablonka and Lamb, 2002). Also, because they provide for phenotypic plasticity that enables gene expression to change in response to environmental pressure (Dias and Ressler, 2013), they may allow gene expression adaptations to be inherited in offspring without explicit changes to the genotype. This property legitimizes, via epigenetic processes, once discredited ideas of Lamarck pertaining to the inheritability of lifetime adaptations (Jablonka and Lamb, 2002, Holliday, 2006).

Regarding the second hypothesis, although local search methods have been developed and integrated into evolutionary algorithms (Gruau and Whitley, 1993, Whitley et al., 1994, Jeong and Lee, 1996, Ross, 1999, Giraud-Carrier, 2002), especially in genetic algorithms (GAs) through prescribed changes to the genotype, the role of structure optimization in symbolic regression is typically left to the GP process. Aside from some recent developments (Arnaldo et al., 2014), local search is traditionally conducted at the genome level. More generic local search methods, like tree snipping (Bongard and Lipson, 2007), focus on improving secondary metrics like size or legibility, whereas the traditional search methods, like stochastic hill-climbing (Bongard and Lipson, 2007), linear (Iba and Sato, 1994) or non-linear regression (Topchy and Punch, 2001) are confined to constant optimization. Although these local search methods improve symbolic regression performance, they cannot aid the search for program topology.

Epigenetics, on the other hand, provide a natural basis for performing local search at the structural (i.e., program topology) level. Motivated by this benefit of epigenetics, we introduce in this paper an epigenetics-enabled GP system to conduct topological optimization of programs at the level of gene expression. The contributions of this method are twofold: first, it introduces a generic method of topological search of the space of individual genotypes via modifications to gene expression. Second, it improves programs without affecting the genotype and without discarding the acquired knowledge gained through evolution, thereby lowering the risk of premature convergence observed in previous studies (Whitley et al., 1994). These contributions are achieved by conducting local search on the epigenome rather than the genome and making these adaptations inheritable via evolutionary processes.

The proposed Epigenetic Linear Genetic Programming (ELGP) method is tested on a large array of data generated from nonlinear ordinary differential equations (ODEs), as well as from three real-world processes, to evaluate the quality of its solutions. The paper is organized as follows. We formulate in Section 2 the identification problem and describe in Section 3 the ELGP method and its application to inference of dynamic models. We also review the relevant work in the context of GP and nonlinear dynamics modeling in Section 4. We then present the experimental analysis of different epigenetic implementations on a series of increasingly complex problems in Section 5. We begin by testing the method on a large set of data obtained from simulated nonlinear ODEs in different engineering fields, in order to illustrate its breadth of application. We then perform identification on hundreds of randomly constructed nonlinear systems, varying in complexity and dimensionality, to evaluate the scalability of the method in comparison to traditional GP approaches. Finally, we apply the ELGP method to three real-world problems, including the identification of (1) a benchmark cascaded tanks system (Wigren and Schoukens, 2013), (2) a chemical distillation tower, and (3) an industrial wind turbine. The results are presented in Section 6 and include comparisons of ELGP's performance in relation to other linear and nonlinear identification methods. We finish this discussion with an analysis of population diversity to study how gene expression evolves for each ELGP implementation.

Section snippets

Problem statement

The underlying assumption of symbolic regression is that there exists an analytical model of the system that would generate the measured observations y(tk) at the sample times tk=t1,,tN under the input, u(t), asy(tk,u)=y^(tk,M(x,u,Θ))+ν;k=1,,Nwhere y^ is the model output, ν represents measurement noise in y, x=[x1,,xn]T is the vector of state variables, and M*(x,u,Θ*) is the correct model form embodied by the correct parameter values Θ*, written M* hereafter for brevity. In the search for

Epigenetic Linear Genetic Programming (ELGP)

In symbolic regression, the search for candidate models is conducted by GP, wherein a population of computer programs, consisting of variables and instructions that produce models of the process, are evolved. Mathematical building blocks compose the genotype of each program that is optimized by an evolutionary algorithm. The operation steps of ELGP,1 outlined in Fig. 1, start with randomly constructed programs that comprise an

Related work

There has been some work to incorporate epigenetic learning into GP, notably by Tanev and Yuta (2008). In that case the focus was to model histone modification through a double cell representation as demonstrated in a predator-prey problem. Unlike our approach, Tanev did not treat lifetime epigenetic modifications as inheritable, as is supported by recent studies in biology (Turner, 2000, Kaati et al., 2002, Dias and Ressler, 2014). There have also been a number of studies on the effects of

Experimental methods

We describe in this section the evolutionary framework to which ELGP is applied and the settings that are used to conduct the experiments, followed by a description of the set of problems that are used to compare the performance of each GP treatment. Section 5.1 describes the algorithms used to perform selection and search operations within GP, which build upon previous symbolic regression research. In Section 5.2, we describe implementation optimizations related to efficiently performing hill

Results and discussion

We first present results obtained on the textbook ODE problems in Section 6.1. Comparisons include the number of exact solutions, the fitness of the best solutions, and complexity of the best models found by each method. Next we analyze in Section 6.2 the ODE suite results according to fitness as a function of point evaluations in training and testing over the entire suite. To give a sense of the scalability of the methods, we group the results by target complexity and compare the number of

Conclusions

The results suggest that epigenetic local search is a significant addition to GP. We find that epigenetic methods, especially EHC methods, outperform a baseline implementation of GP in terms of fitness minimization, exact solutions, and equation intelligibility on textbook nonlinear ODE systems and randomly generated dynamic systems. Furthermore we show in comparison to other nonlinear approaches that ELGP is able to return concise and accurate models in three real-world applications. Our study

Acknowledgments

We thank Nicholas McPhee and the Hampshire Computational Intelligence lab for helping improve this paper. This work is partially supported by the NSF-sponsored IGERT: Offshore Wind Energy Engineering, Environmental Science, and Policy (Grant no. 1068864), as well as Grant nos. 1017817, 1129139, and 1331283. Any opinions, findings, and conclusions or recommendations expressed in this publication are those of the authors and do not necessarily reflect the views of the National Science Foundation.

References (68)

  • S.A. Billings

    Nonlinear System Identification: NARMAX Methods in the Time, Frequency, and Spatio-temporal Domains

    (2013)
  • J. Bongard et al.

    Automated reverse engineering of nonlinear dynamical systems

    Proc. Natl. Acad. Sci.

    (2007)
  • Brameier, M., Banzhaf, W., 2007. Linear Genetic Programming, vol. 1. Springer, 1...
  • H. Cao et al.

    Evolutionary modeling of systems of ordinary differential equations with genetic programming

    Genet. Program. Evol. Mach.

    (2000)
  • S. Chen et al.

    Orthogonal least squares methods and their application to non-linear system identification

    Int. J. Control

    (1989)
  • T.W. Cornforth et al.

    Inference of hidden variables in systems of differential equations with genetic programming

    Genet Program Evol. Mach.

    (2013)
  • K. Deb et al.

    A fast elitist non-dominated sorting genetic algorithm for multi-objective optimization: NSGA-II

  • B.G. Dias et al.

    PACAP and the PAC1 receptor in post-traumatic stress disorder

    Neuropsychopharmacology

    (2013)
  • B.G. Dias et al.

    Parental olfactory experience influences behavior and neural structure in subsequent generations

    Nat. Neurosci.

    (2014)
  • C. Ferreira

    Gene expression programming: a new adaptive algorithm for solving problems

    Complex Syst.

    (2001)
  • Fleming, P., Van Wingerden, J.-W., Wright, A.D., 2011. Comparing State-space Multivariable Controls to Multi-SISO...
  • Fontana, A., 2011. Epigenetic tracking. In: Kampis, G., Karsai, I., Szathmáry, E. (Eds.), Advances in Artificial Life,...
  • Giraud-Carrier, C., 2002. Unifying learning with evolution through Baldwinian evolution and Lamarckism. In: Advances in...
  • F. Gruau et al.

    Adding learning to the cellular development of neural networks: evolution and the Baldwin effect

    Evolut. Comput.

    (1993)
  • R. Holliday

    Epigenetics: a historical overview

    Epigenetics

    (2006)
  • Iba, H., Sato, T., 1994. Genetic Programming with Local Hill-Climbing. Technical Report ETL-TR-94-4, Electrotechnical...
  • E. Jablonka et al.

    The changing concept of epigenetics

    Ann. N.Y. Acad. Sci.

    (2002)
  • G. Kaati et al.

    Cardiovascular and diabetes mortality determined by nutrition during parents' and grandparents' slow growth period

    Eur. J. Human. Genet.

    (2002)
  • M. Keijzer

    Improving symbolic regression with interval arithmetic and linear scaling

  • Keijzer, M., 2013. Push-forth: a light-weight, strongly-typed, stack-based genetic programming language. In:...
  • M. Kommenda et al.

    Effects of constant optimization by nonlinear least squares minimization in symbolic regression

  • Kommenda, M., Kronberger, G., Affenzeller, M., Winkler, S.M., Burlacu, B., 2015. Evolving simple symbolic regression...
  • J.R. Koza

    Genetic Programming: on the Programming of Computers by Means of Natural Selection

    (1992)
  • J.R. Koza et al.

    Automated synthesis of analog electrical circuits by means of genetic programming

    IEEE Trans. Evolut. Comput.

    (1997)
  • Cited by (31)

    • Controller design by symbolic regression

      2021, Mechanical Systems and Signal Processing
      Citation Excerpt :

      The particular features of ELGP that benefit the current application are its efficiency and the conciseness of equations it develops compared to GP, as demonstrated in application to numerous system identification problems [23,24]. ELGP has consistently produced smaller models with better estimation performance for systems represented by nonlinear ordinary differential equations, and in system identification of real-world benchmark problems, industrial processes, population diversity models [23], and industrial wind turbines [24]. The mechanism responsible for its superior performance is shown to result from preservation of diversity in the equation forms during optimization, precipitated from the introduction of the epigenetic layer [21,22].

    • Deriving compact laws based on algebraic formulation of a data set

      2019, Journal of Computational Science
      Citation Excerpt :

      Scientific discovery is the process in which the unknown governing mechanics behind the observation data can be unveiled. A compact law refers to a mathematically explicit description or equation that exactly describes the data [4]. Machine learning algorithms are developed to autonomously discover relationships between equation variables in data sets, organized into input and output data.

    • A Neuroevolutionary Approach for System Identification

      2024, Journal of Control, Automation and Electrical Systems
    View all citing articles on Scopus
    View full text