Inference of compact nonlinear dynamic models by epigenetic local search
Introduction
A major goal of science is to characterize analytically the dynamic behavior of natural phenomena associated with biological, ecological, social, and economic systems, as well as the dynamics of artifacts such as wind turbines, robots, and aircraft. Dynamic behaviors are usually characterized by differential equations which in aggregate represent the dynamic model of the system. These dynamic models are the essence of the simulations that estimate/predict system behavior for policy decisions, design, optimization, control, and/or automation. This paper presents a method for construction of concise and mechanistically meaningful dynamic models from observations.
Dynamic models are preferably formulated according to first principles, to embody the knowledge of the process. However, first-principles models cannot often fully characterize the nonlinear dynamics of the process, as represented by process observations. In regress, first-principles models may be abandoned in favor of empirical models such as neural networks (Narendra and Parthasarathy, 1990, Gregorčič and Lightbody, 2008), linear or nonlinear autoregressive moving average (ARMAX) models (Ljung, 1999, Billings, 2013), or others (Ni et al., 1996, Sadollah et al., 2015), that have the structural flexibility to accommodate the measured process observations. Although these empirical models provide an effective basis for estimation/prediction, they have two major drawbacks. One is their ‘black-box’ format which obscures the knowledge of the process acquired through adaptation. The second is their case-specificity which makes them potentially deficient in representing the process under conditions (inputs) not encompassed by the measured observations. To remedy the black-box nature of these empirical models, dynamic models consisting of differential equations can be defined in algebraic form by symbolic regression (Gray et al., 1998, Cao et al., 2000, Bongard and Lipson, 2007), wherein both the structure (topology) and parameters (constants) are inferred from measured observations. Since these symbolic models are intelligible, they have the capacity to elucidate the process physics. Symbolic regression is typically conducted using genetic programming (GP) (Koza, 1992), which is a bio-inspired machine learning technique that constructs candidate models from mathematical building blocks and proceeds with selection, recombination and mutation over several generations before converging on a model that best fits the process observations.
In comparison to system identification methods that presume fixed model structures, symbolic regression can be computationally expensive because of its expanded search space. Furthermore, when guided solely by an error metric, it can yield unwieldy equations that are elusive to physical interpretation. To remedy these shortcomings, this paper introduces a new method of symbolic regression that fine-tunes candidate model structures by local search (La Cava et al., 2015). This fine tuning is enabled by the addition of an epigenetic layer for selection of program components (consisting of variables and instructions) to be included in the model. The incorporation of this epigenetic layer is motivated by two hypotheses: first, that the benefits of epigenetic regulation observed in biology may confer analogous improvements on GP systems; and second, that generalized local search methods enabled by epigenetics may improve the ability of GP to find correct model structures.
As to the first hypothesis, despite the highly regulated nature of biological genes, the role of epigenetics in regulating gene expressions is traditionally ignored in GP (with some exceptions, e.g. (Ferreira, 2001)). However, epigenetic processes may provide several evolutionary benefits. For example, because epigenetic processes allow the underlying genotype to encode various expressions and lead to neutral variation through crossover and mutation of non-coding segments, they may allow populations to avoid evolutionary bottlenecks or let them respond to changing evolutionary pressures (Jablonka and Lamb, 2002). Also, because they provide for phenotypic plasticity that enables gene expression to change in response to environmental pressure (Dias and Ressler, 2013), they may allow gene expression adaptations to be inherited in offspring without explicit changes to the genotype. This property legitimizes, via epigenetic processes, once discredited ideas of Lamarck pertaining to the inheritability of lifetime adaptations (Jablonka and Lamb, 2002, Holliday, 2006).
Regarding the second hypothesis, although local search methods have been developed and integrated into evolutionary algorithms (Gruau and Whitley, 1993, Whitley et al., 1994, Jeong and Lee, 1996, Ross, 1999, Giraud-Carrier, 2002), especially in genetic algorithms (GAs) through prescribed changes to the genotype, the role of structure optimization in symbolic regression is typically left to the GP process. Aside from some recent developments (Arnaldo et al., 2014), local search is traditionally conducted at the genome level. More generic local search methods, like tree snipping (Bongard and Lipson, 2007), focus on improving secondary metrics like size or legibility, whereas the traditional search methods, like stochastic hill-climbing (Bongard and Lipson, 2007), linear (Iba and Sato, 1994) or non-linear regression (Topchy and Punch, 2001) are confined to constant optimization. Although these local search methods improve symbolic regression performance, they cannot aid the search for program topology.
Epigenetics, on the other hand, provide a natural basis for performing local search at the structural (i.e., program topology) level. Motivated by this benefit of epigenetics, we introduce in this paper an epigenetics-enabled GP system to conduct topological optimization of programs at the level of gene expression. The contributions of this method are twofold: first, it introduces a generic method of topological search of the space of individual genotypes via modifications to gene expression. Second, it improves programs without affecting the genotype and without discarding the acquired knowledge gained through evolution, thereby lowering the risk of premature convergence observed in previous studies (Whitley et al., 1994). These contributions are achieved by conducting local search on the epigenome rather than the genome and making these adaptations inheritable via evolutionary processes.
The proposed Epigenetic Linear Genetic Programming (ELGP) method is tested on a large array of data generated from nonlinear ordinary differential equations (ODEs), as well as from three real-world processes, to evaluate the quality of its solutions. The paper is organized as follows. We formulate in Section 2 the identification problem and describe in Section 3 the ELGP method and its application to inference of dynamic models. We also review the relevant work in the context of GP and nonlinear dynamics modeling in Section 4. We then present the experimental analysis of different epigenetic implementations on a series of increasingly complex problems in Section 5. We begin by testing the method on a large set of data obtained from simulated nonlinear ODEs in different engineering fields, in order to illustrate its breadth of application. We then perform identification on hundreds of randomly constructed nonlinear systems, varying in complexity and dimensionality, to evaluate the scalability of the method in comparison to traditional GP approaches. Finally, we apply the ELGP method to three real-world problems, including the identification of (1) a benchmark cascaded tanks system (Wigren and Schoukens, 2013), (2) a chemical distillation tower, and (3) an industrial wind turbine. The results are presented in Section 6 and include comparisons of ELGP's performance in relation to other linear and nonlinear identification methods. We finish this discussion with an analysis of population diversity to study how gene expression evolves for each ELGP implementation.
Section snippets
Problem statement
The underlying assumption of symbolic regression is that there exists an analytical model of the system that would generate the measured observations at the sample times under the input, , aswhere is the model output, ν represents measurement noise in y, is the vector of state variables, and is the correct model form embodied by the correct parameter values , written hereafter for brevity. In the search for
Epigenetic Linear Genetic Programming (ELGP)
In symbolic regression, the search for candidate models is conducted by GP, wherein a population of computer programs, consisting of variables and instructions that produce models of the process, are evolved. Mathematical building blocks compose the genotype of each program that is optimized by an evolutionary algorithm. The operation steps of ELGP,1 outlined in Fig. 1, start with randomly constructed programs that comprise an
Related work
There has been some work to incorporate epigenetic learning into GP, notably by Tanev and Yuta (2008). In that case the focus was to model histone modification through a double cell representation as demonstrated in a predator-prey problem. Unlike our approach, Tanev did not treat lifetime epigenetic modifications as inheritable, as is supported by recent studies in biology (Turner, 2000, Kaati et al., 2002, Dias and Ressler, 2014). There have also been a number of studies on the effects of
Experimental methods
We describe in this section the evolutionary framework to which ELGP is applied and the settings that are used to conduct the experiments, followed by a description of the set of problems that are used to compare the performance of each GP treatment. Section 5.1 describes the algorithms used to perform selection and search operations within GP, which build upon previous symbolic regression research. In Section 5.2, we describe implementation optimizations related to efficiently performing hill
Results and discussion
We first present results obtained on the textbook ODE problems in Section 6.1. Comparisons include the number of exact solutions, the fitness of the best solutions, and complexity of the best models found by each method. Next we analyze in Section 6.2 the ODE suite results according to fitness as a function of point evaluations in training and testing over the entire suite. To give a sense of the scalability of the methods, we group the results by target complexity and compare the number of
Conclusions
The results suggest that epigenetic local search is a significant addition to GP. We find that epigenetic methods, especially EHC methods, outperform a baseline implementation of GP in terms of fitness minimization, exact solutions, and equation intelligibility on textbook nonlinear ODE systems and randomly generated dynamic systems. Furthermore we show in comparison to other nonlinear approaches that ELGP is able to return concise and accurate models in three real-world applications. Our study
Acknowledgments
We thank Nicholas McPhee and the Hampshire Computational Intelligence lab for helping improve this paper. This work is partially supported by the NSF-sponsored IGERT: Offshore Wind Energy Engineering, Environmental Science, and Policy (Grant no. 1068864), as well as Grant nos. 1017817, 1129139, and 1331283. Any opinions, findings, and conclusions or recommendations expressed in this publication are those of the authors and do not necessarily reflect the views of the National Science Foundation.
References (68)
- et al.
Nonlinear model structure identification using genetic programming
Control Eng. Pract.
(1998) - et al.
Nonlinear system identification: from multiple-model networks to Gaussian processes
Eng. Appl. Artif. Intell.
(2008) - et al.
Structure identification of nonlinear dynamic systems—a survey on input/output approaches
Automatica
(1990) - et al.
Adaptive simulated annealing genetic algorithm for system identification
Eng. Appl. Artif. Intell.
(1996) - et al.
A new method for identification and control of nonlinear dynamic systems
Eng. Appl. Artif. Intell.
(1996) - et al.
Approximate solving of nonlinear ordinary differential equations using least square weight function and metaheuristic algorithms
Eng. Appl. Artif. Intell.
(2015) - et al.
Epigenetic programming: genetic programming incorporating epigenetic learning through modification of histones
Inf. Sci.
(2008) Recursive prediction error identification and scaling of non-linear state space models using a restricted black box parameterization
Automatica
(2006)- Arnaldo, I., Krawiec, K., O'Reilly, U.-M., 2014. Multiple regression genetic programming. In: Proceedings of the 2014...
- Banzhaf, W., 1994. Genotype-phenotype-mapping and neutral variation – a case study in genetic programming. In: Parallel...
Nonlinear System Identification: NARMAX Methods in the Time, Frequency, and Spatio-temporal Domains
Automated reverse engineering of nonlinear dynamical systems
Proc. Natl. Acad. Sci.
Evolutionary modeling of systems of ordinary differential equations with genetic programming
Genet. Program. Evol. Mach.
Orthogonal least squares methods and their application to non-linear system identification
Int. J. Control
Inference of hidden variables in systems of differential equations with genetic programming
Genet Program Evol. Mach.
A fast elitist non-dominated sorting genetic algorithm for multi-objective optimization: NSGA-II
PACAP and the PAC1 receptor in post-traumatic stress disorder
Neuropsychopharmacology
Parental olfactory experience influences behavior and neural structure in subsequent generations
Nat. Neurosci.
Gene expression programming: a new adaptive algorithm for solving problems
Complex Syst.
Adding learning to the cellular development of neural networks: evolution and the Baldwin effect
Evolut. Comput.
Epigenetics: a historical overview
Epigenetics
The changing concept of epigenetics
Ann. N.Y. Acad. Sci.
Cardiovascular and diabetes mortality determined by nutrition during parents' and grandparents' slow growth period
Eur. J. Human. Genet.
Improving symbolic regression with interval arithmetic and linear scaling
Effects of constant optimization by nonlinear least squares minimization in symbolic regression
Genetic Programming: on the Programming of Computers by Means of Natural Selection
Automated synthesis of analog electrical circuits by means of genetic programming
IEEE Trans. Evolut. Comput.
Cited by (31)
Controller design by symbolic regression
2021, Mechanical Systems and Signal ProcessingCitation Excerpt :The particular features of ELGP that benefit the current application are its efficiency and the conciseness of equations it develops compared to GP, as demonstrated in application to numerous system identification problems [23,24]. ELGP has consistently produced smaller models with better estimation performance for systems represented by nonlinear ordinary differential equations, and in system identification of real-world benchmark problems, industrial processes, population diversity models [23], and industrial wind turbines [24]. The mechanism responsible for its superior performance is shown to result from preservation of diversity in the equation forms during optimization, precipitated from the introduction of the epigenetic layer [21,22].
Deriving compact laws based on algebraic formulation of a data set
2019, Journal of Computational ScienceCitation Excerpt :Scientific discovery is the process in which the unknown governing mechanics behind the observation data can be unveiled. A compact law refers to a mathematically explicit description or equation that exactly describes the data [4]. Machine learning algorithms are developed to autonomously discover relationships between equation variables in data sets, organized into input and output data.
Symbolic regression in materials science
2019, MRS CommunicationsA Neuroevolutionary Approach for System Identification
2024, Journal of Control, Automation and Electrical SystemsInterpretable scientific discovery with symbolic regression: a review
2024, Artificial Intelligence Review