Population diversity and inheritance in genetic programming for symbolic regression

Burlacu, Bogdan; Yang, Kaifeng; Affenzeller, Michael

doi:10.1007/s11047-022-09934-x

Population diversity and inheritance in genetic programming for symbolic regression

Open access
Published: 17 January 2023

(2023)
Cite this article

Download PDF

You have full access to this open access article

Natural Computing Aims and scope Submit manuscript

Population diversity and inheritance in genetic programming for symbolic regression

Download PDF

2411 Accesses
1 Citation
Explore all metrics

Abstract

In this work we aim to empirically characterize two important dynamical aspects of GP search: the evolution of diversity and the propagation of inheritance patterns. Diversity is calculated at the genotypic and phenotypic levels using efficient similarity metrics. Inheritance information is obtained via a full genealogical record of evolution as a directed acyclic graph and a set of methods for extracting relevant patterns. Advances in processing power enable our approach to handle previously infeasible graph sizes of millions of arcs and vertices. To enable a more comprehensive analysis we employ three closely-related but different evolutionary models: canonical GP, offspring selection and age-layered population structure. Our analysis reveals that a relatively small number of ancestors are responsible for producing the majority of descendants in later generations, leading to diversity loss. We show empirically across a selection of five benchmark problems that each configuration is characterized by different rates of diversity loss and different inheritance patterns, in support of the idea that each new problem may require a unique approach to solve optimally.

On the Effectiveness of Genetic Operations in Symbolic Regression

Similarity-Based Analysis of Population Dynamics in Genetic Programming Performing Symbolic Regression

GP-DMD: a genetic programming variant with dynamic management of diversity

Article 21 January 2022

Ricardo Nieto-Fuentes & Carlos Segura

1 Introduction

Genetic programming (GP) is an evolutionary metaheuristic that performs a guided, stochastic search over a space of computer programs: syntactical constructs that encode a domain-specific way to solve a given problem.

The idea of allowing the computer to evolve programs itself can be traced back to the works of Turing (1950), Samuel (1959) or von Neumann (1966) in the 1950s and saw its first applications in the 1980s by Forsyth (1981), Cramer (1985) and Hicklin (1986). Later, genetic programming became popular and well known to the public thanks to John Koza’s contributions Koza (1990, 1992) in the 1990s. Each of these developments have been inspired by the earlier works by Holland (1975) and De Jong (1975a) in 1975.

Similar to other Evolutionary Algorithms (EAs), GP evolves a population of solution candidates by following the Darwinian principles of evolution (i.e., the survival of the fittest) and utilizing biologically-inspired operators (e.g., crossover, mutation, and selection). The feature that distinguishes GP from other EAs (e.g., genetic algorithm, evolution strategies) is its variable-length solution representation as S-expressions,^{Footnote 1} which inherently enables it to search for free-form structures using observed data without any domain knowledge.

The versatility and generality of S-expressions and the robustness of population-based evolutionary algorithms make GP particularly well-suited for solving regression problems by searching over the space of mathematical expressions. The application of GP for regression is known as symbolic regression (SR) (Tackett 1995). In SR, the algorithm can evolve mathematical expressions with the goal of fitting an unknown function and return the best expression from the population as a result. Compared to black-box regression methods, this approach has the advantage of producing an understandable model from a human’s perspective and is therefore useful in application areas such as system identification.

1.1 Motivation

As a general method that requires no a-priori knowledge, GP has several degrees of freedom which determine its search behavior. Parameters such as population size, generational limit, selection pressure, genetic operator probabilities, S-expression depth and length limits or primitive set all contribute to the success of the search. Finding appropriate values for these settings is a difficult process and furthermore these settings cannot be easily generalized across different problems (Smit and Eiben 2009). Ideally, an optimal parameterization is characterized by an appropriate balance between both the exploratory and exploitative aspects of the search, but different problems require different strategies for exploration and exploitation (Eiben and Schippers 1998; Črepinšek et al. 2013; Michalewicz 1996). Parameter tuning approaches range from trial and error to following general guidelines and performing hyper parameter optimization (De Jong 2007; Michalewicz and Schmidt 2007; Eiben et al. 2007; Lobo and Lima 2007; Meyer-Nieberg and Beyer 2007).

In the context of GP, exploratory behavior enables recombination operators to sample different regions of the search search space while exploitative behavior enables the gradual, incremental refinement of existing solution candidates (Črepinšek et al. 2013). An ideal search typically starts with a short exploratory phase in the initial stages followed by a longer exploitative phase, increasing in intensity as the algorithm converges. The transition between the two phases takes place gradually under the effects of selection pressure as successful individuals multiply their own genes in the population while unsuccessful ones become extinct.

Since surviving genes propagate through inheritance, the study of inheritance patterns in GP can provide valuable insight into the effectiveness of the genetic search, the loss of diversity and occurrence of repeated patterns. This information can potentially help devise better strategies for the parameterization or automatic tuning of the algorithm.

Furthermore, more recently proposed evolutionary models such as offspring selection (OSGP) (Affenzeller et al. 2005) and age-layered population structure (ALPS) (Hornby 2006) have been specifically designed to alleviate the issue of diversity loss in GP populations and should be considered as alternatives to the canonical algorithm as originally defined by Koza.

OSGP introduces a secondary selection step called offspring selection which enforces the rule that a generated child must be at least to some degree better than its parents in order to be accepted into the population. This ensures that only adaptive changes (e.g. mutations and crossovers that increase fitness) are accepted, thus encouraging fit alleles to surface earlier in the search. Its design makes OSGP less sensitive to the maximum number of generations parameter, since the search can terminate automatically when offspring that outperform their parents can no longer be produced.

ALPS assigns individuals an age value that is incremented with each function evaluation. This value is then used to divide the population into different age layers (or groups) organized hierarchically and evolved independently. The layers represent non-overlapping age intervals ordered from youngest to oldest. The interval size is defined as a function of the age gap parameter. Offspring inherit their parents’ ages after genetic operators. When an individual’s age is older than the maximum age of layer $t, t\in \{ 0, 1, \ldots \}$, this individual will be migrated to layer $t+1$ of higher maximum age in the hierarchy. Randomly generated, fresh individuals are constantly introduced at the bottom of the hierarchy (into the youngest age layer) to ensure a constant trickle of new genes into the population at the bottom layer. The algorithm nurtures these individuals by restricting the competition for survival only in the same age layer. Through this concept, ALPS aims to, on the one hand, promote population diversity and on the other hand, to achieve the idea of open-ended evolution, e.g. an evolutionary process which, through a constant influx of new genes, avoids premature convergence and can evolve indefinitely.

This empirical study of different evolutionary models on a collection of regression benchmark problems aims to highlight the different behaviors caused by different approaches to diversity preservation. Diversity is considered the most important dynamical aspect of GP. This knowledge can, on the one hand, help guide researchers of evolutionary algorithm design by informing on the expected effects of certain kinds of algorithmic extensions (e.g. age-, similarity- or genealogy-based), and on the one hand, it can help practitioners decide which approach is best suited for a particular problem and application domain.

The idea to identify and analyze patterns in GP populations is deeply related to properties such as diversity and convergence and has been thoroughly investigated both theoretically and empirically since the early 1990s. In the following we summarize the relevant approaches.

1.2 Schema-theoretic and empirical approaches

A schema is a genetic template or pattern that contains wildcard symbols and can match multiple individuals in the population. A schema theorem is a mathematical model for predicting the evolution of schema frequencies, where frequency is simply defined as the proportion of matching individuals. Originally formulated by John Holland for the genetic algorithm, the GA Schema Theorem offers an elegant probabilistic explanation for the propagation of fitter than average genetic patterns (schemata) within populations of binary strings (where schema fitness is the average fitness of matching individuals). Since the binary strings are all of the same length, the equations for conditional probabilities can be easily derived by taking into account the probabilities of the genetic operators themselves. The GA Schema Theorem helped establish genetic algorithms in the field of metaheuristics by proving that the evolutionary search converges and is superior to random search.

However, the mathematical models for the propagation of schemata during the evolutionary run become significantly more complex when dealing with variable-length S-expressions. The additional complexity introduced by variable-length encodings is a hard limiting factor for the scope and explanation power of schema theorems for GP.

Thanks to the work of Poli and collaborators, several schema theorems for GP have found applications in the study of bloat, operator behavior, population initialization or even parameterization of GP (Poli and Mcphee 2009, 2009; Poli et al. 2010). However, the proposed mathematical models are large, hard to handle computationally and consequently hard to put into practice. These limitations have instead led to a plethora of empirical analysis methods that are intrinsically connected to the notions of population diversity and algorithm convergence. These methods attempt to record information about the search process and investigate whether it confirms the effects predicted by schema theorems.

Pattern mining approaches use frequency counting and tree matching algorithms to analyze changes in the genetic make-up of the population and measure loss of diversity. Overall they confirm schema-theoretical assumptions, for example that selection increases the frequency of fitter-than-average schemata (Poli and Langdon 1997; Majeed 2005; Smart et al. 2007) while causing the occurrence of large amounts of repeated patterns (semantic and syntactic) (Wilson and Heywood 2005; Ciesielski and Li 2007; Langdon and Banzhaf 2008; Kameya et al. 2008; Joó 2010).

Genealogy analysis approaches aggregate information about lineages and hereditary relationships to investigate aspects of inheritance, diversity or operator behavior (e.g. the ratio of ancestors to descendants, the fitness of offspring compared to their parents). They find that few individuals from the initial population have surviving descendants in the last generation and that only a small fraction of ancestors contribute genes to the best solution (Donatucci et al. 2014; Burlacu et al. 2015a, b; McPhee et al. 2016, 2017). Additionally, network theory approaches indicate that algorithm convergence is preceded by the formation of large connected components within the genealogy graph (Kuber et al. 2014). The cited studies suggest that “modules” or “building blocks” emerge as a result of recombination and selection, and that in general, the choice of operators (selection in particular) has a big impact on search performance.

1.3 Genealogy analysis

In this approach, the genealogy of the best individual (or alternatively, of the entire population in the last generation) is analysed using notions of graph theory and GP-specific heuristics (e.g., graph traversal methods that distinguish between crossover root and non-root parents, are able to handle mutations or consider elitism). The genealogy is typically stored as a directed acyclic graph where vertices represent individuals and arcs represent hereditary relationships between them. The graph structure also stores relevant metadata such as fitness values or types of genetic operators that were applied and the coordinates of genotype changes. Graph traversal mechanisms are integrated with the approach in order to aggregate the necessary information, typically by performing a bottom-up traversal (i.e. from descendant to ancestor). Dynamical aspects such as the evolution of diversity over time are then able to be correlated with various quantifiable properties of the genealogy graph.

Kuber et al. (2014) used graph theory to analyze the properties of “ancestral networks” trying to identify large connected components as a sign of algorithm convergence, on the assumption that “the natural behavior of most species is to form cohesive communities (networks)”. Based on this information, they introduced a reseeding heuristic that reinitialized a fraction of the population if the size of the connected component exceeded an arbitrary limit, with the end goal of tuning the balance between exploration and exploitation during the run.

Donatucci et al. (2014) recorded GP genealogies using graph database software and found that the percentage of initial individuals with direct descendants in the final generation is extremely small. They also observed that success rates of crossover and mutation are quite small (dropping to about 2% for mutation in the final generations).

Burlacu et al. (2015a, 2015b) recorded the genealogy of the entire population together with contextual information about the positions of subtrees swapped by crossover. This enabled the identification of the most frequently sampled subtrees in the entire genealogy, by working backwards from the final population. They find that a) most swaps done by crossover are ineffective as diversity decreases, b) a very small percentage of ancestors contribute genes to the final generation descendants and c) the most sampled subtrees are correlated to the problem building blocks.

McPhee et al. (2016, 2017) used graph databases to record genealogical information for entire runs using PushGP, a GP variant for software synthesis. They explored full genealogical records for 100 runs by working backwards from the last generation, and found that genealogy information can unveil significant new details about search behavior, particularly about the impact of selection. Algorithmic runs (and corresponding genealogies) can differ significantly between different selection methods (lexicase and tournament selection), significantly influencing convergence behavior. The main drawback of this approach is that the generated graph database is very large (e.g., a tournament selection dataset with 100 runs is 31 GB).

Genealogy analysis can help discover specific details about the process of evolution, such as patterns of inheritance (e.g., origins of useful genes in the ancestor population, which genes act in concert with other genes to form fit subtrees or building blocks, the mechanisms by which they are inherited), common ancestries (e.g., degree of relatedness between solutions), or the propagation of schemata.

1.4 Population diversity

The importance of population diversity has been widely recognized within the GP community (Michalewicz 1996), as noted by McPhee and Hopper “progress in evolution depends fundamentally on the existence of variation in the population” (McPhee and Hopper 1999). Diversity refers to the differences between individuals, which can be quantified at the structural (genotype) level, at the semantic (phenotype) level or as a mixture of the two.

Some studies focused on genotype diversity counted frequencies of genes (De Jong 1975b; D’haeseleer and Bluming 1994) or different genotypes (Langdon 1998). McPhee and Hopper proposed three different criteria by contrasting a population with its ancestry population (e.g., the initial population). They concluded that most individuals from the initial population do not contribute any genetic material to the final population by using a canonical GP subtree crossover with no mutation operator (McPhee and Hopper 1999). Burke extended this idea by computing a ratio of unique subtrees to total subtrees (Burke et al. 2002). Other approaches to quantify genotype diversity include distance-based methods, using the Hamming metric (Eshelman and Schaffer 1993; Shimodaira 1997), Euclidean metric (Mc Ginley et al. 2011; Bessaou et al. 2000), edit distance (de Jong et al. 2001; Ekárt and Németh 2000)), structural distance through an information hypertree (iTree) (Ekárt and Gustafson 2004), bottom-up distance (Affenzeller et al. 2017) or entropy-based methods (Li et al. 2004; Burke et al. 2004; Misevičius 2011). The problem of computing the edit distance between unordered labeled fixed-degree $k \ge 2$ trees has been shown to be NP-hard for optimization problems (Zhang et al. 1992; Bille 2005).

One point of criticism for genotype diversity is that a genetically diverse population does not guarantee behavioral diversity, since many trees can evaluate to the same expression or behavior (Burks and Punch 2018). This is due to the complex and non-injective genotype-to-phenotype mapping typical of GP. Instead, phenotype diversity was suggested to better serve the purpose of ensuring diverse behaviors in the population. Some approaches to quantify phenotype diversity consist of calculating ranks of individuals (Luerssen 2005), computing the Hamming distance-to-average-point (Ursem 2002), entropy-based methods (Rosca 1995; Burke et al. 2004) or correlation distance (Neshatian and Zhang 2009; Shirkhorshidi et al. 2015; Winkler et al. 2018; Agapitos et al. 2019).

Many different approaches for maintaining diversity at either the genotype or phenotype levels, or more effectively exploiting remaining diversity have been studied (Jackson 2010; Črepinšek et al. 2013; Burke et al. 2004; Burks and Punch 2018). Promoting diversity as a secondary goal has also been shown to work very well in the multi-objective case (de Jong et al. 2001; Burlacu et al. 2019).

In this paper we focus on diversity at both genotype and phenotype levels. We note cases where the studied algorithms behave differently in terms of diversity and correlate these behaviors with observed search outcomes. Diversity is computed using appropriate distance measures based on tree isomorphism and Pearson’s correlation for genotypes and phenotypes, respectively. The remainder of this work is structured as follows: Sect. 2 introduces the algorithms that are tested in this paper. Section 3 defines the dynamic analysis methodology. Section 4 describes the test problems and parameter settings of algorithms, and shows experimental results. Section 5 concludes with a summary and discusses future work.

2 Algorithm description

In this section, the algorithmic variants introduced in Sect. 1.1 – OSGP and ALPS – are described in technical detail. Both algorithms are seen as extensions to canonical GP which introduce specific augmentations (i.e. an additional selection step for OSGP, age-based niching for ALPS) in order to improve population diversity, as well as the overall algorithm’s ability to exploit existing diversity and generate adaptive change. Our analysis will then consider these augmentations and quantify their effects on evolutionary dynamics, from the perspective of diversity (measured at the genotype and phenotype level) and properties of the resulting genealogy graphs.

2.1 Genetic programming

Genetic programming is capable of solving optimization problems which can be formulated as:

$$\begin{aligned} \arg \min f({\textbf{x}}), \quad {\textbf{x}} \in {\mathcal {X}} \end{aligned}$$

(2.1)

where ${\textbf{x}} = (x_1, \ldots , x_l)^\top $ represents a decision vector (also known as individual) in EAs, ${\mathcal {X}}$ is the search space for the decision vector ${\textbf{x}}$, and f(.) is a function to measure how good/bad a decision vector ${\textbf{x}}$ is.

Unlike the canonical Genetic Algorithm (Holland 1975) and Evolution Strategies (Rechenberg 1965), the search space ${\mathcal {X}}$ in GP is not merely a binary space ${\mathbb {D}}$ or a continuous space ${\mathbb {R}}$, but a union set of a function space ${\mathcal {F}}$ and a terminal node space $\mathcal{T}\mathcal{N} = \{ s, c\} $, where $s \in {\mathcal {S}} $ represents a symbol and $c \in {\mathbb {R}}$ represents a constant number. The function set ${\mathcal {F}}$ includes available symbolic functions ( e.g., ${\mathcal {F}} =\{+, -, \times , \div , \exp , \log \}$ ) which require at least one terminal node as an argument. A terminal node in $\mathcal{T}\mathcal{N}$ can be either a symbol variable x or a constant number c. Therefore, compared with GA and ES, the structure of a GP individual (i.e., relationships among variables) can be evolved and optimized. To emphasize the ability of structure evolvement, a GP individual is denoted as T in this paper, where $T = (t_1, \ldots , t_l)^\top $ and $t_i \in \{{\mathcal {F}} \cup \mathcal{T}\mathcal{N}\}|_{i\in \{ 1, \ldots , l\} }$ is a node in the tree encoding.

Many other encoding strategies may be used to compose a meaningful symbolic expression such as linear encoding (Holmes and Barclay 1996), graph-based encoding (Poli and Langdon 1998), a grammar-based encoding (McKay et al. 2010) and others. However, the canonical tree encoding (Koza 1992) remains the most popular due to its simplicity (allowing for straightforward of genetic operators) and is used in this paper.

In a syntax tree, an internal node represents a mathematical function or logic operator in ${\mathcal {F}}$ and a leaf node is a terminal such as a variable or constant in $\mathcal{T}\mathcal{N}$. An individual $T=(t_1, \ldots , t_l)^\top $ is an ordered sequence of function symbols and terminals, where the first element represents a function in ${\mathcal {F}}$ and the following elements represent arguments for the first element. Some other related terminologies are introduced here:

1.
A root node is the topmost node of a tree.
2.
A tree depth (d) is the number of edges on the longest path from a root to a leaf.
3.
A tree length (l) is the total number of nodes (including the root node) contained in the tree.

Similar to other EAs, three basic operators are applied in GP crossover, mutation, and selection. The pseudocode of GP is shown in Algorithm 1. Our GP implementation employs elitist selection, where the best individual of the current generation is automatically transferred to the next generation. We use the notation B(p) to denote a single-outcome Bernoulli trial:

$$\begin{aligned} B(p) = {\left\{ \begin{array}{ll} 1 &{} \text { with probability }p\\ 0 &{} \text { with probability }1-p \end{array}\right. } \end{aligned}$$

Crossover is an operator that generates offspring by swapping parts of the parents’ genotypes and is an important exploration technique in EAs. In a tree-encoded individual, the most common crossover is the subtree swapping crossover, where two nodes $t'$ and $t^{''}$ are randomly sampled from two parents $T_1$ and $T_2$, respectively. Then the subtrees ${\mathcal {T}}'$ and ${\mathcal {T}}^{''}$, with root nodes $t'$ and $t^{''}$, are exchanged between parents $T_1$ and $T_2$. Typically, as suggested by Koza in Koza (1992), the crossover operator performs a biased selection of nodes to be swapped where internal (function) nodes are prioritized, as swapping leaf nodes leads to less meaningful results.

Example 1

Suppose two parents $T_1 = (\times , +, x_1, x_2, -, x_3, x_4 )^\top $ and $T_2 = (-, \times , x_1, x_3, \times , x_2, x_4 )^\top $, two random nodes are $t' =\{-\} $ and $t^{''} = \{ \times \}$ for $T_1$ and $T_2$, respectively. Then, one of the offspring generated by the subtree swapping crossover for $T_1$ and $T_2$ can be $(\times , +, x_1, x_2, \times , x_1, x_3)^\top $, as shown in Fig. 1.

Mutation in EA acts mostly as a provider of genetic diversity to work against premature convergence. We allow mutation to perform random changes to the tree structure, such as removing subtrees, replacing subtrees, or altering node types or node parameters.

In GP problems, an objective function quantitatively measures the difference between predictions and actual responses on a specific dataset. For this purpose, some common objective functions for supervised learning problems are the mean squared error, normalized mean squared error, coefficient of determination, or the Pearson correlation coefficient. In this paper, the Pearson correlation coefficient squared^{Footnote 2} is used to measure a model’s prediction error.

2.2 Offspring selection with GP

Offspring selection GP Affenzeller et al. (2005) conditionally accepts a new offspring into the population based on its fitness. The pseudocode of general offspring selection, which relates to the loop between line 7 and 14 in Algorithm 1, is shown in Algorithm 2.

In Algorithm 2, a new offspring child is generated through a normal mate selection and crossover/mutation operators. The mate selection operator can be any canonical mate selection operator, like tournament selectio or roulette wheel selection.

A comparison factor $c\in [0,1]$ controls the threshold to determine whether an offspring child can survive or not, where the threshold is within a closed interval between the parents’ fitness values $f(P_1)$ and $f(P_2)$ at line 7^{Footnote 3}.

An offspring child is added to offspring population $Pop(t+1)$ if it meets the criterion. Otherwise, this offspring child will be added to POOL at line 11, which collects the “discarded” offspring by the offspring selection. The discarded population pool POOL is initialized as an empty set in the first generation and is cumulatively updated in the later generations.

Parameter r defines the target success ratio, i.e. the ratio of the new population that should be filled with individuals that are better than their parents (according to the offspring selection criterion, line 4 in Algorithm 2).

Selection pressure limit $p_{sel}$ represents a ratio of the maximum number of generated offspring to a population size $n_{pop}$. We call the parameter setting $c=1$ and $r=1$ strict OS because $\forall i=\{1,2,\ldots , n_{pop} \}, f(child) < \min \big ( f(P_1), f(P_2) \big )$. More variants of the offspring selection and their impacts and influence on various problems were well explained and studied in Affenzeller et al. (2009, 2014).

2.3 Age layer population structure based GP

Age-layered population structure (ALPS) was proposed by Hornby (2006) as an approach to promote diversity via niching and periodic reinitialization. ALPS achieves this goal by introducing a new property, age, for every individual. A set of simple age update rules define how individuals can “grow old”:

A new randomly initialized individual is assigned an age value of zero.
Offspring produced by crossover or mutation inherit the age of the oldest parent.
An individual’s age is updated by an increment of one upon fitness evaluation.

The working principle in ALPS is then to restrict competition for survival only between individuals within the same age layer, in order to prevent “young” individuals from being driven to extinction by older individuals (that have had time to evolve better fitness). This form of niching promotes diversity by making the competition for survival “more fair”.

A new parameter AgeGap is introduced to define the age layers of the population. An age layer i is expressed as a half-closed integer interval:

$$\begin{aligned} \begin{aligned} L_i&= [ \underline{\text {age}}^{(i)}, \overline{\text {age}}^{(i)} ),\,\text {where}\\ \underline{\text {age}}^{(i)}&= {\left\{ \begin{array}{ll} \overline{\text {age}}^{(i-1)} &{} \text { if } i > 0\\ 0 &{} \text { otherwise }\\ \end{array}\right. }\\ \overline{\text {age}}^{(i)}&= AgeLimit (i, AgeGap )\\ \end{aligned} \end{aligned}$$

The function AgeLimit is a user-defined function that defines the age intervals separating the layers and usually increases as the increment of layers by different strategies (e.g., linear, Fibonacci, polynomial or exponential). The layer hierarchy is defined from the bottom up, such that the youngest age layer (or “bottom” layer) will be at the bottom of the hierarchy while the oldest layer will be at the top. The entire population of ALPS is the union of the populations in all active layers .

Initially, there will be a single age layer. However, age update rules will quickly lead to a heterogeneous distribution of age values. Once an individual’s age exceeds the maximum age of a layer, the individual will be pushed to the next layer and a new individual will be inserted in its place (either transferred from the previous layer, it if exists, or randomly initialized). In addition, the bottom layer will be periodically reinitialized with new individuals, every AgeGap generations. This ensures that new genetic material periodically trickles into the process, making its way up the layer hierarchy and ensuring that diversity levels are maintained during the evolution.

As a result, ALPS is a unidirectional hierarchical structure that pushes individuals from a bottom layer to a top layer. This idea is different from the Hierarchical Fair Competition structure (HFC) Hu and Goodman (2002) which uses fitness information to segregate layers. It can be argued that age seggregation in ALPS is more fair as it doesn’t consider fitness.

The Pseudocode of ALPS is shown in Algorithm 3. ALPS starts with the initialization of its parameters (line 1 - 3). The main loop of the ALPS framework (line 4 - 29) cycles through each active layer and creates a new layer until a termination criterion is satisfied at line 4. Some common stop criteria include the number of function evaluations, the maximum number of active layers, and etc. In the main loop, a child’s age is updated (line 10). If this child’s age does not exceed the age limit in this specific layer, this child will survive in the current layer. Otherwise, this child will be removed from the current layer and be used to initialize (lines 13–15) or update the population (17–20) in the next layer. The next will be active when the number of individuals in the next layer equals the population size (lines 24–25). Once the main loop terminates, the optimal solution $T^*$ is selected from the union of all layers.

Any EA can be integrated with ALPS, for example, to produce ALPS with GA (Hornby 2006; Fleck 2015), ALPS with GP (Patel and Clack 2007; Mateiu 2019) and ALPS with NSGA-II (Lichtberger 2019) for multi-objective optimization problems. The age-based segregation into layers in ALPS is a form of niching, a technique commonly used for diversity preservation in EAs. Other popular niching techniques in GP are:

fitness-based, using explicit (Goldberg et al. 1987) and implicit (Smith et al. 1993) fitness sharing, clustering (Yin and Germay 1993) or clearing (Pétrowski 1996; Singh and Deb 2006)
replacement-based, using deterministic (Mahfoud 1996), probabilistic (Mengshoel and Goldberg 1999) or local tournament-based (Mengshoel and Goldberg 2008) crowding or restricted tournament selection (Harik 1995)
preservation-based, using species conservation (Li et al. 2002)
hybrid, which implement adaptive versions or ensembles of fitness-, replacement- or preservation-based methods (Leung and Liang 2003; Yu and Suganthan 2010).

The form of niching implemented by ALPS belongs to the category of preservation-based methods. Burks and Punch (2017) compared ALPS with other preservation-based methods (e.g., Age-Fitness Pareto optimization algorithm (Schmidt and Lipson 2011), subtree semantic geometric crossover (Nguyen et al. 2016), and genetic marker diversity algorithm for GP (Burks and Punch 2015)) and found that ALPS has the highest behavioral diversity in the entire population. However, a detailed analysis of the population diversity in each layer is not available in Burks and Punch (2017).

The algorithms presented in this section represent conceptually sound augmentations of canonical GP meant to improve its performance by promoting – and more efficiently exploiting – population diversity. However, the resulting changes and their respective impact on the algorithm’s convergent behavior cannot be easily compared (for instance by attempting to replicate identical benchmarking conditions, using a fixed evaluation budget or fixing as many common parameters as possible). The effects produced by changes in the internal mechanisms of selection or the topology of the population need to be considered within a methodological framework able to perform a detailed analysis both from a diversity and a genealogical perspective.

In the following, we introduce an analysis methodology capable of producing a detailed description of the evolutionary process, in terms of diversity (structural and semantic), inheritance, subtree frequencies and building blocks.

3 Analysis of evolutionary dynamics

3.1 Population diversity

In Sect. 1, we have argued the importance of understanding dynamical aspects of GP and the deep connection between algorithm convergence and population diversity. Many previous works used a single measure to describe diversity (e.g., structural/distance-based, behavioral, based on the number of unique individuals, genetic markers, etc.).

In order to have a more complete picture of the interplay between selection and recombination, in this work, we use two complementary measures (genotypic and phenotypic), corresponding to the two levels of interaction between genetic operators: selection acting on fitness at phenotype level, crossover and mutation acting on tree syntax at genotype level.

We employ computationally efficient methods for computing diversity. Our methods are normalized between [0, 1] and have a natural correspondence to the concept of similarity between two trees. This in turn also allows for a more natural graphical representation of diversity loss in the empirical section, using a quantity that increases over time. We use the notation $Sim_g(\cdot )$ and $Sim_p(\cdot )$ to denote genotypic and phenotypic similarities, respectively. These similarities can be equivalently formulated as distances by simply subtracting the respective quantities from 1 (e.g., taking $1 - Sim_g(\cdot )$ and $1 - Sim_p(\cdot )$ as genotypic and phenotypic distances, respectively).

Since the measures are symmetric, the similarity matrix associated with the population is triangular and only $\frac{n_\text {pop}(n_\text {pop}-1)}{2}$ similarities must be computed. The average similarity ${\overline{Sim}}$ of is defined as the average of the similarity matrix. This definition applies to both genotypic similarity and phenotypic similarity.

3.1.1 Genotypic similarity

For genotypic diversity, we use a tree isomorphism algorithm which works in a bottom-up fashion similar to Valiente (2001). However, in our case, we resort to hashing (Burlacu et al. 2019), a more efficient algorithmic approach in which each tree node is assigned a unique hash value determined by its structure. Subtree isomorphism is then reduced to finding nodes with the same hash value.

We employ the Jaccard index to obtain a similarity value knowing the number of matching nodes between two trees.

$$\begin{aligned} Sim_{g}(T_i, T_j) = \frac{ |M (T_i, T_j) |}{|T_i| + |T_j| - |M (T_i, T_j) |} \end{aligned}$$

(3.1)

where $|M (T_i, T_j) |$ is the size of the intersection of $T_i$ and $T_j$, M represents a mapping of isomorphic subtrees from $T_i$ to $T_j$, and $|T_i|, |T_j|$ are the lengths of the two trees. Note that Jaccard similarity is a metric.

3.1.2 Phenotypic similarity

We compute phenotypic similarity between two tree models as the squared Pearson correlation coefficient between their responses on the same training data D. As the correlation is not defined when the standard deviation is zero for either one of the responses, we introduce a couple of extra rules to make sure that the similarity measure is well defined:

if both standard deviations are zero, then the similarity is one.
if only one of the standard deviations is zero, then the similarity is zero.

This is formalized below. Let $\sigma _i$ and $\sigma _j$ be the standard deviations of the two tree models $T_i$ and $T_j$:

$$\begin{aligned} Sim_{p}(T_i, T_j) = {\left\{ \begin{array}{ll} 1 \qquad \qquad \text {if } \sigma _i = \sigma _j = 0 \\ 0 \qquad \qquad \text {if } \sigma _i = 0, \sigma _j \ne 0 \text { or } \sigma _i \ne 0, \sigma _j = 0\\ r^2_{T_i, T_j} \qquad \text {otherwise} \end{array}\right. } \end{aligned}$$

(3.2)

As we are using the squared correlation coefficient, the value of $Sim_{p}(T_i,T_j) \in [0,1]$.

3.2 Inheritance patterns and building block propagation

The propagation of subtrees from one generation to the next represents an important aspect of GP evolutionary search behavior. In this context, we consider building blocks to be highly fit subtrees that are inherited by offspring individuals and increase their frequency in the population.

The occurrence of building blocks can be investigated using a frequentist approach where the action of genetic operators (crossover, mutation, selection) is in a first phase recorded in the form of a genealogy graph such that mating events and their outcomes are represented as vertices and arcs (Burlacu et al. 2015a).

Mining this body of information in a second phase can lead to the discovery of inheritance patterns, common ancestries, or frequently sampled subtrees. We describe in the following a practical approach for constructing a genealogy graph containing this information, as well as analysis methods on the graph for the discovery of frequent schemas and building blocks.

3.2.1 Subtree tracing

In the context of subtree tracing, we use the term fragment $fg(T_i)$ to refer to an inherited subtree in $T_i$ individual, which is passed on to $T_i$ from one of its parents. Fragments are saved in the genealogy graph G together with positional information identifying them in the parent and offspring trees. The positional coordinates are represented using indices into the node sequence given by a prefix traversal of the respective tree.

The storage scheme shown in Fig. 2 describes a situation where crossover and mutation are applied in succession. In this case, the offspring generated by crossover represents an intermediate individual that will not be present in the offspring population. In order to preserve continuity this intermediate individual is included as a vertex in the genealogy graph together with the respective fragments. Using the prefix indices stored in the genealogy graph, all fragments inherited via crossover can be identified. In the case of mutation, new genes can be introduced spontaneously. Thus, we consider the case when there is an overlap between the fragment and the traced subtree. In this case it is still useful to follow the origin of the subtree at the site of mutation in order to gain a deeper understanding of effects of the genetic operators.

The concept is illustrated in Fig. 3 where two parent individuals participate in crossover. The prefix indices of the swapped subtree within the non-root parent and the child individual are recorded in the genealogy graph.

A genealogy graph offers the ability to traverse an individual’s ancestry following the inherited fragments until the initial population is reached. Consequently, subtree tracing enables the possibility to identify for every subtree in the population the exact sequence of genetic operations that has lead to its creation.

The logic for tracing any subtree in the population relies on a set of simple arithmetic rules for processing prefix indices according to the relationship between the subtree being traced ${\mathcal {T}}$ and the fragment $f_g(T)$, respective to the containing tree individual T. Table 1 illustrates these rules and provides the foundation for a recursive procedure that navigates the genealogy graph and constructs a so-called trace graph, containing in its vertices only those ancestor individuals of the current tree which have contributed a genetic fragment to its genotype.

Table 1 Prefix indices arithmetic rules for subtree inclusion. Here, fg(T) is a tree fragment, ${\mathcal {T}}$ is the traced subtree, the ${\mathcal {I}}(.)$ operator returns the prefix index and ${\mathcal {L}}(.)$ returns the length of the fragment or subtree

Population diversity and inheritance in genetic programming for symbolic regression

Abstract

Similar content being viewed by others

On the Effectiveness of Genetic Operations in Symbolic Regression

Similarity-Based Analysis of Population Dynamics in Genetic Programming Performing Symbolic Regression

GP-DMD: a genetic programming variant with dynamic management of diversity

1 Introduction

1.1 Motivation

1.2 Schema-theoretic and empirical approaches

1.3 Genealogy analysis

1.4 Population diversity

2 Algorithm description

2.1 Genetic programming

Example 1

2.2 Offspring selection with GP

2.3 Age layer population structure based GP

3 Analysis of evolutionary dynamics

3.1 Population diversity

3.1.1 Genotypic similarity

3.1.2 Phenotypic similarity

3.2 Inheritance patterns and building block propagation

3.2.1 Subtree tracing

3.2.2 Frequency-based identification of building blocks

3.2.3 Contribution ratio

4 Empirical experiments

4.1 Test problems

4.1.1 Aircraft maximum lift coefficient

4.1.2 Friedman-I and Friedman-II

4.1.3 GP-challenge

4.1.4 Poly-10

4.2 Parameter settings

4.3 Empirical results

4.4 Dynamic analysis

4.4.1 Quality and similarity

4.4.2 GP and OSGP

4.4.3 ALPS

4.4.4 Subtree frequencies and building block identification

4.4.5 Building blocks

5 Conclusions and future work

Notes

References

Funding

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendix

Appendix

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation