1 Introduction

Genetic programming (GP) is an evolutionary metaheuristic that performs a guided, stochastic search over a space of computer programs: syntactical constructs that encode a domain-specific way to solve a given problem.

The idea of allowing the computer to evolve programs itself can be traced back to the works of Turing (1950), Samuel (1959) or von Neumann (1966) in the 1950s and saw its first applications in the 1980s by Forsyth (1981), Cramer (1985) and Hicklin (1986). Later, genetic programming became popular and well known to the public thanks to John Koza’s contributions  Koza (1990, 1992) in the 1990s. Each of these developments have been inspired by the earlier works by Holland (1975) and De Jong (1975a) in 1975.

Similar to other Evolutionary Algorithms (EAs), GP evolves a population of solution candidates by following the Darwinian principles of evolution (i.e., the survival of the fittest) and utilizing biologically-inspired operators (e.g., crossover, mutation, and selection). The feature that distinguishes GP from other EAs (e.g., genetic algorithm, evolution strategies) is its variable-length solution representation as S-expressions,Footnote 1 which inherently enables it to search for free-form structures using observed data without any domain knowledge.

The versatility and generality of S-expressions and the robustness of population-based evolutionary algorithms make GP particularly well-suited for solving regression problems by searching over the space of mathematical expressions. The application of GP for regression is known as symbolic regression (SR) (Tackett 1995). In SR, the algorithm can evolve mathematical expressions with the goal of fitting an unknown function and return the best expression from the population as a result. Compared to black-box regression methods, this approach has the advantage of producing an understandable model from a human’s perspective and is therefore useful in application areas such as system identification.

1.1 Motivation

As a general method that requires no a-priori knowledge, GP has several degrees of freedom which determine its search behavior. Parameters such as population size, generational limit, selection pressure, genetic operator probabilities, S-expression depth and length limits or primitive set all contribute to the success of the search. Finding appropriate values for these settings is a difficult process and furthermore these settings cannot be easily generalized across different problems (Smit and Eiben 2009). Ideally, an optimal parameterization is characterized by an appropriate balance between both the exploratory and exploitative aspects of the search, but different problems require different strategies for exploration and exploitation (Eiben and Schippers 1998; Črepinšek et al. 2013; Michalewicz 1996). Parameter tuning approaches range from trial and error to following general guidelines and performing hyper parameter optimization (De Jong 2007; Michalewicz and Schmidt 2007; Eiben et al. 2007; Lobo and Lima 2007; Meyer-Nieberg and Beyer 2007).

In the context of GP, exploratory behavior enables recombination operators to sample different regions of the search search space while exploitative behavior enables the gradual, incremental refinement of existing solution candidates (Črepinšek et al. 2013). An ideal search typically starts with a short exploratory phase in the initial stages followed by a longer exploitative phase, increasing in intensity as the algorithm converges. The transition between the two phases takes place gradually under the effects of selection pressure as successful individuals multiply their own genes in the population while unsuccessful ones become extinct.

Since surviving genes propagate through inheritance, the study of inheritance patterns in GP can provide valuable insight into the effectiveness of the genetic search, the loss of diversity and occurrence of repeated patterns. This information can potentially help devise better strategies for the parameterization or automatic tuning of the algorithm.

Furthermore, more recently proposed evolutionary models such as offspring selection (OSGP) (Affenzeller et al. 2005) and age-layered population structure (ALPS) (Hornby 2006) have been specifically designed to alleviate the issue of diversity loss in GP populations and should be considered as alternatives to the canonical algorithm as originally defined by Koza.

OSGP introduces a secondary selection step called offspring selection which enforces the rule that a generated child must be at least to some degree better than its parents in order to be accepted into the population. This ensures that only adaptive changes (e.g. mutations and crossovers that increase fitness) are accepted, thus encouraging fit alleles to surface earlier in the search. Its design makes OSGP less sensitive to the maximum number of generations parameter, since the search can terminate automatically when offspring that outperform their parents can no longer be produced.

ALPS assigns individuals an age value that is incremented with each function evaluation. This value is then used to divide the population into different age layers (or groups) organized hierarchically and evolved independently. The layers represent non-overlapping age intervals ordered from youngest to oldest. The interval size is defined as a function of the age gap parameter. Offspring inherit their parents’ ages after genetic operators. When an individual’s age is older than the maximum age of layer \(t, t\in \{ 0, 1, \ldots \}\), this individual will be migrated to layer \(t+1\) of higher maximum age in the hierarchy. Randomly generated, fresh individuals are constantly introduced at the bottom of the hierarchy (into the youngest age layer) to ensure a constant trickle of new genes into the population at the bottom layer. The algorithm nurtures these individuals by restricting the competition for survival only in the same age layer. Through this concept, ALPS aims to, on the one hand, promote population diversity and on the other hand, to achieve the idea of open-ended evolution, e.g. an evolutionary process which, through a constant influx of new genes, avoids premature convergence and can evolve indefinitely.

This empirical study of different evolutionary models on a collection of regression benchmark problems aims to highlight the different behaviors caused by different approaches to diversity preservation. Diversity is considered the most important dynamical aspect of GP. This knowledge can, on the one hand, help guide researchers of evolutionary algorithm design by informing on the expected effects of certain kinds of algorithmic extensions (e.g. age-, similarity- or genealogy-based), and on the one hand, it can help practitioners decide which approach is best suited for a particular problem and application domain.

The idea to identify and analyze patterns in GP populations is deeply related to properties such as diversity and convergence and has been thoroughly investigated both theoretically and empirically since the early 1990s. In the following we summarize the relevant approaches.

1.2 Schema-theoretic and empirical approaches

A schema is a genetic template or pattern that contains wildcard symbols and can match multiple individuals in the population. A schema theorem is a mathematical model for predicting the evolution of schema frequencies, where frequency is simply defined as the proportion of matching individuals. Originally formulated by John Holland for the genetic algorithm, the GA Schema Theorem offers an elegant probabilistic explanation for the propagation of fitter than average genetic patterns (schemata) within populations of binary strings (where schema fitness is the average fitness of matching individuals). Since the binary strings are all of the same length, the equations for conditional probabilities can be easily derived by taking into account the probabilities of the genetic operators themselves. The GA Schema Theorem helped establish genetic algorithms in the field of metaheuristics by proving that the evolutionary search converges and is superior to random search.

However, the mathematical models for the propagation of schemata during the evolutionary run become significantly more complex when dealing with variable-length S-expressions. The additional complexity introduced by variable-length encodings is a hard limiting factor for the scope and explanation power of schema theorems for GP.

Thanks to the work of Poli and collaborators, several schema theorems for GP have found applications in the study of bloat, operator behavior, population initialization or even parameterization of GP (Poli and Mcphee 2009, 2009; Poli et al. 2010). However, the proposed mathematical models are large, hard to handle computationally and consequently hard to put into practice. These limitations have instead led to a plethora of empirical analysis methods that are intrinsically connected to the notions of population diversity and algorithm convergence. These methods attempt to record information about the search process and investigate whether it confirms the effects predicted by schema theorems.

Pattern mining approaches use frequency counting and tree matching algorithms to analyze changes in the genetic make-up of the population and measure loss of diversity. Overall they confirm schema-theoretical assumptions, for example that selection increases the frequency of fitter-than-average schemata (Poli and Langdon 1997; Majeed 2005; Smart et al. 2007) while causing the occurrence of large amounts of repeated patterns (semantic and syntactic) (Wilson and Heywood 2005; Ciesielski and Li 2007; Langdon and Banzhaf 2008; Kameya et al. 2008; Joó 2010).

Genealogy analysis approaches aggregate information about lineages and hereditary relationships to investigate aspects of inheritance, diversity or operator behavior (e.g. the ratio of ancestors to descendants, the fitness of offspring compared to their parents). They find that few individuals from the initial population have surviving descendants in the last generation and that only a small fraction of ancestors contribute genes to the best solution (Donatucci et al. 2014; Burlacu et al. 2015a, b; McPhee et al. 2016, 2017). Additionally, network theory approaches indicate that algorithm convergence is preceded by the formation of large connected components within the genealogy graph (Kuber et al. 2014). The cited studies suggest that “modules” or “building blocks” emerge as a result of recombination and selection, and that in general, the choice of operators (selection in particular) has a big impact on search performance.

1.3 Genealogy analysis

In this approach, the genealogy of the best individual (or alternatively, of the entire population in the last generation) is analysed using notions of graph theory and GP-specific heuristics (e.g., graph traversal methods that distinguish between crossover root and non-root parents, are able to handle mutations or consider elitism). The genealogy is typically stored as a directed acyclic graph where vertices represent individuals and arcs represent hereditary relationships between them. The graph structure also stores relevant metadata such as fitness values or types of genetic operators that were applied and the coordinates of genotype changes. Graph traversal mechanisms are integrated with the approach in order to aggregate the necessary information, typically by performing a bottom-up traversal (i.e. from descendant to ancestor). Dynamical aspects such as the evolution of diversity over time are then able to be correlated with various quantifiable properties of the genealogy graph.

Kuber et al. (2014) used graph theory to analyze the properties of “ancestral networks”  trying to identify large connected components as a sign of algorithm convergence, on the assumption that “the natural behavior of most species is to form cohesive communities (networks)”. Based on this information, they introduced a reseeding heuristic that reinitialized a fraction of the population if the size of the connected component exceeded an arbitrary limit, with the end goal of tuning the balance between exploration and exploitation during the run.

Donatucci et al. (2014) recorded GP genealogies using graph database software and found that the percentage of initial individuals with direct descendants in the final generation is extremely small. They also observed that success rates of crossover and mutation are quite small (dropping to about 2% for mutation in the final generations).

Burlacu et al. (2015a, 2015b) recorded the genealogy of the entire population together with contextual information about the positions of subtrees swapped by crossover. This enabled the identification of the most frequently sampled subtrees in the entire genealogy, by working backwards from the final population. They find that a) most swaps done by crossover are ineffective as diversity decreases, b) a very small percentage of ancestors contribute genes to the final generation descendants and c) the most sampled subtrees are correlated to the problem building blocks.

McPhee et al. (2016, 2017) used graph databases to record genealogical information for entire runs using PushGP, a GP variant for software synthesis. They explored full genealogical records for 100 runs by working backwards from the last generation, and found that genealogy information can unveil significant new details about search behavior, particularly about the impact of selection. Algorithmic runs (and corresponding genealogies) can differ significantly between different selection methods (lexicase and tournament selection), significantly influencing convergence behavior. The main drawback of this approach is that the generated graph database is very large (e.g., a tournament selection dataset with 100 runs is 31 GB).

Genealogy analysis can help discover specific details about the process of evolution, such as patterns of inheritance (e.g., origins of useful genes in the ancestor population, which genes act in concert with other genes to form fit subtrees or building blocks, the mechanisms by which they are inherited), common ancestries (e.g., degree of relatedness between solutions), or the propagation of schemata.

1.4 Population diversity

The importance of population diversity has been widely recognized within the GP community (Michalewicz 1996), as noted by McPhee and Hopper “progress in evolution depends fundamentally on the existence of variation in the population” (McPhee and Hopper 1999). Diversity refers to the differences between individuals, which can be quantified at the structural (genotype) level, at the semantic (phenotype) level or as a mixture of the two.

Some studies focused on genotype diversity counted frequencies of genes (De Jong 1975b; D’haeseleer and Bluming 1994) or different genotypes (Langdon 1998). McPhee and Hopper proposed three different criteria by contrasting a population with its ancestry population (e.g., the initial population). They concluded that most individuals from the initial population do not contribute any genetic material to the final population by using a canonical GP subtree crossover with no mutation operator (McPhee and Hopper 1999). Burke extended this idea by computing a ratio of unique subtrees to total subtrees (Burke et al. 2002). Other approaches to quantify genotype diversity include distance-based methods, using the Hamming metric (Eshelman and Schaffer 1993; Shimodaira 1997), Euclidean metric (Mc Ginley et al. 2011; Bessaou et al. 2000), edit distance (de Jong et al. 2001; Ekárt and Németh 2000)), structural distance through an information hypertree (iTree) (Ekárt and Gustafson 2004), bottom-up distance (Affenzeller et al. 2017) or entropy-based methods (Li et al. 2004; Burke et al. 2004; Misevičius 2011). The problem of computing the edit distance between unordered labeled fixed-degree \(k \ge 2\) trees has been shown to be NP-hard for optimization problems (Zhang et al. 1992; Bille 2005).

One point of criticism for genotype diversity is that a genetically diverse population does not guarantee behavioral diversity, since many trees can evaluate to the same expression or behavior (Burks and Punch 2018). This is due to the complex and non-injective genotype-to-phenotype mapping typical of GP. Instead, phenotype diversity was suggested to better serve the purpose of ensuring diverse behaviors in the population. Some approaches to quantify phenotype diversity consist of calculating ranks of individuals (Luerssen 2005), computing the Hamming distance-to-average-point (Ursem 2002), entropy-based methods (Rosca 1995; Burke et al. 2004) or correlation distance (Neshatian and Zhang 2009; Shirkhorshidi et al. 2015; Winkler et al. 2018; Agapitos et al. 2019).

Many different approaches for maintaining diversity at either the genotype or phenotype levels, or more effectively exploiting remaining diversity have been studied (Jackson 2010; Črepinšek et al. 2013; Burke et al. 2004; Burks and Punch 2018). Promoting diversity as a secondary goal has also been shown to work very well in the multi-objective case (de Jong et al. 2001; Burlacu et al. 2019).

In this paper we focus on diversity at both genotype and phenotype levels. We note cases where the studied algorithms behave differently in terms of diversity and correlate these behaviors with observed search outcomes. Diversity is computed using appropriate distance measures based on tree isomorphism and Pearson’s correlation for genotypes and phenotypes, respectively. The remainder of this work is structured as follows: Sect. 2 introduces the algorithms that are tested in this paper. Section 3 defines the dynamic analysis methodology. Section 4 describes the test problems and parameter settings of algorithms, and shows experimental results. Section 5 concludes with a summary and discusses future work.

2 Algorithm description

In this section, the algorithmic variants introduced in Sect. 1.1 – OSGP and ALPS – are described in technical detail. Both algorithms are seen as extensions to canonical GP which introduce specific augmentations (i.e. an additional selection step for OSGP, age-based niching for ALPS) in order to improve population diversity, as well as the overall algorithm’s ability to exploit existing diversity and generate adaptive change. Our analysis will then consider these augmentations and quantify their effects on evolutionary dynamics, from the perspective of diversity (measured at the genotype and phenotype level) and properties of the resulting genealogy graphs.

2.1 Genetic programming

Genetic programming is capable of solving optimization problems which can be formulated as:

$$\begin{aligned} \arg \min f({\textbf{x}}), \quad {\textbf{x}} \in {\mathcal {X}} \end{aligned}$$
(2.1)

where \({\textbf{x}} = (x_1, \ldots , x_l)^\top \) represents a decision vector (also known as individual) in EAs, \({\mathcal {X}}\) is the search space for the decision vector \({\textbf{x}}\), and f(.) is a function to measure how good/bad a decision vector \({\textbf{x}}\) is.

Unlike the canonical Genetic Algorithm (Holland 1975) and Evolution Strategies (Rechenberg 1965), the search space \({\mathcal {X}}\) in GP is not merely a binary space \({\mathbb {D}}\) or a continuous space \({\mathbb {R}}\), but a union set of a function space \({\mathcal {F}}\) and a terminal node space \(\mathcal{T}\mathcal{N} = \{ s, c\} \), where \(s \in {\mathcal {S}} \) represents a symbol and \(c \in {\mathbb {R}}\) represents a constant number. The function set \({\mathcal {F}}\) includes available symbolic functions ( e.g., \({\mathcal {F}} =\{+, -, \times , \div , \exp , \log \}\) ) which require at least one terminal node as an argument. A terminal node in \(\mathcal{T}\mathcal{N}\) can be either a symbol variable x or a constant number c. Therefore, compared with GA and ES, the structure of a GP individual (i.e., relationships among variables) can be evolved and optimized. To emphasize the ability of structure evolvement, a GP individual is denoted as T in this paper, where \(T = (t_1, \ldots , t_l)^\top \) and \(t_i \in \{{\mathcal {F}} \cup \mathcal{T}\mathcal{N}\}|_{i\in \{ 1, \ldots , l\} }\) is a node in the tree encoding.

Many other encoding strategies may be used to compose a meaningful symbolic expression such as linear encoding  (Holmes and Barclay 1996), graph-based encoding  (Poli and Langdon 1998), a grammar-based encoding (McKay et al. 2010) and others. However, the canonical tree encoding (Koza 1992) remains the most popular due to its simplicity (allowing for straightforward of genetic operators) and is used in this paper.

In a syntax tree, an internal node represents a mathematical function or logic operator in \({\mathcal {F}}\) and a leaf node is a terminal such as a variable or constant in \(\mathcal{T}\mathcal{N}\). An individual \(T=(t_1, \ldots , t_l)^\top \) is an ordered sequence of function symbols and terminals, where the first element represents a function in \({\mathcal {F}}\) and the following elements represent arguments for the first element. Some other related terminologies are introduced here:

  1. 1.

    A root node is the topmost node of a tree.

  2. 2.

    A tree depth (d) is the number of edges on the longest path from a root to a leaf.

  3. 3.

    A tree length (l) is the total number of nodes (including the root node) contained in the tree.

Similar to other EAs, three basic operators are applied in GP crossover, mutation, and selection. The pseudocode of GP is shown in Algorithm 1. Our GP implementation employs elitist selection, where the best individual of the current generation is automatically transferred to the next generation. We use the notation B(p) to denote a single-outcome Bernoulli trial:

$$\begin{aligned} B(p) = {\left\{ \begin{array}{ll} 1 &{} \text { with probability }p\\ 0 &{} \text { with probability }1-p \end{array}\right. } \end{aligned}$$
figure a

Crossover is an operator that generates offspring by swapping parts of the parents’ genotypes and is an important exploration technique in EAs. In a tree-encoded individual, the most common crossover is the subtree swapping crossover, where two nodes \(t'\) and \(t^{''}\) are randomly sampled from two parents \(T_1\) and \(T_2\), respectively. Then the subtrees \({\mathcal {T}}'\) and \({\mathcal {T}}^{''}\), with root nodes \(t'\) and \(t^{''}\), are exchanged between parents \(T_1\) and \(T_2\). Typically, as suggested by Koza in  Koza (1992), the crossover operator performs a biased selection of nodes to be swapped where internal (function) nodes are prioritized, as swapping leaf nodes leads to less meaningful results.

Example 1

Suppose two parents \(T_1 = (\times , +, x_1, x_2, -, x_3, x_4 )^\top \) and \(T_2 = (-, \times , x_1, x_3, \times , x_2, x_4 )^\top \), two random nodes are \(t' =\{-\} \) and \(t^{''} = \{ \times \}\) for \(T_1\) and \(T_2\), respectively. Then, one of the offspring generated by the subtree swapping crossover for \(T_1\) and \(T_2\) can be \((\times , +, x_1, x_2, \times , x_1, x_3)^\top \), as shown in Fig. 1.

Fig. 1
figure 1

An example of the swapping tree crossover in GP

Mutation in EA acts mostly as a provider of genetic diversity to work against premature convergence. We allow mutation to perform random changes to the tree structure, such as removing subtrees, replacing subtrees, or altering node types or node parameters.

In GP problems, an objective function quantitatively measures the difference between predictions and actual responses on a specific dataset. For this purpose, some common objective functions for supervised learning problems are the mean squared error, normalized mean squared error, coefficient of determination, or the Pearson correlation coefficient. In this paper, the Pearson correlation coefficient squaredFootnote 2 is used to measure a model’s prediction error.

2.2 Offspring selection with GP

Offspring selection GP  Affenzeller et al. (2005) conditionally accepts a new offspring into the population based on its fitness. The pseudocode of general offspring selection, which relates to the loop between line 7 and 14 in Algorithm 1, is shown in Algorithm 2.

figure b

In Algorithm 2, a new offspring child is generated through a normal mate selection and crossover/mutation operators. The mate selection operator can be any canonical mate selection operator, like tournament selectio or roulette wheel selection.

A comparison factor \(c\in [0,1]\) controls the threshold to determine whether an offspring child can survive or not, where the threshold is within a closed interval between the parents’ fitness values \(f(P_1)\) and \(f(P_2)\) at line 7Footnote 3.

An offspring child is added to offspring population \(Pop(t+1)\) if it meets the criterion. Otherwise, this offspring child will be added to POOL at line 11, which collects the “discarded” offspring by the offspring selection. The discarded population pool POOL is initialized as an empty set in the first generation and is cumulatively updated in the later generations.

Parameter r defines the target success ratio, i.e. the ratio of the new population that should be filled with individuals that are better than their parents (according to the offspring selection criterion, line 4 in Algorithm 2).

Selection pressure limit \(p_{sel}\) represents a ratio of the maximum number of generated offspring to a population size \(n_{pop}\). We call the parameter setting \(c=1\) and \(r=1\) strict OS because \(\forall i=\{1,2,\ldots , n_{pop} \}, f(child) < \min \big ( f(P_1), f(P_2) \big )\). More variants of the offspring selection and their impacts and influence on various problems were well explained and studied in  Affenzeller et al. (2009, 2014).

2.3 Age layer population structure based GP

Age-layered population structure (ALPS) was proposed by Hornby (2006) as an approach to promote diversity via niching and periodic reinitialization. ALPS achieves this goal by introducing a new property, age, for every individual. A set of simple age update rules define how individuals can “grow old”:

  • A new randomly initialized individual is assigned an age value of zero.

  • Offspring produced by crossover or mutation inherit the age of the oldest parent.

  • An individual’s age is updated by an increment of one upon fitness evaluation.

The working principle in ALPS is then to restrict competition for survival only between individuals within the same age layer, in order to prevent “young” individuals from being driven to extinction by older individuals (that have had time to evolve better fitness). This form of niching promotes diversity by making the competition for survival “more fair”.

A new parameter AgeGap is introduced to define the age layers of the population. An age layer i is expressed as a half-closed integer interval:

$$\begin{aligned} \begin{aligned} L_i&= [ \underline{\text {age}}^{(i)}, \overline{\text {age}}^{(i)} ),\,\text {where}\\ \underline{\text {age}}^{(i)}&= {\left\{ \begin{array}{ll} \overline{\text {age}}^{(i-1)} &{} \text { if } i > 0\\ 0 &{} \text { otherwise }\\ \end{array}\right. }\\ \overline{\text {age}}^{(i)}&= AgeLimit (i, AgeGap )\\ \end{aligned} \end{aligned}$$

The function AgeLimit is a user-defined function that defines the age intervals separating the layers and usually increases as the increment of layers by different strategies (e.g., linear, Fibonacci, polynomial or exponential). The layer hierarchy is defined from the bottom up, such that the youngest age layer (or “bottom” layer) will be at the bottom of the hierarchy while the oldest layer will be at the top. The entire population of ALPS is the union of the populations in all active layers .

Initially, there will be a single age layer. However, age update rules will quickly lead to a heterogeneous distribution of age values. Once an individual’s age exceeds the maximum age of a layer, the individual will be pushed to the next layer and a new individual will be inserted in its place (either transferred from the previous layer, it if exists, or randomly initialized). In addition, the bottom layer will be periodically reinitialized with new individuals, every AgeGap generations. This ensures that new genetic material periodically trickles into the process, making its way up the layer hierarchy and ensuring that diversity levels are maintained during the evolution.

As a result, ALPS is a unidirectional hierarchical structure that pushes individuals from a bottom layer to a top layer. This idea is different from the Hierarchical Fair Competition structure (HFC)  Hu and Goodman (2002) which uses fitness information to segregate layers. It can be argued that age seggregation in ALPS is more fair as it doesn’t consider fitness.

The Pseudocode of ALPS is shown in Algorithm 3. ALPS starts with the initialization of its parameters (line 1 - 3). The main loop of the ALPS framework (line 4 - 29) cycles through each active layer and creates a new layer until a termination criterion is satisfied at line 4. Some common stop criteria include the number of function evaluations, the maximum number of active layers, and etc. In the main loop, a child’s age is updated (line 10). If this child’s age does not exceed the age limit in this specific layer, this child will survive in the current layer. Otherwise, this child will be removed from the current layer and be used to initialize (lines 13–15) or update the population (17–20) in the next layer. The next will be active when the number of individuals in the next layer equals the population size (lines 24–25). Once the main loop terminates, the optimal solution \(T^*\) is selected from the union of all layers.

Any EA can be integrated with ALPS, for example, to produce ALPS with GA (Hornby 2006; Fleck 2015), ALPS with GP (Patel and Clack 2007; Mateiu 2019) and ALPS with NSGA-II (Lichtberger 2019) for multi-objective optimization problems. The age-based segregation into layers in ALPS is a form of niching, a technique commonly used for diversity preservation in EAs. Other popular niching techniques in GP are:

  • fitness-based, using explicit (Goldberg et al. 1987) and implicit (Smith et al. 1993) fitness sharing, clustering (Yin and Germay 1993) or clearing (Pétrowski 1996; Singh and Deb 2006)

  • replacement-based, using deterministic (Mahfoud 1996), probabilistic (Mengshoel and Goldberg 1999) or local tournament-based (Mengshoel and Goldberg 2008) crowding or restricted tournament selection (Harik 1995)

  • preservation-based, using species conservation (Li et al. 2002)

  • hybrid, which implement adaptive versions or ensembles of fitness-, replacement- or preservation-based methods (Leung and Liang 2003; Yu and Suganthan 2010).

figure c

The form of niching implemented by ALPS belongs to the category of preservation-based methods. Burks and Punch (2017) compared ALPS with other preservation-based methods (e.g., Age-Fitness Pareto optimization algorithm (Schmidt and Lipson 2011), subtree semantic geometric crossover (Nguyen et al. 2016), and genetic marker diversity algorithm for GP (Burks and Punch 2015)) and found that ALPS has the highest behavioral diversity in the entire population. However, a detailed analysis of the population diversity in each layer is not available in Burks and Punch (2017).

The algorithms presented in this section represent conceptually sound augmentations of canonical GP meant to improve its performance by promoting – and more efficiently exploiting – population diversity. However, the resulting changes and their respective impact on the algorithm’s convergent behavior cannot be easily compared (for instance by attempting to replicate identical benchmarking conditions, using a fixed evaluation budget or fixing as many common parameters as possible). The effects produced by changes in the internal mechanisms of selection or the topology of the population need to be considered within a methodological framework able to perform a detailed analysis both from a diversity and a genealogical perspective.

In the following, we introduce an analysis methodology capable of producing a detailed description of the evolutionary process, in terms of diversity (structural and semantic), inheritance, subtree frequencies and building blocks.

3 Analysis of evolutionary dynamics

3.1 Population diversity

In Sect. 1, we have argued the importance of understanding dynamical aspects of GP and the deep connection between algorithm convergence and population diversity. Many previous works used a single measure to describe diversity (e.g., structural/distance-based, behavioral, based on the number of unique individuals, genetic markers, etc.).

In order to have a more complete picture of the interplay between selection and recombination, in this work, we use two complementary measures (genotypic and phenotypic), corresponding to the two levels of interaction between genetic operators: selection acting on fitness at phenotype level, crossover and mutation acting on tree syntax at genotype level.

We employ computationally efficient methods for computing diversity. Our methods are normalized between [0, 1] and have a natural correspondence to the concept of similarity between two trees. This in turn also allows for a more natural graphical representation of diversity loss in the empirical section, using a quantity that increases over time. We use the notation \(Sim_g(\cdot )\) and \(Sim_p(\cdot )\) to denote genotypic and phenotypic similarities, respectively. These similarities can be equivalently formulated as distances by simply subtracting the respective quantities from 1 (e.g., taking \(1 - Sim_g(\cdot )\) and \(1 - Sim_p(\cdot )\) as genotypic and phenotypic distances, respectively).

Since the measures are symmetric, the similarity matrix associated with the population is triangular and only \(\frac{n_\text {pop}(n_\text {pop}-1)}{2}\) similarities must be computed. The average similarity \({\overline{Sim}}\) of is defined as the average of the similarity matrix. This definition applies to both genotypic similarity and phenotypic similarity.

3.1.1 Genotypic similarity

For genotypic diversity, we use a tree isomorphism algorithm which works in a bottom-up fashion similar to Valiente (2001). However, in our case, we resort to hashing (Burlacu et al. 2019), a more efficient algorithmic approach in which each tree node is assigned a unique hash value determined by its structure. Subtree isomorphism is then reduced to finding nodes with the same hash value.

We employ the Jaccard index to obtain a similarity value knowing the number of matching nodes between two trees.

$$\begin{aligned} Sim_{g}(T_i, T_j) = \frac{ |M (T_i, T_j) |}{|T_i| + |T_j| - |M (T_i, T_j) |} \end{aligned}$$
(3.1)

where \(|M (T_i, T_j) |\) is the size of the intersection of \(T_i\) and \(T_j\), M represents a mapping of isomorphic subtrees from \(T_i\) to \(T_j\), and \(|T_i|, |T_j|\) are the lengths of the two trees. Note that Jaccard similarity is a metric.

3.1.2 Phenotypic similarity

We compute phenotypic similarity between two tree models as the squared Pearson correlation coefficient between their responses on the same training data D. As the correlation is not defined when the standard deviation is zero for either one of the responses, we introduce a couple of extra rules to make sure that the similarity measure is well defined:

  • if both standard deviations are zero, then the similarity is one.

  • if only one of the standard deviations is zero, then the similarity is zero.

This is formalized below. Let \(\sigma _i\) and \(\sigma _j\) be the standard deviations of the two tree models \(T_i\) and \(T_j\):

$$\begin{aligned} Sim_{p}(T_i, T_j) = {\left\{ \begin{array}{ll} 1 \qquad \qquad \text {if } \sigma _i = \sigma _j = 0 \\ 0 \qquad \qquad \text {if } \sigma _i = 0, \sigma _j \ne 0 \text { or } \sigma _i \ne 0, \sigma _j = 0\\ r^2_{T_i, T_j} \qquad \text {otherwise} \end{array}\right. } \end{aligned}$$
(3.2)

As we are using the squared correlation coefficient, the value of \(Sim_{p}(T_i,T_j) \in [0,1]\).

3.2 Inheritance patterns and building block propagation

The propagation of subtrees from one generation to the next represents an important aspect of GP evolutionary search behavior. In this context, we consider building blocks to be highly fit subtrees that are inherited by offspring individuals and increase their frequency in the population.

The occurrence of building blocks can be investigated using a frequentist approach where the action of genetic operators (crossover, mutation, selection) is in a first phase recorded in the form of a genealogy graph such that mating events and their outcomes are represented as vertices and arcs (Burlacu et al. 2015a).

Mining this body of information in a second phase can lead to the discovery of inheritance patterns, common ancestries, or frequently sampled subtrees. We describe in the following a practical approach for constructing a genealogy graph containing this information, as well as analysis methods on the graph for the discovery of frequent schemas and building blocks.

3.2.1 Subtree tracing

In the context of subtree tracing, we use the term fragment \(fg(T_i)\) to refer to an inherited subtree in \(T_i\) individual, which is passed on to \(T_i\) from one of its parents. Fragments are saved in the genealogy graph G together with positional information identifying them in the parent and offspring trees. The positional coordinates are represented using indices into the node sequence given by a prefix traversal of the respective tree.

The storage scheme shown in Fig. 2 describes a situation where crossover and mutation are applied in succession. In this case, the offspring generated by crossover represents an intermediate individual that will not be present in the offspring population. In order to preserve continuity this intermediate individual is included as a vertex in the genealogy graph together with the respective fragments. Using the prefix indices stored in the genealogy graph, all fragments inherited via crossover can be identified. In the case of mutation, new genes can be introduced spontaneously. Thus, we consider the case when there is an overlap between the fragment and the traced subtree. In this case it is still useful to follow the origin of the subtree at the site of mutation in order to gain a deeper understanding of effects of the genetic operators.

The concept is illustrated in Fig. 3 where two parent individuals participate in crossover. The prefix indices of the swapped subtree within the non-root parent and the child individual are recorded in the genealogy graph.

A genealogy graph offers the ability to traverse an individual’s ancestry following the inherited fragments until the initial population is reached. Consequently, subtree tracing enables the possibility to identify for every subtree in the population the exact sequence of genetic operations that has lead to its creation.

The logic for tracing any subtree in the population relies on a set of simple arithmetic rules for processing prefix indices according to the relationship between the subtree being traced \({\mathcal {T}}\) and the fragment \(f_g(T)\), respective to the containing tree individual T. Table 1 illustrates these rules and provides the foundation for a recursive procedure that navigates the genealogy graph and constructs a so-called trace graph, containing in its vertices only those ancestor individuals of the current tree which have contributed a genetic fragment to its genotype.

Table 1 Prefix indices arithmetic rules for subtree inclusion. Here, fg(T) is a tree fragment, \({\mathcal {T}}\) is the traced subtree, the \({\mathcal {I}}(.)\) operator returns the prefix index and \({\mathcal {L}}(.)\) returns the length of the fragment or subtree
Fig. 2
figure 2

Genealogy graph vertices representing a crossover operation followed by mutation

Fig. 3
figure 3

Schematic of a crossover representation. Prefix indices are shown under each node

A trace graph can be seen as a compact representation of the inheritance information aggregated with respect to a given subtree. If we consider an individual’s genealogy graph \(G = (V,A)\), where V is the set of vertices (representing individuals) and A is the set of arcs (representing inheritance relationships), and an individual’s trace graph \(G_T = (V_T, A_T)\), the following are true:

  • G is a digraph and \(G_T\) is a multidigraph. The reason for this is the existence of inbreeding in the GP population. Then, we have multiple intersecting evolutionary pathways tracing down to common ancestors, where multiple arcs between a pair of vertices represent trace paths tracing different fragments.

  • \(V_T \subset V\). For the purpose of compactness, the trace graph only includes ancestors from which elements of the genotype of the traced individual originated, without considering the propagation of said elements due to selection.

  • A is a set of ordered vertex pairs from V, while \(A_T\) is a multiset of ordered vertex pairs from \(V_T\). This is explained by the fact that single ancestor individuals generate multiple descendants who become subsequently intermingled, leading to intercrossing inheritance patterns.

3.2.2 Frequency-based identification of building blocks

For reasons enumerated above, the decomposition of a traced subtreeFootnote 4 results in a record of genotypic operations (crossover, mutation) and associated set of genetic fragments transferred from ancestor to descendant individuals. We can then quantify the propagation of subtrees in the GP population by applying the tracing algorithm to the entire population and counting the number of times each subtree was sampled by a genetic operator.

Algorithmically, the counting is performed by the tracing procedure by incrementing a counter for all the nodes of each visited subtree. Since the incrementing scheme leads to child node counts that are always greater or equal than parent node counts, we define a node’s sample count as the difference between its own count and the count of its parent, as illustrated in Fig. 4.

Together with this node numbering scheme, the tracing algorithm can directly quantify the sampling frequencies of all the subtrees in the GP population via the calculated sample counts. This allows the discovery of building blocks as the most frequently sampled subtrees in the population.

Fig. 4
figure 4

Node counts (above each node) and sample counts (bold font, inside parentheses)

3.2.3 Contribution ratio

Previous work has already shown that the percentage of ancestor individuals having descendants in the last generation is low and that only a small fraction of ancestors contribute genetic material to the solutions in the last generation (Donatucci et al. 2014; McPhee et al. 2015, 2016). We quantify the fraction of contributing ancestors as the ratio between the size of the trace graph (in which vertices represent ancestors that have contributed a part of their genotype to their descendants in the final population) and the size of the genealogy graph. Additionally, the ratio of arcs to vertices in the trace graph indicates the number of times subtrees are traced to the same ancestor individual and can help identify the so-called progenitors in the genealogy. This represents a systematic approach of characterizing algorithm convergence, as low contribution ratios mean that most recombination events shuffled subtrees around in the population without making actual progress in terms of achieving adaptive change.

4 Empirical experiments

4.1 Test problems

We illustrate the dynamic analysis methods described in Sect. 3 on five symbolic regression benchmarks Poly-10 (Poli and McPhee 2003), GP-Challenge (Widera et al. 2010), Friedman-I (Friedman 1991), Friedman-II (Friedman 1991), and Aircraft Maximum Lift Coefficient problem (Chen et al. 2018).

The test problems have been selected based on the following considerations. Firstly, they are well-established in the field of GP, additionally they are included (or voted to be included) into a standard benchmark set for symbolic regression (White et al. 2013) (GP-Challenge, Friedman spline regression problems). Secondly, problems where the structure of the target function is known (Poly-10, Friedman-I, Friedman-II, Aircraft) allow us to validate the methodology, by checking if subtree tracing finds building blocks that correspond to terms of the target function. The Poly-10 problem in particular makes it easy to compare the identified subtrees with the linear terms of the formula. The problems were also chosen to be the right level of difficulty for the chosen algorithmic settings.

4.1.1 Aircraft maximum lift coefficient

In aircraft design, the maximum lift coefficient of a whole aircraft can be approximately modeled as:

$$\begin{aligned} f({\textbf{x}}) =&x_1 - 0.25 x_4 x_5 x_6 \big [4 + 0.1 \frac{x_2}{x_3} - \left( \frac{x_2}{x_3}\right) ^2\big ] + x_{13} \frac{x_{14}}{x_{15}} x_{18} x_{7} - \\&x_{13} \frac{x_{14}}{x_{15}} x_{8} + x_{13} \frac{x_{14}}{x_{15}} x_{9} + x_{16} \frac{x_{17}}{x_{15}} x_{18} x_{10} - \\&x_{16} \frac{x_{17}}{x_{15}} x_{11} + x_{16} \frac{x_{17}}{x_{15}} x_{12} \end{aligned}$$

where \(x_1\in [0.4,0.8], x_{4,13,16}\in [2,5], x_{2}\in [3,4], x_{3}\in [20,30], x_{8,11,14,17}\in [1,1.5], x_{15}\in [5,7], x_{18}\in [10,20], x_{9,12}\in [1,2], x_{7,10}\in [0.5,1.5], x_5\in [0,20], x_6=1\). The dataset consist of 200 samples, where the first 100 samples are used for training and the rest are use for the test.

4.1.2 Friedman-I and Friedman-II

Friedman-I contains 10 continuous predictor variables, where five variables are independent and the rest five variables are redundant. Friedman-I is defined as:

$$\begin{aligned} f({\textbf{x}})=0.1e^{4x_1} + \frac{4}{1+e^{-20x_2+10}} + 3x_3 + 2x_4 + x_5 \end{aligned}$$

Friedman-II contains 10 continuous predictor variables including three non-linear and interacting variables, two linear variables, and five irrelevant variables. Friedman-II is defined as:

$$\begin{aligned} f({\textbf{x}}) = 10\sin (\pi x_1 x_2) + 20( x_3 - 0.5)^2 + 10x_4 + 5x_5 \end{aligned}$$

Both Friedman-I and Friedman-II datasets consist of 10000 samples, where the first 5000 samples are used for training, and the rest samples are for the test.

4.1.3 GP-challenge

The GP-Challenge problem consists of data generated artificially from protein decoy energy terms, where the goal is to predict the protein energy function used for protein structure prediction (Widera et al. 2010). The data consists of 82965 data points, where the first 5000 data points are reserved for training and the rest of the data points are used as test data. The input feature set consists of the variables:

  • \(t_1,...,t_3\): three short-range potentials between \(C_\alpha \) atoms \(E_{13}\), \(E_{14}\) and \(E_{15}\).

  • \(t_4\): local stiffness potential \(E_\text {stiff}\)

  • \(t_5\): hydrogen bonds potential \(E_\text {HB}\)

  • \(t_6\): long-range pairwise potential between side chain centres of mass \(E_\text {pair}\)

  • \(t_7\): electrostatic interactions potential \(E_\text {electro}\)

  • \(t_8\): environment profile potential \(E_{env}\)

  • \(\text {RMSD}\): root mean square deviation between 3D coordinates of \(C_\alpha \) atoms

  • \(\text {rank}\): rank of each decoy based on increasing order of RMSD

4.1.4 Poly-10

This problem is a synthetic benchmark problem where the target is defined as a sum of non-linear terms:

$$\begin{aligned} f({\textbf{x}}) = x_1 x_2 + x_3 x_4 + x_5 x_6 + x_1 x_7 x_9 + x_3 x_6 x_{10} \end{aligned}$$

The data consists of 500 data points of which the first half are used for training and the rest are used as test data.

4.2 Parameter settings

The parameters for an individual are defined in Table 2. The parameter settings for GP, OSGP, and ALPS-GP are shown in Table 3 where \(p_m\) and \(p_c\) stand for mutation and crossover probabilities, respectively. The selection operator was parameterized for each algorithm according to suggested settings: tournament selection with group size 5 for GP, gender-specific selection (Affenzeller et al. 2005) for OSGP, and generalized rank selection (Tate and Smith 1995) for ALPS. The termination criterion for GP and ALPS-GP is 500 generations while for OSGP the termination criterion is a maximum selection pressure \(p_{sel} \) of 100. Each experiment consisted of 33 repetitions of the same configuration in order to account stochastic effects.

Table 2 Parameter settings for an individual T

In this paper, the algorithms have not been limited in runtime or evaluation budget since each algorithm converges differently and our goal is to analyze the different resulting evolutionary dynamics. These limitations make sense in benchmarking scenarios where the goal is to obtain the best model within certain computational constraints, but in our case imposing the same limits would have hindered our ability to investigate the different behaviors.

Table 3 Parameter settings for algorithms

4.3 Empirical results

We describe algorithm performance in terms of the normalized mean squared error (NMSE) of the best solution on the training and test datasets. Metrics based on the sum of squares (e.g. MSE, NMSE, RMSE) are typically used to assess model quality as they are more intuitive to understand (always non-negative, values close to zero are better). Note that the metric used here to describe results is not related to the fitness measure (Pearson’s \(R^2\)) used by the algorithms. Within GP, it is recommended to use either correlation or mean squared error with linear scaling for fitness (Keijzer 2003).

$$\begin{aligned} \text {NMSE}({\hat{y}}, y)) = \frac{\text {MSE}({\hat{y}}, y)}{\text {Var}(y)} \end{aligned}$$
(4.1)

where \({\hat{y}}\) is the model prediction, y is the actual regression target, and \(\text {Var}(y)\) is the variance of the target. The errors shown in Table 4 are expressed as median \(\hbox {NMSE}\pm \hbox {IQR}\) over the 33 repetitions, where IQR is the interquartile range.

Table 4 Modeling results expressed as median normalized mean squared error ± interquartile range
Table 5 Statistical significance p values according to the Kruskal test

Statistical analysis based on a Kruskal test (\(p<0.05\)) is shown in Table 5. On the complex problems with more than 10 relevant variables (e.g., the Aircraft and the Poly-10 problems), there exist small but significant differences between ALPS and GP. On Friedman-I and GP-Chalenge problems, significant differences exist between ALPS and OSGP.

4.4 Dynamic analysis

In this section, we demonstrate how similarly-performing algorithms in terms of solution quality achieve their results through very different internal dynamics of the search process. Our findings can help users determine whether specific evolutionary models are more suited for solving certain kinds of problems. For example, we assume that:

  • ALPS would be more suitable for dynamic environments characterized by dynamic changes in the operating regime of the analyzed system. In such cases, the constant influx of new genetic material due to the periodic reinitialization of layer 0 would enable the algorithm to react to these changes meaningfully.

  • OSGP would be more suitable for difficult problems with a high risk of premature convergence if certain alleles run extinct before they can be combined into meaningful building blocks. This is because the more focused exploitation of existing diversity by offspring selection minimizes the risk of losing important alleles.

As extensions to canonical version, both ALPS and OSGP are spected to perform equal or better than standard GP. This kind of analysis helps better understand the effects of certain parameterizations on run dynamics and can lead to algorithmic developments were aspects like selection pressure, comparison factor (OSGP), age gap (ALPS), etc. could be dynamically adjusted in order to optimally exploit the properties of the problem and the fitness landscape.

First, we illustrate the differences in the evolution of population diversity and solution quality. Then, we illustrate the tracing methodology and enumerate the subtrees ranked according to their sample counts calculated by the tracing procedure. Note that, in order to improve the flow of the paper, the referenced figures are found in the “Appendix”.

4.4.1 Quality and similarity

In this section, we analyze the dynamic analysis w.r.t. convergent similarities and qualities. Besides, the evolution of the joint distributions over genotype and phenotype similarities are also analyzed for the GP and OSGP.Footnote 5

4.4.2 GP and OSGP

Figure 5 represents the dynamic analysis of similarity and model quality on the Aircraft, Friedman-I, Friedman-II, GP-Challenge and Poly-10 problems. The plots illustrate the evolution of average phenotype and genotype similarities alongside with the best and average fitness (1 - NMSE). The first two columns correspond to GP while the last two columns correspond to OSGP. The values are aggregated over the 33 algorithm repetitions. Since OSGP runs are adaptively terminated according to a maximum selection pressure criterion (\(p_{sel} = 100\)), the number of generations can vary in each run. For this reason, similarity and quality values for OSGP are aggregated over the 33 repetitions up to the median generation number, leading to some instability as observed in the right hand side of the charts.

An important observation here is that although phenotype and genotype similarity values are represented on the same graph and their respective metrics are in the same range, these are not directly comparable since they measure diversity on different levels. On the phenotype level, since solution candidates are selected based on the “similarity” of their predictions to the target variable (according to the fitness measure), an increase in phenotype similarity is expected as a side effect of the optimization process.

The differences between GP and OSGP are explained by the additional offspring selection step in OSGP rejecting the outcome of recombination when the child is worse than its parents. In this way, deleterious changes to the parent genotypes are not accepted. This single rule leads to significant changes in the algorithm behavior.

In the case of GP, we note a pronounced loss of diversity in the first few generations, after which the value stabilizes as a result of structural homogeneity in the population balanced by a constant influx of new genetic material due to mutation. We note that after the initial diversity loss accompanied by a dramatic increase in fitness, the quality curves remain mostly flat. This supports the hypothesis that genetic improvement can only occur when enough diversity is present in the population. When this diversity is used up, the search begins to stagnate.

In the case of OSGP, the more focused selection strategy leads to a more gradual genotype similarity increase between offspring and their parents, noticeable on problems from the first and the third columns in Fig. 5, and a smoother transition between the exploratory and the exploitative phases of the search. This suggests that OSGP is better able to utilize the diversity in the population and as a consequence, does not rely on mutation that much. Strict offspring selection also causes phenotype similarity to increase towards 1, as only offspring with an improved fitness (compared to their parents) are accepted.

The almost identical shape of the genotype similarity curves for each problem suggests that structural diversity depends rather more strongly on selection pressure than on the characteristics of the fitness landscape. On the other hand, phenotype similarity is clearly influenced by the problem characteristics, as observed for both GP and OSGP. The evolution of average similarity curves already highlights some important differences between the tested algorithms. However, averages tend to obscure the real distribution of similarity values in the population.

Fig. 5
figure 5

Similarities and qualities of the GP and the OSGP on the five test problems

The joint distributions of genotype and phenotype similarity values on the five tested problems are shown for GP and OSGP in Figs. 6, 7, 8, 9 and 10. The snapshots shown in these figures represent joint probability plots aggregated from similarity points from all 33 repetitions. For canonical GP, in each figure, from left to right, we showed 4 snapshots of the joint distributions of the populations with \(2\%\), \(36\%\), \(70\%\) and \(100\%\) function evaluations. For OSGP, since run lengths are not uniform, the number of evaluations is extrapolated from the selection pressure per generation, as we consider selection pressure a better indicator of convergence status. The hexbin color indicates point density and the x and y coordinates are calculated as an individual’s average similarity with the rest of the population, by taking the row-wise averages of the genotype and phenotype similarity matrices, respectively.

The shape of the univariate distributions (or the marginal distributions of the multi-variate distribution) of the two similarities resembles a normal curve as a result of the central limit theorem. The observed distributions correlate with our interpretation of how GP and OSGP make use and exploit genetic diversity in the population. Requiring adaptive change via offspring selection causes an abrupt increase in phenotype similarity. In other words, selecting strictly for adaptive change lowers the diversity of the overall behaviors in OSGP. This brings in turn more variance in tree structures, as indicated by the spread-out shape of the genotype similarity marginal distributions. For GP, the distribution of phenotype similarities is left-skewed, with a long tail in the area of high behavioral diversity, meaning that although most individuals exhibit similar behavior, some individuals are diverse, likely due to lower fitness. The distribution of genotype similarities is normal-shaped and similar across problems.

From these charts, we can conclude that an evolutionary model focused on adaptive change (OSGP) leads to a more intense focus on building blocks, as the building block hypothesis implicitly relies on parents’ ability to produce children of comparable or even better fitness. This focus in turn leads to decreased phenotype diversity as behaviors are less likely to deviate from the parents and more likely to be similar between descendants.

4.4.3 ALPS

The dynamic analysis of the ALPS on the problems are shown from Figs. 11, 12, 13, 14 and 15, respectively. A periodic pattern for both the genotype and the phenotype similarities occurs in the layer \(L_0\) because the bottom layer is re-initialized every 20 generations to ensure a constant influx of genetic material in the population. The population re-initialization at the bottom layer affects other layers by introducing a periodic pattern. However, this periodic effect diminishes gradually in the upper layers, because a larger age gap exists in the higher layers which allows the layer to converge to a “stable” population. Therefore, the top layer (\(L_5\) in this case) is the most stable layer in terms of the genotype and the phenotype changes. Furthermore, compared with similarities of the GP and the OSGP in Fig. 5, the top layer \(L_5\) still possesses the most diverse population in the steady stage. The periodic pattern of the similarities also occurs in quality analysis in terms of the best quality and the average quality in a specific layer population. The best quality of every layer diminishes because of unidirectional movement of an individual from a lower layer to a upper layer. The decrease of the best quality in the top layer \(L_5\) does not occur as there is no individual movement in the top layer.

4.4.4 Subtree frequencies and building block identification

We employ the methodology described in Sect. 3.2 to calculate subtree frequencies for the GP, OSGP and ALPS algorithms. In the case of ALPS, a genealogy graph is generated for each layer. Due to the unidirectional movement of individuals as they age, the layer genealogy graphs overlap to a significant degree.

For each algorithm we select the best solution out of the 33 repetitions based on the normalized mean squared error. Then the tracing methodology is applied to the population, according to the following steps:

  1. 1.

    The genealogy graph is constructed online during the evolution.

  2. 2.

    A trace graph is calculated for the entire final population in a post-analysis step. During tracing, subtree sample counts are updated as described in Sect. 3.2.2.

  3. 3.

    A post-processing step hashes each subtree ignoring coefficients for terminal nodes in order to focus on structure. This decision was made since the leaf coefficients are real-valued and including them in the hashing would have generated too many variations. The sample counts of isomorphic subtrees are then added together in order to obtain the final ranking.

The resulting genealogy graphs are illustrated in Table 6. The number of vertices |V| is determined by the population size, number of generations and mutation probability. For GP and OSGP, we have:

$$\begin{aligned} |V| = n_{pop} + \left( n_{pop} \cdot (p_c + p_m) \right) \cdot n_{gen} \end{aligned}$$
(4.2)

In the case of ALPS, due to the layered population structure, the number of vertices |V| given by 4.2 is increased by the term \(n_{pop} \cdot \textit{AgeGap}\). Additionally, vertices are not removed from the graph when corresponding individuals are moved out of the layer, because the information is necessary for computing an accurate history. Therefore, the upper layers grow larger in size, since they will additionally include the vertices corresponding to the individuals that are being moved from the previous layer after aging. For this reason, the graph size reported for ALPS in Table 6 are those corresponding to the final layer (Layer 5).

Table 6 Genealogy graph sizes for the tested configurations. The first line represents the number of vertices and the second line represents the number of arcs
Table 7 Trace graph sizes for the tested configurations. The first line represents the number of vertices and the second line represents the number of arcs

The differences between the theoretical limits and the values reported in Table 6 are due to stochastic effects. Table 8 shows the contribution ratios, defined in Sect. 3.2.3, of each algorithm on the five test problems. The value is calculated as the ratio between a trace graph’s size and a genealogy graph’s size. In the case of ALPS, the final population solutions are traced in each layer, and the results are aggregated together to obtain a total trace graph size which is then divided by the size of the genealogy graph in the last layer (since this graph already includes vertices for all the individuals migrated from the previous layers). The aggregated contribution ratio value is then comparable to the contribution ratios of the GP and OSGP algorithms.

Table 8 Solution contribution ratios for the tested configurations

According to Table 8, OSGP features the largest amount of contributing ancestors, followed by GP and then ALPS. This is consistent with the assumption that strict offspring selection determines a smoother and more gradual convergence, due to the more granular and effective utilization of the existing diversity. This can be also observed in the OSGP quality charts from Fig. 5 where the increase is more gradual compared to canonical GP.

Standard GP features a lower contribution ratio since the standard evolutionary model makes it easier for highly-fit “super individuals” to quickly dominate the population in subsequent generations, becoming the ancestors of all the descendants recorded in the genealogy graph. A lower contribution ratio corresponds to a smaller region of the genotype space sampled by the genetic operators during the algorithm’s attempt to produce adaptive change. This region is continuously reduced by loss of diversity (every generation), up to the point where fitness is locally maximized and the available pool of genetic material is limited to a reduced common set of inherited genes, originating from a small number of “super ancestors”.

ALPS features contribution ratios similar with GP, although in this case we have to take into account the fact that many new individuals are introduced in the genealogy graph by the reseeding of the bottom layer. This suggests that ALPS layers and the standard GP population evolve in a similar way, with the added benefit of a constant influx of diversity for ALPS that allows it to avoid premature convergence. Therefore, instead of evolving a single large population, it seems more beneficial to evolve a number of smaller populations structured into age groups.

Overall, our results confirm previous observations (Kuber et al. 2014; Donatucci et al. 2014; McPhee et al. 2016, 2017) that only a small fraction of ancestor individuals actually contribute genetic material to the best solution. GP algorithms can potentially be made more robust by enforcing a more granular and less wasteful (in terms of diversity) exploitation of the alleles present in the randomly initialized starting population. Niching techniques, based for instance on age or inheritance information could be used to ensure that diversity is used more efficiently during the search.

4.4.5 Building blocks

Empirical measurements show that the most frequently sampled subtrees originate from ancestor individuals identified by the tracing methodology. Compared with the mathematical expression of the regression target, the identified subtrees include all the relevant input variables. However, since selection in GP does not consider expression complexity, the algorithms have a tendency towards more complex expressions, semantically equivalent to the corresponding formula terms.

Tables 10, 11, 12, 13, 14 and 15 in the “Appendix” show the subtrees identified by each algorithm on the five test problems in Sect. 4.1. The top 10 identified subtrees and their corresponding sample counts are represented in these figures. Table 9 is a summary of Tables 10, 11, 12, 13, 14 and 15 and shows the identified subtrees that correspond to the test problem formulas, except for the GP-Challenge problems, as there is no explicit formula for this problem. A subtree T is in the bold format if T exactly matches one term or a sub-term of one term in the problem formula.

The difference in the actual frequencies reported for each algorithm are determined, on the one hand, by differences in each algorithm’s parameters (e.g., population size, maximum number of generations, selection scheme) and, on the other hand, by differences in their inherent search behavior. For this reason, subtree sample counts cannot be directly compared between algorithms, and instead we compare the relative ranking. We do observe, however, an inverse correlation between an identified subtree’s size and its sampling frequency, as smaller subtrees with a simpler structure tend to be the most sampled.

For clarity as well as space reasons, we focus our discussion on the Poly-10 problem, since its target formula is more explicit and relatively simpler than the other test problems (only involving addition and multiplication operations). The main difficulty in this synthetic benchmark problem consists in finding the appropriate three-variable products: \(x_1 x_7 x_9\) and \(x_3 x_6 x_{10}\). Since three of the variables (\(x_7\), \(x_9\), \(x_{10}\)) do not occur at all in the other three terms of the formula (\(x_1 x_2\), \(x_3 x_4\), \(x_5 x_6\)), there is a non-trivial probability for them to go extinct from the population before the algorithm can assemble them into meaningful building blocks, leading to premature convergence.

Both GP and OSGP are able to discover all three two-variable terms of the formula but find only of the two three-variable terms. The term \(x_3 x_6 x_{10}\) is not found therefore the two algorithms do not completely solve the problem. While GP samples shorter and more simple terms, OSGP tends to sample larger, more complex terms that can simplify to sums of two variable products. The larger and more complex building blocks assembled by OSGP are determined by offspring selection, which ensures that already existing good building blocks in the structure of the root parent are not disrupted by crossover or mutation. This leads us to conclude that GP exihibits more granular inherited patterns (smaller subtrees) while in OSGP these patterns become larger due to the more constructive approach leading to the reuse of fit alleles from parent individuals.

In the case of ALPS, the search becomes more focused on the terms of the formula as the population ages and individuals migrate to the upper layers. In the bottom layer, since the population is only partially converged due to the periodic reseeding, the most sampled subtrees are simple structures containing relevant variables not necessarily assembled in an optimal relationship. However, moving to the upper layers we notice that the identified subtrees gradually begin to approximate more exactly the terms of the target expression, growing in complexity until the last layer. In this case, the layered evolution is able to correctly assemble the building blocks of the problem, finding the term \(x_3 x_6 x_{10}\) that the other two algorithms failed to find. In terms of the complexity of the sampled subtrees, ALPS behaves more like standard GP in the lower layers and more like OSGP in the upper layers.

In summary, the contribution ratio calculated on the basis of the genotypes in the population and their hereditary relationships gives a good indication of the amount of genetic diversity lost through the action of selection during the evolutionary search. At the same time, it acts as a measure of search efficiency or yield as the algorithm evolves towards the best solution, since it directly quantifies how many of the total genotype oprations that acted on ancestors of the individuals in the final population had a direct contribution to their structure.

A high contribution ratio implies frequent adaptive changes across an individual’s ancestry, corresponding to incremental improvement. Since the changes are adaptive (they improve fitness), they are maintained by selection and present in the trace graph. On the other hand, a low contribution ratio corresponds to neutral or deleterious changes which are not fixated by selection and may get rejected or simply drift away.

Empirical observation shows that once diversity goes below a certain threshold, most of these operations fail to produce any meaningful new improvement, since the exchanged genetic material (e.g. by crossover) is already predominant in the population. From this perspective, we can rank the algorithms in terms of search efficiency – purely from the perspective of genetic convergence – as ALPS being the most efficient, followed by OSGP and lastly by GP. In practice, improving yield often comes at the cost of increased energy expenditure, analogous in this case with the spent evaluation budget.

The observed structures and frequencies of the identified building blocks represent additional evidence in support of the idea that a more efficient search is more easily able to assemble increasingly complex building blocks, as is the case for OSGP and ALPS.

5 Conclusions and future work

This paper analyzed population diversity and genealogy aspects for canonical GP, offspring selection GP (OSGP) and age-layered population GP (ALPS) in terms of population diversities, building block occurrences, and subtree inheritance patterns from the initialized population to the final population. The experimental benchmark set consisted of five different test problems, including three artificial benchmarks and two real-world problems.

The experimental results reveal the different evolutionary search characteristics of the three GPs. More specifically, canonical GP extracts diversity from the population comparatively fast and maintains the search mainly via mutation. Therefore, canonical GP flavors shorter subtrees. On the other hand, OSGP ultimately leads to a less diverse population because of the strict criterion in the offspring selection, but the rate of diversity loss is reduced as the convergence is smoother. The OSGP relies comparatively less on the mutation to maintain and assemble relevant building block information. Therefore, OSGP tends to produce longer subtrees, and the contribution ratios of the OSGP are the highest among all the three GPs. ALPS exhibits a unique character due to the continuous transfer of every new genetic diversity across the younger age layers. Due to the periodic population initialization, a periodic pattern of diversity evolution occurs in the bottom layer. This periodic pattern is less pronounced as the layer number is increased. Regarding identified subtrees, its behavior is similar to that of the canonical GP in the lower layers (e.g., layer 0 and layer 1). In higher layers, ALPS trends to longer subtrees because of the cumulative effect of fitness comparison when an individual moves to the next layer in the previous layers.

The proposed similarity metrics can be efficiently implemented with minimal overhead such that the corresponding quantities can be computed during the evolutionary run. The combination of genotype and phenotype measures helps paint a more complete picture of diversity loss and reveal the different characteristics of the evolutionary search between different GP flavors. The tracing algorithm based on information provided by a genealogy graph represents a useful instrument for identifying relevant solution building blocks and characterizing the efficiency of the evolutionary search. A straightforward way to build the genealogy graph during evolution is by instrumentation of the recombination operators such that each new offspring is inserted into the graph together with relevant inheritance information. The analysis proposed in this paper reveals different inheritance patterns between the tested algorithmic variants as well as between problems solved by the same algorithm, supporting the idea that search behavior is highly dependent not only on algorithm parameterization but also on problem characteristics.

For future works, it will be interesting to investigate strategies for dynamically adjusting hyperparameters based on observed dynamical aspects, for example the distribution of similarity values and building block frequencies during the search process. In addition, it is also worth to try to improve the effectiveness of the GPs by reducing the space of functional operators \({\mathcal {F}}\). Space \({\mathcal {F}}\) can be reduced by identifying/counting the useful functional operators in the most sampled subtrees and removing the “irrelevant” operators that never occur in the identified subtrees.