A similarity measure for Straight Line Programs and its application to control diversity in Genetic Programming

https://doi.org/10.1016/j.eswa.2021.116415Get rights and content

Highlights

  • A metric is developed to measure the differences between two Straight Line Programs.

  • The proposed metric is used in a CHC algorithm to improve diversity.

  • The method is evaluated using data from public buildings of University of Granada.

Abstract

Finding a balance between diversity and convergence plays an important role in evolutionary algorithms to avoid premature convergence and to perform a better exploration of the search space. In the case of Genetic Programming, and more specifically for symbolic regression problems, different mechanisms have been devised to control diversity, ranging from novel crossover and/or mutation procedures to the design of distance measures that help genetic operators to increase diversity in the population. In this paper, we start from previous works where Straight Line Programs are used as an alternative representation to expression trees for symbolic regression, and develop a similarity measure based on edit distance in order to determine how different the Straight Line Programs in the population are. This measure is used in combination with the CHC algorithm strategy to control diversity in the population, and therefore to avoid local optima to solve symbolic regression problems. The proposal is first validated in a controlled scenario of benchmark datasets and it is compared with previous approaches to promote diversity in Genetic Programming. After that, the approach is also evaluated in a real world dataset of energy consumption data from a set of buildings of the University of Granada.

Introduction

In evolutionary computation, diversity and convergence play an important role in the exploration of the search space. As many authors argued (Badran and Rockett, 2007, Burke et al., 2004, Chen et al., 2009), finding a balance between diversity and convergence is critical in genetic algorithms, since premature convergence causes the end of the evolution in local optima, while an uncontrolled divergence may reduce exploitation of the search space. Two main mechanisms that guide the exploration of the search space in genetic algorithms have been identified in the literature (Smit & Eiben, 2010): Variation, which promotes diversity, and Selection to reinforce convergence. A suitable combination of these mechanisms helps to explore the search space to avoid falling in local optima (Črepinšek et al., 2013). Indeed, many approaches emerged to tackle the problem of premature convergence in genetic algorithms by proposing new evolutionary algorithms or improving genetic operators with the aim of delaying a premature convergence. For example, the work (Mc Ginley et al., 2011) implemented ACROMUSE, a genetic algorithm that adapts crossover, mutation and parameter selection to preserve diversity in the population. Lozano et al. (2008) proposed replacement strategies that consider both fitness quality and the diversity of an individual in the population, in order to maintain individuals with high fitness and diversity for the next generation. Aslam et al. (2018) presented a selection operator that determines whether two individuals can be recombined considering their distance. On the other hand, other authors established a criterion that helps to select the individuals that will be combined: e.g. techniques based on neighborhoods such as niching methods (Martín et al., 2016) or approaches that consider behavior similarities by using fitness sharing (Ekárt and Németh, 2000, Ekárt and Németh, 2002). A recent article showed how multi-objective optimization can be used to promote diversity in the population, considering both fitness and a diversity measure as objectives to be optimized (Segura et al., 2017).

In addition to the aforementioned problems of diversity and convergence, Genetic Programming (GP) has to deal with additional issues regarding the solution encoding (Koza, 1992). As the encodings used in GP have a non-linear structure, such as trees, it is harder to tackle the control of diversity (Burke et al., 2004). The problem of tree uncontrolled growth, known as the bloating problem in GP, leads to premature convergence (dal Piccol Sotto & de Melo, 2016). Preventing this problem is an implicit goal for researchers in GP, and different authors have proposed to modify genetic operators, fitness evaluation or selection schemes (Alfaro-Cid et al., 2008, de Jong et al., 2001, Liu et al., 2007) to solve the bloating problem while maintaining diversity in the population. Diversity measures may be classified into three main categories: (a) behavioral or phenotypic diversity, that considers differences in solution performance (fitness value) (Hildebrandt and Branke, 2015, Kalkreuth et al., 2015, Li et al., 2016), (b) syntactic or genotypic diversity, which computes structural differences between individuals (shape and content of solutions in the population) (Ferdjoukh et al., 2017, Qu et al., 2015) and (c) a combination of both previous approaches (Affenzeller et al., 2017, Kelly et al., 2019).

In this piece of research, we focus on improving diversity in formal grammar evolution by studying a combination of both phenotypic and genotypic approaches. As the solutions in GP are encoded using tree data structures traditionally, genotypic diversity measures focus on this type of representation (Burks and Punch, 2017, Burlacu et al., 2019, Ekárt and Németh, 2000, Ekárt and Németh, 2002, Kulunchakov and Strijov, 2017, Pawlik and Augsten, 2016). We may classify the cited methods as distance measures or metrics: whereas a metric holds the properties of non-negativity, identity, symmetry, and triangle inequality, the remaining distance measures fail to accomplish one or more of these properties (usually the triangle inequality), but they can provide a value to estimate how distant two encoded solutions are, and have provided good results in the problems they have been used. Examples of (non-metric) distance measures are described in Burks and Punch (2017), which implement a density measure that considers a portion of each tree and determine how genetic material is distributed in the population; or the work (Burlacu et al., 2019), that uses isomorphic properties to measure structural diversity between two trees as the number of common nodes. Regarding metric proposals, Pawlik and Augsten (2016) developed a metric that determines the differences between two individuals as the sequence of minimum cost of operations needed to transform one tree into another. Besides, Ekárt and Neméth described in Ekárt and Németh, 2000, Ekárt and Németh, 2002 a metric that computes the structural difference of two encoded programs, distinguishing terminal and operator nodes.

Regarding phenotypic or behavioral diversity in genetic programming, the literature offers a wide variety of works that obtain semantic information from individuals during the evolutionary process and it is used it to improve the search space exploration in GP. These works range from classical methods such as the traditional Ramped Half and Half method to prevent the insertion of duplication trees into the population (Koza, 1992) to more recent works such as Castelli et al. (2015) that proposed the Geometric Semantic Genetic Programming (GGSP) (Moraglio et al., 2012) algorithm that designs an operator which measures semantic differences between two individuals to guide the search space exploration, or Uy et al. (2010) whose developed a Semantic Similarity Crossover (SSC) which add semantic knowledge to control the changes of the semantic of individuals by comparing similarities of random subtrees. In summary, the main works proposed to directly or indirectly control diversity in Genetic Programming go from the structures used to represent the population to genetic operators and measures to control the population growth (Ursem, 2002).

In this piece of research, we focus on improving diversity in GP in two ways: (i) studying alternative structures to classical trees and (ii) developing measures to control diversity during the genetic procedure for these alternative structures. In previous works, we studied an alternative representation scheme to tree encoding, using Straight Line Programs (SLP) (Rueda et al., 2019), and we concluded that using this representation may help to overcome limitations of classic tree encoding and to overcome local optima solutions. In this article, our main objective is to develop a metric based on edit distance that allows us to quantify how different two SLPs are, and use this metric to measure diversity in a population of SLPs to find a balance between diversity and convergence that helps to improve the exploration of the search space. More specifically, we combine the developed metric distance with the CHC algorithm (Eshelman, 1991), to achieve a balance in the exploration and exploitation of the search space. Thus, the main novelty presented in this manuscript is the design of the similarity measure for Straight Line Programs, the proof that this measure is a metric, and its application in combination with a well tested evolutionary scheme such as CHC to prove its practical application. We remark that the classic edit distance is applied over sequences, and the similarity measure proposed in this work is adapted to grammars as formal languages. As no previous works have been proposed to quantify the distance between Straight Line Programs, we test our approach against tree-based encodings as baseline methods.

The remaining of the manuscript is structured as follows: Section 2 describes the background of our research introducing the fundamentals of Symbolic Regression, the representation problem and an outline of the classic CHC algorithm. Section 3 works out the proposed similarity measure. Section 4 applies the proposed metric in combination with the CHC algorithm to control diversity and convergence in genetic programming. Section 5 shows the experimental results in synthetic data and real energy consumption data and discusses the comparative study of the proposal with state-of-the-art algorithms. Finally, Section 6 summarizes the conclusions obtained and describes future works.

Section snippets

Symbolic regression and the representation problem

Regression analysis (Harrell, 2015) is a statistical method that allows to find the relationships between dependent and independent variables. More specifically, regression analysis is composed by a model hypothesis f(x̄,w̄)+ε, a set of input data x̄={x1,x2,,xn}, a set of output data ȳ={y1,y2,,ym}, a set of constant parameters w̄={w1,w2,,wk}, and an error ε that represents the part of the data that the model f(x̄,w̄) is unable to model. The main goal of regression analysis is to approximate

Similarity measure for Straight Line Programs

Our goal is to define a similarity measure that provides the structural difference between SLPs and can be computed efficiently. We are inspired by the edit distance metric (Ristad & Yianilos, 1998). Hence, the proposed similarity measure provides the minimum number of operations required to transform one SLP into another. As in edit distance, the available operations to compute such transformation are insertions, deletions and substitutions. Then, the more similar two SLPs are, the lower the

Control of diversity of Straight Line Programs evolution with CHC

In this section we describe an application of the proposed metric to control diversity in Genetic Programming using SLPs as representation for symbolic regression problems. More specifically, we use the proposed distance as a diversity measure in an adapted CHC evolutionary algorithm as incest prevention mechanism. We selected the CHC algorithm since it is a classic approach that combines a balance in diversity and convergence and it has been widely tested in the literature. We name our

Experimentation

The main goal of this experimentation is to test if the proposed metric together with the CHC algorithm can improve the exploration and exploitation of the Symbolic Regression solution space in Genetic Programming. As no previous works have been devised to measure similarities between SLPs, we compare the approach with classic metrics for tree representation (Ekárt and Németh, 2000, Ekárt and Németh, 2002) and Genetic Programming evolution of SLPs with no diversity control. In particular, we

Conclusions

This manuscript has addressed the problem of holding a balance in diversity and convergence in genetic programming, and more specifically symbolic regression. The outcomes of this piece of research encompass both theory and practice. Regarding the theoretical dimension, we have developed a metric that can be used to calculate the structural distance between symbolic regression expressions represented as Straight Line Programs (SLP). Such metric works similarly to the edit distance, and it is

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

This work was supported by the projects PID2020-112495RB-C21 and B-TIC-42-UGR20 .

References (51)

  • WatermanM. et al.

    Some biological sequence metrics

    Advances in Mathematics

    (1976)
  • AffenzellerM. et al.

    Dynamic observation of genotypic and phenotypic diversity for different symbolic regression gp variants

  • Alfaro-CidE. et al.

    Prune and plant: A new bloat control method for genetic programming

  • AlonsoC.L. et al.

    Straight line programs: A new linear genetic programming approach

  • AngelineP.J.

    Genetic programming and emergent intelligence

  • BadranK. et al.

    The roles of diversity preservation and mutation in preventing population collapse in multiobjective genetic programming

  • BillardL. et al.

    Symbolic regression analysis

  • BrameierM. et al.

    Evolving teams of predictors with linear genetic programming

    Genetic Programming and Evolvable Machines

    (2001)
  • BurkeE.K. et al.

    Diversity in genetic programming: an analysis of measures and correlation with fitness

    IEEE Transactions on Evolutionary Computation

    (2004)
  • BurksA.R. et al.

    An analysis of the genetic marker diversity algorithm for genetic programming

    Genetic Programming and Evolvable Machines

    (2017)
  • BurlacuB. et al.

    Parsimony measures in multi-objective genetic programming for symbolic regression

  • ChenG. et al.

    Preserving and exploiting genetic diversity in evolutionary programming algorithms

    IEEE Transactions on Evolutionary Computation

    (2009)
  • ClaudeF. et al.

    Indexing straight-line programs

    (2008)
  • ČrepinšekM. et al.

    Exploration and exploitation in evolutionary algorithms: A survey

    ACM Computing Surveys

    (2013)
  • dal Piccol SottoL.F. et al.

    Studying bloat control and maintenance of effective code in linear genetic programming for symbolic regression

    Neurocomputing

    (2016)
  • Cited by (2)

    View full text