Semantic approximation for reducing code bloat in Genetic Programming

https://doi.org/10.1016/j.swevo.2020.100729Get rights and content

Abstract

Code bloat is a phenomenon in Genetic Programming (GP) characterized by the increase in individual size during the evolutionary process without a corresponding improvement in fitness. Bloat negatively affects GP performance, since large individuals are more time consuming to evaluate and harder to interpret. In this paper, we propose two approaches for reducing GP code bloat based on a semantic approximation technique. The first approach replaces a random subtree in an individual by a smaller tree of approximate semantics. The second approach replaces a random subtree by a smaller tree that is semantically approximate to the desired semantics. We evaluated the proposed methods on a large number of regression problems. The experimental results showed that our methods help to significantly reduce code bloat and improve the performance of GP compared to standard GP and some recent bloat control methods in GP. Furthermore, the performance of the proposed approaches is competitive with the best machine learning technique among the four tested machine learning algorithms.

Introduction

Genetic Programming (GP) is an evolutionary method to find solutions in the form of computer programs, for a problem [20]. A GP system is started by initializing a population of individuals. The population is then evolved for a number of generations using genetic operators such as crossover and mutation. At each generation, the individuals are evaluated using a fitness function, and a selection schema is used to choose better individuals to create the next population. The evolutionary process is continued until a desired solution is found or when the maximum number of generations is reached.

Although GP has been successfully applied to many real-world problems, it is still not widely accepted as other machine learning approaches e.g. Support Vector Machines or Linear Regression [27]. This is due to some important shortcomings of GP such as its poor local structure, ill-defined fitness landscape and code bloat [34]. Among them, bloat phenomenon is one of the most studied problems. Bloat happens when individuals grow too large without a corresponding improvement in fitness [40,52]. Bloat causes several problems to GP: the evolutionary process is more time consuming, it is harder to interpret the solutions, and the larger solutions are prone to overfitting. To date, many techniques have been proposed to address bloat. These techniques range from limiting individual size to designing specific genetic operators for GP [8,11,14,25,26,36,42,43].

In this paper, we present a novel approach to control GP bloat, that leverages the previous studies [8,9]. Our methods are based on the Semantic Approximation Technique (SAT) that allows to grow a small tree of similar semantics to a target semantic vector. Using SAT, we proposed two methods for lessening GP code bloat. The first method is Subtree Approximation (SA) in which a random subtree is selected in an individual and this subtree is replaced by a small tree of approximate semantics. The second method is Desired Approximation (DA) where the new tree is grown to approximate the desired semantics of the subtree instead of its semantics. The performance of the bloat control strategies is examined on a large set of regression problems employing the benchmark and the UCI problems. We observe that the new methods help to significantly reduce code growth and improve the generalization ability of the evolved solutions when compared to standard GP and the state of the art methods for controlling code bloat in GP. Moreover, the proposed methods are also better than three popular machine learning models and competitive with the fourth one.

The main contribution of this paper is the demonstration how the idea of semantic approximation (presented in Section 4) can be utilized to reduce code growth in GP. Comparing to [8,9] where we originally proposed this technique, here we present a generalized version of semantic approximation and a better technique for controlling code bloat. Moreover, the proposed methods are thoroughly examined and compared with a number of GP and non-GP systems on a wider range of regression problems.

The remainder of this paper is organized as follows. In the next section, we present the background of the paper. Section 3 reviews the related work on managing code bloat in GP. The semantic approximation technique and two strategies for lessening code bloat are presented in Section 4. Section 5 presents the experimental settings adopted in the paper. Section 6 analyses and compares the performance of the proposed strategies with standard GP and some recent related methods. Some properties of our proposed methods are analyzed in Section 7. Section 8 compares the proposed methods with four popular machine learning algorithms. Finally, Section 9 concludes the paper and highlights some future work.

Section snippets

Background

This section presents some important concepts used in this paper, including the semantics of a GP individual and the semantic backpropagation algorithm.

Related work

Due to the negative impact of code bloat, many approaches have been proposed to control bloat and lessen its impact on GP performance. Generally, the bloat control methods can be divided into three main groups: constraining individual size, adjusting selection techniques and designing genetic operators.

Methods

This section presents a new technique for constructing a program that approximates a given semantic vector and two methods for reducing GP code bloat based on that.

Experimental settings

We tested SA and DA on nine GP benchmark problems recommended in the literature [53], and an additional nine real problems are taken from UCI machine learning repository [3]. The abbreviation, the name, number of features, number of training and testing samples of each problem are presented in Table 1. The GP parameters used in our experiments are shown in Table 2. The raw fitness is the root mean squared error on all fitness cases. Therefore, smaller values are better. For each problem and

Performance analysis

This section analyses the performance of the proposed methods using four popular metrics: training error, testing error, solution size and running time.

Bloat, overfitting and complexity analysis

This section presents a deeper analysis on the properties of the tested methods using three quantitative metrics: bloat, overfitting and functional complexity [50]. Due to the space limitation, we only present the results on four typical problems (F1, F7, F11 and F17) and for two configurations (SA20 and DA20). The results on the other problems and the rest versions of SA and DA are shown in the supplement 1 of this paper.

Comparing with machine learning algorithms

This section compares the results of the proposed methods with some popular machine learning models. Four machine learning algorithms including Linear Regression (LR) [17], Support Vector Regression (SVR) [44], Random Forest (RF) [24], and Orthogonal Polynomial Expanded Random Vector Functional Link Neural Network (ORNN) [19,51] are used in this experiment. For LR, SVR and RF, the implementation of these regression algorithms in a popular machine learning packet in Python Scikit learn [37] is

Conclusion and future work

This section summarizes the paper, discusses the limitations and highlights some future research.

Declaration of competing interest

None.

Acknowledgement

This research is funded by Vietnam National Foundation for Science and Technology Development (NAFOSTED) under grant number 102.05-2019.05.

References (53)

  • M.J. Cavaretta et al.

    Data mining using genetic programming: the implications of parsimony on generalization error

  • T.H. Chu et al.

    Reducing code bloat in genetic programming based on subtree substituting technique

  • T.H. Chu et al.

    Semantics based substituting technique for reducing code bloat in genetic programming

  • T.H. Chu et al.

    Tournament selection based on statistical test in genetic programming

  • J. Derrac et al.

    A practical tutorial on the use of nonparametric statistical tests as a methodology for comparing evolutionary and swarm intelligence algorithms

    Swarm Evol. Comput.

    (2011)
  • G. Dick et al.

    Controlling bloat through parsimonious elitist replacement and spatial structure

  • S. Dignum et al.

    Operator equalisation and bloat free gp

    Lect. Notes Comput. Sci.

    (2008)
  • J.V.C. Fracasso et al.

    Multi-objective semantic mutation for genetic programming

  • M.-A. Gardner et al.

    Controlling code growth by dynamically shaping the genotype size distribution

    Genet. Program. Evolvable Mach.

    (2015)
  • J.M. Hilbe

    Logistic Regression Models

    (2009)
  • P. Juárez-Smith et al.

    Integrating local search within neat-gp

  • R. Katuwal et al.

    Random Vector Functional Link Neural Network Based Ensemble Deep Learning

    (2019)
  • J.R. Koza

    Genetic programming as a means for programming computers by natural selection

    Stat. Comput.

    (1994)
  • K. Krawiec et al.

    Approximating geometric crossover in semantic space

  • K. Krawiec et al.

    Locally geometric semantic crossover

  • V. Leonardo et al.

    A survey of semantic methods in genetic programming

    Genet. Program. Evolvable Mach.

    (2014)
  • Cited by (16)

    • Semantic Cluster Operator for Symbolic Regression and Its Applications

      2022, Advances in Engineering Software
      Citation Excerpt :

      This method adjusts the scaling paramseters attached to the target node such that desired semantics becomes similar to the semantics of an alternative sub-program. Nguyen and Chu [60] attempted to improve the bloat control performance of GP. Therefore, their operator selects a random node from a large program and then replaces the sub-program with a smaller sub-program that contains semantics similar to the desired semantics of the target node.

    • Semantic schema based genetic programming for symbolic regression

      2022, Applied Soft Computing
      Citation Excerpt :

      Ruberto et al. [45] proposed SGP-DT, a new semantic genetic programming algorithm based on the dynamic target. Nguyen and Chu [46] proposed two semantic approximation-based methods for controlling bloat in genetic programming. Miranda et al. [47] introduced an instance selection method for reducing the dimensionality of geometric semantic genetic programming space.

    • Automatic compiler/interpreter generation from programs for Domain-Specific Languages: Code bloat problem and performance improvement

      2022, Journal of Computer Languages
      Citation Excerpt :

      In this section, examples of code bloat are presented first, followed by performance improvement results. Solutions from Genetic Programming are often difficult to understand due to code bloat [12]. Code bloat refers to a phenomenon where the size of an individual is increased, but the meaning of an individual is still the same.

    • PS-Tree: A piecewise symbolic regression tree

      2022, Swarm and Evolutionary Computation
    • Optimizing genetic programming by exploiting semantic impact of sub trees

      2021, Swarm and Evolutionary Computation
      Citation Excerpt :

      Using this metric a framework is introduced for preserving the diversity in GP population. Nguyen and Chu [21] used semantic similarity to reduce code bloat. They replace sub-trees with smaller semantically similar trees during the crossover.

    View all citing articles on Scopus
    View full text