Semantic approximation for reducing code bloat in Genetic Programming
Introduction
Genetic Programming (GP) is an evolutionary method to find solutions in the form of computer programs, for a problem [20]. A GP system is started by initializing a population of individuals. The population is then evolved for a number of generations using genetic operators such as crossover and mutation. At each generation, the individuals are evaluated using a fitness function, and a selection schema is used to choose better individuals to create the next population. The evolutionary process is continued until a desired solution is found or when the maximum number of generations is reached.
Although GP has been successfully applied to many real-world problems, it is still not widely accepted as other machine learning approaches e.g. Support Vector Machines or Linear Regression [27]. This is due to some important shortcomings of GP such as its poor local structure, ill-defined fitness landscape and code bloat [34]. Among them, bloat phenomenon is one of the most studied problems. Bloat happens when individuals grow too large without a corresponding improvement in fitness [40,52]. Bloat causes several problems to GP: the evolutionary process is more time consuming, it is harder to interpret the solutions, and the larger solutions are prone to overfitting. To date, many techniques have been proposed to address bloat. These techniques range from limiting individual size to designing specific genetic operators for GP [8,11,14,25,26,36,42,43].
In this paper, we present a novel approach to control GP bloat, that leverages the previous studies [8,9]. Our methods are based on the Semantic Approximation Technique (SAT) that allows to grow a small tree of similar semantics to a target semantic vector. Using SAT, we proposed two methods for lessening GP code bloat. The first method is Subtree Approximation (SA) in which a random subtree is selected in an individual and this subtree is replaced by a small tree of approximate semantics. The second method is Desired Approximation (DA) where the new tree is grown to approximate the desired semantics of the subtree instead of its semantics. The performance of the bloat control strategies is examined on a large set of regression problems employing the benchmark and the UCI problems. We observe that the new methods help to significantly reduce code growth and improve the generalization ability of the evolved solutions when compared to standard GP and the state of the art methods for controlling code bloat in GP. Moreover, the proposed methods are also better than three popular machine learning models and competitive with the fourth one.
The main contribution of this paper is the demonstration how the idea of semantic approximation (presented in Section 4) can be utilized to reduce code growth in GP. Comparing to [8,9] where we originally proposed this technique, here we present a generalized version of semantic approximation and a better technique for controlling code bloat. Moreover, the proposed methods are thoroughly examined and compared with a number of GP and non-GP systems on a wider range of regression problems.
The remainder of this paper is organized as follows. In the next section, we present the background of the paper. Section 3 reviews the related work on managing code bloat in GP. The semantic approximation technique and two strategies for lessening code bloat are presented in Section 4. Section 5 presents the experimental settings adopted in the paper. Section 6 analyses and compares the performance of the proposed strategies with standard GP and some recent related methods. Some properties of our proposed methods are analyzed in Section 7. Section 8 compares the proposed methods with four popular machine learning algorithms. Finally, Section 9 concludes the paper and highlights some future work.
Section snippets
Background
This section presents some important concepts used in this paper, including the semantics of a GP individual and the semantic backpropagation algorithm.
Related work
Due to the negative impact of code bloat, many approaches have been proposed to control bloat and lessen its impact on GP performance. Generally, the bloat control methods can be divided into three main groups: constraining individual size, adjusting selection techniques and designing genetic operators.
Methods
This section presents a new technique for constructing a program that approximates a given semantic vector and two methods for reducing GP code bloat based on that.
Experimental settings
We tested SA and DA on nine GP benchmark problems recommended in the literature [53], and an additional nine real problems are taken from UCI machine learning repository [3]. The abbreviation, the name, number of features, number of training and testing samples of each problem are presented in Table 1. The GP parameters used in our experiments are shown in Table 2. The raw fitness is the root mean squared error on all fitness cases. Therefore, smaller values are better. For each problem and
Performance analysis
This section analyses the performance of the proposed methods using four popular metrics: training error, testing error, solution size and running time.
Bloat, overfitting and complexity analysis
This section presents a deeper analysis on the properties of the tested methods using three quantitative metrics: bloat, overfitting and functional complexity [50]. Due to the space limitation, we only present the results on four typical problems (F1, F7, F11 and F17) and for two configurations (SA20 and DA20). The results on the other problems and the rest versions of SA and DA are shown in the supplement 1 of this paper.
Comparing with machine learning algorithms
This section compares the results of the proposed methods with some popular machine learning models. Four machine learning algorithms including Linear Regression (LR) [17], Support Vector Regression (SVR) [44], Random Forest (RF) [24], and Orthogonal Polynomial Expanded Random Vector Functional Link Neural Network (ORNN) [19,51] are used in this experiment. For LR, SVR and RF, the implementation of these regression algorithms in a popular machine learning packet in Python Scikit learn [37] is
Conclusion and future work
This section summarizes the paper, discusses the limitations and highlights some future research.
Declaration of competing interest
None.
Acknowledgement
This research is funded by Vietnam National Foundation for Science and Technology Development (NAFOSTED) under grant number 102.05-2019.05.
References (53)
- et al.
Semantic tournament selection for genetic programming based on statistical analysis of error vectors
Inf. Sci.
(2018) - et al.
Neat genetic programming: controlling bloat naturally
Inf. Sci.
(2016) - et al.
On the roles of semantic locality of crossover in genetic programming
Inf. Sci.
(2013) - et al.
A comprehensive experimental evaluation of orthogonal polynomial expanded random vector functional link neural networks for regression
Appl. Soft Comput.
(2018) - et al.
Prune and plant: a new bloat control method for genetic programming
- et al.
Bloat control operators and diversity in genetic programming: a comparative study
Evol. Comput.
(2010) - et al.
UCI Machine Learning Repository
(2013) - et al.
Semantic analysis of program initialisation in genetic programming
Genet. Program. Evolvable Mach.
(2009) - et al.
Semantically driven mutation in genetic programming
Evolution of visual feature detectors
Data mining using genetic programming: the implications of parsimony on generalization error
Reducing code bloat in genetic programming based on subtree substituting technique
Semantics based substituting technique for reducing code bloat in genetic programming
Tournament selection based on statistical test in genetic programming
A practical tutorial on the use of nonparametric statistical tests as a methodology for comparing evolutionary and swarm intelligence algorithms
Swarm Evol. Comput.
Controlling bloat through parsimonious elitist replacement and spatial structure
Operator equalisation and bloat free gp
Lect. Notes Comput. Sci.
Multi-objective semantic mutation for genetic programming
Controlling code growth by dynamically shaping the genotype size distribution
Genet. Program. Evolvable Mach.
Logistic Regression Models
Integrating local search within neat-gp
Random Vector Functional Link Neural Network Based Ensemble Deep Learning
Genetic programming as a means for programming computers by natural selection
Stat. Comput.
Approximating geometric crossover in semantic space
Locally geometric semantic crossover
A survey of semantic methods in genetic programming
Genet. Program. Evolvable Mach.
Cited by (16)
Semantic Cluster Operator for Symbolic Regression and Its Applications
2022, Advances in Engineering SoftwareCitation Excerpt :This method adjusts the scaling paramseters attached to the target node such that desired semantics becomes similar to the semantics of an alternative sub-program. Nguyen and Chu [60] attempted to improve the bloat control performance of GP. Therefore, their operator selects a random node from a large program and then replaces the sub-program with a smaller sub-program that contains semantics similar to the desired semantics of the target node.
Semantic schema based genetic programming for symbolic regression
2022, Applied Soft ComputingCitation Excerpt :Ruberto et al. [45] proposed SGP-DT, a new semantic genetic programming algorithm based on the dynamic target. Nguyen and Chu [46] proposed two semantic approximation-based methods for controlling bloat in genetic programming. Miranda et al. [47] introduced an instance selection method for reducing the dimensionality of geometric semantic genetic programming space.
Automatic compiler/interpreter generation from programs for Domain-Specific Languages: Code bloat problem and performance improvement
2022, Journal of Computer LanguagesCitation Excerpt :In this section, examples of code bloat are presented first, followed by performance improvement results. Solutions from Genetic Programming are often difficult to understand due to code bloat [12]. Code bloat refers to a phenomenon where the size of an individual is increased, but the meaning of an individual is still the same.
PS-Tree: A piecewise symbolic regression tree
2022, Swarm and Evolutionary ComputationOptimizing genetic programming by exploiting semantic impact of sub trees
2021, Swarm and Evolutionary ComputationCitation Excerpt :Using this metric a framework is introduced for preserving the diversity in GP population. Nguyen and Chu [21] used semantic similarity to reduce code bloat. They replace sub-trees with smaller semantically similar trees during the crossover.
A geometric semantic macro-crossover operator for evolutionary feature construction in regression
2024, Genetic Programming and Evolvable Machines