Elsevier

Information Sciences

Volume 502, October 2019, Pages 418-433
Information Sciences

Genetic programming performance prediction and its application for symbolic regression problems

https://doi.org/10.1016/j.ins.2019.06.040Get rights and content

Highlights

  • Theoretical analysis of GP performance prediction.

  • Suggestion of upper bounds for GP performance.

  • Guide GP search with the proposed upper bound.

Abstract

Predicting the performance of Genetic Programming (GP) helps us identify whether it is an appropriate approach to solve the problem at hand. However, previous studies show that measuring the difficulty of a problem for GP and predicting GP performance are challenging issues. This paper presents a theoretical analysis of GP performance prediction problem and suggests an upper bound for GP performance. It means that the error of the best solution that is found by GP for a given problem is less than the proposed upper bound. To evaluate the proposed upper bound experimentally, a wide range of synthetic and real symbolic regression problems with different dimensions are solved by GP and consequently, a lot of actual GP performances are collected. Comparing the actual GP performances with their corresponding upper bounds shows that the proposed upper bounds are not violated for both synthetic and real symbolic regression problems. Then, the proposed upper bound is used to guide GP search. The results show that the proposed approach can find better results in comparison to Multi Gene Genetic Programming (MGGP).

Introduction

No Free Lunch Theorem [1], [2] states that there is no best learning algorithm for all problems, different algorithms have different performance depending on the problem at hand. Therefore, selecting an appropriate learning algorithm among all possible approaches to solve a specific problem is a very important task [3]. In order to select an appropriate approach (i.e., an approach that has good success rate), having an estimation or prediction of the performance of each approach for a given problem is better than determining the performance by running the algorithm, specially for time consuming or high computational cost algorithms.

GP is one of evolutionary computation techniques that can solve a wide range of problems. Symbolic regression is one of the most common problems that are usually solved by GP, which is studied in this paper. For a fixed configuration GP (fixing the parameters of GP such as representation, genetic operators, function set, population size, number of generations, fitness function, etc.), the performance of GP is related to the characteristics of the symbolic regression problem, in such a way that some of these problems are hard for GP (i.e., GP has a low success rate, when used to solve these problems) and some of them are easy (i.e., GP has a high success rate in solving them). However, as stated in [4], identifying the difficulty of a particular problem for a GP system is hard.

Koza [5] defined the computational effort required to solve a given problem that is appropriate to compare different evolutionary systems or different algorithms. It is also a measure of the difficulty of a problem. The fitness landscape metaphor can also sometimes indicate the hardness of a problem, but not always. As an example, consider high dimensional problems, where fitness landscape is of no use [4]. For GP which has a high dimensional search space with difficulty in defining individuals’ neighborhoods, it is impossible to sketch the fitness landscape. Therefore, measures of problem difficulty based on the fitness landscape concept such as Fitness Distance Correlation (FDC) [6], [7], [8] and Negative Slope Coefficient (NSC) [9] have been proposed.

However, these measures can not predict the fitness of the final solution or the success rate of GP for a particular problem. Graff and Poli [10], [11] estimated the performance of GP based on the problem distances. They tried to predict the performance of GP on a given problem based on the similarity between that problem and a set of reference problems. Moreover, they suggested an improvement to their approach [12], [13], where they predict the performance of GP based on the introduced finite difference difficulty indicators. The disadvantage of FDC and NSC is that they contain an extensive sampling of the search space that could cost as much as running GP to know actual GP performance. On the other hand, performance prediction models that are proposed by Graff and Poli are not reliable for new problems that are not sufficiently similar to the problems of the training set [11].

This paper presents a theoretical analysis of GP performance prediction problem and suggests an upper bound for GP performance. It means that the error of the best solution that is found by GP for a given problem is less than the proposed upper bound. The proposed upper bounds are evaluated experimentally using a wide range of synthetic and real symbolic regression problems with different dimensions. The actual GP performances obtained from experiments are compared with their corresponding proposed upper bounds. The results show that the proposed upper bounds are not violated for both synthetic and real symbolic regression problems. As an application of performance prediction, the best proposed upper bound is used to guide GP search. The results show that the proposed approach can find better results in comparison to MGGP [14], [15]. So this paper contains several contributions that are listed as follows:

  • The upper bounds that are proposed in this paper are general and can be used for any symbolic regression problem, i.e., they are applicable to both one dimensional and multidimensional symbolic regression problems with any sample size and with any sampling strategy.

  • The computation of the proposed upper bounds is fast because their computation does not depend on search space sampling or the actual GP performance for a large set of benchmark problems.

  • The proposed upper bounds are also reliable when they predict a small (good) value for GP performance.

  • The proposed upper bounds are analyzed theoretically and experimentally.

  • The best proposed upper bound is also used to guide GP search and to evolve subtrees that improve the performance of GP.

The rest of this paper is organized as follows. Literature review is presented in Section 2. Several theoretical notes about GP performance prediction are described in Section 3. The proposed upper bounds are introduced in Section 4. Section 5 is consist of several experiments to confirm the proposed upper bounds. It is also contains an experiment to guide GP search via the best proposed upper bound. Finally, in Section 6, conclusions and some possible future works are made.

Section snippets

Related work

Koza [5] analyzed the performance of GP over a large number of runs (in order to minimize the effect of random behavior of GP) and suggested calculating the total number of runs that succeeded on or before the ith generation, divided by the total number of runs. This is called P(M, i), where M is population size. Note that P(M, i) depends on the population size M and the generation number i but it is obtained experimentally and it is not a formula in terms of M and i. Given P(M, i) one can

Theoretical background

Koza [5] defined the computational effort required to solve a given problem that is appropriate to compare different evolutionary systems or different algorithms. It is also a measure of the difficulty of a problem, i.e. easier problems would need lower computational effort to be solved. But what if the problem is not solvable? For example, a solution with zero error may not be found for a given symbolic regression problem. Here, instead of computing the computational effort required to find a

Proposed upper bounds

Suppose that the data set DS=[(xi1,,xim;yi)]i=1n contains n samples. Define the output vector Y=[yi]i=1n, and the input matrix X=[X1Xm], where Xj=[xij]i=1n and m is the dimension of the input matrix X. Assume that Y^=[y^]i=1n is the output of a GP solution, then NRMSE is the square root of Mean Squared Error between Y^ and Y that is normalized by the output variance, Eq. (6).MSE(Y,Y^)=1ni=1n(yiy^i)2NRMSE(Y,Y^)=(MSE(Y,Y^)Var[Y])12

Keijzer [25], [26] proposed scaled symbolic regression that

Experimental results

In this section, several experiments are designed to evaluate the upper bounds that are proposed in Section 4 experimentally. First, 500 synthetic problems are generated. For each synthetic problem, the proposed upper bounds and GP performance are obtained and the relationship between them are analyzed. In this paper, the median of the best fitness of the last generation of GP over 31 independent runs is used as the performance of GP on synthetic problems. The proposed upper bounds and GP

Conclusion and future work

In this paper, a theoretical analysis is presented for GP performance prediction that is a challenging issue. This analysis is performed under certain condition, e.g., for a fixed number of fitness function evaluations, and as a result, an upper bound for the error of the best solution found by GP is introduced. Several experiments are conducted to analyze the relationship between the proposed upper bounds and the performance of GP for symbolic regression problems. Furthermore, the best upper

Compliance with ethical standards

  • This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors.

  • Declarations of interest: none. The authors declare that they have no conflict of interest.

  • This article does not contain any studies with human participants or animals performed by any of the authors.

Declaration of competing interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

The authors wish to acknowledge the assistance they received by using UCI repository.

References (39)

  • J.R. Koza

    Genetic Programming: on the Programming of Computers by Means of Natural Selection

    (1992)
  • T. Jones et al.

    Fitness distance correlation as a measure of problem difficulty for genetic algorithms

    Proceedings of the 6th International Conference on Genetic Algorithms

    (1995)
  • M. Tomassini et al.

    A study of fitness distance correlation as a difficulty measure in genetic programming

    Evol. Comput.

    (2005)
  • Z. Zhang et al.

    Predictive models of problem difficulties for differential evolutionary algorithm based on fitness landscape analysis

    Proceedings of the 37th Chinese Control Conference (CCC)

    (2018)
  • L. Vanneschi et al.

    Negative slope coefficient: a measure to characterize genetic programming fitness landscapes

  • M. Graff et al.

    Practical model of genetic programming’s performance on rational symbolic regression problems

    Genetic Programming

    (2008)
  • M. Graff et al.

    Performance models for evolutionary program induction algorithms based on problem difficulty indicators

    Proceedings of the European Conference on Genetic Programming

    (2011)
  • M. Graff et al.

    Models of performance of evolutionary program induction algorithms based on indicators of problem difficulty

    Evolut. Comput.

    (2013)
  • D.P. Searson et al.

    Gptips: an open source genetic programming toolbox for multigene symbolic regression

    Proceedings of the International Multiconference of Engineers and Computer Scientists

    (2010)
  • Cited by (0)

    View full text