Introduction

Genetic programming (GP) [1] is a branch of evolutionary computation that can generate computer programs, and is widely applied in numerous fields [2,3,4,5,6,7,8,9,10]. Evolutionary programming (EP) is a black-box optimiser and mutation is the only operator in EP. Researchers recommended different probability distributions as mutation operators and analyse their characteristics. For example, Yao et al. [11] point out that a Cauchy mutation performs better than a Gaussian mutation by virtue of a higher probability of making large jumps, while large step sizes are typically detrimental towards the end of the search process when the set of current search points are close to the global optimum. Hong et al. [12] mentioned that when using a non-adaptive mutation operator in EP, more offspring usually survive in the early generations, and conversely less survive in the later generations of the run. Researchers also proposed different mutation strategies to promote EP efficiency [12,13,14,15,16,17]. ‘A hyper-heuristic is a search method or learning mechanism for selecting or generating heuristics to solve computational search problems’ [18]. Researchers classify hyper-heuristics according to the feedback sources in the learning process: Online learning hyper-heuristics learn from a single instance of a problem; Offline learning hyper-heuristics learn from a set of training instances and generalise to unseen instances [18]. Both online [2,3,4,5] and offline hyper-heuristics [6,7,8,9,10] are applied to various research fields.

Hyper-heuristics are an effective and popular technique which have been applied to a wide range of problem domains. Cowling et al. [2] used online learning hyper-heuristics to minimise the number of delegates who actually attend the sales summit out of a number of possible delegate attendees. Dowsland et al. [3] used online learning hyper-heuristics to handle the design and evaluation of a heuristic solution to the problem of selecting a set of shippers that minimises the total annual volume of space required to accommodate a given set of products with known annual shipment quantities. Ochoa et al. [4] used online learning hyper-heuristics to describe the number of extensions to the HyFlex framework that enables the implementation of more robust and effective adaptive search heuristics. Pisinger et al. [5] used online learning hyper-heuristics to present a unified heuristic which is able to solve five different variants of the vehicle routing problem. Shao et al. [6] used multiobjective genetic programming as an offline learning hyper-heuristic to apply the feature learning for image classification. Hong et al. [7, 8] used GP as an offline learning hyper-heuristic to automatically design a mutation operator for EP and automatically design more general mutation operators for EP [9]. Ross et al. [10] used an offline learning hyper-heuristic to represent a step towards a new method of using evolutionary algorithms that may solve some problems of acceptability for real-world use.

In this study, the heuristics are adaptive mutation operators generated by GP based on an offline hyper-heuristic. In other words, we use a GP-based offline hyper-heuristic to redesign a portion of the EP algorithm (i.e. the probability distribution) to improve the overall EP algorithm performance. The contribution of this study is that this work realises the ‘automatic’ design of ‘adaptive’ mutation operators, it realises both ‘automatic’ and ‘adaptive’ at the same time has been revised and updated. Previous studies either automatically designed static/non-adaptive mutation operators [7,8,9], or human/manually designed adaptive mutation operators/mutation strategies for EP [13, 16, 17, 19,20,21,22]. To achieve this target, a set of adaptive factors was proposed, which collected and updated historical information during the evolutionary process. The proposed method also contributes to a group of automatically designed adaptive mutation operators for function classes. In essence, these adaptive factors are variables and are provided in terminal sets for GP that uses these variables to automatically design a mutation operator (random number generator), which replaces the human designed mutation operator in the EP algorithm. In EP, each individual is taken as a pair of real-valued vectors. The variables can partly reflect the individuals’ current position and the evolutionary status. These variables change during evolution, and affect the characteristics of mutation, thus we call them adaptive factors. For example, \(CUR\_MIN\_X\) is the minimum value of the best individuals up to the current generation. \(N(\mu , CUR\_MIN\_X)\) is a mutation operator, \(CUR\_MIN\_X\) is updated in each EP generation, then the size of the jumps the mutation operator can make keeps updating in each EP generation. \(CUR\_MIN\_X\) is an adaptive factor type. The adaptive factors are collected and employed for EP in both the training and testing stages.

The hypothesis of this study, inspired by [8, 9, 11, 12] is that for a specific class of functions GP can design an adaptive mutation operator for EP, which outperforms an automatically designed non-adaptive mutation operator. The adaptive factors collected through the EP run can lead to the size of the jumps being different. A set of adaptive mutation operators can be discovered for function classes respectively. For more details please refer to Sect. 4.

The outline of this paper is as follows: in Sect. 2, we describe function optimisation and the basic EP algorithm. In Sect. 3, we describe the adaptive mutation operator. Section 4 describes connections between GP and EP. Section 5 describes the benchmark functions and their classes, and how a mutation operator is trained, and also describes the test results. Here, we contrast the performance of previously proposed automatically designed non-adaptive mutation operators (ADMs, i.e. not adaptive) with automatically designed adaptive mutation operators (ADAMs, i.e. are adaptive). In Sect. 6, we analyse and compare the testing results and in Sect. 7, we summarise and conclude the paper.

The basic EP algorithm

The EP algorithm evolves a population of numerical vectors to find near-optimum functional values. Mutation is the only operator for EP, and recently EP researchers have focussed primarily on manually designing mutation operators or smart strategies to use the mutation operators [11,12,13,14,15,16, 23].

Minimisation can be formalised as a pair (Sf), where \(S \in {\mathbb {R}}^{n}\) is a bounded set on \( {\mathbb {R}}^{n}\), and \(f:S\longrightarrow {\mathbb {R}}\) is an n-dimensional real-valued function. S is the problem search space (function f). The aim of EP is to find a point \(x_{\min } \in S\) such that \(f(x_{\min }\)) is a global minimum on S or a close approximation. More specifically, the requirement is to find an \(x_{\min }\in S\) such that

$$\begin{aligned} \forall x \in S : f(x_{\min })\le f(x) \end{aligned}$$

f does not need to be continuous or differentiable, but it must be bounded (i.e. S is bounded). The EP’s mutation process is represented by the following equations:

$$\begin{aligned} {x}_{{i}}^{\prime }(j)= & {} {x}_{{i}}(j) + \eta _{i}(j){D}_{{j}}, \end{aligned}$$
(1)
$$\begin{aligned} \eta _{i}^{\prime }(j)= & {} \eta _{i}(j)exp(\gamma ^{\prime }N(0,1) + \gamma {N}_{{j}}(0,1)). \end{aligned}$$
(2)

In the above equations, each individual is taken as a pair of real-valued vectors, (\(x_{i}\), \(\eta _{i}\)), \(\forall i\in \{1, \cdot \cdot \cdot , \mu \}\), \(\mu \) represents number of individuals in the population, i is the dimensionality of f, and j represents the \(j-th\) component of the vectors \(x_{i}\), \(x_{i}^{\prime }\), \(\eta _{i}\), and \(\eta _{i}^{\prime }\), the factors \(\gamma \) and \(\gamma \) \(^{\prime }\) are set to \((\sqrt{2\sqrt{n}})^{-1}\) and \((\sqrt{2n})^{-1}\). \(D_{j}\) represents the mutation operator; researchers usually use a Cauchy, Gaussian, or Lévy distribution \(L_{\alpha , \gamma }(y)\) as the mutation operator [11, 14, 23]. Lee et al. [14] point out that the Lévy distribution with \(\alpha \) = 1.0 is the Cauchy distribution, and that with \(\alpha \) = 2.0, it is the Gaussian distribution. For a complete EP description, refer to [24].

In this study, the hyper-heuristic framework designs an adaptive mutation operator (can also be seen as an adaptive mutation strategy) and replaces a probability distribution \(D_{j}\). The EP algorithm uses this candidate mutation operator on functions generated from the function classes.

Adaptive mutation operator

This study focuses on automatically designing adaptive heuristics (i.e. random number generators which are used as mutation operators in EP). However, the human designed adaptive heuristics have already been proposed in numerous studies, below are examples of human designed heuristics that relate to adaptiveness, adaptive mutation operator, and mutation strategies:

The concept of adaptiveness is widely used in many technologies: in [25] they proposed new local search procedures combined in an adaptive large neighbourhood search meta-heuristic to generate solutions for the gate matrix layout problem. An adaptive charged system search algorithm was developed to solve economic dispatch problems in [26]. To make the terminal phase of the standard global positioning system and inertial navigation system (GPS/INS) landing system more precise, an adaptive fuzzy data fusion algorithm was developed yielding more accurate state estimates while the vehicle approaches the landing surface [27].

A novel adaptive strategy is developed to dynamically tune the control parameters of the random learning operator so that the improved adaptive human learning optimisation can efficiently accelerate the convergence at the beginning of an iteration, develop diversity in the middle of the searching process to better explore the solution space, and perform an accurate local search at the end of the search to find the optima [28].

A hybrid adaptive evolutionary algorithm was introduced to improve the performance of the search operators across the various stages of the search/optimisation process of evolutionary algorithms [29]. A neighbourhood-adaptive differential evolution method was also proposed; in this framework, multiple neighbourhood relationships are defined for each individual and the neighbourhood relationships are then adaptively selected for specific functions during the evolutionary process [30].

An adaptive switching particle swarm optimisation algorithm using a hybrid update sequence is proposed, which can automatically switch to synchronous or asynchronous updating during the evolutionary process [31]. [32] presented a method for reusing the valuable information available from previous individuals to guide later search; in this approach, prior useful information was fed back to the updating process.

The concept of adaptive mutation operators has been proposed in numerous studies. The adaptive mutation operator proposed for particle swarm optimisation uses the three mutation operators (Cauchy, Gaussian, and Lévy) in [33]; the mutation operators that cause lower fitness values for the offspring see their selection ratios decreased, and the selection ratios for mutation operators that cause higher fitness values for the offspring increase. In [34], they developed an adaptive mutation for a particle swarm optimisation for an airfoil aerodynamic design. In [35] they proposed using an adaptive strategy in differential evolution, with a Cauchy distribution (\(F_{m}\), 0.1), where they use a fixed scale parameter 0.1 and an adaptive location parameter \(F_{m}\). The Gaussian (\(C_{\mathrm{{rm}}}\), 0.1) has a fixed standard deviation of 0.1 and a mean of \(C_{\mathrm{{rm}}}\).

Liu et al. [21] investigated operator adaptation in EP both at the population and individual levels. The operator adaptation at the population level aims to update the rates of the operator based on the operator performance over the entire population during the evolution [21].

In [17], a mixed mutation strategy with a local fitness landscape was proposed: In these strategies, a local fitness landscape was used as a key factor to determine the mutational behaviour. In [16], ensemble strategies with adaptive EP was proposed. In this work, EP with an ensemble of Gaussian and Cauchy mutation operators was proposed where each mutation operator has its own population and parameters. In [12], mutation operators with different parameters were selected in each EP generation according to the step size of the jumps by individuals. In this strategy, the size of the jumps the mutation operator can make keeps changing during the EP evolutionary process. These studies can be considered as human designed adaptive mutation strategies. Liu [21] proposed that operator adaptation in EP can be investigated at both the population and individual levels. In [12], we introduced a mutation strategy for EP, generating long step size variants at the beginning and short step size variants later on in the search.

Regardless of the type of adaptive or non-adaptive mutation operators used in state-of-the-art algorithms, the essence of the adaptive mutation operator is the following:

  • Use different mutation operators, or a combination of them, for each generation.

  • Change the jump sizes for different generations according to the feedback from the current generation or historical information (i.e. each EP generation has a best fitness value, the best fitness values of each generation can be stored in an array, the array fitness values are an example of historical information).

The proposed hyper-heuristic GP algorithm

In this section, we describe how GP is used to build EP mutation operators. In previous work, the GP framework successfully designed non-adaptive mutation operators for EP [7, 9]. In [7, 9], Hong et al. used GP as an offline learning hyper-heuristic to automatically design a mutation operator for EP and automatically design more general mutation operators for EP. In [7], a group of mutation operators were designed for function classes. In [9], function classes were classified into several groups, and each group is assigned an independent automatically designed mutation operator. In both works, the automatically designed mutation operator is fixed at each EP generation. The automatically designed mutation operator is a probability distribution. Once the probability distribution is automatically designed, the form of the equation with the values are fixed when an EP is used; thus, the search region is fixed at each EP generation.

However, it is obvious that dynamically updating the probability distribution during the evolutionary process can lead to a more efficient search. In this study, we employ the improved framework in Fig. 1 by more creative GP terminal settings, while using GP as an offline hyper-heuristic to automatically design adaptive mutation operators for EP on specific function classes. In the framework, GP sits at the hyper-level to generate a piece of code (which acts as a mutation operator by generating random numbers according to a probability distribution), and secondly, generate functions from the function classes to optimise. The automatically generated program is actually an automatically designed probability distribution with adaptive factors. For example, \(CUR\_MIN\_X\) represents the absolute minimum value of the best individual in an EP run up to the current generation, which is updated at each EP generation; thus, the jump sizes of the ADAM change dynamically at different EP generations. The settings in Table 1 are adaptive factors that exist in the mutation operator. At the base-level, EP optimises functions and EP is treated as a fitness function by GP in the framework. The fitness value used in GP is the value calculated, averaged over nine runs by EP.

Fig. 1
figure 1

Overview of the improved hyper-heuristic framework proposed, GP sits at the hyper-level to generate heuristics, EP sits at the base-level to validate the heuristics

In Fig. 1, the blue arrow between ‘Function class’ and ‘Function to optimise’ represents functions are generated from the function class. The blue arrow between ‘Function to optimise’ and ‘Evolutionary programming’ represents functions that are taken by EP and optimised. The blue arrow between ‘Evolutionary programming’ and ‘Adaptive mutation operator generator’ represents the insertion of an ADAM which is inserted into EP. The adaptive mutation operator is generated by GP and inserted into EP where it is tested on a set of functions.

In the experiments, we perform comparisons of the two types of automatically designed mutation operators: A non-adaptive mutation operator (ADM), which is a random number generator according to a fixed probability distribution, and an adaptive mutation operator (ADAM), which is a random number generator according to a probability distribution that dynamically changes during the EP run. This means that the mutation operator could be different at each EP generation. To identify the hypothesis proposed in this study, in contrast with the framework we proposed in [7, 9], we propose two significant improvements:

  • The adaptive factors, are proposed and recalculated at each EP generation, added to the terminal set for GP.

  • In [7,8,9], the GP framework uses EP as a fitness function, which is used to evaluate the performance of GP individuals, through non-adaptive mutation operators. In this work, the adaptive factor values are used as part of the adaptive mutation operator for EP. Due to the adaptive factors added, EP needs to calculate and update values of adaptive factors at each generation. The updated values of adaptive factors lead to the probability distribution of the possible step sizes to which a mutation operator can make changes.

Experimental design

We call a random number generator produced by GP without adaptive factors an automatically designed mutation operator (ADM) [7], and a random number generator produced by GP with adaptive factors an automatically designed mutation operator (ADAM). The experiment is designed to test the following hypothesis: ADAM can search different regions at different EP generations. The ADAM achieves better performance than an ADM for EP on specific function classes. Thus, the adaptive factors are calculated and used by EP when evolution is being processed. In the experiment, 168 functions are generated from each function class. Eighteen functions are used in the training stage, 50 functions are used in the testing stage, 100 functions are used in the exchange testing (50 for ADAM and 50 for ADM).

Table 1 The adaptive factors included in the GP terminal set

The training stage

The algorithm proposed in [7, 9] demonstrates automatically designed non-adaptive mutation operators have better performance than human designed mutation operators. In these experiments, we train mutation operators for each function class with two different terminal sets. In one experiment, the GP terminal set contains adaptive factors (see Table 1) that are generated by EP. In the other experiment, the terminal set of GP excludes adaptive factors (see Table 4). Both training processes use the same set of functions generated from function classes. In the experiments, ADAM denotes adaptive factors with automatically designed mutation operators. The tailored ADM automatically designed for the function class \(F_{g}\) is represented with \(\hbox {ADM}_{{g}}\), where g is the function class index. \(\hbox {ADM}_{{g}}\) is called a dedicated ADM for the function class \(F_{g}\). The tailored ADAM automatically designed for a function class \(F_{g}\) is represented by \(\hbox {ADAM}_{{g}}\), where g is the function class index. \(\hbox {ADAM}_{{g}}\) is called a dedicated ADAM for function class \(F_{g}\).

In the experiments, where a subtree crossover is applied, nodes are randomly selected from both parent trees, and the related branches are exchanged creating two offspring. A one-point mutation with the grow initialisation method is applied in GP to generate a new offspring [1, 38]. During training, the GP fitness value is derived from the averaged fitness values of the nine EP runs. Each \(\hbox {ADM}_{{g}}\) or \(\hbox {ADAM}_{{g}}\) is used as an EP mutation operator on nine functions drawn from a given function class. The fitness value of an \(\hbox {ADM}_{{g}}\) or \(\hbox {ADAM}_{{g}}\) is the average of the best values obtained in each of the individual nine EP runs on the nine functions. We use the same nine functions from each function class for the entire GP run on a given function class. In general, for one function class, 18 functions are taken for training, nine of which are used to calculate the fitness value, and nine others to monitor over-fitting.

To ensure that the experimental data is more traceable and easier to compare, the function classes and number of generations for each function class used in this study follow the settings in study [9]. The EP parameters follow [11, 14, 23], and are presented in Table 2: the population size is set to 100, the tournament size is set to 10, the initial standard deviation is set to 3.0. The settings for the dimension n, and the number of domains S are listed in Table 6. To reduce the cost of training, in our experiment the maximum number of EP generations was set to 1000 for \(F_{1}\)\(F_{13}\) and \(F_{15}\). The maximum number of EP generations was set to 100 for \(F_{14}\) and \(F_{16}\)\(F_{23}\).

Table 2 Parameter settings for EP
Table 3 Parameter settings for GP
Table 4 Terminal set of GP without the adaptive factors
Table 5 Function set for GP

The terminal set includes the adaptive factors outlined in Table 1, and excluding the adaptive factors outlined in Table 4. In both tables, N is a normal distribution for which the value of \(\mu \) is a random number in \([-2, 2]\). This value may cause the designed mutation operator to not be Y-axis symmetric. \(\sigma ^2\) is a random number in [0, 5]. U is the uniform distribution in the range [0, 3]. In Table 1, GEN is the current EP generation index: It is an integer in the range [1, 1000] or [1, 100], which depends on the function classes we selected. \(CUR\_MIN\_Y\) is the best fitness value of the individuals up to the current generation, and the framework records the following for a given EP individual (each individual is taken as a pair of real-valued vectors in EP) [23]: \(CUR\_MIN\_X\), as the absolute minimum value for this individual, \(CUR\_MAX\_X\), as the absolute maximum value for this individual, and \(CUR\_STD\_X\) as the standard deviation of this individual. \(CUR\_MAX\_Y\) is the worst fitness value of the individuals in the current generation, and \(CUR\_STD\_Y\) is the standard deviation of all fitness values in the current generation. All these values constitute useful information, and will change during the EP runs.

The GP parameter settings are listed in Table 3. depthnodes is set to 2, indicating that restrictions are to be applied to the tree size (number of nodes) [36]. The GP function sets are listed in Table 5. The other GP settings are: The population size is 20, and the maximum number of generations is 25. The algorithm framework to automatically design ADAM is described in Algorithms 1 and 2.

Testing the \(\hbox {ADAM}_{{g}}\), \(\hbox {ADM}_{{g}}\), and human designed mutation operators

We use ADAMs (in Table 9), ADMs (in Table 10), and human designed EP mutation operators, and test them on each function class \(F_{g}\). For each ADAM or ADM, we record 50 values from 50 independent EP runs, each being the lowest value over all EP generations, and we then average them: This is called the mean best value. The results of testing in Tables 7, 8, and 11 are based on the same 50 functions generated from each function class \(F_{g}\). Thus, in total 68 functions are used for both training and testing stages.

figure a

The testing stage

In [11, 13, 17, 23], a suite of 23 benchmark functions are commonly used to test the different EP variants, where \(f_{1}\)\(f_{7}\) are unimodal functions, \(f_{8}\)\(f_{13}\) are multimodal functions with many local optima, \(f_{14}\)\(f_{23}\) are multimodal functions with a few local optima [11].

In this study, we built function classes based on the following function classes [7]. We use function classes instead of single functions for benchmark function optimisation. In other words, we are specially designing an ADAM for a class of functions.

The set of functions can be considered as a set of training instances of offline learning hyper-heuristics. The framework generates functions (instances) from the function class to train an ADAM or ADM. We then draw an independent set of functions (unseen instances) from the function class to test the ADAMs and ADMs produced from the training stage.

Based on the 23 functions, we have constructed 23 function classes in Table 6, with the index of each function class corresponding to the original functions of [23]. In Table 6, \(a_{i}\), \(b_{i}\), and \(c_{i}\) are uniformly distributed in range [1, 2], \([-1, 1]\), and \([-1, 1]\), respectively. The definition of the function class has been proposed in [7, 9]. In this study, we use the symbol \(f_{g}\) for a function, and \(F_{g}\) for a function class.

To observe the performance of the \(\hbox {ADAM}_{{g}}\), \(\hbox {ADM}_{{g}}\) and human designed mutation operators, we tested the functions drawn from on \(F_{g}\). The mean best values and standard deviations are listed in Table 7. The highest mean best values are in boldface. Each mean value is averaged

figure b

over 50 runs. To improve the observation, we retain more decimal places for \(F_{6}\), \(F_{16}\), \(F_{17}\), \(F_{18}\), and \(F_{19}\).

Table 6 Function Classes with n dimensions and domain S, \(a_{i} \in [1 ,2]\), \(b_{i}, c_{i} \in [-1, 1]\)

We also performed the Wilcoxon signed-rank test for \(\hbox {ADAM}_{{g}}\) versus \(\hbox {ADM}_{{g}}\), Lévy (1.0, 1.2, 1.4, 1.6, 1.8, 2.0), the results of which are given in Table 8. The Lévy distribution, where \(\alpha \) = 1.0 is a Cauchy distribution, whereas \(\alpha \) = 2.0 yields a Gaussian distribution [14]. Due to diverse features of different distributions, they have different performance on different benchmark functions.

Exchange testing for \(\hbox {ADAM}_{{g}}\) and \(\hbox {ADM}_{{g}}\) on each function class

The exchange testing evaluates the performance of \(\hbox {ADAM}_{{g}}\) and \(\hbox {ADM}_{{g}}\) on all function classes. An \(\hbox {ADAM}_{{g}}\) designed for the function class \(F_{g}\) is called a tailored adaptive mutation operator, while an \(\hbox {ADAM}_{{g}}\) tested on \(F_{j}\) is called a non-tailored adaptive mutation operator; here \(g \not = j\). An \(\hbox {ADM}_{{g}}\) designed for the function class \(F_{g}\) is called a tailored non-adaptive mutation operator, while an \(\hbox {ADM}_{{g}}\) tested on \(F_{j}\) is called a non-tailored non-adaptive mutation operator. For example, \(\hbox {ADAM}_{{1}}\) is a tailored adaptive mutation operator for \(F_{1}\), but it is a non-tailored adaptive mutation operator for the function class \(F_{2}\). \(\hbox {ADM}_{{1}}\) is a tailored non-adaptive mutation operator for \(F_{1}\), but it is a non-tailored non-adaptive mutation operator for the function class \(F_{2}\). To observe the performance of both tailored (or dedicated) and non-tailored (or non-dedicated) \(\hbox {ADAM}_{{g}}\) and \(\hbox {ADM}_{{g}}\) on \(F_{j}\), we tested \(\hbox {ADAM}_{{g}}\) and \(\hbox {ADM}_{{g}}\) on \(F_{j}\) over 50 runs, and a function is generated from the function class in each run. Fifty functions are used for \(\hbox {ADAM}_{{g}}\), 50 functions are used for \(\hbox {ADM}_{{g}}\). Thus, 100 functions are used in total. The mean best values and standard deviations are given in Tables 11, and 12. We display more decimal places for \(F_{6}\), \(F_{16}\), \(F_{17}\), \(F_{18}\), and \(F_{19}\), as the results are otherwise too close to distinguish. The values of the tailored ADAM and ADM are in boldface, and the values that are better than those of the tailored ADAM and ADM are also in boldface.

Table 7 Results of \(\hbox {ADAM}_{{g}}\), \(\hbox {ADM}_{{g}}\), and Lévy(1.0, 1.2, 1.4, 1.6, 1.8, 2.0) on the trained function classes \(F_{g}\). Mean indicates the mean of the best values found over all generations over 50 runs. Std Dev denotes standard deviations
Table 8 Wilcoxon signed-rank test of \(\hbox {ADAM}_{{g}}\) versus \(\hbox {ADM}_{{g}}\) and Lévy distribution (with \(\alpha \) = 1.0, 1.2, 1.4, 1.6, 1.8, 2.0)
Table 9 ADAMs discovered by GP
Table 10 ADMs discovered by GP

Analysis of the performance of \(\hbox {ADAM}_{{g}}\) and \(\hbox {ADM}_{{g}}\)

In this section, we compare \(\hbox {ADAM}_{{g}}\), \(\hbox {ADM}_{{g}}\), and the human designed mutation operators. From the experiment, both \(\hbox {ADAM}_{{g}}\) and \(\hbox {ADM}_{{g}}\) achieve better performance than the human designed mutation operators on \(F_{g}\), and \(\hbox {ADAM}_{{g}}\) achieves an outstanding performance on most of \(F_{g}\).

In Table 7, \(\hbox {ADAM}_{{7}}\), \(\hbox {ADAM}_{{12}}\), \(\hbox {ADAM}_{{19}}\), and \(\hbox {ADAM}_{{20}}\) are the exceptions: the experimental results for \(\hbox {ADM}_{{7}}\), \(\hbox {ADM}_{{12}}\), and \(\hbox {ADM}_{{20}}\) show that they achieve better performance than \(\hbox {ADAM}_{{7}}\), \(\hbox {ADAM}_{{12}}\), and \(\hbox {ADAM}_{{20}}\); The performance of \(\hbox {ADAM}_{{19}}\), and \(\hbox {ADM}_{{19}}\) is the same; In this particular case, the adaptive factors selected may not well match with \(F_{7}\), \(F_{12}\), \(F_{19}\), and \(F_{20}\). In the future we will analyse and collect more adaptive factors for EP, and import the new adaptive factors to design ADAMs for \(F_{7}\), \(F_{12}\), \(F_{19}\), and \(F_{20}\).

Table 11 The mean best values (averaged over 50 runs) and standard deviations (in the parentheses) for \(\hbox {ADAM}_{{g}}\) tested on different function classes, the fitness value of the \(\hbox {ADAM}_{{g}}\) is in bold, and those other fitness values that are better than the fitness value of the \(\hbox {ADAM}_{{g}}\) are also in bold
Table 12 The mean best values (averaged over 50 runs) and standard deviations (in parentheses) for \(\hbox {ADM}_{{g}}\) tested on different function classes, the fitness value of the \(\hbox {ADM}_{{g}}\) is in bold, and those other fitness values that are better than the fitness value of the \(\hbox {ADM}_{{g}}\) are also in bold

Table 8 lists the results of the Wilcoxon Signed-Rank Test at the 5% significance level, comparing a tailored \(\hbox {ADAM}_{{g}}\) with an \(\hbox {ADM}_{{g}}\) and human designed mutation operators, using Lévy distributions (with \(\alpha \)=1.0, 1.2, 1.4, 1.6, 1.8, 2.0). In this table ‘\(\ge \)’ indicates that the \(\hbox {ADAM}_{{g}}\) performs better on \(F_{g}\) than \(\hbox {ADM}_{{g}}\) or the human designed mutation operators on average. ‘>’ means that the difference is statistically significant, and ‘\(=\)’ means that there is no difference. In the majority of the cases, the \(\hbox {ADAM}_{{g}}\) outperform \(\hbox {ADM}_{{g}}\) and human designed mutation operators. ‘\(\le \)’ indicates that \(\hbox {ADAM}_{{g}}\) performs worse on \(F_{g}\) than \(\hbox {ADM}_{{g}}\).

Table 11 lists the experimental results from using all ADAMs given in Table 9, testing all function classes. ‘N/A’ means \(\hbox {ADAM}_{{g}}\) may lead to generating value out of range of the machine’s representation. For non-tailored function classes in certain generations. From this table we find that tailored ADAMs achieve much better performance than non-tailored ADAMs on \(F_{j}\) in most cases. The reason is the following: ADAMs can make different jump sizes in different EP generations; however, non-tailored ADAMs usually cannot fit in function classes, once an (non-tailored) ADAM fails to make an EP search in the entire space in the early generations, it will fall into local optima with no ability to subsequently jump out and escape them. As the values generated by ADAM are probably too short to allow individuals to jump out. In the future we can search for smarter adaptive factors to design ADAMs to avoid falling into local optima earlier, or establish a mechanism to avoid this problem, such as importing long step size mutation operators occasionally in a later EP generation.

The results in Tables 11, and 12 demonstrate that \(\hbox {ADAM}_{{1}}\), \(\hbox {ADAM}_{{3}}\), \(\hbox {ADAM}_{{4}}\), \(\hbox {ADAM}_{{5}}\), \(\hbox {ADAM}_{{6}}\), \(\hbox {ADAM}_{{8}}\), \(\hbox {ADAM}_{{9}}\), \(\hbox {ADAM}_{{10}}\), \(\hbox {ADAM}_{{11}}\), \(\hbox {ADAM}_{{14}}\), \(\hbox {ADAM}_{{16}}\), \(\hbox {ADAM}_{{17}}\), \(\hbox {ADAM}_{{18}}\), \(\hbox {ADAM}_{{19}}\), \(\hbox {ADAM}_{{21}}\), \(\hbox {ADAM}_{{22}}\), and \(\hbox {ADAM}_{{13}}\) achieve the best performance when acting as tailored operators, i.e. acting on their own \(F_{g}\). \(\hbox {ADAM}_{{2}}\), \(\hbox {ADAM}_{{7}}\), \(\hbox {ADAM}_{{12}}\), \(\hbox {ADAM}_{{13}}\), \(\hbox {ADAM}_{{15}}\), and \(\hbox {ADAM}_{{20}}\) achieve satisfactory performance, albeit not the best.

Fig. 2
figure 2

EP evolutionary process for ADAM, ADM, Gaussian, Cauchy, and Lévy distributions with different values of \(\alpha \). The X-axis represents EP generation, and the Y-axis represents the fitness value of EP

Fig. 3
figure 3

EP evolutionary process for ADAM, ADM, Gaussian, Cauchy, and Lévy distributions with different values of \(\alpha \). The X-axis represents EP generation, and the Y-axis represents the fitness value of EP

Fig. 4
figure 4

EP evolutionary process for ADAM, ADM, Gaussian, Cauchy, and Lévy distributions with various values of \(\alpha \). The X-axis represents the EP generation, and the Y-axis represents the fitness value of EP

The large variation in a single mutation enables EP to globally search a wider area of the search space [14]. Researchers attempted to design mutation strategies that can generate larger variations, especially when local optima have deep and wide basins, in early EP generations, and smaller variations in the later EP generations. Otherwise, mutation strategies can generate smaller variations in early EP generations and reduced size large variations, still have chances to jump out of local optima in the later evolutionary process if they fall into early generation local optima, in the later EP generations. The ADAM can perform well on functions generated from function classes.

Table 13 Absolute minimum, absolute maximum, and mean absolute values generated by ADAMs for generations 1, 250, 500, 1000

In Figs. 2, 3 and 4, \(\hbox {ADAM}_{{3}}\), \(\hbox {ADAM}_{{4}}\), \(\hbox {ADAM}_{{11}}\), \(\hbox {ADAM}_{{13}}\), \(\hbox {ADAM}_{{21}}\), \(\hbox {ADAM}_{{22}}\), and \(\hbox {ADAM}_{{23}}\) achieve significant performance for EP. \(\hbox {ADAM}_{{2}}\), \(\hbox {ADAM}_{{6}}\), \(\hbox {ADAM}_{{12}}\), and \(\hbox {ADAM}_{{14}}\) do not perform well in the early EP generations. However, \(\hbox {ADAM}_{{2}}\), \(\hbox {ADAM}_{{6}}\), \(\hbox {ADAM}_{{12}}\), and \(\hbox {ADAM}_{{14}}\) lead to an acceleration in the convergence rate in the later EP generations due to the significant effectiveness of the EP adaptive factors. The experiments also demonstrate that these adaptive factors have positive impact, especially in later EP generations.

Let us further analyse the ADAMs. In Fig. 3, \(\hbox {ADAM}_{{12}}\) does not perform well in the early EP generations and shows the worst performance among all mutation operators in early generations (where the generation index is less than approximately 500). This situation changes in the later generations. The adaptive factors in \(\hbox {ADAM}_{{12}}\) does not help EP establish efficient convergence in the early generations. The values in Table 13 show random numbers generated by \(\hbox {ADAM}_{{12}}\) that are large in early EP generations, and much smaller in the later generations. It is obvious that the fit for \(\hbox {ADAM}_{{12}}\) improves in the later EP generations, as the jump sizes the mutation operator can make is significantly more precise due to the variability in mutation operators. We also notice that the processing shapes of \(\hbox {ADM}_{{22}}\) appear as ‘punctuated equilibria’, with \(F_{22}\) as a multimodal function class with a few local optima. This is because the evolutionary process makes no improvement in some generations.

For further discussion, consider two components \(x^{g}\)(j) from an individual, \(g = 1, 2\), as below:

$$\begin{aligned} x^{g}(1)= & {} x^{g-1}(1) + \beta {Y}_{1}^{g-1}, \end{aligned}$$
(3)
$$\begin{aligned} x^{g}(2)= & {} x^{g-1}(2) + \beta {Y}_{2}^{g-1}. \end{aligned}$$
(4)

Then, let us use ADAM to generate variants for \(Y_{j}^{k}\); we plot the variables \(x^g\)(1) and \(x^g\)(2) in Figs. 5,  6 and  7 for each ADAM with four different generations: 1, 250, 500, 1000, or alternatively 1, 25, 50, 100. The figures demonstrate that the size of the jumps the mutation operator can make in different EP generations are different. For example, in the early EP generations, the ADAMs search a wider space and can be seen as globally searching the space. In later EP generations, the ADAMs search a very small space. The ADAMs search widely different spaces in the early and later EP generations. This feature leads to the outperformance of ADAMs over the ADMs and human designed mutation operators.

We use ADAMs to generate 3000 random values for different EP generations. The minimum, maximum, and mean values are given in Tables 13 and 14. The values generated by the ADAMs are widely different across generations. In most cases, the minimum and mean values generated by the ADAMs are much smaller for large generation indices, and thus ADAMs search a wider space in the early EP generations, and search smaller space in the later EP generations.

Fig. 5
figure 5

Plot of the 2-dimensional variables from ADAMs in various generations (1, 250, 500, 1000). The original components are set to \(x^0\)(1) = \(x^0\)(2) = 0 and plotting 1000 with \(\beta \) = 1

Fig. 6
figure 6

Plot of the 2-dimensional variables from ADAMs in various generations. The original components are set to \(x^0\)(1) = \(x^0\)(2) = 0 and plotting 1000 or 100 with \(\beta \) = 1

Fig. 7
figure 7

Plot of the 2-dimensional variables from ADAMs in various generations (1, 25, 50, 100). The original components are set to \(x^0\)(1) = \(x^0\)(2) = 0 and plotting 100 with \(\beta \) = 1

Table 14 Absolute minimum, absolute maximum, and mean absolute values generated by ADAMs in generations 1, 25, 50, 100

Conclusions and future work

One of the advantages of the proposed method is that it passes the responsibility of the design process, to discover ADAMs for EP, from the human researcher to GP. Different from [7,8,9], the mutation operators discovered by the method proposed in this paper is ‘automatically’ designed and ‘adaptive’. The GP sits at the hyper-level to generate heuristics as ADAMs for EP on function classes. In other words, it tailors an ADAM to a function class. In this work, we compare two types of automatically designed mutation operators (ADAMs with adaptive factors and ADMs without adaptive factors) as well as human designed ADMs. Adaptive factors may capture the current status and present positions of individuals, or the landscape of the search space. The mutation operators with these factors enable EP to search in a targeted way at different stages. One of the disadvantages of the framework is, due to EP algorithm at the base level, the computational cost needed to evolve both ADAMs and ADMs. However, this is a singular up-front cost, and the trade-off as to whether this is acceptable depends on the application, required solution quality, and the number of times we are expected to solve problems from the target class of problems. A second disadvantage is the need for a number of training problems. In optimisation, an optimisation algorithm is designed and applied to a single instance of a problem, and that is the entire optimisation process. In our case of automatic design, we are employing a machine learning approach which requires a separate set of training and testing instances. In general, this will require a fair number of training instances, which may not be available.

A large variation in a single mutation enables the EP to globally search a wider region of the search space [14]. Researchers attempted to design mutation strategies that can generate a large variation in the early generations and a small variation in the later EP generations, or small variations in early generations and large variations in the later EP generations. The ADAM can perform well on functions generated from function classes. In Figs. 2, 3 and 4, \(\hbox {ADAM}_{{2}}\), \(\hbox {ADAM}_{{6}}\), \(\hbox {ADAM}_{{13}}\), and \(\hbox {ADAM}_{{23}}\) significantly underperform in the early EP generations, as the ADAMs are not well matched with the functions in those early EP generations.

However, their convergence is faster in the later generations due to the adaptive factors, which demonstrate great effectiveness in EP. The experimental data also demonstrate that the adaptive factors collected are useful, and guide the mutation operator in its search from wider regions to small regions for EP. In the early EP generations, according to Figs. 5, 6 and 7, Table 13 and 14, the behaviour of an ADAM is as follows: (1) ADAM searches a wider space and can be seen as searching the space globally in early generations, and then in the later generations it searches a very small space. (2) Alternatively, it searches a very small space earlier and a wider space later, as do \(\hbox {ADAM}_{{3}}\), \(\hbox {ADAM}_{{4}}\), and \(\hbox {ADAM}_{{22}}\). The first behaviour is expected, however, the second is more interesting. The search from narrow spaces to wider spaces is due to the change of mutation operators. This means even if the search falls into local optima, the ADAM still has the ability to jump out from the local optima.

In summary, ADAMs search regions vary widely between early and later EP generations, and this feature leads to the superior performance of ADAMs over ADMs and human designed mutation operators. The random values generated by the ADAMs vary significantly across generations: In most cases, the minimum and mean values generated by ADAMs are much smaller for large generation numbers, and thus, the ADAMs search wider spaces in the early EP generations, and smaller spaces in later generations.

In this study, we use an offline hyper-heuristic to automatically design ADAMs for specific function classes. Previously, researchers have used offline hyper-heuristics to tailor ADMs [8] for specific function classes. We conducted tests to evaluate the performance of the tailored ADAMs, tailored ADMs, and mutation operators designed by humans, on specific function classes. The primary conclusions of this study are as follows. First, a tailored ADAM designed by GP can make jumps in the search space of different sizes depending on the EP generation. Secondly, the tailored ADAMs outperformed the tailored ADMs, and the mutation operators designed by humans on most function classes, or achieved the second best performance on a few function classes. Thirdly, an ADAM tailored to functions class performs well on that class of functions when compared to the performance of other ADAMs which were trained on different function classes.

In the future, there are several possible directions that can be investigated. At this point, mutation operators are adaptive, the relationships among adaptive factors are not considered and designed by GP. For example, whether the minimum value and standard deviation exists with certain connections at different generations. It is possible to improve the framework and parameters to further automatically design self-adaptive mutation operators. From the experimental tests, some ADAMs do not demonstrate outstanding performance, which means current adaptive factors may not well fit the function classes, thus more adaptive factors could be incorporated and their effects investigated. Due to the successful application of the work we present in this study, the idea could also be applied to other meta-heuristics including particle metropolis-hastings [39, 40], water wave optimisation [41], differential evolution, and particle swarm optimisation.