Introduction

Convolutional neural network (CNN) is a typical representative of neural network architectures. Due to its powerful feature extraction capability, it has shown excellent advantages on a variety of competing vision-related tasks, including images classification [1, 2], text detection[3], industrial data analysis [4], etc. In the past few years, researchers have conducted a variety of interesting studies on CNNs, including the design of architectures (depth, width, etc.) and the enhancement of learning capabilities (feature extraction and exploitation, propagation of loss functions, etc.). Representatives of CNNs include AlexNet [5], GoogLeNet [6], VGG [7], ResNet [8], DenseNet [9], etc., all of which have achieved quite remarkable results.

Although these CNNs have been designed with great success, the construction of CNN architectures is by no means an easy task. For machine learning scholars, CNNs are like a finely crafted work of art that requires constant tuning to optimize the mechanisms and parameters of the network until a good combination, a suitable regular planner, and optimized parameters are found. This process is inseparable from a wealth of prior knowledge and a huge amount of work, and many scholars have put considerable efforts just to design a satisfactory network architecture [10].

Fortunately, the potential of evolutionary algorithm (EA) in neural architecture searching (ENAS) has attracted considerable attention in the last few years and has yielded very promising results [11, 12]. Systematic reviews of ENAS can be found in [13, 14], where several representative algorithms are introduced, such as Genetic CNN and Evo-CNN (CNN architectures searching based on genetic algorithm) [15, 16], EAS and Meta-QNN (CNN architectures searching based on Q-learning) [17, 18], Large-scale Evolution [19], CGP-CNN (CNN architectures searching based on Cartesian genetic programming) [20], NAS (CNN architectures searching based on reinforcement learning) [21], and CNN-GA and AE-CNN (CNN architectures searching based on genetic algorithm and block encoding) [22, 23], etc. Experimental results of these methods have shown that pretty good performance in searching for the optimal network structures and excellent results have been achieved. But there are still some important limitations that need to be addressed. First, some of these algorithms still require considerable expertise, e.g., the EAS algorithm is based on a primary network, while the selection of the primary network still requires considerable empirical knowledge. Second, the performance of some ENAS algorithms is highly dependent on computational resources, e.g., it takes 28 days to train NAS on CIFAR10 even with 800 GPUs. Third, some of these methods, such as genetic CNNs, employ the encoding strategy with a fixed length or a fixed width, which means that the depth or width of CNNs is fixed. Since the performance of CNNs depends heavily on their depth and width, it is desirable that the architecture of CNNs could be flexible and versatile to ensure good generalization ability to different tasks.

The purpose of this paper is to design an algorithm that can autonomously evolve neural networks for the task of image classification, which can overcome the limitations of the existing methods described above. In view of the excellent performance of genetic programming in many practical applications, the idea of the binary diagram of Cartesian genetic programming (CGP) [24] is introduced into this paper to encode the structure of CNN architectures. The main contributions of this paper are as follows:

  1. (1)

    A flexible coding strategy of variable length and width is proposed based on CGP, and 22 alternative function blocks are designed. Where the tree-like structure of CGP with fewer structural constraints can represent the topology of CNNs well. Meanwhile, the efficient alternative function blocks can make the algorithm more efficient and can expand the search space, which in turn provides more possibilities for finding more high-quality architectures.

  2. (2)

    A multi-objective genetic programming with a leader–follower mechanism is designed, where the external archive of non-dominated solutions and the elite population act as leader and follower, respectively. This mechanism can speed up the convergence while preventing the algorithm from getting trapped in a local optimum to a large extent.

  3. (3)

    The effectiveness of the proposed algorithm is validated on eight benchmark datasets that are widely adopted in images classification tasks and a real-world industrial dataset. Computational results illustrate the superior performance of the proposed algorithm over various state-of-the-art algorithms.

The rest of this paper is organized as follows. The next section presents an introduction of the background and related works on the design of CNNs. Then the details of the proposed LF-MOGP are presented in the subsequent section followed by which the experiment design is introduced to evaluate the performance of the proposed algorithm. Then the results and analysis of the experiments are given. The penultimate section presents a case of LF-MOGP applied to a real-world industrial problem of slab number recognition. Finally, the conclusion and future works are presented.

Background

Cartesian genetic programming algorithm

As an evolutionary computation technique, genetic programming (GP) is capable of automatically evolving models to solve real-world problems based on the principles of evolution and natural selection in the biological world [25], which is well-known for its flexibility and high interpretability compared to other evolutionary algorithms and is widely adopted in image classification, scheduling and regression tasks [26]. As a highly representative branch of GP, CGP can flexibly encode various computing structures while avoiding the bloat problem in GP [27,28,29]. Which is capable of achieving high robustness and generalization due to its remarkable self-organization, self-learning, and self-adaptive properties, and is more effective than other GP algorithms in many complex problems such as parameters optimization, scheduling, resource allocation, and complex network analysis [30, 31].

Fig. 1
figure 1

General form of Cartesian-GP

The general form of CGP is shown in Fig. 1, which is represented by a directed graph with index nodes. In this form, there are n inputs and m outputs, and the output is obtained from the nodes of the last column. The size of the directed graph shown in Fig. 1 is \( n \times c\). The nodes of the same column cannot be connected to each other, and the connection between columns is also restricted by the level-back (e.g. level-back = 3 means that the nodes in ith column can connect to column \(i-3\) at most). The red dashed path ‘a-b-c-d’ represents a simple neural network with five nodes.

ResNet block

The creation of ResNet is a landmark event in the development of CNNs, which has made extraordinary contributions to mitigating gradient loss and utilizing features. The success of ResNet is mainly attributed to the design of its building blocks, especially the shortcut connections. During forward propagation, the presence of the hopping structure allows the input signal to propagate directly from any lower level to the higher level, and the loss will not be attenuated by any intermediate weight matrix. For the hopping structure, when the input and output dimensions are the same, the corresponding dimensions can be added directly and without any other operation. But if the two dimensions are inconsistent, the smaller dimension will be expanded by 0-padding or \(1 \times 1\) convolution to make the input and output dimensions consistent. Figure 2 shows a typical example of ResNet consisting of three convolution layers and a skip connection, where the line chart shows the jump propagation of loss values.

Fig. 2
figure 2

Examples of ResNet

Proposed algorithm

Algorithm overview with leader-and-follower mechanism

figure a

The framework of the proposed LF-MOGP is shown in Algorithm 1, which consists of the following steps. In the beginning, the items such as \(\mathbb {E}\), \(\mathbb {F}_{t}\), \(\mathbb {P}\) are initialized, where the initialization of the population based on the basic functional blocks is explained in detail in Algorithm 2, and the basic blocks designed in our algorithm are described in Sect. 3.2.

Then, the CNN corresponding to each solution in the population is constructed and evaluated separately. Here two evaluation metrics are introduced, the maximization of classification accuracy (Acc) and the minimization of complexity of the model, their detailed descriptions can be found in Sect. 3.6.

Fig. 3
figure 3

Illustrative diagram of the leader-and-follower mechanism

After that, the comparisons of the non-dominated relationship of all the solutions are performed, and then the leader–follower mechanism comes into play, the flow diagram of the leader–follower mechanism is shown in Fig. 3 (the red line represents the leader \(\mathbb {E}\) and the yellow circle represents the follower \(\mathbb {F}_{t}\), and the black arrow represents the update of the solution set). Specifically, the non-dominated solutions are stored in \(\mathbb {E}\), which will act as the leader during evolution. For the rest solutions in \(\mathbb {P}\), K elite solutions will be selected based on the crowding distance and stored in \(\mathbb {F}_{t}\), which will act as the follower during evolution. In the early stage of the algorithm, the non-dominated solutions in external archive \(\mathbb {E}\) will act as the leader. That is, the parent solutions are mainly selected from \(\mathbb {E}\), which helps to achieve a fast convergence speed. While in the later stage of the algorithm, the parent solutions are selected from both the leader \(\mathbb {E}\) and follower \(\mathbb {F}_{t}\), which can improve the search diversity thus preventing the algorithm from falling into local optimum. In the main loop of the LF-MOGP algorithm, the leader and the follower will be updated iteratively and will collaborate to achieve optimization of the relevant metrics with higher accuracy and lower model complexity, as shown in the third graph of Fig. 3, migrating to the upper right corner.

For the generation of new solutions, based on the characteristics of CGP encoding, we design both mutation and crossover operators. Obviously, the leader–follower mechanism allows the evolutionary process to focus on non-dominated solutions rather than the whole population at the early stage, which can greatly reduce the computational resource requirements and also speed up the convergence rate. The new solutions generated are also used to update the solutions in \(\mathbb {F}_{t}\), thus allowing \(\mathbb {F}_{t}\) to follow the changes of \(\mathbb {E}\), which can improve the individuals’ diversity in the later stage of the algorithm. That is, the leader–follower mechanism is able to achieve a better balance between exploration and exploitation while reducing computational resources.

Encoding and decoding strategy

Since the optimal structure of a CNN (including width and depth) is unknown when dealing with a specific problem, the encoding strategy employed must satisfy the variability of the depth and width of the CNN so that the proposed LF-MOGP algorithm has the opportunity to find the optimal structure of the CNN without the limitation of the search space. In this regard, CGP has exactly such flexibility, and each node function in CGP can be easily replaced with an efficient function block, which makes CGP a good representation of CNN structure. In addition, the CGP-based encoding strategy has fewer restrictions on crossover and variation operations. It does not have the traditional restrictions on the length or width of the parent individuals by the crossover and variation operators, which can further expand the search space and provide the possibility of finding the optimal structure of CNNs.

As mentioned above each node in CGP is replaced with an efficient function block. To increase the efficiency of the algorithm, the ResNet block mentioned in Sect. 2.2 is introduced as a basic block. Besides, another four types of functional blocks, namely, ConvBlock, Pooling, Concat, and Sum are also designed. These blocks contain several sub-blocks depending on their internal parameter settings. Table 1 shows the specific design of parameters in each functional block and its corresponding sub-blocks.

These functional blocks follow the following naming rules. Taking ConvBlocks as an example, their names are changed from C1 to C9, and the number and size of convolution kernels are increased accordingly. That is, these 9 ConvBlocks CB_32_1, CB_32_3, CB_32_5, CB_64_1, CB_64_3, CB_64_5, CB_128_1, CB_128_3, CB_128_5 are denoted as C1, C2,..., C9 in that order. ResNet Blocks are also named in a similar way. The average and max pooling are denoted by P1 and P2, respectively. The names of the other function blocks are consistent with their symbols.

Table 1 Detailed settings of the alternate blocks

ConvBlock is used for feature extraction. The parameters of the convolution are as follows. The kernel size is chosen from {\(1 \times 1\), \(3 \times 3\), \(5 \times 5\)}, the step is set to 1, and the input will be padded with 0 before the convolution operation. To alleviate the gradient dispersion during training, a batch normalization is performed after each convolution.

ResNet Block is processed by the standard convolution. The size of the convolution kernel is selected from {\(1 \times 1\), \(3 \times 3\), \(5 \times 5\)}, padding for half of the size of the convolution kernel size. After convolution, the batch normalization and ReLU function will be adopted. The specific form of ResNet is shown in Fig. 2 above.

Pooling includes the maximum one and the average one. The size of the filter is set to \(2 \times 2\), and the step is set to 2. Since the pooling process is equivalent to downscaling the features, the number of pooling layers is bounded by the size of the input image of \(d \times d\), and the maximum number of the adopted pooling block is \(log_2d\).

Concat is designed to merge the feature maps at the channel level. If the two feature maps to be concatenated have the same number of rows and columns, they will be merged directly at the channel level, otherwise, we will down-sample the one with a larger feature map by the maximum pooling to make the two features have the same size. Then the final feature map \(\mathcal {F}\) can be represented as follows:

$$\begin{aligned} \mathcal {F}=\min (M_1, M_2) \times \min (N_1, N_2) \times (C_1 + C_2), \end{aligned}$$
(1)

where the \(M_1 \times N_1 \times C_1\) and \(M_2 \times N_2 \times C_2\) are two dimensions of the input images.

Sum is designed to merge the feature maps at the pixel level. Similar to the Concat block, if the two feature maps to be summed are in different numbers of rows or columns, we will down-sample the one with a larger feature map by the maximum pooling to make the two features have the same size. In addition, if the two features to be summed have different channels, the one with the smaller number of channels will be expanded with a \(1\times 1 \) convolution operator to make them have the same dimension. The output of the feature map \(\mathcal {F}\) can be represented as follows:

$$\begin{aligned} \mathcal {F}=\min (M_1, M_2) \times \min (N_1, N_2) \times \max (C_1,C_2)\end{aligned}$$
(2)

where the \(M_1 \times N_1 \times C_1\) and \(M_2 \times N_2 \times C_2\) are two dimensions of the input images.

Fig. 4
figure 4

Illustrative diagram of encoding and decoding strategy

Figure 4 illustrates the procedure of how to decode the genotype of a solution into its corresponding CNN architecture. As shown in Fig. 4, a solution consists of three items, \({node\_id}\), \({is\_active}\) and gene. Specifically, the gene of the first node is (C4, 0, 0), and according to the naming rules mentioned earlier, C4 indicates that Convblock is selected, the number of convolution kernels is 64, and the size of the convolution kernel is \(1 \times 1\). The blocks and their specific symbols can be referenced in Table 1. With respect to the pooling block in the dotted box in the corresponding CNN architecture below, it does not actually work here because the sixth value of \({is\_active}\) is False.

Population initialization

The initialization procedure of the population is illustrated in Algorithm 2, where each solution is produced according to the employed encoding and decoding strategy.

figure b

The generation of a solution consists of the following steps. First, the parameters and an empty solution are initialized. The next step is to select the type and connections for each block of the solution. Notice that, since the pooling is a downsampling operation, so the number of the adopted pooling blocks must be less than \(\log _2d\) [20], otherwise, b must be re-selected from the rest of the blocks except the pooling ones. Furthermore, each block can only be connected to the nodes whose position precedes it, that is, both \(c_m\) and \(c_n\) must be less than j. When all the genotypes of the k nodes are determined, they will be encoded into the corresponding blocks according to the encoding strategy described in Sect. 3.2. Finally, the solution \(S_i\) corresponding to a CNN network is obtained by combining the gene and \(is\_active\).

Fig. 5
figure 5

Illustration of the mutation process implementation

Mutation

The mutation operation proposed in our method includes the following three types, Adding, Removing, and Modifying, respectively. The detailed process of the mutation operation is presented in Algorithm 3. The process of mutation operation consists of the following steps. First, the number of nodes in the parent individual p is counted and one position is randomly selected for the mutation. Second, a mutation type is randomly selected from the three alternative types, and then the mutation operation is performed. Please note that this mutation process may result in a dimension mismatch in the mutated offspring. For this possible case, a dimension matching discriminant and restoration module is proposed to determine if there is a mismatch in the dimension or the image size in the mutated offspring. If so, the \(1 \times 1\) convolution will be adopted to make the two dimensions consistent, and the maxpool operation will be used to downsample the large-size input so that the two inputs have the same size. Finally, the mutated offspring q is obtained and returned.

figure c

Figure 5 illustrates the implementation of the mutation process, where the third mutation strategy Modifying includes two methods, which are modifying connection nodes of block and modifying the configuration of block (size or number of convolutional kernels, etc.)

Crossover

Unlike the traditional crossover operation that requires two parent individuals to have the same length, in our algorithm, the two parent individuals can have different lengths. Fewer constraints here, so a larger search space is gained. The detailed procedure of this crossover operation is presented in Algorithm 4.

figure d

Specifically, the crossover operation consists of the following steps. First, based on the two random crossover positions the single-point crossover operation is performed. After restructuring, new offsprings are preliminary generated. A simple example to illustrate the crossover process is presented in Fig. 6. Next, a dimension matching discriminant is carried out, and if necessary, their dimensions are repaired in a similar way like the mutation operator. Finally, the offspring solutions generated by the crossover operation are obtained.

Fig. 6
figure 6

Illustration of the crossover process implementation

Fitness evaluation

The evaluation of a solution consists of two conflicting objectives: the classification accuracy (Acc) and the complexity of the corresponding CNN. The classification accuracy is calculated by the ratio of the number of incorrectly classified images to the total number of images, which is widely adopted in image classification tasks. The complexity is calculated by the number of trainable parameters of the corresponding CNN, as done by Sun et al. [22]. The purpose of choosing these two metrics is to obtain CNN architectures with high accuracy but low complexity.

Experiment design

Benchmark datasets

In the experiments, eight benchmark datasets that are widely used for images classification tasks are adopted to evaluate the performance of the proposed LF-MOGP. These benchmark datasets include CIFAR10, CIFAR100, Fashion, MB, MRI, MRB, MRD, and MRDBI. CIFAR10 and CIFAR100 cover the colorful images of objects such as cars and boats. The difference between them is that CIFAR100 covers 100 categories of objects, and each image of CIFAR100 contains a fine label except for super label. Fashion covers some fashion objects such as coats and shirts, and images covered in Fashion are grayscale. MNIST and its variants are established for the classification of ten hand-written digits (i.e., 0-9). The variants MRI, MRB, MRD, and MRDBI introduce different obstacles to MB (e.g., rotation, random noise, background images), which significantly increase the complexity of the classification tasks. The examples of these benchmark datasets are shown in Figs. 78, and 9, respectively. A more detailed description of these datasets is provided in Table 2.

Table 2 Summary of benchmark datasets used in the experiments
Fig. 7
figure 7

Examples for benchmarks dataset of CIFAR10 and CIFAR100

Fig. 8
figure 8

Examples for benchmarks datasets of Fashion

Fig. 9
figure 9

Examples for benchmarks datasets of MNIST and its variants

Experimental setting

The LF-MOGP algorithm is implemented in Pytorch, and all the experiments are carried out on a personal computer with one GeForce RTX 3090 GPU, Intel(R) Xeon(R) Silver 4110 CPU, and 32 GB RAM. The relevant details of the experiments can be described as follows. The maximum of the rows, columns, level_back, the minimum, and maximum of active nodes allowed in the neural networks are set to 5, 30, 10, 7, 30, respectively, which are determined by preliminary experimental experiences. Additionally, 22 alternative basic blocks are designed, which are shown in Table 1. The population size of \(\mathbb {P}\) is set to 30, and the maximum of generations for population evolution is set to 100. The maximum of the training epoch is set to 100. The learning rate is initially set to 1e-3 and is adjusted by cosine with an adjustment period of 80. The Adaptive Gradient method (Adam), where the beta is set to (0.9, 0.999), is chosen as the optimizer.

Experiment results

Overall results

To verify the effectiveness of the proposed LF-MOGP algorithm, a series of comparison experiments with 36 powerful competitor algorithms including the state-of-the-art ones are conducted.

Since the re-implementation of the competitor algorithms may not be able to achieve the same performance reported in the original papers. To make a fair comparison, for each dataset in our experiment, we collected the experimental results of these algorithms from the original papers. Furthermore, the competitor algorithms were often experimented with different datasets. Therefore, different competitor algorithms may be chosen for different datasets. More specifically, the best classification performance results of the proposed LF-MOGP and its competitors are shown in Tables 34, and 5, which correspond to the MNIST and its variant datasets, Fashion, and CIFAR, respectively. Note that all the results on the competitors given in the tables are reported in their papers, where the symbol ‘−’ implies that there is no public recorded result by the corresponding peer competitors.

In Table 5, not only the classification error but also the number of trainable parameters and ‘GPU Days’ are investigated to evaluate both the accuracy and complexity. Note that ‘GPU Days’ is just a reference indicator of computational consumption because the performance of different GPUs is different, and the specific experimental environment of each algorithm can be seen in the last row of Table 5. Specifically, if the algorithm employs 3 GPUs and runs for 7 days, then the corresponding ‘GPU Days’ will be 21. In contrast, the classification errors and training epochs are given in Table 4, while Table 3 only gives the classification errors due to fact that the competitors only reported the best classification errors of their algorithms and do not report other relevant results.

A. MNIST and its variants

As shown in Table 3, regarding the best classification performance, LF-MOGP can outperform all the compared competitors on the MNIST and its variants, except for the third-best performance on the MRD dataset. More specifically, the best classification error for MB, MRB, MBI and MRDBI obtained by the ENAS algorithms, are 0.79%, 2.44%, 4.06% and 17.92%, respectively. However, our LF-MOGP can further reduce the best classification error to 0.52%, 2.41%, 3.08% and 14.98%. Especially for the MB dataset, our LF-MOGP achieves a classification accuracy of nearly 99.5%. Moreover, for the MRDBI dataset whose difficulty of the classification is highest, the best performance among the compared algorithms is 17.92% obtained by SEECNN, but our LF-MOGP reduces the classification error to 14.98%, which demonstrates the advantage of LF-MOGP in dealing with complex classification tasks. When compared with the two GP-based algorithms(IEGP and FGP), the proposed LF-MOGP also demonstrates an absolute advantage, with significantly lower classification errors than both algorithms on all queryable datasets.

Table 3 The best classification error rates of LF-MOGP and its competitors on MNIST and its variants

B. Fashion

Nine peer competitors including the four methods (2C1P2F, 2C1P, 3C2F, 3C1P2F+Dropout) collected from the website (https://github.com/zalandoresearch/fashion-mnist) of the Fashion dataset are adopted here to evaluate the performance of LF-MOGP, and the statistical results are shown in Table 4. As shown in Table 4,

LF-MOGP obtained the second-best classification error, and among all the competitors, the lowest classification error is 3.09% obtained by Fine-Tuning DARTS. However, it is worth pointing out that Fine-Tuning DARTS is a handcrafted model, where the cutout and random erasing data augmentation techniques were adopted during training. Besides, Fine-Tuning DARTS is a DARTS-based fine tuning algorithm, while the proposed LF-MOGP does not use any additional data augmentation techniques and is trained from scratch instead of fine tuning. Except for the Fine-Tuning DARTS, LF-MOGP reduces the best two classification errors by 2.52% and 2.72%, respectively, compared to VGG16 and GoogleNet. Compared to the three ENAS algorithms (EvoCNN, SEECNN and FPSO), LF-MOGP achieves the highest classification accuracy without significantly increasing parameters. Compared with FPSO, the number of parameters of our model increases by 0.12M, but the accuracy of our model is higher. Besides, LF-MOGP reduces the classification error by 1.68% compared to EvoCNN, while the number of parameters is reduced by 1.24M, which is very promising, and the classification error is reduced by 1.6% compared to SEECNN.

Table 4 The best classification error rates, parameters and train epoch of LF-MOGP and its competitors on Fashion dataset

C. CIFAR10 and CIFAR100

The comparison results of LF-MOGP against the competitors are presented in Table 5. For CIFAR10, the classification errors of LF-MOGP are lower than that of all the handcrafted models. Besides, the best CNN evolved by LF-MOGP has a smaller number of parameters, which demonstrates the significant superiority of the proposed algorithm over the handcrafted models. In comparison to the ENAS methods on CIFAR10, the proposed LF-MOGP is not inferior, except that the classification error of LF-MOGP is 0.72% higher than that of PNAS, while the running time is only 4% of PNAS.

Regarding the GPU Days and parameters, the best results of the ENAS algorithm are 1.65 and 0.7M obtained by FPSO. However, its accuracy is not very promising, due to the fact that its classification error is 2.15% higher than that of our LF-MOGP.

In addition to PNAS, the top five ENAS algorithms with the minimal classification errors are CNN-GA, Genetic CNN, Large-scale Evolution, EvoCNN, NATS-Bench, and their classification errors are 4.78%, 5.01%, 5.40%, 5.47%, 5.63%, respectively. Compared with them, the proposed LF-MOGP can further reduce the classification error by 0.65%, 0.88%, 1.27%, 1.34% and 1.50%, respectively. The CNN evolved by LF-MOGP is more lightweight with only 1.07M parameters, which is conducive to extending the algorithm to practical applications. Moreover, LF-MOGP also has a significant advantage in terms of ‘GPU Days’. Compared with the above ENAS algorithms, the proposed LF-MOGP algorithm takes the shortest running time to search for the optimal CNN architecture except for FPSO.

For CIFAR100, the best error obtained by LF-MOGP is about 26.37%. Compared with the Handcrafted algorithms, LF-MOGP can outperform all of them except DenseNet. More specifically, although the classification error of LF-MOGP is a little higher than that of DenseNet, the number of parameters of the model obtained by our LF-MOGP is only one-third of that obtained by DenseNet. Compared with the ENAS algorithms, the classification error of LF-MOGP is slightly inferior to those obtained by CNN-GA, Large-scale Evolution and Genetic CNN. However, the number of parameters of our evolved model is also much less than those obtained by the three algorithms. Moreover, our LF-MOGP can reduce the error by 0.77%, 0.74%, and 0.12% compared to ME-HDSS, MetaQNN, and NATS-Bench algorithms, respectively. Thus, it can be seen that our LF-MOGP is still very competitive with these state-of-the-art algorithms on the CIFAR100 dataset.

Table 5 The best classification error rates, parameters, GPU Days, and experimental environment of LF-MOGP and its competitors on CIFAR10 and CIFAR100

Analysis of strategy effectiveness and evolutionary behavior

Effectiveness regarding the leader–follower mechanism

The leader–follower mechanism (details in Sect. 3.1) is an important strategy proposed in this study. To verify its necessity and effectiveness, further experiments are carried out on CIFAR10 in this section. The comparison of the evolutionary behavior of LF-MOGP with and without the leader–follower mechanism is illustrated in Fig. 10.)

Fig. 10
figure 10

Evolutionary behavior of LF-MOGP with and without the leader–follower mechanism

As can be seen from Fig. 10a, the Pareto front obtained by LF-MOGP is much superior to that obtained by the one without the leader–follower mechanism. In addition, according to Fig. 10b, the evolutionary process of the mean ACC of top 5 individuals obtained by LF-MOGP also outperform that obtained by the one without the leader–follower mechanism. The experiment results show that the leader–follower mechanism is effective and can guide the search to more promising regions in the search space.

Evolutionary behavior

To analyze the evolutionary behavior (especially convergence) of the proposed LF-MOGP, we show the evolutionary process of the Pareto front obtained by our algorithm in Fig. 11, using the MBI dataset as an example. The figure contains three parts, namely the evolutionary process of Pareto front (with a sampling period of 20 generations), and the two convergence trajectories of the ACC and complexity (number of parameters) metrics obtained by our algorithm.

As can be seen from the Fig. 11a, the quality of the Pareto front steadily improves as the number of generations increases, with the highest accuracy of 97.6% obtained by the non-dominated solutions on the validation set. According to Fig. 11b, both the mean ACC of the population and the best ACC of the individual show a steady increase against the generation, and tend to converge at the 80th generation. Similarly, from Fig 11c, it appears that the parameters of the individual with the best accuracy, as well as the mean parameters of the population, decreases obviously with the increase of generation, and their trajectories also tend to converge at around the 80th generation.

Fig. 11
figure 11

LF-MOGP evolutionary behavior on MBI

Fig. 12
figure 12

Best CNNs evolved by LF-MOGP

Discussion

In summary, LF-MOGP achieves promising performance on the eight datasets. Compared with the ENAS algorithms, LF-MOGP performed best on five datasets, second best on two datasets, and third best on one dataset. The advantages are more obvious when compared with the Handcrafted algorithms. The main reasons for the good performance of LF-MOGP can be analyzed as follows:

  1. (1)

    Benefit from the encoding strategy of variable length-width designed in LF-MOGP. This encoding strategy is conducive to feature extraction and ensures that richer features can be extracted, and rich features are the basic guarantee for completing the classification tasks. In addition, LF-MOGP is essentially a block-based algorithm, while CGP is a tree-like structure with fewer structural constraints on the CNNs, so encoding the blocks with CGP can provide a larger search space for the algorithm.

  2. (2)

    The proposed leader–follower mechanism has the advantage of accelerating the convergence speed of the algorithm as well as fewer resource requirements. An external archive of non-dominated solutions acts as the leader and the evolution operation is mainly performed on solutions from the external archive, which can greatly reduce the computational resources. In addition, an elite population is updated by new solutions that cannot enter the external archive, which means that the elite population can be viewed as a follower of the external archive. During evolution, if the diversity of the external archive tends to deteriorate, the solutions from the elite population will be selected to generate new solutions with good diversity, which can help the algorithm avoid getting trapped in the local optimum.

Best CNNs evolved by LF-MOGP

Evolved CNN architectures

The best CNNs evolved by LF-MOGP on the seven benchmark datasets are presented in Fig. 12. Based on these best CNNs, the following conclusions can be drawn. First, the depth and width of these CNNs are different on different datasets, which indicates that LF-MOGP can design CNNs in a targeted manner according to the data characteristics of different tasks. Second, it breaks the limitations of traditional CNNs construction. For example, the fully connected layer can also use features from previous layers, which is significantly different from the classical fully connected layer that only completes the classification task based on the features obtained in the last layer, which improves the utilization of features. In addition, in conventional CNNs, pooling after convolution is a regular operation, whereas here they can be performed simultaneously, which makes the scale of features richer and thus the CNNs constructed by LF-MOGP become more flexible and powerful for feature extraction and utilization. Finally, the best CNNs evolved by LF-MOGP on the seven benchmark datasets are relatively lightweight, and thus it is more friendly to resources and hardware. Such CNNs are more suitable for practical applications such as applications on mobile devices.

Fig. 13
figure 13

Convergence performance of the best CNNs

Convergence performance of the best CNNs

To better understand the convergence performance of the best CNNs evolved by LF-MOGP, we plot the trajectories of classification accuracy and loss value on the validation set during the training process. Note that here is the validation set instead of the test set, because the test set is not allowed to participate in the training process. To organize the paper properly, the convergence curves of six representative CNNs with the best performance are provided in Fig. 13. From the observation of the convergence curves, the following conclusions can be drawn. First, these CNNs can converge within 100 epochs, so the convergence rate is relatively fast. Second, the convergence curves indicate that the settings of the relevant hyperparameters are feasible to ensure that each model can be adequately trained, such as the number of training epochs. In addition, the settings of the learning rate and its adjustment strategy are reasonable to enable the CNNs to achieve convergence as soon as possible.

Real-world application

Nowadays, intelligent methods based on computer vision have been widely used in industrial applications [52], so in this section, LF-MOGP is further validated on the online classification of the real-world industrial slab numbers. The slab number is used as a unique identification for each slab in the hot rolling process to help the site operator to put the slab into the designated heating furnace for heating and then complete the hot rolling production. Figure 14 shows an example of industrial slab number image data, where the sequence of slab numbers is on the left and the segmented slab number characters images are on the right. At present, the identification of slab numbers is done by operators 24 h a day, which is labor-intensive and inefficient, and any mistake will cause serious economic losses to the production line. Therefore, it is important to design a stable and efficient intelligent slab number identification algorithm to reduce labor costs and improve production efficiency.

Fig. 14
figure 14

Example of real-word industrial slab data

Table 6 Classification results on the real-world slab numbers dataset
Fig. 15
figure 15

Example of identification results of LF-MOGP and several comparative algorithms (the red characters indicate recognition errors)

Table 6 shows the classification results of the proposed LF-MOGP and several comparative algorithms on the real-world slab dataset. As can be seen from Table 6, the best CNN architecture evolved by LF-MOGP achieves the highest classification accuracy of 98.07% without a significant increase in the number of parameters compared to the rival algorithms. In fact, the number of parameters is less than those obtained by VGG and ResNet. The application scenarios for real-world problems are generally limited computing resources. Therefore, the evolved CNNs have a very high practical application value with high accuracy and low complexity.

The identification results for some practical slab numbers are presented in Fig. 15, where the characters marked in red mean misidentified numbers. From this figure, it can be seen that the CNN evolved by LF-MOGP can identify the slab number correctly in most cases even for those slabs with quite low quality characters. Such slab numbers are also very hard to distinguish by experienced human experts. It is worth noting that LF-MOGP is designed for single-label image classification. When dealing with slab number sequence recognition, our method can first recognize the segmented character images with a batch size of 10 without disturbing the order, and then reconvert the recognition result into a slab number sequence.

Conclusion and future work

In this paper, a CGP-based autonomous evolutionary convolutional neural network search algorithm (LF-MOGP) was proposed to evolve good CNNs for image classification tasks. In this algorithm, a flexible variable length-width encoding strategy was designed based on CGP and 22 basic functional blocks, which can help to expand the search space. To achieve convergence acceleration and reduce computational resources, a leader–follower strategy was proposed to guide the evolution process. The proposed LF-MOGP is tested on eight benchmark datasets and a real-world industrial dataset, and the experimental results illustrated that LF-MOGP outperformed 35 existing algorithms in the literature in terms of classification accuracy, model complexity, and computational resource requirement. Since the CNNs constructed by LF-MOGP are relatively lightweight, it has greater potential for industrial applications, which is our main future work.