1 Introduction

Steel corrosion in concrete is the most common reason for the degradation and poor durability of reinforced concrete (Angst 2018; Ben Seghier et al. 2021c). This destructive process consists of corrosion initiation and corrosion propagation stages (Ben Seghier et al. 2021b). The first stage is accompanied by ingress of carbon dioxide and carbonation of concrete cover (Nguyen et al. 2015) and increment of chloride content exceeds a critical value (Cao et al. 2019) or threshold level (Ann and Song 2007) in the surrounding concrete. Under these conditions, the concrete pH around reinforcement decreases then the protective film on the rebar surface breaks, and corrosion of steel rebar will begin. Before the corrosion initiation, the presence of a very thin and dense oxide film of iron on the surface of steel rebars is an excellent barrier to the corrosion of reinforcing bars which is stable in concrete with high pH (Berrocal et al. 2015; Al-majidi et al. 2019).

In the next stage, corrosion is occurred and corrosion products have been formed. The volume of this product (rust) is about 3 to 7 times greater than that of common steel volume and this causes high tensile stresses around the reinforcement and subsequently the concrete cover cracks (Tuutti 1982; Nguyen et al. 2015). The cracks increase access to destructive atmospheres such as oxygen, carbon dioxide and chlorine ions, thus staining, lamination, and degradation of the concrete cover occur and reducing the durability, safety, and useful service life of the Reinforced Concrete (RC) structures (Taffese and Sistonen 2017; Al-Akhras and Al-Mashraqi 2021).

Unforeseen RC failures not only cause economic losses due to repairs and increase energy consumption, but also affect the safety and quality of human life (Angst 2018). For example, the cost of maintenance and repair of RC structures in the United States is multibillion dollars per year and in Western Europe is 5 billion euros per year (Taffese and Sistonen 2017).

Therefore, the forecast of the useful service life and the design of the durable RC are major challenges to avoid the cost of premature damage and extend the service life of such structures. Corrosion in reinforced concrete (or penetration of chlorine ions and ingress of carbon dioxide into the concrete structure) is a complicated phenomenon involving many factors that are difficult to control (Ben Seghier et al. 2021c), so it cannot be predicted by conventional mathematical solutions or traditional computing methods.

Machine learning has been proven to be an efficient and reliable source for solving multiple engineering problems and it has many applications to solve nonlinear, multivariate and complex issues (Taffese and Sistonen 2017; Ben Seghier et al. 2021a). Artificial neural networks (ANNs) are one of the greatest commonly used methods of machine learning in the analysis and modeling of durability, service life prediction and corrosion problems of RC structures (Morcous and Lounis 2005; Parthiban et al. 2005; Ukrainczyk et al. 2007; Ukrainczyk and Ukrainczyk 2008; Topçu et al. 2009; Li et al. 2019; Roxas and Lejano 2019). Also, besides the fuzzy logic method (Anoop et al. 2002; Nehdi 2009; Sobhani and Ramezanianpour 2011), particle swarm optimization and support vector machine methods (Lv et al. 2020), genetic algorithm (Lee and Kim 2007) and a combination of these methods (Chatterjee et al. 2017) have been studied by several researchers for the behavior modeling of RC systems. For example; Parthiban and his coworkers (Parthiban et al. 2005) evaluated the performance of ANN in predicting the reinforcement potential of steel in reinforced concrete using experimental laboratory data and the back propagation neural network (BPNN) method. Ukrainczyk et al. (Ukrainczyk et al. 2007) suggested an ANN-based model for the prediction of the degree of corrosion destruction of reinforced concrete. Topçu et al. (Topçu et al. 2009) also used an ANN method to predict the corrosion current of reinforced concrete.

In spite of the advantages of the mentioned methods, they do not usually give a definite equation for calculating the output of the model. For example, the ANNs structure including the transfer function, the number of hidden layers and the number of neurons per layer must be known in advance (Gandomi and Alavi 2012a). Moreover, for a complicated database with various input parameters, accurate results may not be provided by ANN (Mai et al. 2020). This issue can be overcome by employing supervised machine learning procedures which give a particular mathematic equation to estimate the outcome such as genetic programming (GP) (Koza 1994). As an example of employing this approach to model the service life prediction of the tunnel structures, the implemented research by Gao et al. is considerable (Gao et al. 2019). multi-gene genetic programming (MGGP) is a modern technique of genetic programming that has been successfully employed in the analysis of various engineering issues (Gandomi and Alavi 2012a, b; Kaydani et al. 2014b; Garg and Lam 2015; Hoang et al. 2017). The benefits of this procedure are high learning ability andflexibility and the capability to deal with complex problems (Kaydani et al. 2014a). Furthermore, the accuracy of MGGP is more than a common GP and no researcher has been used from this approach to model the degree of corrosion damage in RC structures. Therefore, in this study, the new MGGP algorithm was used to formulate a mathematical relationship between the degree of damage in reinforced concrete structures located in the corrosive area of the Persian Gulf and input variables. The multi-gene genetic programming is also compared with an ANN model. Moreover, a large field database including various parameters was utilized, which have a significant impact on the corrosion process in the RC structures.

2 Methods and data collection

In this regard, 492 data were gathered from different regions located along in the Persian Gulf shown in Fig. 1. General information of each structure (point) including lifetime, concrete repair history, distance from the sea and elevation respect the sea level based on the technical reports and a global positioning system (GPS) were recorded. Also, distinctive properties were measured as: compression resistance of concrete and chloride penetration to concrete according to EN 13,791 standard and ASTM C114 standard, respectively; rebar diameter and concrete cover depth using an ultrasonic equipment; rebar potential which assessed by ASTM C876 standard and corrosion current density of rebar using pulse galvanostatic technique and ohmic resistance by a Randleʼs equivalent circuit. Accordingly, 12 parameters were considered as input variables for MGGP and neural network. The list of these parameters is shown in Table 1.

Fig. 1
figure 1

Different regions located along in the Persian Gulf (1–10) for data collection

Table 1 Effective factors on the corrosion of reinforced concrete in this study

Statistical information of the mentioned parameters in Table 1 including the mode, median, mean, minimum, maximum, standard deviation, and range of them is listed in Table 2. Furthermore, the histograms of the input parameters are illustrated in Fig. 2. From the statistical information in Table 2 and also the variable pattern of each parameter in the histograms of Fig. 2, it can be found that the problem is complicated.

Table 2 Statistical information of the mentioned parameters in Table 1
Fig. 2
figure 2

Histograms of the input parameters: a Age b Concrete repairing history c Height above the sea level d Distance from sea e Concrete compressive strength f Rebar diameter g Concrete cover depth h Corrosion potential i Corrosion current density j Concrete electrical resistivity k Chloride ion concentration l Alkalinity of concrete

Since the range of input variables x1x12 is not the same (i.e. we have different scales of data), the network considers large-scale variables more important than the small-scale ones. Therefore, it is necessary to normalize all the input variables to reduce the effect of scaling. All the input variable values are normalized within the interval (−1, 1). Table 3 shows the categorization criteria of reinforced concrete corrosion and the abundance of input data in the defined categories. According to Table 3, the data are categorized into four groups: the first group relates to the state that no corrosion has been observed in the evaluated structure. However, the fourth group relates to the structures that severe corrosion is occurred in them. The severity of corrosion in second and third groups is regarded between the aforementioned groups. By attention to the abundance of each categorize, it can be stated that a nearly comprehensive database is considered for the modeling.

Table 3 Categorization criteria of reinforced concrete corrosion and abundant of input data

2.1 The determination of output variables

To represent the four categories given in Table 3, four target variables are considered, i.e., t1, t2, t3, and t4 along with one-hot encoding. Accordingly, for representing the ith category, the variable ti is set to 1, and the others are set to 0; e.g. for indicating category 1 (the first class presented in this study) t1 = 1, t2 = 0, t3 = 0, and t4 = 0. Therefore, based on the above-given descriptions, an overall view of the dataset presented in this study is given in Table 4.

Table 4 Overall schema of the dataset, presented in this study

Categorization feature in ANNs is an effective tool that can be used for grouping a given set of correlated inputs. Feature categorization expressed whether the sample belongs to a particular class or not (Ukrainczyk et al. 2007; Ukrainczyk and Ukrainczyk 2008; Chatterjee et al. 2017).

Neural network architecture for pattern-classification is shown in Fig. 3. In this study, the corrosion category is designated as a distinct output. For this purpose, the input data are classified into four classes according to ASTM standard and visual observation (based on the intensity of corrosion).

Fig. 3
figure 3

Classification network architecture for 4 categories

The output variables for ANN and MGGP models are considered as y1, y2, y3, and y4 where t1, t2, t3, and t4 are the corresponding targets, respectively. Target variables t1, t2, t3, and t4 are the last four columns of the datasets, presented in this study.

3 Framework and methodology

3.1 Brief overview of GP and MGGP methods

GP can generally be defined as a generated extension of the genetic algorithm (GA), though GP and GA have a significant difference. GP searches a program space (tree structures) rather than a data space. The algorithm initially generates a random population of individuals models based on the Darwinian principle to find the best/optimal solution. These individuals are hierarchically structured trees comprising terminal and function sets. The function set consists of mathematical functions, arithmetic symbols, Boolean operations, iterated functions and, conditional expressions. Terminator sets contain function relationships, variables, and constants (Koza 1994; Gandomi and Alavi 2012a). An example of a simple GP solution (i.e. a model tree) with output y and input variables such as a and b is shown in Fig. 4.

Fig. 4
figure 4

A form of a simple GP solution

The MGGP method is a new and vigorous type of GP that combines the capabilities of GP modeling with the ability to estimate the parameters of classic regression (Searson et al. 2010). Representation of standard GP is based on the expression of a single tree, while in the MGGP method, each individual derives from several genes that every tree can be considered as a gene, and the overall model is represented as a linear combination of weight of genes (Mohammadi Bayazidi et al. 2014).

The basic steps of the MGGP procedure are as follows; (a) First, a random primary population of computer programs (individuals) is generated, function sets and terminal sets are selected based on the problem. Then, the components of these two sets are randomly mingled to generate genes. The maximum depth of the tree and the number of genes must be specified by the user. (b) Second, evaluation of the initial population: determine the fitness function to evaluate the performance of every individual in the initial population (the fitness function is the objective function that must be optimized). (c) Third, generation of the new population: if termination condition such as the threshold error or the maximum number of generations of the model is not satisfied for an individual of the population, selection, crossover and mutation (the genetic operations) are applied to the models for the evolution of the new population. (d) Fourth, according to the minimum training error or the best-so-far solution, the best model is chosen (the repetitive process of creating a new population proceeds till the termination condition is attained) (Gandomi and Alavi 2012a; Garg and Lam 2015). Figure 5 shows the flowchart of the proposed MGGP approach. The dataset is randomly divided into two sub-categories, 80% of the data is used in the training process and 20% is used in the test to analyze the MGGP model. This Figure shows the evolutionary process of MGGP that is started by an initial population of chromosomes being randomly generated. In MGGP, each chromosome contains multiple genes and each gene is a mathematical expression (represented as a tree); in this research, 21 genes (trees) are considered to be in each chromosome. For adapting the MGGP to our four-class classification task, for the jth class, the output variable yj is considered. Then, for each output variable yj, 21 trees in each chromosome are combined by weights that are calculated using the least square estimator as follows:

Fig. 5
figure 5

Flowchart of proposed MGGP approach

$$ y_{{\text{j}}} {\text{ }} = {\text{ }}d_{0} {\text{ + }}d_{1} \left( {{\text{tree1}}} \right){\text{ + }}d_{2} \left( {{\text{tree2}}} \right){\text{ + }}d_{3} \left( {{\text{tree3}}} \right){\text{ + }} \ldots {\text{ + }}d_{{21}} \left( {{\text{tree21}}} \right){\text{ for }}j{\text{ = 1}}{ - }{\text{4}} $$
(1)

The weights or linear coefficients (d0, d1, …, d21) are automatically computed from the training data set by applying the least square method for each multigene individual (Rahdari et al. 2016; Hoang et al. 2017). Figure 6 illustrates the symbolic model that predicts 4 output variables using input variables x1x12 and 21 trees or genes.

Fig. 6
figure 6

The schematic structure of the symbolic model that is coded in each chromosome

For each individual, the fitness is calculated by the cross entropy that is described in the following. The fitness value for every individual is measured by Cross-Entropy (XE) given by:

$$\text{XE} = -{\sum_{{\text{i}= 1}}^{\text{N}}}{\sum_{{\text{j}=1}}^{4}{{\text{t}}}_{\text{j}}^{\text{i}}}\,{\text{log}}\,{\text{y}}_{\text{j}}^{\text{i}}$$
(2)

where N is the number of train data, \({\text{y}}_{\text{j}}^{\text{i}}\) is the jth model output for the ith data sample and \({\text{t}}_{\text{j}}^{\text{i}}\) is the corresponding jth target.

3.2 Setting parameters of MGGP algorithm

In this study, a free open-source MATLAB toolbox called GPTIPS (Searson 2009) is utilized to accomplish MGGP. Various parameters that affected the model generalization ability of MGGP must be adjusted. The optimal parameter settings are listed in Table 5. The parameters are chosen according to a trial-and-error method. The victory of the MGGP algorithm ordinarily increments with enhancing the maximum permissible number of genes in an individual and maximum depth of the tree but the intricacy of the evolved function and running time increase (Kaydani et al. 2014b). Therefore, they must be set to optimal values (maximum number of genes = 21; maximum depth of tree = 4).

Table 5 Parameter settings for the MGGP method

The remaining parameters of the MGGP model are selected as follows: the population size = 30; the maximum number of generations = 400; the tournament size = 8; cross over probability = 0.7; Mutation probability = 0.01. Tournament selection strategy with 8 tournament size is approved for choosing the parent genes. The next step is to choose the set of functions as well as the set of terminals, so use x1-x12 as terminals and “times, minus, plus, sqrt, square, tan, cot, exp, add3, mult3, pdiv, sin, cos, msig” as functions. “msig” is sigmoid function and for the variable u is defined as follows:

$${\text{F}}\left({\text{u}}\right) \, = \text{ } \frac{1}{\text{1} + {\text{e}}^{\text{-u}}}$$
(3)

3.3 Artificial Neural Network framework

To estimate the best ANN model or to determine the optimum number of neurons in the hidden layer, model choice processes were utilized according to the K-fold cross-validation. In practice, this method often works well and is one of the most used methods in model selection. In K-fold CV, the total data are categorized into K folds (approximately equal size). Then (K-1) fold is assumed as a training data set and the residual fold is chosen as a test data set. The process is repeated K times and the test cross-entropy (XEi) is calculated at each time. Finally, the average of K tests XEi is a final error for each model. The model with the lowest average test cross-entropy (AXE) is selected as the best model (Shalev-Shwartz and Ben-David 2013). In this research, 10-Fold Cross Validation is used and a schematic of this method is shown in Fig. 7. The neural network containing 15 neurons in the hidden layer was selected as the best model. The ANN architecture is illustrated in Fig. 8 (12 input variables, 15 neurons in the hidden layer, and 4 output variables).

Fig. 7
figure 7

A schematic illustration of fivefold cross-validation

Fig. 8
figure 8

The best neural network structure obtained by tenfold CV

In order to train the best neural network, the dataset was randomly divided into two sets: 80% for training and 20% for test. The steps of the ANN approach are shown in Fig. 9. A common and popular back-propagation neural net (Perera et al. 2014; El et al. 2021) was employed as a topology of the ANN. It consists of two phases: (1) feedforward phase and (2) backpropagation phase and it has three main layers: an input layer, an intermediate hidden layer, and an output layer. Since the Levenberg–Marquardt training algorithm is often very fast and its prediction accuracy is high (Hadi 2003; Golafshani et al. 2012; Li 2015), so it was used for updating weights and biases value, although this algorithm needs more memory than others. The activation function of tan-sigmoid was used for both hidden and output layers. The maximum number of epochs for training is set as 1000 and the ANN model is trained until the maximum number of epochs is reached.

Fig. 9
figure 9

Flowchart of ANN approach

3.4 Comparing the structures of ANN and MGGP approaches

By comparing the proposed structures of ANN and MGGP, it can be apparently seen that the number of parameters for the ANN is bigger than those of MGGP. Since MGGP tunes only 22 × 4 = 88 parameters while the number of weights for ANN model are (12 × 15) + 15 = 195 weight parameters for the first layer and (15 × 4) + 4 = 64 parameters for the second layer (Totally 195 + 64 = 259). Another worth noting point is to compare and contrast the search mechanisms used for tuning the parameters and making the model structures. ANN backpropagation algorithm is employed as a gradient based method that uses the gradient of loss function for adjusting the aforementioned 259 parameters where this algorithm is a local search and it has the risk of trapping in local optima. In contrast, MGGP is an evolutionary algorithm and acts as a global search mechanism for finding the solutions through an iterative process of simulated evolution. Therefore, MGGP as a global search takes more computational efforts compared to the local search approaches like backpropagation but it is more likely that the global optimum of the loss function to be found by global search techniques like MGGP. As mentioned in Sect. 3.3, the structure of ANN model (number of layers and number of neurons in each layer) was obtained by employing cross-validation method, while MGGP determines the structure of model through the evolution process. MGGP automatically removes redundant and non-relevant variables to the outputs during iterations. Generally, it can be concluded that MGGP has the ability to escape from local optima in the search space in comparison with the ANN approach, while it is more computationally prohibitive in contrast to ANN. As the final important point, MGGP gives a mathematical formula as the final model where ANN is a black-box model.

4 Experiments and analysis

4.1 Performance metrics

Some commonly used metrics to evaluate the performance of the classification-type problems are accuracy, precision, recall, and F-measure (F1). The formulas of them can be seen in Eqs. (4)–(7) (Gandomi and Alavi 2011; Chatterjee et al. 2017), respectively.

$$\text{Accuracy = }\frac{\text{TP+TN}}{\text{P+N}}$$
(4)
$$\text{Precision = }\frac{\text{TP}}{\text{TP+FP}}$$
(5)
$$\text{Recall = }\frac{\text{TP}}{\text{TP+FN}}$$
(6)
$$ {\text{F}}1{\text{ }} = {\text{ }}\frac{{2 \times {\text{precision}} \times {\text{recall}}}}{{{\text{precision }} + {\text{ recall}}}} $$
(7)

where TP is true positive (The number of data accurately classified as positive), TN is true negative (The number of data accurately classified as negative), P is the total of samples correctly classified as positive, and samples inaccurately classified as negative (TP + FN), N is the total of TN and FP (The number of data that is incorrectly classified as positive). The relationship between these parameters constitutes the confusion matrix. F1 is the compatible mean of recall as well as accuracy and it is usually more useful than accuracy because it considers both precision and recall.

4.2 Numerical results and discussions

Figure 10 shows the changes in the best and mean fitness over the number of generations. As indicated, the fitness value reduced as the number of generations increased, and the best fitness was obtained in 400 generations (best fitness value = 1.9). It should be noted that the program did not produce better results for more generations.

Fig. 10
figure 10

The best and mean changes in fitness in terms of the number of generations

The vector TR is created by putting the obtained trees (tree1 to tree 21 given in Table 6) into a column vector as well as 1; TR = [1, tree1, tree2 … tree21]T. Also, considering the four column vectors β1 to β4 given in Table 7, the best anticipated equations obtained from the MGGP model to classify the degree of damage in reinforced concrete structures, reported as follows:

$$ y_{1} {\text{ = }}\beta _{1} ^{{\text{T}}} {\text{TR , }}{\mkern 1mu} y_{2} {\text{ = }}\beta _{2} ^{{\text{T}}} {\text{TR , }}y_{3} {\text{ = }}\beta _{3} ^{{\text{T}}} {\text{TR}}{\mkern 1mu} {\text{and}}{\mkern 1mu} y_{4} {\text{ = }}\beta _{4} ^{{\text{T}}} {\text{TR}}{\mkern 1mu} $$

where the model outputs y1-y4 and input variables x1-x12 have been defined previously and tree 1 to tree 21 are the best trees obtained by MGGP, given in Table 6.

Table 6 Equation of each tree or gene
Table 7 Four vectors of coefficients calculated by least square to combine trees

Table 8 illustrates the efficiency of derived models on training and test data for MGGP and ANN. By comparing the results, both models learned the influential correlation between input and output process parameters with a high accuracy. Although the value of performance measures given in Table 8 are close to each other for both ANN and MGGP, it is worth noting that the performance of MGGP on test data is better than ANN. The better performance of MGGP on test data shows its generalization ability compared to ANN. Therefore, the obtained results justify the superiority of MGGP in comparison with the ANN since it uses a global search mechanism and less parameters.

Table 8 The efficiency of obtaining models of training and test data for MGGP and ANN methods

4.3 Sensitivity analysis

For determining the most essential parameter in anticipating the class of corrosion and therefore investigating the sensitivity analysis, one can use from Pearson correlation coefficient (PCC) method. It is a vigorous statistical estimate of the direction and strength of a linear dependency between two variables and also different indices can be evaluated by it in the science of materials (Chen et al. 2021). The PCC (r) regarding two variables X = {x1, x2, …, xn} and Y {y1, y2, …, yn}, defined as follows:

$${r}_{xy}= \frac{\sum_{i=1}^{n}({x}_{i}-\overline{x })({y}_{i}-\overline{y })}{\sqrt{\sum_{i=1}^{n}{({x}_{i}-\overline{x })}^{2}}\sqrt{\sum_{i=1}^{n}{({y}_{i}-\overline{y })}^{2}}}$$
(8)

where \(\overline{x }\) denotes the mean of X and \(\overline{y }\) signifies the mean of Y, n is the sample size, and r is between −1 and 1. The two variables are directly related if the sign of r is positive, whereas they are inversely related if the sign of r is negative. The stronger the linear correction can be inferred provided that the r is close to + 1 or −1.

The PCC values between each pair of input variables x1 to x12 and class of corrosion are calculated and presented as a heat map in Fig. 11. As seen in the last row of Fig. 11, x6 and x7 are the least correlated variables to class corrosion. Concrete compressive strength (x5), corrosion current density (x9) and concrete electrical resistivity (x10) are the most remarkable variables to forecast the degree of corrosion damage. Corrosion current density is directly related to class corrosion, while concrete compressive strength and concrete electrical resistivity are inversely related to the output.

Fig. 11
figure 11

Pearsonsʼs correlation of inputs to output parameters

One of the most significant characters that control the acceleration of degradation of RC structures is corrosion current density (x9) (i.e. corrosion rate) and it inversely relates to concrete resistivity (x10). By decreasing the concrete resistivity, the corrosion current density augments. Moreover, the compressive strength of concrete (x5) is a noticeable mechanical property that is directly proportional to electrical resistivity for a similar concrete mixture design (Azarsa and Gupta 2017; Rodrigues et al. 2020).

Comparing the model given by MGGP with PCC analysis, the following facts have been concluded: Electrical resistivity of concrete (x10) is the most frequent parameter in the obtained equations of genes, while it is the most correlated feature with class corrosion (model output). Three features x6, x7 and x9 have a relatively strong negative correlation with x10 (according to Fig. 11) and thereby there is high redundancy between them, so they are omitted in the MGGP model (Fig. 6). However, there is a feeble redundancy between x1 and x11 with x10 (their correlation values are low) and consequently they are seen in the MGGP equations. These evidences justify the ability of MGGP as a powerful soft computing tool for modeling. As it can be seen from obtained equations of MGGP, the important input variables are considered in MGGP equations and redundant ones are automatically removed during its evolution process.

Furthermore, the electrical resistivity has been utilized in the progression of service life models due to the preparation of information about the electrochemical process (this process is completely related to corrosion of steel) and the widespread indicator of the microstructure of concrete (Andrade 2010; Mendes et al. 2018; Balestra et al. 2019).

5 Conclusions

This paper reports a new and effective technique, namely MGGP, for the formulation of the degree of damage in reinforced concrete structures located in the corrosive region of the Persian Gulf. This approach is based on genetic programming and it can be modeled the nonlinear complex behavior of the problems under study. The performance of MGGP is compared to ANN. To train these methods of machine learning, a data set containing 492 data has been collected. After running the program, results demonstrate that both models perform well with a high degree of accuracy. Although the MGGP method can generate nonlinear treatment without the necessity to predefine the structure, the ANN network structure must be identified in advance. Furthermore, MGGP gives a definite equation for calculating the output of the model. Therefore, the MGGP model can be used to evaluate design parameters in the planning and design phases of reinforced concrete structures for prediction service life and maintenance of them. The sensitivity analysis confirmed that MGGP has the ability to select the most relevant variables to class corrosion, while there is not any redundancy between them and consequently MGGP does not need any feature selection method for reducing the input variables as a preprocessing task.

Eventually, it is concluded that models like MGGP that employ a global search approach for constructing models will have better generalization performance on unseen data. Furthermore, the obtained models by symbolic regression approaches like MGGP (mathematical expressions) are more interpretable than black box models like ANN. MGGP has been usually used in regression problems and this study was one of the successful adaptations of this algorithm for a classification task in corrosion engineering. As a future work, this method can be used in gray box modeling of engineering problems, where we have an imprecise white box model of a process (e.g. governing differential equations). An imprecise white box model can be modified into an accurate one by adding some mathematical expressions being generated by symbolic regression methods through given data.