Skip to main content
Log in

Compositional kernel learning using tree-based genetic programming for Gaussian process regression

  • Research Paper
  • Published:
Structural and Multidisciplinary Optimization Aims and scope Submit manuscript

Abstract

Although Gaussian process regression (GPR) is a powerful Bayesian nonparametric regression model for engineering problems, its predictive performance is highly dependent on a kernel for covariance function of GPR. However, choosing a proper kernel is still challenging even for experts. To choose proper kernel automatically, this study proposes a compositional kernel (CPK) learning using tree-based genetic programming (GEP). The optimal structure of the kernel is defined as a compositional representation based on sums and products of eight base-kernels. The CPK can be encoded as a tree-structure, so that tree-based GEP is employed to discover an optimal tree-structure of the CPK. To avoid overly complex solution in GEP, the proposed method introduced a dynamic maximum tree-depth technique. The novelty of the proposed method is to utilize more flexible and efficient learning capability to learn the relationship between input and output than existing methods. To evaluate the learning capability of the proposed method, seven test functions were firstly investigated with various noise levels, and its predictive accuracy was compared with existing methods. Reliability problems in both parallel and series systems were introduced to evaluate the performance of the proposed method for efficient reliability assessment. The results show that the proposed method generally outperforms or performs similarly to the best one among existing methods. In addition, it is also shown that proper kernel function can significantly improve the performance of GPR as the training data increases. Stated differently, the proposed method can learn the function of being fitted efficiently with less training samples than existing methods. In this context, the proposed method can make powerful and automatic predictive modeling based on GPR in engineering problems.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19
Fig. 20
Fig. 21
Fig. 22
Fig. 23

Similar content being viewed by others

References

Download references

Replication of results

The codes are available from the following repository of GITHUB: https://github.com/seungsab/CPKL_using_Tree-GEP.

Funding

This research was supported by a grant from a Strategic Research Project (Smart Monitoring System for Concrete Structures Using FRP Nerve Sensor) funded by the Korea Institute of Civil Engineering and Building Technology.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Seung-Seop Jin.

Ethics declarations

Conflict of interest

The author declares no conflict of interest.

Additional information

Responsible Editor: Mehmet Polat Saka

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix 1. Base kernel for covariance function

The distance-based similarity between function values is fundamental for the nonparametric modeling in regression. The properties of the similarity are encoded by the kernel. Therefore, the kernel for covariance function is a crucial ingredient in GPR, since it encodes our prior assumptions about the function which we wish to learn. This appendix briefly reviews eight kernels for richer representation to examine their properties. Hereafter, these kernels are referred to as base-kernels. Their properties and expressible structures are shown in Fig. 24.

Fig. 24
figure 24

Base-kernels and their expressible structures (WN, white noise; CON, constant; LIN, linear; PER, periodic; SE, squared exponential; RQ, rational quadratic; ME3, Matérn-3/2; ME5, Matérn-5/2)

1.1 Squared exponential kernel (SE)

SE is the most popular kernel for the covariance function of GPR. Since the SE kernel is infinitely differentiable, it encodes prior assumption about the strongly smooth function. It is also known as the “radial basis” or “Gaussian correlation” kernel and this kernel is given by

$$ {k}_{\mathrm{SE}}\left(\boldsymbol{x},{\boldsymbol{x}}^{\prime };{\sigma}^2,\boldsymbol{l}\right)={\sigma}^2\exp \left(-\frac{{\left(\boldsymbol{x}-\boldsymbol{x}\prime \right)}^2}{2{\boldsymbol{l}}^{\mathbf{2}}}\right) $$
(23)

where σ2 and l are the scale factor and length scale for each input, respectively; σ2 determines the average distance away from its mean, while l defines the characteristics of the correlation length scale for each input (length of the wiggle in the functions).

1.2 Rational quadratic kernel (RQ)

RQ is equivalent to adding many SE kernels with different length scales. The RQ kernel encodes the function of varying smoothly across many length scales to represent smoothness with multiscale variation. This kernel is given by

$$ {k}_{\mathrm{RQ}}\left(\boldsymbol{x},{\boldsymbol{x}}^{\prime };{\sigma}^2,\boldsymbol{l},\alpha \right)={\sigma}^2{\left(1+\frac{{\left(\boldsymbol{x}-\boldsymbol{x}\prime \right)}^2}{2\alpha {\boldsymbol{l}}^{\mathbf{2}}}\right)}^{-\alpha } $$
(24)

where σ2, l, and α are the scale factor, the correlation length scale for each input, and the shape parameter for the RQ kernel, respectively; σ2 and l play similar roles in those of the SE kernel; and α adjusts multiscale variation (i.e., relative weighting of large-scale and small-scale variations).

1.3 Matérn kernel (ME)

ME is a generalization of the SE kernel with an additional hyperparameter (υ) of controlling the smoothness of the function (i.e., roughness). Stein (1999) recommended the ME kernel as an alternative of the SE kernel, since the smoothness assumption of the SE kernel is unrealistic for modeling physical processes. Matérn kernels with υ = 3/2 and υ = 5/2 are most popular for the GPR, and they are described as

$$ {k}_{\mathrm{ME}3}\left(\boldsymbol{x},{\boldsymbol{x}}^{\prime };{\sigma}^2,\boldsymbol{l}\right)={\sigma}^2\left(1+\frac{\boldsymbol{r}\sqrt{3}}{\boldsymbol{l}}\right)\exp \left(-\frac{\boldsymbol{r}\sqrt{3}}{\boldsymbol{l}}\right) $$
(25)

and

$$ {k}_{\mathrm{ME}5}\left(\boldsymbol{x},{\boldsymbol{x}}^{\prime };{\sigma}^2,\boldsymbol{l}\right)={\sigma}^2\left(1+\frac{\boldsymbol{r}\sqrt{5}}{\boldsymbol{l}}+\frac{5{\boldsymbol{r}}^{\mathbf{2}}}{3{\boldsymbol{l}}^{\mathbf{2}}}\right)\exp \left(-\frac{\boldsymbol{r}\sqrt{5}}{\boldsymbol{l}}\right) $$
(26)

where r = x − x; σ2 and l play similar roles in those of the SE kernel.

1.4 Constant kernel (CON)

CON can be used as the part of a product-kernel to scale the magnitude of the other kernel. The CON kernel is also used as part of a sum-kernel to represent the mean of GPR. It is given by

$$ {k}_{\mathrm{CON}}\left(\boldsymbol{x},{\boldsymbol{x}}^{\prime };{\sigma}^2\right)={\sigma}^2 $$
(27)

where σ2 can be used as both scale factor for part of a product-kernel or mean of GPR as the sum-kernel.

1.5 Linear kernel (LIN)

LIN is nonstationary, which means that it depends on the absolute location of inputs. The nonstationarity of the LIN kernel can model the polynomial trend as part of the product-kernel. Therefore, the LIN kernel is used to encode the trend in the function. This kernel is defined by

$$ {k}_{\mathrm{LIN}}\left(\boldsymbol{x},{\boldsymbol{x}}^{\prime };{\sigma}_b^2,{\sigma}_v^2\right)={\sigma}_b^2+{\sigma}_v^2\boldsymbol{x}\cdotp \boldsymbol{x}^{\prime } $$
(28)

where \( {\sigma}_b^2 \) and \( {\sigma}_v^2 \) denote the bias and slope coefficient for each input.

1.6 Periodic kernel (PER)

PER is a stationary periodic function to encode the repeating patterns in the function. This kernel is given by

$$ {k}_{\mathrm{PER}}\left(\boldsymbol{x},{\boldsymbol{x}}^{\prime };{\sigma}^2,\boldsymbol{l},p\right)={\sigma}^2\exp \left(-\frac{2\sin^2\left(\pi \left|\boldsymbol{x}-\boldsymbol{x}^{\prime}\right|/p\right)}{{\boldsymbol{l}}^{\mathbf{2}}}\right) $$
(29)

where σ2 and l play similar roles in those of the SE kernel, and p is the period to simply determine the distance between repetition of the function.

1.7 White noise kernel (WN)

WN encodes the uncorrelated noise in functions. The WN kernel can be modeled by a limit of the SE kernel by going its length scale to zero. The WN kernel is widely used to model the additive noise in the function. This kernel is defined by

$$ {k}_{\mathrm{WN}}\left(\boldsymbol{x},{\boldsymbol{x}}^{\prime };{\sigma}_{\mathrm{WN}}^2\right)={\sigma}_{\mathrm{WN}}^2{\delta}_{\boldsymbol{x},\boldsymbol{x}\prime } $$
(30)

where \( {\sigma}_{\mathrm{WN}}^2 \) and δx, x are the noise variance and Kronecker delta function, respectively. It has only one hyperparameter as: \( {\sigma}_{\mathrm{WN}}^2 \) determines the level of the noise.

Appendix 2. Analytical test functions

1.1 Branin function (Forrester et al. 2008) (d = 2)

$$ f(X)={\left({x}_2-\frac{5.1}{4{\pi}^2}{x}_1^2+\frac{5}{\pi }{x}_1-6\right)}^2+10\left(1-\frac{1}{8\pi}\right)\cos \left({x}_1\right)+10 $$
(31)

where x1 ∈ [−5, 10] and x2 ∈ [0, 15].

1.2 Friedman function (Friedman 1991) (d = 5)

$$ f(X)=10\sin \left(\pi {x}_1{x}_2\right)+20{\left({x}_3-0.5\right)}^2+10{x}_{\mathbf{4}}+5{x}_5 $$
(32)

where xi ∈ [0, 1]5.

1.3 Dette and Pepelyshev function (Dette and Pepelyshev 2010) (d = 8)

$$ f\left(\boldsymbol{X}\right)=4{\left({x}_1-2+8{x}_2-8{x}_2^2\right)}^2+{\left(3-4{x}_2\right)}^2+16\sqrt{x_3+1}{\left(2{x}_3-1\right)}^2+\sum \limits_{i=4}^8i\ln \left(1+\sum \limits_{j=3}^i{x}_j\right) $$
(33)

where xi ∈ [0, 1]8.

1.4 Welch et al. function (Welch et al. 1992) (d = 20)

$$ f(X)=\frac{5{x}_{12}}{1+{x}_1}+5{\left({x}_4-{x}_{20}\right)}^2+{x}_5+40{x}_{19}^3-5{x}_{19}+0.05{x}_2+0.08{x}_3-0.03{x}_6+0.03{x}_7-0.09{x}_9-0.01{x}_{10}-0.07{x}_{11}+0.25{x}_{13}^3-0.04{x}_{14}+0.06{x}_{15}-0.01{x}_{17}-0.03{x}_{18} $$
(34)

where xi ∈ [−0.5, 0.5]20.

1.5 Output-transformerless (OTL) circuit model (d = 6)

The OTL circuit model has six inputs to estimate the mid-point voltage of an OTL push-pull circuit. This model has been used as a test function in surrogate modeling (Ben-Ari and Steinberg 2007). This model provides analytical expression of the mid-point voltage and it is defined as

$$ f(X)=\frac{\left(\frac{12{x}_2}{x_1+{x}_2}+0.74\right){x}_6\left({x}_5+9\right)}{x_6\left({x}_5+9\right)+{x}_3}+\frac{11.35{x}_3}{x_6\left({x}_5+9\right)+{x}_3}+\frac{0.74{x}_3{x}_6\left({x}_5+9\right)}{\left({x}_6\left({x}_5+9\right)+{x}_3\right){x}_4} $$
(35)

where x1 ∈ [50, 150], x2 ∈ [25, 70], x3 ∈ [0.5, 3], x4 ∈ [1.2, 2.5], x5 ∈ [0.25, 1.2], and x6 ∈ [50, 300].

1.6 Borehole model (d = 8)

The borehole model has eight inputs to estimate water-flow through a borehole. The borehole is drilled from the ground surface through the two aquifers. This model has been used as a test function in surrogate modeling (Kersaudy et al. 2015; Morris et al. 1993) and sensitivity analysis (Harper and Gupta 1983). The water-flow rate can be computed by (36) with the properties of the aquifers and borehole

$$ f\left(\boldsymbol{X}\right)=\frac{2\pi {x}_3\left({x}_4-{x}_6\right)}{\ln \left({x}_2/{x}_1\right)\left(1+\frac{2{x}_7{x}_3}{\ln \left({x}_2/{x}_1\right){x}_1^2{x}_8}+\frac{x_3}{x_5}\right)} $$
(36)

where x1 ∈ [0.05, 0.15], x2 ∈ [100, 50,000], x3 ∈ [63,070, 115,600], x4 ∈ [990, 1110],

x5 ∈ [63.1, 116], x6 ∈ [700,  82], x7 ∈ [1120, 1680], and x8 ∈ [9855, 12,045].

1.7 Wing weight model (d = 10)

The wing weight model has 10 inputs to estimate the weight of the light aircraft wing. This model has been used as a test function for the purpose of the input variable screening (Forrester et al. 2008). The wing weight is computed by (37)

$$ f\left(\boldsymbol{X}\right)=0.036{x}_1^{0.758}{x}_2^{0.0035}{\left(\frac{x_3}{\cos^2\left({x}_4\right)}\right)}^{0.6}{x}_5^{0.006}{x}_6^{0.04}{\left(\frac{100{x}_7}{\cos \left({x}_4\right)}\right)}^{-0.3}{\left({x}_8{x}_9\right)}^{0.49}+{x}_1{x}_{10} $$
(37)

where x1 ∈ [150, 200], x2 ∈ [220,  300], x3 ∈ [6, 10], x4 ∈ [−10, 10], x5 ∈ [16,  45], x6 ∈ [0.5, 1], x7 ∈ [0.08, 0.18], x8 ∈ [2.5, 6], x9 ∈ [1700, 2500], and x10 ∈ [0.025, 0.08].

Appendix 3. Predictive performance for global surrogate modeling

1.1 Numerical verification #1: interpolation for noiseless data (β = 0)

Table 4 Summary statistics of RMSE for mathematical test functions (β = 0)
Table 5 Summary statistics of MAE for mathematical test functions (β = 0)
Table 6 Summary statistics of RMSE for physical models (β = 0)
Table 7 Summary statistics of MAE for physical models (β = 0)

1.2 Numerical verification #2: regression for noisy data (β = 0.005)

Table 8 Summary statistics of RMSE for mathematical test functions (β = 0.005)
Table 9 Summary statistics of MAE for mathematical test functions (β = 0.005)
Table 10 Summary statistics of RMSE for physical models (β = 0.005)
Table 11 Summary statistics of MAE for physical models (β = 0.005)

1.3 Numerical verification #3: regression for noisy data (β = 0.05)

Table 12 Summary statistics of RMSE for mathematical test functions (β = 0.05)
Table 13 Summary statistics of MAE for mathematical test functions (β = 0.05)
Table 14 Summary statistics of RMSE for physical models (β = 0.05)
Table 15 Summary statistics of MAE for physical models (β = 0.05)

1.4 Computational experiment #1: system reliability in a parallel system

Table 16 Summary statistics of RMSE for multimodal system
Table 17 Summary statistics of MAE for multimodal system

1.5 Computational experiment #2: system reliability in a series system

Table 18 Summary statistics of RMSE and MAE for liquid hydrogen tank

Appendix 4. Computational cost with different sample sizes and dimensionalities

Fig. 25
figure 25

Computational costs with different dimensionalities and number of the training sample for Dixon and Price function

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Jin, SS. Compositional kernel learning using tree-based genetic programming for Gaussian process regression. Struct Multidisc Optim 62, 1313–1351 (2020). https://doi.org/10.1007/s00158-020-02559-7

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00158-020-02559-7

Keywords

Navigation