Keywords

1 Introduction

Choroidal neovascularization (CNV), which will cause deterioration of the vision, is characterized by the growth of abnormal blood vessels in the choroidal layer [2]. One of the medical treatments for CNV is anti-vascular endothelial growth factor (anti-VEGF) injections, which may inhibit the growth of CNV. However, the treatment is very expensive and may be ineffective for some patients. Hence, accurate detection and quantification of CNV would be extremely useful to the clinicians in the diagnosis and evaluation of the therapeutic effect of the treatments [5].

As a non-invasion and non-contact imaging modality, Optical Coherence Tomography (OCT) [3] is becoming a powerful tool to monitor the change of CNV. Due to the number of acquired OCT volumes increases, estimating the area of CNV automatically is becoming increasingly relevant. Traditionally, segmentation based methods are popular for this process [1, 5, 12]. However, segmentation based methods are highly dependent on the intermediate steps, which inevitably induces cumulative errors.

Recently, some researchers proposed using the direct methods without segmentation for automated medical image analysis, whose effectiveness has been proved by many works [6, 7, 14]. In addition, with the revival of convolutional neural networks (CNNs) [4], the direct methods become more convenient since the CNNs can be trained directly by the raw data, and generally, achieve significant improvements compared with the traditional approaches.

In this paper, we train a CNN with raw images to estimate the area of CNV in retinal OCT images directly. The effectiveness of such a simple way is verified by the experimental results. However, the CNN is notorious due to the drawback that it is difficult to derive the function being approximated by it via studying the structure of the CNN. To deal with this problem, we try to find a surrogate model, whose mathematical expression is explicit, to fit the function of each layer in the CNN. To this end, genetic programming (GP) [9], which can automatically evolve both the structure and the parameters of the mathematical model from the data, is employed to derive the model. The contributions of this paper are concluded as follows.

  • We demonstrate that using CNNs to estimate the area of CNV in retinal OCT images without segmentation is very promising.

  • We propose using GP to derive mathematical models to fit the function being approximated by CNNs, which is a potential way to explain how do CNNs work.

The rest of this paper is organized as follows: In Sect. 2, the details of CNNs and GP are described. Section 3 presents the experimental results, including the performance of the CNN on direct area estimation and the functions obtained by GP. We have discussion and conclusion in Sect. 4.

2 Methods

2.1 Convolutional Neural Networks

Convolutional neural networks can be employed for different kinds of problems by constructing a proper network structure \(C_{\theta }\) and cost function \(\phi \) given the training set \(\{(p_s, q_s)\}_{s=1}^S\), where \(p_s\) is the observed value and \(q_s\) is the desired output. This process can be formulated as

$$\begin{aligned} C_{learn} = \min _{C_\theta ,\theta \epsilon \varTheta }(\sum _{s=1}^S \phi (q_s, C_{\theta }(p_s))) \end{aligned}$$
(1)

where \(\varTheta \) is the set of all possible parameters. Once the learning step is complete, \(C_{learn}\) can be used to the specific problem. In this work, we employ the CNNs to estimate the area of CNV in retinal OCT images directly. Since the area of CNV is continuous variable, using CNNs to estimate the area directly is a regression problem. The least squares is used as the loss function. Thus, the problem is formulated as

$$\begin{aligned} R_{learn} = \min _{R_\theta ,\theta \in \varTheta }(\sum _{s=1}^S||y_s-R_\theta (x_s)||^2) \end{aligned}$$
(2)

where \(x_s\) is an OCT image and \(y_s\) is the corresponding area value.

Particulary, the structure of a typical CNN includes several convolutional layers and pooling layers optionally followed by at least one fully connected layer. The structure of CNN applied on this work is shown in Table 1. The convolution kernels in all the convolutional layers are defined of rectangular size \(7\times 3\). The number of filters in each convolutional layer is \(32\times 2^{l-1}\), where \(l=1,2,3,4\) is the \(l^{th}\) convolutional layer. The receptive field size of pooling layer is \(3\times 3\) with stride 2 for downsampling. At the final layer, a \(1\times 1\) convolution is used to map each 256-component feature vector to the output. In total, there are four convolutional layers, three pooling layers and one fully connected layer in the CNN. It is noteworthy to point out that each convolutional layer is followed by a rectified linear units (ReLU) layer which is not displayed in the Table 1.

Table 1. The structure of the CNN applied on this work.

2.2 Genetic Programming

Genetic programming (GP) is a biologically inspired machine learning method that evolves computer programs to perform a task. It does this by randomly generating a population of computer programs and then breeding together the best performing individuals to create a new population. Usually, the individual in a population is represented by tree structures, which consists of a set of functions, e.g. minus, square, as shown in Fig. 1(a). One of the popular applications of GP is symbolic data mining, which is a process of building an empirical mathematical model of data.

The general process of building an empirical mathematical model of data using GP is shown in Fig. 1(b). A population is first initialized. The fitness of each individual is then evaluated. The individual, which has the highest fitness value in the population, is selected if the evolution is completed. Otherwise, genetic operations are employed to breed new individual for the next generation. Generally, crossover, mutation and duplication are the common used genetic operations. The crossover operation is done by exchanging the subtrees at random positions in a pair of selected individuals, as shown in Fig. 1(c). Mutation is a process of replacing a subtree in an individual with a randomly generated tree, as shown in the upper part of Fig. 1(d). Duplication is a process which can result in a same individual, as shown in the lower part of Fig. 1(d). We refer reader to [9] for a comprehensive reading about the GP.

Fig. 1.
figure 1

(a) Example of a tree (gene) representing the model term \((cos(x_2)+log(x_3))-sin(x_1)\). (b) The general process of GP. (c) Crossover. (d) Mutation and duplication.

3 Results

3.1 Data Collection and Augmentation

The OCT images used for the experiments are acquired from 7 patients. These images with \(1024\times 512 \times 128\) voxels (11.72 \(\upmu \)m\(\times \) 5.86 \(\upmu \)m\(\times \) 15.6 \(\upmu \)m), covering the volume of 6 mm \(\times \) 6 mm \(\times \) 2 mm, are obtained by ZEISS scanner. The number of B-scan images with CNV in these 3D OCT images is 4060, in which 60% of them are selected as training set randomly, 20% as validation set and 20% as test set. All the CNV in the B-scan images are delineated by a clinician. The ground truth used to train the CNN is calculated based on the delineated results and normalized to the range of [0, 1] by the following equation:

$$\begin{aligned} \widehat{y} = \frac{y-y_{min}}{y_{max}-y_{min}} \end{aligned}$$
(3)

where y and \(\widehat{y}\) are the area values before and after normalization respectively. \(y_{max}\) and \(y_{min}\) are the maximal and minimal area values in the training set respectively.

Each image is cropped to a size of \(512\times 256\) to save memory. In addition, we can augment the data easily due to the cropping. A point is randomly selected from the CNV region. A sub image with size of \(512\times 256\) is then cropped center on this point. In this work, this cropping process is repeated ten times for each original image.

3.2 Training Details

The experiments are performed on a PC equipped with an Intel (R) Core (TM) i7-4790 M CPU at 3.60 GHZ and 8 GB of RAM capacity using MATLAB. Stochastic Gradient Descent (SGD) with momentum 0.9 is employed to train the CNN. The learning rate is set to 0.002 in the training process. Once the training done, it takes about 0.0494 s for the CNN to make a prediction.

3.3 Performance of the CNN on Area Estimation

Different kinds of metrics are employed for evaluating the performance of the CNN on area estimation, including mean square error (MSE), mean absolute error (MAE) and Pearson correlation coefficient (Pcc) [8, 13]. MSE or MAE is a quantity used to measure how close predictions are to the eventual outcomes. A smaller MSE (MAE) indicates better performance. Pcc is a measure of the linear correlation between two variables F and Y (predicted results and ground truth in this paper), which has a value between +1 and −1, where 1 is total positive linear correlation, 0 is no linear correlation, and −1 is total negative linear correlation. A higher correlation coefficient indicates better performance.

Table 2 summarises the average results obtained by the trained CNN for estimating the area of CNV directly. It is observed that the values of MSE, MAE and Pcc obtained by the direct method are 0.0022, 0.0328 and 0.9763 respectively. To evaluate the performance of the direct method further, we also compare with the segmentation methods. In this work, we select the u-net [10], which is a popular network structure in biomedical image segmentation domain, for comparisons. Figure 2 shows several estimation results on different images obtained by these two methods respectively. The red curves in each sub-figure of Fig. 2 are the CNV boundaries delineated by hand and the white curves are the boundaries obtained by u-net. Table 2 aslo summarises the average results obtained by the segmentation based method. It is observed that the average results obtained by the direct method are better than the segmentation based method.

Fig. 2.
figure 2

Estimation results obtained by different methods on the different images. Note that the red curves are the ground truth and the white curves are the results obtained by u-net. (a) gt = 0.2633, dir = 0.2653, seg = 0.2504; (b) gt = 0.0365, dir = 0.0337, seg = 0.0415; (c) gt = 0.1603, dir = 0.2503, seg = 0.2281; (d) gt = 0.0068, dir = 0.0453, seg = 0.0194; At each subfigure title, gt is the ground truth, dir is the value obtained by the direct method and seg segmentation based method. (Color figure online)

Table 2. The results obtained by the direct method and segmentation based method respectively.
Table 3. The mathematical models obtained by GP.
Table 4. The performance of each mathematical Model.

3.4 Deriving the Function Approximated by CNN

As the results shown above, the performance of CNN used for estimating the area of CNV directly is very competitive. However, the entire process looks like a black box. In addition, it is difficult to derive the function being approximated by the CNNs via studying the structure of the CNNs. To explain how does the CNN work, we try to find the function approximated by the CNN. To this end, we employ the GP, which can evolve an empirical mathematical model of data automatically, to derive the mathematical model in the last layer of the CNN, namely find a mathematical model to fit the input and output of the fc1 layer. The GPTIPS2 toolbox [11] with default settings is employed for building the mathematical models automatically. Table 3 lists several models obtained by GP. It is noteworthy to point out that \(x_i\) is the \(i^{th}\) variable in the input of the fc1 layer, where \(i=1, 2,..., 256\) since the input of the fc1 layer is a \(256\times 1\) vector.

Table 4 summarizes the performance of each model on fitting the data of fc1 layer. It is observed that the derived mathematical models only contain n (\(n<256\)) variables, which means that the function in the fc1 layer can be approximated by the n dimensional curve. It is also observed that the derived mathematical model could be different at each genetic programming and their performance on fitting the data would be different also. The best model among the list is model 5, who achieves Pcc of 0.9426. Although the results are not very excellent, they illustrate that using GP to derive a mathematical model for each layer in the CNN, which is desired to have the same input and output with the corresponding layer, could be feasible since the derived mathematical model would be different at each genetic programming.

4 Discussion and Conclusion

In this paper, we employ CNNs to estimate the area of CNV in retinal OCT images directly. The results show that the performance of the CNN trained by raw images is very competitive. To explain how does the trained CNN work, we try to find the function approximated by the CNN. To this end, we propose using GP to derive a surrogate model for each layer in the CNN to demonstrate the function being approximated by CNN. In this work, we derive such a model for the fc1 layer in the CNN. The experimental results show that the function in fc1 layer can be approximated by a high dimensional curve. However, so far, the GP is only suitable for the case that the input is multiple variables while the output is single variable. If we search such a surrogate model for each single variable in the CNN, the amount of calculation would be very huge. Could we derive a mathematical model to fit the layer well where both of the input and output are multiple variables? It is the issue being explored in our future work.