1 Introduction

Digital images uploaded to social media such as facebook, Instagram are essential in our daily life for communications and entertainments. Those images are captured by a camera at the front stage and are watched by human at the end stage in a pipeline. Between the front and end stages, images digitization and compression are performed; processed images are transmitted through a communication channel such as Internet, wireless network; images decompression and reconstruction are performed before watching by human users. Within the pipeline, certain distortions are contaminated to the original images. Coding, decoding, capturing, storing and displaying images generate another distortion. Those processes further degrade the visual qualities. At the end of the pipeline, human users watch the distorted versions. Quantifying visual quality can be conducted by subjective quality assessments with human judgements. However, subjective quality assessments are time-consuming and impractical for online applications. To automatically evaluate image qualities, image quality evaluation (IQE) models can be used [1, 2]. Those IQE models [3,4,5,6] have been developed to engage with many image processing tasks. The predictions of the IQE models can be used to optimize the multi-parameters in the pipeline with many image processing tasks, in order to improve the image qualities.

IQE models can be classified into three domains, namely the full-reference, reduced-reference and no-reference models. Full-reference models use the original images without distortion as a reference in order to determine qualities of distorted images; full-reference models predict image qualities when information of original images is available; reduced-reference models use image features captured from original images and compare those features with those of the distorted images, in order to perform image quality predictions [7, 8]. However, the original images in many applications are not available to compare with distorted images. As an example, an image is transmitted from Smartphone A to Smartphone B; the original image from the Smartphone A is an unknown to Smartphone B, and the original image is not available. Therefore, no-reference models are implemented in many applications since original images are generally not available. No-reference models are commonly used, although IQE is more challenging since no information of the original images is available. They automatically predict images qualities with respect to human perception, when distorted images are only given.

No-reference models can be catalyzed into two types namely the image feature models and machine learning models which are the explicit and implicit typed models, respectively. The image feature model captures an image feature from distorted images which are correlated to subject image qualities evaluated by people. The image feature indicates the quantity of a distortion type, such as blur, blocking, ringing artifacts. Based on the distortion quantity, the image quality can be estimated [9]. However, image feature models cannot be applied for general or multi-purposes; they can be used only when the distortion type is already known or the distortion type of the images is covered by the model. Another type of no-reference models, machine learning models, map a set of image features to the subjective image qualities [3]. They were developed based on a dataset which consists of images features and subjective image qualities scored by human judgement. The image features correlated to perceived image qualities can be extracted using statistical analysis [3]. The machine learning models have been developed based on the approaches of probabilistic prediction models [10, 11], support vector machines [12], fuzzy methods [13, 14], simple neural networks with shallow network architectures [15, 16]. The machine learning models do not rely on explicit information between image features and subjective image qualities. Motivated by the effectiveness of deep neural networks (DNNs) such as convolutional neural networks (CNNs) for object classifications and detections, CNNs have been developed to perform IQEs [5]. More recently Bosses et al. [17] have developed a CNN for predicting image qualities. Better results can be achieved compared with many recently developed no-reference IQE models [4, 5, 11, 18, 19] and other DNNs for IQEs [20, 21].

So far, standard CNNs with cascade convolutional and pooling operations have only been implemented for IQEs [17, 20, 21]. The CNN consists of two main stages namely feature extraction stage and classification stage. In the feature extraction stage, the CNN captures the image features using a set of cascade convolutional and pooling operations from the original distorted images. In the classification stage, those image features are fed into a fully connected neural network in order to quantify image qualities. Generally CNN consists of millions of network weights which are initialized randomly. The initialized weights are updated iteratively by minimizing a loss function which quantifies the differences between real image quality scores and predictions. Therefore, some features are significant to classifications, but some are insignificant; also some image features are strongly correlated to image qualities but are not captured by the cascade convolutional and pooling operations. When those excluded image features are included as an input to the classification stages, the performance of image quality evaluations can be improved.

In this paper, we propose a hybrid DNN in order to improve the feature extraction stage and the classification stage of the currently used CNNs for IQEs. In the feature extraction stage, the proposed hybrid DNN integrates significant image features which are generated by image feature models, and those significant features are fed to the classification stage. The proposed hybrid DNN ensures that significant image features are included to evaluate image qualities. Since significant image features are generally correlated to image qualities, they are likely to improve the IQEs when they are integrated in the CNN for IQE. In the classification stage, we propose another machine learning approach namely geometric semantic genetic programming which uses the DNN predictions and the image features to perform the IQEs. Similar to the fully connected neural networks which are commonly used in the DNNs to the classification stage, geometric semantic genetic programming represents as a tree which consists of branches and nodes. Small trees are initialized and are grown iteratively to achieve a better classification. The size of the geometric semantic genetic programming is more flexible, unlike the fully connected neural network which has a fixed architecture and consists of a lot of links of which some of the links are not significant for classifications. Unnecessary branches of tree are not included in the geometric semantic genetic programming which is simpler than the cumbersome fully connected neural network.

The performance of the proposed hybrid DNN is evaluated using an image quality database which contains 3000 distorted images [22]. The backbone of the proposed hybrid DNN is implemented with the recently developed CNN [17] and is integrated with several commonly used simple image feature models namely blocking [23], blur [24], and two ringing artifacts [25]. Here we particularly select these four image features models which are computationally simpler than the modern IQE models. If these four computationally simple models are able to assist the hybrid DNN to perform better IQEs, better performance can be achieved when more accurate and computationally complex models are integrated. Experimental results show that the proposed hybrid DNN is capable to perform more accurate IQEs compared to the four image feature models and the recently developed CNN [17] which outperformed many IQA models [4, 5, 11, 18, 19] and neural network models [20, 21]. The rest of the paper is organized as follows: Sect. 2 discusses the commonly used IQE models and the metric for evaluating the models. Section 3 presents the mechanisms of the proposed DNN, and the motivations of why the hybrid DNN are proposed. Section 4 presents how the proposed DNN is implemented, and the performance is evaluated. Experimental results and analysis are also shown. A conclusion is drawn in Sect. 5. A novel hybrid DNN attempts to improve the currently used CNNs of which some significant image features cannot be included for image quality evaluations. This limitation is overcome by incorporating typical image features generated by classical models with the CNN. The hybrid DNN further improves the prediction capabilities of the CNN for IQEs.

2 Image quality evaluation

When an image, \(\bar{I} \in R^{n \times m}\), with the dimension \(n \times m\), is given, an IQE model, \(F_\mathrm{IQE}\), can be used to estimate the image quality, q for \(\bar{I}\):

$$\begin{aligned} q=F_\mathrm{IQE}(\bar{I}) \end{aligned}$$
(1)

To evaluate the performance of \(F_\mathrm{IQE}\), we can use a set of \(N_{\bar{I}}\) images, \(\bar{I}_{i}\) with \(i=1,2,...,N_{\bar{I}}\), which are contaminated with different types of noise and different levels of distortion. Image qualities are scored by many people, and \(MOS_{i}\) is the corresponding mean opinion score (MOS) for \(\bar{I}_{i}\). Based on \(MOS_{i}\) and \(\bar{I}_{i}\), Pearson linear correlation, r in (2), is commonly used to evaluate the performance of \(F_\mathrm{IQE}\), which is widely used to evaluate IQE models [2],

$$\begin{aligned} r=\frac{ \sum _{i=1}^{ N_{\bar{I}}} (F_\mathrm{IQE}(\bar{I_i})-\bar{q})(MOS_{i}-\overline{MOS}) }{ \sqrt{ \sum _{i=1}^{ N_{\bar{I}}} (F_\mathrm{IQE}(\bar{I_i})-\bar{q})^2 \sum _{i=1}^{ N_{\bar{I}}} (MOS_{i}-\overline{MOS})^2} } \end{aligned}$$
(2)

where \(F_\mathrm{IQE}(\bar{I_i})\) is the predicted image quality for the image \(\bar{I}_{i}\); \(\bar{q}\) and \(\overline{MOS}\) are the means of all \(F_\mathrm{IQE}(\bar{I_i})\) and all \(MOS_{i}\), respectively; r is the covariance between the predictions of \(F_\mathrm{IQE}\), and the actual mean opinion scores divided by the product of the two standard deviations. The performance of \(F_\mathrm{IQE}\) is good when the correlation (2) between all \(\bar{I}_{i}\) and their corresponding \(MOS_{i}\) is high. The correlation is strong when \( 0.5 < r \le 1\) or \( -1 \le r < -0.5\); the correlation is moderate when \( 0.3 < r \le 0.5 \) or \( -0.5 \le r < -0.3\); the correlation is weak when \( 0.1 \le r < 0.3 \) or \( -0.3 \le r < -0.1\); the correlation is very weak when \( -0.1 \le r \le 0.1\).

2.1 Image feature models for IQE

Some distortion types such as blurring, blocking, ringing, are correlated to image qualities. If the distortion type contaminated on the image is available, one can quantify the distortion level using an image feature model which is particularly developed to quantify this distortion type. Based on the quantified distortion, one can estimate the image quality. Given a distorted image, \(F_\mathrm{IQE}\) can be developed as an image feature model to quantify its image quality when its distortion type is available. Three image features, namely blur, blocking and ringing are commonly used to quantify image qualities [2].

  1. 1.

    Blur artifact is caused by camera movements, long exposure times, object movement or improper camera focus. Blur in an image can be observed by loss of semantic information such as object shapes. The amount of blur is correlated to the loss of spatial detail in an image. Hence, blur in an image can be quantified by the spatial-frequency domain [26]. Blur artifact also affects object edges and fine details on an image. Blur can also be measured by edge spreads [27], edge gradients [28] and edge widths [29] which are correlated to the magnitude of blur.

  2. 2.

    Blocking artifact is caused by the block-based image coding in low bit-rate rates, packet loss in image transmissions or block-based image compressions. The blocking artifact can be observed as artificial horizontal and vertical contours which are in block edges. The blocking artifact can also be caused by quantization of pixels at image blocks, and it causes image discontinuity at block boundaries. To quantify the magnitude of blocking artifact, ones can quantify the edge strength at block boundaries. Those Blocking magnitudes can be quantified by measuring the energy level of the blocky signal [23], detecting step edges with low amplitudes [30] and discrete Fourier transform (DFT)-based measure of blocking artifact [31].

  3. 3.

    Ringing artifact is caused by the coarse quantization of discrete wavelet transform or improper truncation of high-frequency components. It can also be generated in high-frequency irregularities during the reconstruction. Ringing distortion can be observed in high contrast edges from smooth texture regions. Ringing distortion can be quantified by edge-detection techniques which measure the overall magnitudes of edge-spread on an image. Those techniques include principal components analysis [32], quantifying pixel and edge distortion [33], and changes in statistic regularities of DWT/DCT coefficients [34].

However, this is ineffective to implement \(F_\mathrm{IQE}\) based on the quantity of a unique distortion type, since information about the distortion type is not known in most practical applications. When a single distortion type is only used to develop \(F_\mathrm{IQE}\), other distortion types cannot be quantified. The \(F_\mathrm{IQE}\) is only sensitive to a particular distortion type, although an image can be contaminated by many distortion types. Therefore, developing \(F_\mathrm{IQE}\) based on a unique distortion type is impractical.

2.2 Machine learning for IQE

To overcome the limitation of image feature models, \(F_\mathrm{IQE}\) can be developed by machine learning which is trained automatically using distorted images and their MOSs. Machine learning is involved with two main steps [35]. First, image features corresponding to different distortion types are captured from the distorted image; second, \(F_\mathrm{IQE}\) is developed by learning the relationship between captured image features and MOSs. \(F_\mathrm{IQE}\) is capable to predict visual image qualities across different distortion types and image contents, when images contaminated with more than one distortion types are used to develop \(F_\mathrm{IQE}\) [4, 5, 11, 36].

Recent approaches based on the machine learning have been developed to evaluate image qualities when training data with images as input and MOSs as labels is available. Artusi et al. [37] proposed a DNN to evaluate image qualities. However, this approach is not a no-reference image quality metric, where the original undistorted image is used as another input to the DNN. The image quality of the distorted image is compared to that of the original undistorted image. The approach is not practice for many applications such as image transmissions or compressions since the original undistorted images are generally not available. Despite this full-reference image quality metric, no-reference image quality metrics based on DNNs have been developed to perform image quality evaluations for particular object types such as magnetic resonance images [38], sonar images [39], images for liquid crystal displays [40]. Also an approach based on the genetic programming was developed to evaluate qualities of fish images [41]. Those approaches attempt to evaluate image qualities before performing further image processing or object recognitions. However, those approaches were only focused on a particular object type, where image features of those objects are only captured. Those approaches are not developed to evaluate general images which are captured by cameras, where the images are contaminated by common distortion types such as image digitization and compression, internet image transmission, wireless network, image decompression and reconstruction. Although Bi et al. [42] has developed a genetic programming-based model to evaluate qualities for general images, the model is only effective to evaluate image qualities of which the images are distorted with either blurring, lowering contrast or gaussian noise. The model is not developed for multi-image distortion types.

A more robust metric based on convolutional neural network CNN has been used to develop \(F_\mathrm{IQE}\) where the CNN is trained by a set of distorted images and MOSs of the corresponding distorted images [17]. The performance of CNN has been evaluated by the TID database with 24 image distortion types and better performance can be achieved by the CNN compared with many no-reference IQE models [4, 5, 11, 18, 19] and other deep neural networks [20, 21]. The CNN was developed since the CNNs are capable and effective to perform object classification and detection applications involved with images. A distorted image is the input of the CNN, and the corresponding MOS is the label. The fundamental building blocks of a CNN are illustrated in Fig. 1. The topology of CNN consists of many convolution and pooling layers. The image patch is the inputted to the first layer. The CNN uses multi-layers with pooling and convolution kernels to generate a set of features \(\overline{f_\mathrm{CNN}}=\{f^\mathrm{CNN}_1,f^\mathrm{CNN}_2,...,f^\mathrm{CNN}_{NF}\}\). The final layer uses \(\overline{f_\mathrm{CNN}}\) to predict the image quality, \(q_\mathrm{CNN}\); a fully connected neural network is generally used as the final layer.

Fig. 1
figure 1

Convolutional neural network (CNN)

CNN weights of the pooling and convolution kernels and those of the fully connected neural network are generally determined by the back-propagation algorithm. First, the CNN weights are randomly initialized. The back-propagation algorithm uses the loss function to determine the prediction error between the actual MOSs and the estimated image qualities. In each iteration, the CNN weights are updated based on the prediction error. The generalization capability of CNN can be improved through the iterations. When the CNN weights are properly fine-tuned iteratively, a lower prediction error can be obtained. After running the back-propagation algorithm with a certain number of iterations, satisfactory predictions can be achieved by the CNN.

CNN weights for the pooling and convolution kernels are optimized with respect to the loss function and the training data set. The pooling and convolution kernels in CNNs only generates a limited set of features, \(\overline{f_\mathrm{CNN}}\). The features in \(\overline{f_\mathrm{CNN}}\) only cover distortion types in the training dataset. When the image quality is evaluated based on an image feature model, the corresponding distortion type is guaranteed to be quantified on the image. For an example, when a model is particularly developed to quantify image blur, blur distortion can be fully quantify on the image. The approach of the image feature models do not have the limitation of the CNN since the CNN only relies on the training dataset, and some image features cannot be fully quantified.

3 Proposed hybrid deep neural network

In this paper, we propose a hybrid deep neural network namely hybrid DNN which integrates CNN evaluations and image features captured from distortion metrics, in order to predict image qualities. The hybrid DNN attempts to overcome the limitations of the CNN approach and the unique distortion metrics. The hybrid DNN is illustrated in Fig. 2 which uses the geometric semantic genetic programming (GSGP) [43, 44] to predict image qualities based on CNN predictions and image features captured from image features models. Compared to the neural networks and the regression models, the GSGP is proposed since (1) the GSGP is a heuristic algorithm of which better solutions can be explored when keep running the algorithm. The coefficients of regression model and the weights of neural network are determined based on the least square method and the backpropagation method, respectively, which only reaches local optima. (2) When using the regression model, the model structures including the interaction and orders have to be predefined. Also, in the neural network, the network configuration has to be predefined based on users’ experience. The models generated by the GSGP have more variants, and the model structures can be optimized automatically when keep running the algorithm. Also, the models generated by the GSGP are more transparent, compared to the neural networks.

Fig. 2
figure 2

Hybrid DNN integrating CNN and distortion metrics

The hybrid DNN model, \(F_{DNNGP}\) in (3), integrates the image features and CNN prediction by using the GSGP, in order to determine the image quality:

$$\begin{aligned} q_{DNNGP}=F_\mathrm{DNNGP}(\overline{f_\mathrm{Dis}},q_\mathrm{CNN}) \end{aligned}$$
(3)

where \(\overline{f_\mathrm{Dis}}=\{f_1^\mathrm{Dis},f_2^\mathrm{Dis},...,f_{Nd}^\mathrm{Dis}\}\) is a set of \(N_d\) image features, and \(q_\mathrm{CNN}\) is the CNN prediction. \(F_\mathrm{DNNGP}\) attempts to include the image features which are generated by the image feature models and are not included by the CNN.

3.1 Algorithmic flow

The DNNGP-Algorithm in Algorithm 1 is proposed to generate \(F_{DNNGP}\). The flow of DNNGP-Algorithm is illustrated in Fig. 3. To develop \(F_\mathrm{DNNGP}\), a set of \(N_D\) IQA samples, \(\{ \overline{f_\mathrm{Dis}}^j,q_\mathrm{DNN}^j \varvec{|} MOS^j \}\) with \(j=1,2,...,N_D\), is collected, where \(q_\mathrm{DNN}^j\) and \(\overline{f_\mathrm{Dis}}^j\) are the \(j^{th}\) data argument, and \(MOS^j\) is the jth data label; \(q_\mathrm{DNN}^j\) and \(\overline{f_\mathrm{Dis}}^j\) are the DNN prediction and image features to the \(j^{th}\) image sample, respectively; \(MOS^{j}\) is the mean opinion score to the jth image sample. \(F_\mathrm{DNNGP}\) attempts to correlate the data argument and the data label. In the DNNGP-Algorithm, the predefined number of generations is denoted as \(\Gamma _\mathrm{max}\). A population of \(N_\mathrm{POP}\) models namely \(\overline{F_\mathrm{DNNGP}}^i=\{F_\mathrm{DNNGP,1}^i,F_{\mathrm{DNNGP},2}^i,...,F_{\mathrm{DNNGP},N_\mathrm{POP}}^i\}\) is initialized randomly, where i is the generation number of the genetic process. When \(i=0\), the genetic process is at the first generation. The representation of \(F_{\mathrm{DNNGP},k}^i\) with \(k=1,2,...,N_\mathrm{POP}\) is formatted as a geometric semantic tree, where the arithmetic operations, { +, −, ×, / } are used as the tree nodes. The image features, \(f_1^\mathrm{Dis},f_2^\mathrm{Dis},...,f_\mathrm{Nd}^\mathrm{Dis}\), and DNN prediction, \(q_\mathrm{DNN}\), are used as the tree terminals. The nonlinear functions, such as exponential function, sinusoid function, can be used in the nodes. In the proposed DNNGP-Algorithm, the arithmetic operations are used since the execution time of arithmetic operations is shorter than that of the nonlinear functions (Fig. 3).

Fig. 3
figure 3

Flow of DNNGP-algorithm

The proposed DNNGP-Algorithm reproduces new models based on the two geometric semantic operators, namely geometric semantic crossover and geometric semantic mutation [45], where the two geometric semantic operators are discussed in Sect. 3.4. An empty set, namely \(\overline{S}\), is generated to store the generalization capabilities of all models in the current \({\overline{F_\mathrm{DNNGP}}}^i\) generation. The generalization capabilities of models are evaluated by the proposed fitness function, namely \({\mathfrak {FIT}}\), which is discussed in Sect. 3.3. \({\mathfrak {FIT}}\) evaluates the correlation between the predictions of \({F_\mathrm{DNNGP}}\) and the actual MOSs of the images. The commonly used tournament selection method is used to select the good models from the current \(i^{th}\) generation into the \((i+1)\)th generation.

figure a

3.2 Model representation for image quality predictions

The model, which is generated by the DNNGP-Algorithm, predicts the image quality, \(y_\mathrm{DNNGP}\), when the image features, \(f^\mathrm{Dis}_1, f^\mathrm{Dis}_2,...,f^\mathrm{Dis}_\mathrm{Nd} \), and the CNN prediction, \(q_\mathrm{CNN}\), are given. In the algorithm, the model, \(F_{\mathrm{DNNGP},k}^i\), is represented as the following regular expression:

$$\begin{aligned} y_\mathrm{DNNGP} &= F_{\mathrm{DNNGP},k}^i\left( f^\mathrm{Dis}_1,f^\mathrm{Dis}_2,...,f^\mathrm{Dis}_\mathrm{Nd},q_\mathrm{CNN}\right) \\ &= \left\{ (A \vee string) \cdot a \right\} ^{*} \cdot (A \vee string) \end{aligned}$$
(4)

where \(A=(string \cdot a \cdot string)\); \(a\in \{+,-,\times ,\div \} \); \(string\in \{f^\mathrm{Dis}_1, f^\mathrm{Dis}_2,...,f^\mathrm{Dis}_\mathrm{Nd},q_\mathrm{DNN}\}\); ‘\(\cdot \)’ is the concatenate; ‘\(\vee \)’ is the OR operation; ‘\(*\)’ is the Kleen star which is the unary operation. (4) is incorporated with the model arguments, \(f^\mathrm{Dis}_1, f^\mathrm{Dis}_2,...,f^\mathrm{Dis}_\mathrm{Nd}\), and \(q_\mathrm{DNN}\), in order to predict \(y_\mathrm{DNNGP}\). The manipulations between the model arguments are performed by the arithmetical operations, ‘\(+\)’, ‘−’,‘\(\times \)’ and ‘\(\div \)’. The model includes both the CNN prediction and the image features which are correlated to the distorted image. The model overcomes the limitation of the CNN, which may not include those image features correlated to image distortions [17].

Here an example of regular expression is shown in (5) and Fig. 4. In this example, the final prediction is correlated to the CNN prediction, \(q_\mathrm{CNN}\), and also it is correlated to the image features, \(f_2^\mathrm{Dis}\), \(f_8^\mathrm{Dis}\) and \(f_{10}^\mathrm{Dis}\). The final predictions are more robust to estimate image qualities where the images are contaminated with different distortion types.

$$\begin{aligned} y_\mathrm{DNNGP}&= \left( f_2^\mathrm{Dis} \times f_{10}^\mathrm{Dis}\right) \div (q_\mathrm{CNN}) + \left( f_{10}^\mathrm{Dis}-f_8^\mathrm{Dis}\right) \\ &= \frac{f_2^\mathrm{Dis} \times f_{10}^\mathrm{Dis}}{q_\mathrm{CNN}}+\left( f_{10}^\mathrm{Dis}-f_8^\mathrm{Dis}\right) \end{aligned}$$
(5)
Fig. 4
figure 4

An example of image quality evaluation model in geometric semantic tree

3.3 Fitness evaluations

The correlation function in (2) is reformulated as the fitness function of the proposed DNNGP-Algorithm, \(\mathfrak {FIT}\), in order to evaluate the performance of the image quality prediction model, \(F_{\mathrm{DNNGP},k}^i\). (6) is defined as \(\mathfrak {FIT}\) which is particularly incorporated with the real \(MOS^j\) and the image quality predictions obtained by \(F_{\mathrm{DNNGP},k}^i\) with respect to the \(j^{th}\) IQE image sample, \(\{ \overline{f_\mathrm{Dis}^j},q_\mathrm{CNN}^j \varvec{|} MOS^j \}\) with \(j=1,2,...,N_D\).

$$\begin{aligned} \mathfrak {FIT}(F_{DNNGP,k}^i)=\frac{ \sum _{j=1}^{ N_{D}} (F_{DNNGP,k}^i(\overline{f_\mathrm{Dis}^j},q_{DNN}^j)-\overline{F_{DNNGP,k}^i(\overline{f_\mathrm{Dis}^j},q_{DNN}^j)})(MOS_{j}-\overline{MOS}) }{ \sqrt{ \sum _{i=j}^{ N_{D}} (F_{\mathrm{DNNGP},k}^i(\overline{f_\mathrm{Dis}^j},q_\mathrm{DNN}^j)-\overline{F_{\mathrm{DNNGP},k}^i(\overline{f_\mathrm{Dis}^j},q_\mathrm{DNN}^j)})^2 \sum _{j=1}^{ N_{D}} (MOS_{j}-\overline{MOS})^2} } \end{aligned}$$
(6)

where \(\overline{F_{\mathrm{DNNGP},k}^i(\overline{f_\mathrm{Dis}^j},q_\mathrm{DNN}^j)}=\frac{\sum _{j=1}^{ N_{D}} F_{\mathrm{DNNGP},k}^i(\overline{f_\mathrm{Dis}^j},q_\mathrm{DNN}^j)}{N_{D}}\) is the mean of the image quality predictions obtained by \(F_{\mathrm{DNNGP},k}^i\); \(\overline{MOS}=\frac{sum_{j=1}^{N_D}MOS_j}{N_D}\) is the average of the mean opinion scores to the IQE image samples. The denominator is the product of the standard derivations of the image quality predictions and the real MOSs. The correlation indicates whether the real MOSs can be explained by the image quality predictions. The performance of \(F_{\mathrm{DNNGP},k}^i\) is good when the correlation between the image quality predictions and the real MOS is high. In machine learning applications, the mean square errors and mean absolute errors are commonly used to indicate whether the real observations and machine predictions are close. Since image qualities are mostly evaluated by human judgement of which the evaluations are subjective and perceptive, the evaluations are not highly precise compared to the machine judgements. Therefore, the mean square error and mean absolute error are not the most suitable metrics to be used as the fitness function. Correlation between the subjective MOS and the image quality predictions of \(F_{\mathrm{DNNGP},k}^i\) in (6) is used as the fitness function. The correlation indicates the consistency between the visual machine predictions and the human visual judgements.

3.4 Reproductions of image prediction models

After evaluating the fitness of each model \(F_{DNNGP,k}^i\) using (6), some models in the current generation are selected to perform reproductions for the new generation. The new models are generated as candidates which have potential to have better capabilities to predict image qualities. The new model is reproduced by incorporating an old model and a random model in the form of (4). The components of the new model are created by exchanging some components of the random model. In the DNNGP-Algorithm, the following two operators, namely geometric semantic crossover and geometric semantic mutation [45], are used to reproduce new models.

  1. a.

    Geometric semantic crossover namely CRO incorporates the components of two current models, \(F_{DNNGP,k_1}^i\) and \(F_{DNNGP,k_2}^i\), in order to reproduce a new model. Compare to \(F_{DNNGP,k_1}^i\) and \(F_{DNNGP,k_2}^i\), the new model has potential to generate better predictions of image qualities. CRO first generates a random model, R. CRO then performs a mapping from the 2D-dimension \(n \times n\) to the 1D-dimension n. The new model, newmodel, is reproduced as:

    $$\begin{aligned} newmodel&= {} {CRO}(F_{DNNGP,k_1}^i,F_{DNNGP,k_2}^i,R) \nonumber \\&= {} (F_{DNNGP,k_1}^i \cdot R') + ((1-R') \cdot F_{DNNGP,k_2}^i) \end{aligned}$$
    (7)

    where \(R'\) is given by \(R' = \frac{1}{1+e^{R}}\).

  2. b.

    Geometric semantic mutation namely MUT creates a new model based on a single model, \(F_{DNNGP,k}^i\). The new model involves new components which have potential to generate better image quality predictions. First, MUT creates two random models, \(R_{1}\) and \(R_{2}\). MUT then performs a map in the 1D-dimension n. A new model, namely newmodel, is created as:

    $$\begin{aligned} newmodel&= {MUT}(F_{DNNGP,k}^i,R_{1},R_{2}) \nonumber \\&= {} (F_{DNNGP,k}^i + c_m \cdot (R'_{1} - R'_{2}) \end{aligned}$$
    (8)

    where \(c_m\) is a constant; \(R'_{1}=\frac{1}{1+e^{R_{1}}}\) and \(R'_{2}=\frac{1}{1+e^{R_2}}\) are the functional values of \(R_{1}\) and \(R_{2}\), respectively.

4 Experimental results and analysis

In this section, we discuss how we implement the proposed hybrid DNN, and also we evaluate the performance of the proposed hybrid DNN. Section 4.1 discusses the database of which we use to evaluate the algorithmic performance. Section 4.2 discusses how the proposed hybrid DNN is implemented. Section 4.3 presents the experimental and comparison results with other methods including the CNN, and the commonly used image feature models for IQEs.

4.1 Image quality assessment database

The prediction capability of the proposed hybrid DNN is evaluated by predicting image qualities of the commonly used image quality assessment database namely TID2013 [22] which is developed for academic researchFootnote 1. TID2013 database is an extended version of the old version TID2008 database [46]. TID2013 contains 3000 distorted images which were developed based on 25 reference images. The distorted images are contaminated with 25 distortion types in 5 levels. Each distorted image is created by contaminating a reference image with a distortion type in a single distortion level. Those distortions are caused by camera operations, image transmissions, image compressions, and conventional image processing. TID2013 database also contains images with exotic distortions, which do not exist in general applications of image processing but are challenging for image quality prediction algorithms. The image qualities of the distorted images in the TID2013 were evaluated by 985 subjective experiments, which were involved with human observers from 5 countries, Finland, France, Italy, Ukraine and USA. MOSs from 0 to 100 were scored and were rescaled to the range from 1 to 5. The images contaminated with the 25 distortion types are shown in Fig. 5. These 25 distortion types cover many image processing applications. TID2013 is a commonly used image quality assessment database to assess the performance of image quality prediction models [22]. We attempt to evaluate the prediction capability of the proposed hybrid DNN based on (2) which is the correlation between algorithmic predictions and true visual qualities to images.

Fig. 5
figure 5

25 distortion types for TID2003 database

Despite the TID2013, another image quality assessment database namely LIVE databaseFootnote 2 is used to develop the hybrid DNN which estimates image quality when a distorted image is given [47, 48]. The distorted images in LIVE database are created based on 29 reference images without distortion. These 29 images were contaminated with 5 distortion types namely additive white Gaussian noise, Gaussian blur, a simulated fast fading Rayleigh channel, JP2K compression and JPEG compression. The images were contaminated with many distortion levels. The image qualities of the distorted images were evaluated by more than 25,000 human image quality judgements. Quality difference scores of distorted images were evaluated by comparing the reference images and the distorted images. The quality difference scores are in the range between 0 to 100. When the quality difference score is lower, the image quality is better. Hence, the quality difference scores are different to the MOS. MOS is the higher the better, while the quality difference score is the lower the better.

To evaluate the performance of the proposed DNNGP-Algorithm, the proposed DNNGP-Algorithm uses the LIVE database to train the hybrid DNN, and TID2013 database is used to validate the generalization capabilities of the trained hybrid DNN. Since the number of distortion types in TID2013 database is 24 and the number of distortion types in LIVE database is 6, the number of distortion types of TID2013 is larger than that of LIVE. When the LIVE database is used to train the hybrid DNN, some image distortions including in the TID2013 are not covered. Performing these experiments attempt to evaluate whether the performance of the proposed hybrid DNN is better than the CNN [17] and the other compared algorithms [23,24,25]. The proposed hybrid DNN uses classical image feature models to capture distortion types which are correlated to the image qualities. The approach overcomes the limitations of the CNN of which the training is only relied on the training database. If the image distortion types are not included in the training database, the CNN is unlikely to estimate image qualities of those distorted images since the CNN is fully relied on training samples. Also the four image feature models are only developed for a particularly distortion type. We attempt to verify whether the performance difference between the proposed hybrid DNN and the other tested methods is significant.

4.2 Algorithmic implementation of proposed hybrid DNN

The implementation of the proposed hybrid DNN in Fig. 2 consists of three main components, GSGP-Algorithm, CNN and image feature models which are illustrated in Fig. 6. The GSGP-Algorithm [44] is implemented since the algorithm is computationally simpler than the classical genetic programming models and is more capable to generate nonlinear models compared to the statistical regression. The CNN is implemented with a convolutional neural network namely Boses-CNN [17] which is recently developed for predicting image qualities; experimental results showed that the Boses-CNN is significantly better than the commonly used image quality metrics. In the proposed hybrid DNN, the image feature models are selected as Blocking artifact, Blur artifact, Ringing artifact (edge magnitude) and Ringing artifact (edge gradient) [49] which are correlated to the image qualities and have small computational costs. Here we do intentionally not implement the most modern and computationally complex models. We particularly select these four state-of-the-art models which are simple and are not most modern. If these four computationally simple models are able to perform better image predictions and achieve improvement, much better performance can be achieved by the proposed hybrid DNN when the more modern and computationally complex models are integrated.

The implementation details of the GSGP-Algorithm are given in Sect. 4.2.1. The implementations of CNN and distortion metrics are described in Sects. 4.2.2 and 4.2.3, respectively.

Fig. 6
figure 6

Implementation of the proposed hybrid DNN

4.2.1 GSGP-Algorithm

The GSGP-Algorithm integrates the four image feature models and the Boses-CNN, in order to develop the hybrid DNN for IQEs. The following algorithmic parameters are implemented in the GSGP-Algorithm: (Population size) = 200; (Maximum number of generations) = 50; (Generation gap) = 0.9; (Crossover probability) = 0.5; (Mutation probability) = 0.5; Tournament selection with the size of 4 is used. These parameters were determined by experiments where the hybrid DNN with convincing prediction capabilities can be generated by the GSGP-Algorithm.

The image quality samples in the TID database were used to generate the proposed hybrid DNN. Since the TID database is created by 25 reference images, 25-fold cross was used to evaluate the prediction capability of the proposed hybrid DNN. The image quality samples were divided into 25 folds, where each fold contains the contaminated images which were distorted on one of the reference images. In each validation, 24 out of the 25 folds were used to develop the model, and the remaining data fold were used to validate the prediction capability of the model. The performance of the models was evaluated based on the Pearson linear correlation in (2) which indicates the correlation between the actual MOSs and the model predictions. The proposed GSGP-Algorithm was coded based on the C++ framework for the public source of GSGPFootnote 3. The GSGP-Algorithm was implemented by a P510 Xeon E5-1630 v4 machine with 32 GB memory and two xGTX1080 GPU cards.

4.2.2 CNN

Boses-CNN [17] was implemented on the proposed hybrid DNN. The backbone of Boses-CNN is based on the architecture of VGGnet which is embedded with cascaded convolution kernels with the size of 3\(\times \)3 [50]. The VGGnet consists of eight convolution layers and four maxpool layers. Since the VGGnet was only developed for images with the size of 224\(\times \)224 pixels, extra image resizing was performed between the images and the input of the VGGnet by two convolution layers and a single maxpool layer. The output of VGGnet is a set of image features which are fed into two cascaded fully connected neural networks and a single pooling layer. In order to avoid overtraining the Boses-CNNs, the approach of dropout regularization is used to determine the network parameters. Based on [17], the total number of parameters in Boses-CNN is about 5.2 million. The detailed description of the Boses-CNN can be referred to [17]. Also the implementable Boses-CNNs are available for public use; they have been trained by two image quality databases namely TID and LIVE, and they can be downloadedFootnote 4.

The proposed hybrid model is integrated with the Boses-CNN which is trained by the LIVE database. In [17], experiments have been conducted to evaluate the performance of the Boses-CNNs for predicting the image qualities. Their image prediction performance was evaluated based on the TID2013 database. The experiments attempt to evaluate the prediction capabilities of the Boses-CNNs while the training images are not duplicate to the test images. Experimental results showed that more accurate image quality predictions can be achieved by the Boses-CNNs compared to other recently developed no-reference image quality metrics namely Saad-BLIINDS-II [11], Mittal-DIIVINE [4], Mittal-BRISQUE [5], Ghadiyaram-NIQE [18], Ye-CORN-A [19] and Zhang-SOM [51], and other deep neural networks namely Kang-CNN [20] and Kim-BIECON [21].

4.2.3 Image feature models

The following image features Fig. 7Footnote 5, namely blocking, blur and ringing artifacts which commonly exist in images, are integrated to the proposed hybrid DNN. CNN-Bosses in the hybrid DNN is trained by the LIVE database which contains the distortion types of blur and image compressions. Although those distortion types are similar to those of the following image features, CNN-Bosses is only trained by a limited number of images contaminated by those distortion types. The CNN-Bosses may not cover all the levels of those distortions. The proposed hybrid DNN attempts to integrate CNN predictions and the four image feature models, in order to compensate those uncovered levels. It attempts to improve the performance of image predictions.

  • Blocking artifacts: The metric of block artifacts was developed by Wang et al. [23]. The blocking artifacts are usually caused by image compressions such as JPEG, JPEG2000. The artifacts appear continuously at block boundaries and are caused individually by the quantization of those blocks. Three quantity parameters indicate the blocking measures. The first parameter quantifies the blocks by measuring overall differences between block boundaries. The other two parameters quantify image blurs in horizontal and vertical directions based on differences between blocks and a zero-crossing rate, respectively. A high order polynomial function is used to combine the three parameters in order to quantify the blocking artifacts.

  • Blur artifacts: The metric of blur artifacts was developed by Marziliano et al. [24]. Blur reduces spatial detail and object shapes in an image. Blur is caused by the overall increasing of edge smoothness or reducing of edge sharpness. The metric quantifies the blur by measuring the widths of vertical edges in the image. Detections of vertical edges are applied since the required computations are less, comparing to the inclusion of both vertical and horizontal edges. The metric first uses the Sobel filter to detect vertical edges, where the Sobel filter is commonly used for edge detections [52]. Blur measures are quantified in each detected edge. The overall blur is quantified based on the average of all those blur measures.

  • Ringing (edge magnitude and gradient) - The metric of ringing artifacts was developed by Saha et al. [25]. The ringing artifacts appear on high contrast edges which are in smooth textures. The ringing artifacts are caused by image processing or transmissions, where high-frequency components exist in the images. In the metric of ringing artifacts, two parameters are quantified. First Sober edge detection [52] is used to generate the edge image from the original one. Based on the edge image, the first parameter is measured as the overall edge magnitude. The second parameter indicates the overall edge gradients in both vertical and horizontal directions. These two parameters are used in order to quantify the image activity which is correlated to ringing artifacts (Fig. 7).

Fig. 7
figure 7

4 Image features

4.3 Experimental results and statistical tests

Since the GSGP-Algorithm is a heuristic algorithm, the random operations such as the population initialization, mutation and crossover are involved. Different hybrid DNNs are generated in different runs, although the same parameters are used in the GSGP-Algorithm. Therefore, the GSGP-Algorithm was run for 30 times. 25-fold cross validations were conducted in which each trial was corresponded to a reference image. The averages for 30 runs for each reference image were recorded and are shown in Table 1. Section 4.3.1 shows the correlation performance for the proposed hybrid DNN and the other tested methods. Section 4.3.2 shows the statistical tests including the t-test, F-test and Tueky’s range test which are achieved by the proposed hybrid DNN.

4.3.1 Correlation performance

Table 1 also shows the cross validation results for the Boses-CNN and the proposed hybrid DNNs. For Image 1, the models were developed based on the distorted images and the MOSs of Image 2–25. After the models were developed, they were used to predict the MOSs of the untrained distorted images. The Pearson linear correlations of the predicted image qualities and the actual qualities are calculated in order to evaluate whether the predictions of the models are linearly correlated. If the correlation is high, the model is capable to explain the linear relationship between the actual MOSs and the predictions. The table also shows the correlations for the four commonly used image quality metrics for blocking artifacts, blur artifacts, and ringing artifacts (edge magnitude and gradient). In the first row for Image 1, the means of the Pearson linear correlations obtained by the proposed hybrid DNNs are generally larger than those obtained by Boses-CNN which are larger than the four commonly used image quality metrics. Similar results can be found on the second main row regarding for Image 2 to Image 25. The means of Pearson linear correlations obtained by the proposed hybrid DNNs are higher than those of the Boses-CNN which are larger than those of the four commonly used image quality metrics. Therefore, the proposed hybrid DNNs are generally better than the Boses-CNN models which can generate more accurate image quality predictions than the four commonly used image quality metrics.

All the four commonly used image quality metrics achieved poorer results compared to the proposed hybrid DNNs and the Boses-CNN, the four metrics only address one of the four artifacts namely blocking artifacts, blur artifacts and two ringing artifacts (edge magnitude and gradient), respectively. If the image quality metric is used to measure the images with other distortion types, those distortions cannot be quantified. For example, the metric for quantifying blur artifacts is not effective to quantify the images with blocking artifacts. If this metric is used to measure images contaminated with blocking artifacts, predictions of images qualities are likely to be poor. Therefore, solely quantifying one distortion type is not enough to quantify the overall image distortion since the image database is involved with distorted images which are contaminated by other distortion types. Since both Boses-CNN and the proposed hybrid DNNs are developed by images contaminated with many distortion types, the overall performance of image quality predictions is generally better than that of the four image quality metrics.

When the performance of the Boses-CNN and the proposed hybrid DNNs is compared, the correlations in Table 1 show that the proposed hybrid DNN outperforms the Boses-CNN. The image quality predictions of the proposed hybrid DNN are integrated with both the predictions from the Boses-CNN and the image quality metrics. The exact quantity of a distortion type can be evaluated by the corresponding distortion metric. The image quality metric does not have the limitation of the Boses-CNN which is trained by a limited number of features from a database. When images contaminated with some distortion types are not included for training, quality predictions for those image types can be poorer. Hence, the overall predictions of Boses-CNN are also likely to be poorer. The proposed hybrid DNN is generally better than the Boses-CNN.

Table 1 Correlation to MOS using different methods

4.3.2 Statistical tests

To further compare the prediction performance of the four commonly image quality metrics, Boses-CNN and the proposed hybrid DNN, the t-test [53] was used. The t-test evaluated the significance of the hypothesis that the Pearson linear correlations achieved by the proposed hybrid DNN are larger than those of the four commonly image quality metrics and the Boses-CNN. In the t-distribution table, the hypothesis has 99.9% confidence level when the t-value is higher than 3.99. Hence, the Pearson linear correlations of the hybrid DNN are higher than those of the other methods with a 99.9% confidence level. Table 2 shows the t-values for the proposed hybrid DNN, compared to the four commonly used image quality metrics and the Boses-CNN. The second row shows the t-values for the Pearson linear correlations. The t-distribution table shows that the Pearson linear correlations achieved by the proposed hybrid DNN are significantly larger than those of the other five tested methods with a very high confidence level of 99.9%, since the t-values are higher than 3.9. Therefore, the hybrid DNN is able to obtain better Pearson linear correlations in terms of image quality predictions than the other five tested methods with a very high confidence level.

Also we have conducted the F-test for the Pearson linear correlations achieved by the proposed hybrid DNN which was developed by the proposed DNNGA-Algorithm, compared to the other five methods which were all developed to predict the image qualities for the distorted images. Table 2 shows the P-values which are all less than 0.0000 for the hybrid DNN compared to the four image quality metrics. Also the P-value is 0.0002 for the hybrid DNN compared to the Boses-CNN. Therefore, we have significant confidence to claim that the hybrid DNN generated by the DNNGA-Algorithm is able to generate better results than the other four tested methods for predicting image qualities.

Since the p-values are all zeros or close to zero, the post-hoc analysis namely Tueky’s range test is used to further validate whether the performance of the proposed hybrid DNN generated by the proposed DNNGA-Algorithm and the other five tested methods is significant difference. The results in Fig. 8a–e show that the Pearson linear correlation mean achieved by the proposed hybrid DNNs is better than those achieved by the four image quality metrics and the Boses-CNN, respectively. These results further validate the performance of the proposed hybrid DNN.

These validations further demonstrate that the proposed hybrid DNN which is a novel version of the CNN and is better than the recently developed Boses-CNN. The prediction of hybrid DNN is integrated with the Boses-CNN prediction and the four image quality metrics. The integration is performed by the proposed DNNGA-Algorithm which uses the five predictions in order to achieve better predictions with better Pearson linear correlations. Therefore, better results can be achieved by the proposed hybrid DNN.

These experimental results demonstrate that the proposed hybrid-DNN achieves higher correlations to the real MOSs in the TID database, compared to the five tested methods including block artifact metric, blur artifact metric, ringing artifact metric, ringing artifact metric, Boses-CNN. Also the validation results are shown by the T-test, F-test and Tueky’s range test. They showed that the proposed approach is able to achieve significantly more accurate quality predictions, compared to the five tested methods. These results show that the proposed hybrid-DNN outperforms the CNN-Bosses of which those distortion types are not included for training. Better results are achieved by the hybrid-DNN since the hybrid-DNN integrates with image features generated by the four image feature models, block artifact metric, blur artifact metric, ringing artifact metric, and ringing artifact metric. Some distortion types, which have not been covered by the CNN-Bosses, are included in the hybrid-DNN. Also the hybrid-DNN is better than the individual image feature model which is only robust on a single distortion type. Therefore, better results can be achieved by the proposed hybrid-DNN.

Table 2 Correlation to MOS using different methods

5 Conclusion

In this paper, a novel hybrid DNN was developed to perform automatic IQEs. The proposed hybrid DNN consists of two stages, namely feature extraction stage and classification stage. In the feature extraction stage, the proposed approach integrates image features captured from IQE models in order to predict image qualities. The proposed approach ensures that significant features correlated to image qualities are integrated by image feature models. It overcomes the limitation of the recently developed CNN that image features are only captured randomly, and significant one cannot be guaranteed to be included. In the classification stage, the tree-based model namely geometric semantic genetic programming integrates image features to perform the final image quality predictions. The approach is simpler than the cumbersome fully connected neural networks.

The performance of the proposed approach was evaluated by the TID image quality database which is commonly used to evaluate the performance of image quality metrics. The database consists of 3000 distorted images which were contaminated with 25 distortion types in 5 levels. We have compared the proposed approach with the four state-of-art IQE metrics and a powerful CNN which outperformed many IQE models. The mean correlation achieved by the proposed hybrid DNN is 0.57 which is higher than the tested methods including the state-of-the-art IQE metics and the powerful CNN for IQEs. The correlation results in terms of t-test, F-test and Tueky’s range tests showed that the proposed approach is able to achieve significantly more accurate quality predictions with a 99.9% confidence level, compared to the tested methods.

In the future, we will incorporate the proposed hybrid DNN with more modern models for IQEs. This is expected that further improvement can be achieved but long execution time is required when the more computationally complex models are integrated. We will find the tradeoff between the prediction accuracy and computational time.

Fig. 8
figure 8

Tueky’s range test