M3GPSpectra: A novel approach integrating variable selection/construction and MLR modeling for quantitative spectral analysis

https://doi.org/10.1016/j.aca.2021.338453Get rights and content

Highlights

  • M3GPSpectra based on genetic programming is developed for quantitative spectral analysis.

  • M3GPSpectra is used to construct models that predict the content of seven substances.

  • M3GPSpectra outperforms seven popular spectral analysis methods (including a CNN-based method).

  • M3GPSpectra model obtains excellent accuracy on a small amount of training samples.

Abstract

Quantitative analysis of the physical or chemical properties of various materials by using spectral analysis technology combined with chemometrics has become an important method in the field of analytical chemistry. This method aims to build a model relationship (called prediction model) between feature variables acquired by spectral sensors and components to be measured. Feature selection or transformation should be conducted to reduce the interference of irrelevant information on the prediction model because original spectral feature variables contain redundant information and massive noise. Most existing feature selection and transformation methods are single linear or nonlinear operations, which easily lead to the loss of feature information and affect the accuracy of subsequent prediction models. This research proposes a novel spectroscopic technology-oriented, quantitative analysis model construction strategy named M3GPSpectra. This tool uses genetic programming algorithm to select and reconstruct the original feature variables, evaluates the performance of selected and reconstructed variables by using multivariate regression model (MLR), and obtains the best feature combination and the final parameters of MLR through iterative learning. M3GPSpectra integrates feature selection, linear/nonlinear feature transformation, and subsequent model construction into a unified framework and thus easily realizes end-to-end parameter learning to significantly improve the accuracy of the prediction model. When applied to six types of datasets, M3GPSpectra obtains 19 prediction models, which are compared with those obtained by seven linear or non-linear popular methods. Experimental results show that M3GPSpectra obtains the best performance among the eight methods tested. Further investigation verifies that the proposed method is not sensitive to the size of the training samples. Hence, M3GPSpectra is a promising spectral quantitative analytical tool.

Introduction

With the rapid development of chemometrics and spectrometry, spectrum-based methods have become a powerful tool for analysis of objects in the areas of agriculture, food, environment monitoring, and so on [[1], [2], [3], [4], [5]]. This spectral analytical approach quantitatively analyzes an object at a lower cost than traditional methods [6,7]. Spectral analytical approaches aim to establish a linear or nonlinear quantitative relationship between the spectra to the substance content (SC), so the quality of the given object can be indirectly evaluated based on its spectral information.

Considering that raw data collected by spectral sensors usually contain severely collinear and redundant information [8], most existing studies focus on integrated calibration approaches. In this approach, suitable feature selection or transformation algorithms are first used to reduce the dimension of raw data; these selected or transformed data are used as model input to construct a prediction model [9,10]. Mathematically, the step of establishing the above prediction model can be summarized as Equation (1),yˆ=Model(DR(x))where x and yˆ represent the raw data and the predicted value of SC; Model ( ) means modeling method, and DR ( ) represents dimensionality reduction operation. From Equation (1), whether the prediction model is nonlinear or linear depends on Model ( ) and DR ( ). Therefore, existing integration methods can be divided into four cases.

Case 1

Model( ) and DR( ) are linear operations. In general, linear DR( ) involves feature selection and linear projection. In quantitative spectral analysis, raw data collected by the spectral sensor contains a large amount of redundant information. Hence, feature selection methods (such as successive projections algorithm (SPA) [11], competitive adaptive reweighted sampling method (CARS) [12], genetic algorithm (GA) [[13], [14], [15], [16], [17]]) are employed to reduce the dimensionality of the raw data, thereby improving the accuracy of the prediction model and reducing the cost of modeling. In contrast to feature selection-based methods, linear projection-based methods (including principal component regression (PCR) [18] and partial least square regression (PLS) [19]) extract the principal components of raw features through linear transformations for subsequent modeling. In addition, the method of combining feature selection and linear projection as DR( ) was reported in Refs. [[20], [21], [22]]. This method first selects information-rich features and then transforms them to novel features to feed modeling. In this case, linear DR ( ) operations reply whether the linear combination of features is beneficial to modeling but ignores the positive effect of the nonlinear combination of features; these operations are more diverse and have more information than linear combination on modeling. Nonlinear prediction models with good performance cannot be constructed by the methods in this case.

Case 2

DR( ) and Model( ) are nonlinear and linear operations, respectively. For example, methods reported in Refs. [23,24] map the raw data to new coordinates based on the corresponding nonlinear kernel function to increase the distinguishability of the data and then input the novel data into PLS and support vector regression (SVR) respectively to construct a prediction model. References [25,26] reported a calibration method combining feature selection and kernel PLS. These methods first convert some information-rich features in the raw data to new features by a nonlinear method and then sends the new features into linear models. In addition, several methods based on convolutional neural network (CNN) complete the nonlinear transformation of features through multiple layers of CNN with nonlinear activation functions; subsequently, the output layer of linear activation function is employed to predict SC based on transformed data [27,28]. Compared with Case 1, nonlinear projections support the construction of a complex nonlinear prediction model, which can generally improve model accuracy. However, these nonlinear projections would possibly alter the patterns of raw data and lose raw information-rich features.

Cases 3

DR( ) and Model( ) are linear and nonlinear operations, respectively. The methods based on extreme learning machine (ELM) are typical representatives of Case 3 [29,30]. Essentially, ELM is a single-layer neural network. The input layer of the network is fully connected to the hidden layer, each node of which receives the linearly weighted output of all the feature variables, to complete the linear transformation of the original data. The connection from the hidden layer to the output layer (with a nonlinear activation function) establishes a nonlinear relationship between transformed data and SC.

Cases 4

DR( ) and Model( ) are nonlinear operations. Recent CNN-based works [31,32], which are classified as Case 4, add nonlinear activation functions after the CNN layer and output layer to construct a nonlinear prediction model. Similar to Case 2, the methods of Cases 3 and 4 can also construct excellent nonlinear models but they require a huge size of samples to determine the parameters in the modeling method. When the samples are excessively few, even with a variety of tricks, over-fitting might occur [33].

An integrated calibration method with the following characteristics may construct accurate prediction model: (1) DR( ) cannot only pick information-rich features but also construct linear and nonlinear combinations of multiple features; (2) Model( ) is linear, so “big-data” is not necessary for construction of high-accuracy prediction model. The calibration method based on multidimensional multiclass genetic programming with multidimensional populations (M3GP) algorithm integrates features selection/projection and multiple linear regression (MLR), which satisfies the two conditions simultaneously. According to the author’s knowledge, M3GP-based method has not been used to establish the relationship between the spectrum and SC. This research therefore aims to explore whether M3GP-based method is suitable for quantitative spectral analysis. Hereinafter, M3GP-based integrated method is referred to as M3GPSpectra, and the prediction model obtained by M3GPSpectra is named as M3GPSpectra model. This study compared M3GPSpectra with six popular calibration approaches on the datasets obtained by long-wave NIR spectrometer, short-wave NIR spectrometer, HIS, Raman spectrometer, nuclear magnetic resonance (NMR) spectrometer, and fluorometer. Experimental results show that 18 M3GPSpectra models achieved the best accuracy, and one M3GPSpectra model obtained the sub-optimal accuracy. Furthermore, M3GPSpectra was applied to short-wave NIR and fluorescent datasets with different sizes to test the adaptability of the method. Overall, M3GPSpectra is not sensitive to the size of the training samples used.

Section snippets

Description of problem

Assume that m samples with q features (spectrums) form a training set {(xi,yi),i=1,....,m}, where xiq represents the i-th training sample, and yi is defined as the target variable of the i-th training samples. A prediction model is established based on M3GPSpectra to bridge the spectrum and SC based on the training samples. M3GPSpectra encodes the problem into a tree structure because DR( ) realized by decoding a tree includes feature selection/linear projection/non-linear projection, thereby

Results and discussion

This section has four sections. Section 3.1 presents the performance of 152 prediction models established by eight types of quantitative spectral analysis methods for six types of spectral data. In Section 3.2, a comprehensive comparison of M3GPSpectra and its seven competitors is presented. Section 3.3 explores the effect of the difference between sample size and feature size on the analysis method. Finally, the relationship between the performance of M3GPSpectra and the size of training

Conclusion and future work

This study introduces a novel machine learning approach, named as M3GPSpectra, for quantitative spectral analysis. The method combines feature selection/construction and MLR and realizes the exchange of information between the two parts. First, M3GPSpectra selects features and constructs linear and nonlinear novel features for input to subsequent MLR. This step can reduce the redundant information in the raw spectral data and refine the constructed feature. Second, MLR is applied to features

CRediT authorship contribution statement

Yu Yang: Conceptualization, Methodology, Software, Formal analysis, Investigation, Writing – original draft. Xin Wang: Investigation, Visualization, Writing – original draft. Xin Zhao: Software, Validation. Min Huang: Writing – review & editing, Formal analysis, Supervision, Visualization. Qibing Zhu: Writing – review & editing, Formal analysis, Supervision, Visualization.

Declaration of competing interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgements

Dr. Qibing Zhu and Dr. Min Huang gratefully acknowledge the financial support from the National Natural Science Foundation of China (Grant no. 61772240, 61775086), the 111 Project (B12018).

References (47)

  • M.B. El-Zeiny et al.

    An evaluation of different bio-inspired feature selection techniques on multivariate calibration models in spectroscopy

    Spectroc. Acta Pt. A-Molec. Biomolec. Spectr.

    (2021)
  • G.X. Ren et al.

    Multi-variable selection strategy based on near-infrared spectra for the rapid description of dianhong black tea quality

    Spectroc. Acta Pt. A-Molec. Biomolec. Spectr.

    (2021)
  • J.H. Yin et al.

    Concentration profiles of collagen and proteoglycan in articular cartilage by Fourier transform infrared imaging and principal component regression

    Spectroc. Acta. Pt. A-Molec. Biomolec. Spectr.

    (2012)
  • A. Moghimi et al.

    Vis/NIR spectroscopy and chemometrics for the prediction of soluble solids content and acidity (pH) of kiwifruit

    Biosyst. Eng.

    (2010)
  • Z.L. Wang et al.

    Evaluating photosynthetic pigment contents of maize using UVE-PLS based on continuous wavelet transform

    Comput. Electron. Agric.

    (2020)
  • B.M. Nicolaï et al.

    Kernel PLS regression on wavelet transformed NIR spectra for prediction of sugar content of apple

    Chemometr. Intell. Lab. Syst.

    (2007)
  • Q.S. Chen et al.

    Comparisons of different regressions tools in measurement of antioxidant activity in green tea using near infrared spectroscopy

    J. Pharmaceut. Biomed. Anal.

    (2012)
  • X. Huang et al.

    Improved kernel PLS combined with wavelength variable importance for near infrared spectral analysis

    Chemometr. Intell. Lab. Syst.

    (2017)
  • D. Shah et al.

    A feature-based soft sensor for spectroscopic data analysis

    J. Process Contr.

    (2019)
  • W. Ng et al.

    Convolutional neural network for simultaneous prediction of several soil properties using visible/near-infrared, mid-infrared, and their combined spectra

    Geoderma

    (2019)
  • X.L. Zhang et al.

    DeepSpectra: an end-to-end deep learning approach for quantitative spectral analysis

    Anal. Chim. Acta

    (2019)
  • C. Cui et al.

    Modern practical convolutional neural networks for multivariate regression: applications to NIR calibration

    Chemometr. Intell. Lab. Syst.

    (2018)
  • Y. Yang et al.

    Multispectral image based germination detection of potato by using supervised multiple threshold segmentation model and Canny edge detector

    Comput. Electron. Agric.

    (2021)
  • Cited by (0)

    View full text