M3GPSpectra: A novel approach integrating variable selection/construction and MLR modeling for quantitative spectral analysis
Graphical abstract
Introduction
With the rapid development of chemometrics and spectrometry, spectrum-based methods have become a powerful tool for analysis of objects in the areas of agriculture, food, environment monitoring, and so on [[1], [2], [3], [4], [5]]. This spectral analytical approach quantitatively analyzes an object at a lower cost than traditional methods [6,7]. Spectral analytical approaches aim to establish a linear or nonlinear quantitative relationship between the spectra to the substance content (SC), so the quality of the given object can be indirectly evaluated based on its spectral information.
Considering that raw data collected by spectral sensors usually contain severely collinear and redundant information [8], most existing studies focus on integrated calibration approaches. In this approach, suitable feature selection or transformation algorithms are first used to reduce the dimension of raw data; these selected or transformed data are used as model input to construct a prediction model [9,10]. Mathematically, the step of establishing the above prediction model can be summarized as Equation (1),where x and represent the raw data and the predicted value of SC; Model ( ) means modeling method, and DR ( ) represents dimensionality reduction operation. From Equation (1), whether the prediction model is nonlinear or linear depends on Model ( ) and DR ( ). Therefore, existing integration methods can be divided into four cases. Case 1 Model( ) and DR( ) are linear operations. In general, linear DR( ) involves feature selection and linear projection. In quantitative spectral analysis, raw data collected by the spectral sensor contains a large amount of redundant information. Hence, feature selection methods (such as successive projections algorithm (SPA) [11], competitive adaptive reweighted sampling method (CARS) [12], genetic algorithm (GA) [[13], [14], [15], [16], [17]]) are employed to reduce the dimensionality of the raw data, thereby improving the accuracy of the prediction model and reducing the cost of modeling. In contrast to feature selection-based methods, linear projection-based methods (including principal component regression (PCR) [18] and partial least square regression (PLS) [19]) extract the principal components of raw features through linear transformations for subsequent modeling. In addition, the method of combining feature selection and linear projection as DR( ) was reported in Refs. [[20], [21], [22]]. This method first selects information-rich features and then transforms them to novel features to feed modeling. In this case, linear DR ( ) operations reply whether the linear combination of features is beneficial to modeling but ignores the positive effect of the nonlinear combination of features; these operations are more diverse and have more information than linear combination on modeling. Nonlinear prediction models with good performance cannot be constructed by the methods in this case. Case 2 DR( ) and Model( ) are nonlinear and linear operations, respectively. For example, methods reported in Refs. [23,24] map the raw data to new coordinates based on the corresponding nonlinear kernel function to increase the distinguishability of the data and then input the novel data into PLS and support vector regression (SVR) respectively to construct a prediction model. References [25,26] reported a calibration method combining feature selection and kernel PLS. These methods first convert some information-rich features in the raw data to new features by a nonlinear method and then sends the new features into linear models. In addition, several methods based on convolutional neural network (CNN) complete the nonlinear transformation of features through multiple layers of CNN with nonlinear activation functions; subsequently, the output layer of linear activation function is employed to predict SC based on transformed data [27,28]. Compared with Case 1, nonlinear projections support the construction of a complex nonlinear prediction model, which can generally improve model accuracy. However, these nonlinear projections would possibly alter the patterns of raw data and lose raw information-rich features. Cases 3 DR( ) and Model( ) are linear and nonlinear operations, respectively. The methods based on extreme learning machine (ELM) are typical representatives of Case 3 [29,30]. Essentially, ELM is a single-layer neural network. The input layer of the network is fully connected to the hidden layer, each node of which receives the linearly weighted output of all the feature variables, to complete the linear transformation of the original data. The connection from the hidden layer to the output layer (with a nonlinear activation function) establishes a nonlinear relationship between transformed data and SC. Cases 4 DR( ) and Model( ) are nonlinear operations. Recent CNN-based works [31,32], which are classified as Case 4, add nonlinear activation functions after the CNN layer and output layer to construct a nonlinear prediction model. Similar to Case 2, the methods of Cases 3 and 4 can also construct excellent nonlinear models but they require a huge size of samples to determine the parameters in the modeling method. When the samples are excessively few, even with a variety of tricks, over-fitting might occur [33]. An integrated calibration method with the following characteristics may construct accurate prediction model: (1) DR( ) cannot only pick information-rich features but also construct linear and nonlinear combinations of multiple features; (2) Model( ) is linear, so “big-data” is not necessary for construction of high-accuracy prediction model. The calibration method based on multidimensional multiclass genetic programming with multidimensional populations (M3GP) algorithm integrates features selection/projection and multiple linear regression (MLR), which satisfies the two conditions simultaneously. According to the author’s knowledge, M3GP-based method has not been used to establish the relationship between the spectrum and SC. This research therefore aims to explore whether M3GP-based method is suitable for quantitative spectral analysis. Hereinafter, M3GP-based integrated method is referred to as M3GPSpectra, and the prediction model obtained by M3GPSpectra is named as M3GPSpectra model. This study compared M3GPSpectra with six popular calibration approaches on the datasets obtained by long-wave NIR spectrometer, short-wave NIR spectrometer, HIS, Raman spectrometer, nuclear magnetic resonance (NMR) spectrometer, and fluorometer. Experimental results show that 18 M3GPSpectra models achieved the best accuracy, and one M3GPSpectra model obtained the sub-optimal accuracy. Furthermore, M3GPSpectra was applied to short-wave NIR and fluorescent datasets with different sizes to test the adaptability of the method. Overall, M3GPSpectra is not sensitive to the size of the training samples used.
Section snippets
Description of problem
Assume that m samples with q features (spectrums) form a training set , where represents the i-th training sample, and is defined as the target variable of the i-th training samples. A prediction model is established based on M3GPSpectra to bridge the spectrum and SC based on the training samples. M3GPSpectra encodes the problem into a tree structure because DR( ) realized by decoding a tree includes feature selection/linear projection/non-linear projection, thereby
Results and discussion
This section has four sections. Section 3.1 presents the performance of 152 prediction models established by eight types of quantitative spectral analysis methods for six types of spectral data. In Section 3.2, a comprehensive comparison of M3GPSpectra and its seven competitors is presented. Section 3.3 explores the effect of the difference between sample size and feature size on the analysis method. Finally, the relationship between the performance of M3GPSpectra and the size of training
Conclusion and future work
This study introduces a novel machine learning approach, named as M3GPSpectra, for quantitative spectral analysis. The method combines feature selection/construction and MLR and realizes the exchange of information between the two parts. First, M3GPSpectra selects features and constructs linear and nonlinear novel features for input to subsequent MLR. This step can reduce the redundant information in the raw spectral data and refine the constructed feature. Second, MLR is applied to features
CRediT authorship contribution statement
Yu Yang: Conceptualization, Methodology, Software, Formal analysis, Investigation, Writing – original draft. Xin Wang: Investigation, Visualization, Writing – original draft. Xin Zhao: Software, Validation. Min Huang: Writing – review & editing, Formal analysis, Supervision, Visualization. Qibing Zhu: Writing – review & editing, Formal analysis, Supervision, Visualization.
Declaration of competing interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgements
Dr. Qibing Zhu and Dr. Min Huang gratefully acknowledge the financial support from the National Natural Science Foundation of China (Grant no. 61772240, 61775086), the 111 Project (B12018).
References (47)
- et al.
Quantitative analysis modeling of infrared spectroscopy based on ensemble convolutional neural networks
Chemometr. Intell. Lab. Syst.
(2018) - et al.
Interval combination iterative optimization approach coupled with SIMPLS (ICIOA-SIMPLS) for quantitative analysis of surface-enhanced Raman scattering (SERS) spectra
Anal. Chim. Acta
(2020) - et al.
Machine learning based on holographic scattering spectrum for mixed pollutants analysis
Anal. Chim. Acta
(2021) - et al.
Raman spectral analysis for non-invasive detection of external and internal parameters of fake eggs
Sensor. Actuator. B Chem.
(2020) - et al.
A review of near infrared spectroscopy and chemometrics in pharmaceutical technologies
J. Pharmaceut. Biomed. Anal.
(2007) - et al.
Non-linear calibration models for near infrared spectroscopy
Anal. Chim. Acta
(2014) - et al.
A deep learning based feature extraction method on hyperspectral images for nondestructive, prediction of TVB-N content in Pacific white shrimp (Litopenaeus vannamei)
Biosyst. Eng.
(2019) - et al.
A consensus successive projections algorithm – multiple linear regression method for analyzing near infrared spectra
Anal. Chim. Acta
(2015) - et al.
Determination of SSC in pears by establishing the multi-cultivar models based on visible-NIR spectroscopy
Infrared Phys. Technol.
(2019) - et al.
Comparison of several variable selection methods for quantitative analysis and monitoring of the Yangxinshi tablet process using near-infrared spectroscopy
Infrared Phys. Technol.
(2020)
An evaluation of different bio-inspired feature selection techniques on multivariate calibration models in spectroscopy
Spectroc. Acta Pt. A-Molec. Biomolec. Spectr.
Multi-variable selection strategy based on near-infrared spectra for the rapid description of dianhong black tea quality
Spectroc. Acta Pt. A-Molec. Biomolec. Spectr.
Concentration profiles of collagen and proteoglycan in articular cartilage by Fourier transform infrared imaging and principal component regression
Spectroc. Acta. Pt. A-Molec. Biomolec. Spectr.
Vis/NIR spectroscopy and chemometrics for the prediction of soluble solids content and acidity (pH) of kiwifruit
Biosyst. Eng.
Evaluating photosynthetic pigment contents of maize using UVE-PLS based on continuous wavelet transform
Comput. Electron. Agric.
Kernel PLS regression on wavelet transformed NIR spectra for prediction of sugar content of apple
Chemometr. Intell. Lab. Syst.
Comparisons of different regressions tools in measurement of antioxidant activity in green tea using near infrared spectroscopy
J. Pharmaceut. Biomed. Anal.
Improved kernel PLS combined with wavelength variable importance for near infrared spectral analysis
Chemometr. Intell. Lab. Syst.
A feature-based soft sensor for spectroscopic data analysis
J. Process Contr.
Convolutional neural network for simultaneous prediction of several soil properties using visible/near-infrared, mid-infrared, and their combined spectra
Geoderma
DeepSpectra: an end-to-end deep learning approach for quantitative spectral analysis
Anal. Chim. Acta
Modern practical convolutional neural networks for multivariate regression: applications to NIR calibration
Chemometr. Intell. Lab. Syst.
Multispectral image based germination detection of potato by using supervised multiple threshold segmentation model and Canny edge detector
Comput. Electron. Agric.
Cited by (8)
A novel hybrid variable selection strategy with application to molecular spectroscopic analysis
2023, Chemometrics and Intelligent Laboratory SystemsEBM3GP: A novel evolutionary bi-objective genetic programming for dimensionality reduction in classification of hyperspectral data
2023, Infrared Physics and TechnologyPredictions of multiple food quality parameters using near-infrared spectroscopy with a novel multi-task genetic programming approach
2023, Food ControlCitation Excerpt :After evolution, the offspring population of the same size as the parent population is obtained. Five STL methods, named PLSR (Pan, Lu, et al., 2015), LS-SVR (Lee, Shim, Kim, Lee, & Lim, 2022), M3GPSpectra (Yang et al., 2021), Deepspectra (Zhang et al., 2019), M3GP-LSSVR, and four MTL methods, named MTLSSVR (Lin et al., 2014), PCA-MTLSSVR, MTR-GL (Wu et al., 2020), and MTCNN (Padarian, Minasny, & McBratney, 2019), are applied to compare with EM4GPO on the predictions of the FPQs for apple and sugar beet. These methods need to learn the prediction model in the data-driven training mode.