Abstract

Soybean saponin is a natural antioxidant and is anti-inflammatory. Hyperspectral analysis technology was applied to detect soybean saponin content rapidly and nondestructively in this paper. Firstly, spectral preprocessing methods were studied, and standard normal variable (SNV) was used to remove noise information. Secondly, a two-step hybrid variable selection approach based on synergy interval partial least squares (SiPLS) and iteratively retains informative variables (IRIV) was proposed to extract characteristic variables. Then, the ensemble learning model was constructed by back propagation neural network (BPNN), deep forest (DF), partial least squares regression (PLSR), and extreme gradient boosting (EXG). Finally, image information was combined into spectral data to improve model accuracy. The prediction coefficient of determination () of the final model reached 0.9216. It can provide rapid, nondestructive, and accurate detection technology of soybean saponin content. A combination of spectral and image information will provide a new idea for application of hyperspectral.

1. Introduction

Soybean is a vitally economic food crop. Because of containing protein, fat, saponin, and amino acid, soybean is mainly used for oil production, edible proteins, building materials, and cosmetics [1, 2]. Soybean saponins are metabolic products during the growth of soybean, and their content range from 0.6% to 6.2% [3, 4]. Soybean saponin has multifaceted physiological functions such as anticancer, antiaging, antiallergic, and antiviral effects [58]. Prolonged consumption of soybean saponin can effectively mitigate diseases like hypertension, hyperlipidaemia, and obesity [911]. Soybean saponin content detection is important for quality testing for soybean breeding.

Traditionally detection methods of soybean saponin content rely on wet chemical methods, such as high-performance liquid chromatography [12], colorimetry [13], and liquid chromatography mass spectrometry [14]. However, these methods have shortage of cumbersome procedure, high cost, or subjective results [15]. Therefore, it is crucial to develop a rapid, accurate, and low-cost method for detecting soybean saponin content.

In recent years, spectral analysis techniques have been widely used in the detection of crop nutrient content due to their advantages of fast analysis speed, easy operation, and no sample damage. Compared to near-infrared technology, hyperspectral technology offers wider wavelength range and higher information localization accuracy. Because of acquisition of spatial distribution information of spectral data, hyperspectral technology can collect pixel-level spectral data of crops, so it has significant advantages in crop analysis. Guo et al. [16] detected the moisture content of individual soybeans based on the interval variable iterative space shrinkage approach and successive projection algorithm by using a visible-near-infrared hyperspectral imaging device. Song et al. [17] conducted nondestructive detection of moisture and fatty acid content in rice using a near-infrared spectroscopic imaging device combined with PLSR. Zhang et al. [18] detected 25 different nutrients in soybeans by near-infrared reflectance spectra including soybean saponin with of 0.35. Berhow et al. [19] developed a multiple linear regression model for detecting the content of soy isoflavones and saponins based on 3200 soybean samples. The model effectively detected isoflavone, but for the soybean saponin content detecting model was only 0.6.

Although spectral and image information combination will bring larger amount of data, it provides higher spatial resolution and extract light absorption and surface grayscale variations of measured substances. Zheng et al. [20] detected soil total nitrogen content by combining infrared spectrum and image information with value of 0.815 and root mean square error (RMSE) of 0.153. Wang et al. [21] identified damage of soybean using high-quality spatial resolution-hyperspectral imaging images by combination of hyperspectral imaging and RGB images with model accuracy of 98.36%.

Compared to single-spectrum data, combination of spectral and image information can reduce the impact of different spectra from the same substance and the same spectrum from different substances. Additionally, visual characteristics would be preserved better by combining spectral and image information because of higher spatial resolution.

Gao and Xu [22] compared single spectral information to combination of spectral and texture colour information in soluble solid content in red earth grapes and showed that combination of spectral and image information effectively improved the model’s detection capability. Xu et al. [23] applied the spectral and image information combination method to detect nitrogen content in rice leaves at different growth stages. The results showed that a combination of spectral and image information reduced interference from soil and water, and increased by 0.05 to 0.09, while RMSE decreased by 0.011 to 1 for various models.

Accuracy of soybean saponin content detection was not high in previous studies. That is because the content of saponin in soybean is small and single spectral information is insufficient to express saponin. In this paper, spectral and image information in hyperspectral data was combined to improve the accuracy of the soybean saponin content detection model. Spectral preprocessing and spectral data feature band selection, ensemble leaning model with skip connect, and multihead self-attention mechanisms were studied to further improve model accuracy.

2. Materials and Methods

2.1. Test Materials and Data Acquisition
2.1.1. Soybean Sample

Soybean samples are provided by the Agriculture College of Northeast Agricultural University. Ten types of soybeans are selected including Beidou 5, Sui 03-3952, Hongfeng 3, Chundou 1, Dongnong 60, Beidou 14, Huajiang 1, L-58Keburi, Zhongpin 03-5373, and Dongnong 50. 30 samples without defects are collected for each variety, and total 300 soybean samples are used in this paper. All samples are stored in a cool and well-ventilated place. The spectral-physicochemical value cooccurrence distance (SPXY) algorithm is used to divide samples into the training set and test set in ratio of 7 : 3.

2.1.2. Spectral and Image Data Acquisition

Hyperspectral data of soybean samples are collected by Hyperspec III hyperspectral imager form HEADWALL Company. Samples are placed on the tray of moving platform with moving speed of 3.5 mm·s−1, and camera exposure time is 38.84 ms. Reflection values of soybean samples are used as spectral information with 495 bands from 463 nm to 957 nm. The RGB image of soybean is output at the same time.

After collection of spectral data and image data, spectral reflectance values of each sample are extracted by ENVI 5.3 from regions of interest (ROI). Although reflectance rates may vary across different positions on soybean, overall trend remains consistent, which does not affect subsequent modelling results [24]. In this study, ROI is selected as rectangular areas with a size of 10 × 10 pixels in the soybean centre. The obtained spectral reflectance value is adjusted using the black and white correction method to obtain accurate spectral reflectance. The formula for black and white correction is as follows:in the equation, represents soybean spectral reflectance, represents soybean spectral reflection values, represents the spectral reflection value of the whiteboard, represents the spectral reflection value of the blackboard.

2.1.3. Determination of Soybean Saponin Content

In this study, soybean saponin content is measured using liquid chromatography-mass spectrometry (LC-MS). 100 mg sample is placed in a 1.5 mL centrifuge tube and mixed with 300 μL of 75% methanol/water mixed solvent (containing 0.1% formic acid, v/v). The mixture is vortexed for 30 seconds and subjected to ultrasound treatment at 20°C for 15 minutes in a water bath. After vortexing for additional 2 minutes, the sample is centrifuged at 12,000 rpm at 4°C for 20 minutes. Then, we take the supernatant and test on the machine. Acquired mass spectrometry raw data are processed using Agilent Profinder software. Data processing steps include retention time correction, peak identification, peak extraction, peak integration, and peak alignment. Agilent Massive Parallel Processor software is used for statistical processing and combined with the KEGG database, and substance identification is conducted to determine saponin content.

2.2. Image Processing and Feature Extraction

A flowchart of image processing procedure is illustrated in Figure 1. Firstly, the acquired image is converted into a grayscale image. Subsequently, the nonlocal means denoising algorithm and the Gaussian filter are applied to eliminate noise in the grayscale image to facilitate subsequent edge detection research. Finally, soybean contour is extracted using the adaptive threshold algorithm [25].

After edge detection, feature information of soybean samples is extracted including area, perimeter, major axis, minor axis, roundness, eccentricity, aspect ratio, rectangle-area ratio, circle-area ratio, equal-area-circle diameter, and edge variation coefficient. Meanings of this feature information are described in Table 1.

2.3. Spectral Data Preprocessing

Spectral data preprocessing is essential for minimizing errors during model building. Noise information would be included in spectral data such as sample background, dispersive light, signal noise, and so on. To reduce impact of abovementioned unrelated factors on the detection model and enhance spectral characteristics, removing noise information by spectral preprocessing is necessary. Common preprocessing methods include Savitzky–Golay smoothing (S-G), SNV, de-trending (DT), multiplicative scatter correction (MSC), baseline correction (BL), first derivative (FD), and second derivative (SD). These methods effectively eliminate noise from different perspectives and highlight characteristics of spectral data [2629]. In this study, these methods are applied into soybean saponin detection and the selected suitable algorithm.

2.4. Dimensionality Reduction of Spectral Features

Full-band spectra contain a lot of redundant information, which makes the detection model complex and inaccurate. In order to reduce data dimension and obtain main characteristic bands of the spectrum, spectral data feature band selection should be done before model building. Because of selecting several spectral intervals related to the tested substance, models based on the SiPLS band selection algorithm are usually with high accuracy [30]. Extracted spectral features are stable and continuous, but there is still some redundant spectral information. IRIV iteratively reduces spectral data dimensionality by iteratively building informative variables and keeping spectral feature wavelengths with high weights in feature subsets. After removing the irrelevant variables and interference variables, the last group of variables is processed by reverse elimination to obtain more simplified spectral characteristic wavelength [31]. However, due to multiple iterations, applying IRIV to the full band spectral set needs large calculation. In this study, method combination SiPLS and IRIV was proposed to select the spectral data feature band. Selecting valid intervals by SiPLS before selecting variables by IRIV can improve model fitting ability.

2.5. Data Combination

Due to distinct feature attributes of spectral and image information in soybeans, normalization is necessary before using spectral and image information as combination input for the detection model. Spectral and image information is scaled to map both datasets to [0, 1] interval [32]. Min-max normalization is used for data combination processing and its formula is as follows:in the equation, represents transformed data, original represents feature data, represents maximum characteristic data, and represents minimum characteristic data.

2.6. Model Construction and Evaluation

The two-layer stacking ensemble learning framework is built to detect soybean saponin content. The framework consists of three base learners including DF, PLSR, and EXG. Meta-learner is PBT-BPNN which is the learning rate of BPNN was optimized by the population-based training (PBT) method. Effects of spectral information and combination information are compared in this study. The ensemble learning model construction process is illustrated in Figure 2.

Skip connect [33] and multihead self-attention mechanism [34] are introduced into BPNN in this paper, in order to enhance generalization and the expressive ability of BPNN, as well as prevent overfitting and gradient explosion in deep networks. BPNN with an optimized hidden layer can improve the detection capability of the network because of its deeper network architecture. The optimized structure of the hidden layer is shown in Figure 3. The input data sequence of this layer obtains a set of sequences by traditional BPNN nonlinear computation, and element relationships of the sequences are captured by the multihead self-attention mechanism. This ensures that no information was forgotten by the neural network after computations of multiple hidden layers, thus preventing data loss and decline in the model detection ability. The original sequence is added by skip connect to enhance the neural network’s memory of the original input sequence. It can avoid weight reduction of information after multiple nonlinear transformations and solve the gradient explosion phenomenon and network performance degradation phenomenon in deep neural networks.

The structure of the multihead self-attention mechanism is shown in Figure 4. Given an input sequence, it is multiplied with three trainable parameter matrices to yield three vectors, query, key, and value. Three vectors are multiply processed by parallel computations using self-attention with distinct parameters for each computation, in order to emphasize different features of the sequence. Finally, results of multiple computations are concatenated and linearly transformed. Three vectors are first linearly transformed for each computation. Then, transformed query and key vectors are multiplied to compute attention scores. The scores are scaled and multiplied with mask for each element to prevent excessively attending to certain elements during training. The process can also enhance the generalization capability of the model. Attention weights are obtained by inputting processed attention scores to SoftMax and multiplied to the corresponding value vector and summed to be output.

Model performance is evaluated based on , RMSE, and ratio of prediction to deviation (RPD) of the model for test set. represents the fitness of the model and indicates higher fitness when its value is closer to 1, i.e., independent variables can better explain the variation of the dependent variable. RMSE is commonly used to evaluate the model error and represents higher model accuracy and less error when its value is closer to 0. RPD is an academic metric that quantifies the predictive accuracy and reliability of a model. When the RPD is less than 2.0, the model is generally considered incapable of reliable quantitative prediction; when the RPD ranges between 2.0 and 2.5, the model can make rough quantitative predictions; and when the RPD exceeds 2.5, the model is deemed to have good predictive accuracy. The result takes the average value of three tests. , RMSE, and RPD are mathematically formulated by equations (3)–(5).in the equation, represents the sample size, represents the actual value, represents the model calculated value, and represents the average value of actual values.

3. Results and Discussion

3.1. Dataset Partitioning Results

SPXY is a kind of dataset partitioning algorithm based on the Kennard Stone (KS) algorithm, which is optimized by combined spectral data (X) and chemical values (Y) in sample distance calculation. SPXY enhances model robustness and reduces regression errors because of covering multidimensional space effectively. It also mitigates the impact of imperfect dataset partitioning on final results [35]. Dataset partitioning results by SPXY are shown in Table 2 and Figure 5. From Table 2 and Figure 5, it can be observed that the soybean saponin content of the test set fell within the range of the training set. It indicated that samples were representative and SPXY method for dataset partitioning was rational.

3.2. Image Information Correlation Analysis

Correlation analysis results between soybean image feature information and soybean saponin content are presented in Table 3. Correlations between each image feature and soybean saponin content were different. Among eleven image features, the largest absolute correlation coefficient was observed for rectangle-area ratio, which was 0.289. Despite being highly significant, the Pearson correlation coefficient was relatively small. This may be due to the fact that the Pearson correlation coefficient measures linear relationships between two variables, and a larger coefficient indicates stronger linear correlation. However, if relationship between two variables is influenced by other variables, there may exist a nonlinear relationship between them. Therefore, image features such as rectangularity could exhibit a joint nonlinear relationship with soybean saponin content, leading to a lower Pearson correlation coefficient despite with highly significant correlation.

Results in Table 3 revealed that image features were highly significant related to soybean saponin content with except for area, perimeter, minor axis, and equal-area-circle diameter. Specifically, major axis and rectangle-area ratio showed highly significant positive correlation with soybean saponin content, while roundness, eccentricity, aspect ratio, circle-area ratio, and edge variation coefficient exhibited highly significant negative correlation with soybean saponin content. High significance with indicates that the independent variables have a notable impact on the dependent variables, and this impact is extremely unlikely to be explained by random errors statistically. Therefore, seven image features, including the major axis, roundness, eccentricity, aspect ratio, rectangle-area ratio, circle-area ratio, and edge variation coefficient, are closely correlated with the saponin content of soybeans. As a result, we have chosen these seven image features for combination with spectral information.

3.3. Spectral Reflectance Extraction Results

Figure 6 shows original spectral reflectance intensities of 300 soybean samples in 463–957 nm wavelength range. Figure 7 presents spectral reflectance values of soybean samples after black and white correction. In these two figures, each colour line represents a soybean sample. Spectral data of all samples exhibited an increasing trend in range 463–760 nm. Absorption valleys in 760–820 nm corresponded to the three types of soybean samples including Chundou 1, L-58Keburi, and Zhongpin 03-5373. Spectral reflectance of soybean samples gradually became stable in range 850–957 nm.

3.4. Result of Spectral Preprocessing

To reduce the influence of noise, stray light, and other irrelevant factors on spectral data and improve signal-to-noise ratio, various methods were applied to preprocess soybean spectral data in this study. PLSR can reliably perform regression modelling even with variables with less correlation and multicollinearity independent variables because of combining characteristics of principal component analysis, canonical correlation analysis, and linear regression analysis. PLSR has been widely used in many studies to compare the effectiveness of different spectral preprocessing methods. So, S-G, SNV, DT, MSC, BL, FD, and SD were compared for soybean saponin detection based on the PLSR model. and RMSE of the model for the test set were used to evaluate effects. Results are shown in Table 4.

According to Table 4, the PLSR model based on S-G, BL, and SD had negative effects with decreased and increased RMSE compared to original data. Important information was maybe removed with spectral preprocessing.

The detection model based on the MSC method was with higher , but also higher RMSE compared the detection model by original data. This indicated that the model had a stronger capability to explain the target variable after spectral data preprocessing. However, the error of the detection model was increased, resulting in lower accuracy. This could be attributed to loss of some important features in spectral data after preprocessing.

Models based on SNV, DT, and FD were with higher and lower RMSE. Fitting capability and accuracy of models were optimized by three valid preprocessing methods. Models based on the SNV preprocessing method had the highest and the lowest RMSE. The SNV method was chosen to preprocess spectral data in this paper. Subsequent variable selection and modelling processes utilize spectral data that has been preprocessed by SNV.

The PLSR model evaluates the efficacy of spectral preprocessing algorithms. However, the performance of these preprocessing algorithms in ensemble learning and optimized ensemble learning models may not align with their performance in the PLSR model. To validate the effectiveness of the PLSR model, we reevaluate the performance of the preprocessing algorithms using ensemble learning and optimized ensemble learning. If the resulting trends align with those observed in the PLSR model, it demonstrates the validity of the PLSR model in assessing the effectiveness of preprocessing algorithms. The ensemble learning model and optimized ensemble learning model were used to evaluate spectral preprocessing performance. Results are shown in Table 5.

Results based on the ensemble learning model and optimized ensemble learning model were the same as the PLSR model. Models based on S-G, BL, and SD exhibited poor performance. Models based on MSC only improved the without reducing RMSE. On the other hand, models based on SNV, DT, and FD methods enhanced model performance comprehensively. Models based on the SNV method achieved the highest and the lowest RMSE. This validation experiment demonstrated that model building methods would not affect results of preprocessing algorithms. That is because the purpose of spectral preprocessing is removing noisy information. So, the PLSR model can be used to evaluate a better spectral preprocessing method.

3.5. Dimensionality Reduction of Spectral Features

SiPLS can select spectral wavelength intervals containing spectral features. Figure 8 shows the RMSE of models by spectral wavelength interval combinations when SiPLS with 50 interval divisions and 3 combinations. The lowest RMSE is obtained for the 132nd wavelength interval combination, which was 3.2647 × 10−3. Spectral wavelength intervals for the combination were 561∼658 nm, 708∼757 nm, and 858∼907 nm, with a total of 149 spectral wavelengths, accounting for 30.1% of total wavelengths.

149 spectral wavelengths selected by SiPLS were used as input variables for the IRIV algorithm. As shown in Figure 9, irrelevant and interfering variables were eliminated from the variable combinations after 4 times of iterations, resulting in the backward elimination of 11 variables. Finally, 17 feature spectral variables related to soybean saponin content were selected including 562 nm, 575 nm, 579 nm, 597 nm, 606 nm, 710 nm, 737 nm, 739 nm, 743 nm, 836 nm, 842 nm, 847 nm, 851 nm, 852 nm, 853 nm, 854 nm, and 855 nm. Variables remained were only 3.43% of the total wavelengths. Distribution of selected spectral feature wavelengths by SiPLS-IRIV is shown in Figure 10.

3.6. Result of Ensemble Learning Modelling

Ensemble learning can reduce the risk of model overfitting and improve robustness, reliability, and accuracy of the model. Results of the detection model are shown in Table 6.

According to Table 6, six models showed a significant improvement compared to previous studies. The PLSR model with single spectral information was with the lowest of 0.7195, the highest RMSE of 3.0639 × 10−3, and the lowest RPD of 1.8881. RPD of this model was less than 2.0. This indicated that the model possessed quantitative predictive ability, albeit not particularly outstanding. The residual attention ensemble learning model by using combined spectral image information was with the highest of 0.9216, the lowest RMSE of 1.7071 × 10−3, and the highest RPD of 3.5714. RPD of this model was higher than 2.5. This indicated that the model possessed exceptional predictive ability, exhibiting stable and accurate performance on the test dataset. These results indicated that single spectral information had limitations in describing soybean saponin content and cannot effectively detect saponin content in soybeans. By supplementing image feature information, input dimensionality not captured by spectral data was enriched. More comprehensive descriptions of saponin content were enabled in the model. The ensemble learning model and optimized ensemble learning model have more complex structures compared to PLSR, which allow performing more accurate nonlinear transformations on the input information and resulting in improved saponin detection performance. Skip connect and multihead self-attention modules were added to optimize the ensemble learning model. The multihead self-attention module enabled a better capture of relationships between input sequences in each hidden layer and given more weight to elements that were highly correlated with saponin content. Skip connect ensured that elements with smaller forget weights were not neglected during nonlinear transformations in hidden layers. It preserved the integrity of element information as the number of hidden layers increasing, enhanced the generalization ability of the model, and improved the accuracy of the soybean saponin content detection model.

A scatter plot of predicted values and measured values for test set is shown in Figure 11, with predicted values on the vertical axis and measured values on the horizontal axis. The linear function expressed in the figure is y = x, with a slope of 1 and a bias of 0. The closer the scattered point is to the line y = x, the smaller the error between the predicted value and the measured value. Conversely, the farther the scattered points are from the line y = x, the greater the error between the predicted value and the measured value. It intuitively demonstrates that the optimized ensemble learning model established by combined spectral image information was better. The first three figures depicted three models that were established by single spectral information. The PLSR model (Figure 11(a)) exhibited poor fit for soybean samples. Notably, for a sample with a measured saponin content of 2.1%, the model predicted a value of 2.6%, indicating significant deviation. Overall, there was considerable dispersion among the scattered points. The ensemble learning model (Figure 11(b)) demonstrated relatively more concentrated scattered points compared to the PLSR model. However, it tended to overestimate the saponin content, with most predicted values exceeding the measured values. The ensemble model optimized with the residual attention module (Figure 11(c)) performed well within the saponin content range of 2.5% to 3.0%, where predicted values closely aligned with measured values. Nevertheless, its performance deteriorated when dealing with higher saponin contents, particularly exceeding 4%. The models presented in the last three figures were established by combination information. Compared to models with single spectral information, these models demonstrated a significant improvement in the fitting ability. A scatter plot for the PLSR model (Figure 11(d)) tended to have a relatively concentrated distribution around the line y = x. However, the fitting performance was still unsatisfactory. The ensemble learning model’s scatter plot (Figure 11(e)) showed only a few points with poor predictions, while the majority of the points were scattered around the y = x line. The scatter plot of the ensemble model optimized with the residual attention module (Figure 11(f)) resembled a diagonal line, indicating minimal errors. Especially in the range of soybean saponin content from 3.5% to 4.5%, the other five models did not fit well and showed significant deviations from the measured values. However, the optimized ensemble learning model demonstrated minimal errors within this range. Overall, the residual attention ensemble learning model with combined spectral image information can accurately estimate soybean saponin content.

Through a comparative analysis of the results of this experiment and those of other research methods, it can be seen that the method employed in this research demonstrates remarkable superiority on multiple levels. Unlike previous methods based on near-infrared spectroscopy, our study leverages the unique spectral-imaging capabilities of hyperspectral technology. This allows us to simultaneously capture spectral and image data for multiple soybeans, enabling concurrent detection of multiple samples. Furthermore, the integration of hyperspectral precision with the optimized stacking ensemble learning model yields more accurate detection of soybean saponin content than ever before. When compared to traditional wet chemical methods, our approach not only ensures high-precision detection results but also excels in cost control, detection speed, and sample preservation. Specifically, this method is cost-effective. It significantly reduces detection time. Importantly, it eliminates the need to destroy soybean samples, preserving their integrity. Additionally, it removes the influence of human proficiency on test results. Thus, our research offers a novel approach for accurate and efficient detection of soybean saponin content.

4. Conclusions

The soybean saponin content detection model based on spectroscopy and image information combination was developed in this paper. SNV was selected as the spectral preprocessing method. SiPLS-IRIV was used to perform dimensionality reduction. The ensemble learning model with skip connect and multihead self-attention modules was built to detect soybean saponin content. and RMSE values of the model were 0.9216 and 1.7071 × 10−3. The detection method based on hyperspectral technology reduced sample processing time, improved detection efficiency, avoided sample damage, and minimized experimental errors caused by human operators. This study provides a new method for researchers in soybean breeding quality testing, making the process more efficient and convenient.

Data Availability

The raw/processed data required to reproduce these findings cannot be shared at this time as the data also form part of an ongoing study.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.

Acknowledgments

The authors would like to thank the Key Laboratory of Northeast Smart Agricultural Technology, Ministry of Agriculture and Rural Affairs.