Abstract

Substantial deep learning methods have been utilized for hyperspectral image (HSI) classification recently. Vision Transformer (ViT) is skilled in modeling the overall structure of images and has been introduced to HSI classification task. However, the fixed patch division operation in ViT may lead to insufficient feature extraction, especially the features of the edges between patches will be ignored. To address this problem, we devise a workflow for HSI classification based on the Nested Transformers (NesT). The NesT employs the block aggregation module to extract edge information between patches, which realizes cross-block communication of nonlocal information and optimizes global information extraction. In this paper, the NesT is used for HSI classification for the first time. The experiments are carried out on four widely used hyperspectral datasets: Indian Pines, Salinas, Tea Farm, and Xiongan New Area (Matiwan Village). The obtained results reveal that the NesT can provide competitive results compared to conventional machine learning and deep learning methods and achieve top accuracy on four datasets, which proves the superiority of the NesT in HSI classification with limited training samples.

1. Introduction

Hyperspectral imaging technology generates multiple spatially aligned images of ground objects by the sensor while recording hundreds of contiguous narrow spectral bands in the spectral domain. Therefore, the hyperspectral image (HSI) is a data cube. Different from the conventional images, each pixel in the data cube is a 1D spectral representation of the response at a given spatial location, and each 2D layer is a spatial image representing the response at a specific wavelength band, so the information available from HSI is much richer and contains both spectral and spatial information [14].

Most of the research explored for HSI classification can be divided into spectral classifiers and spectral-spatial classifiers. The spectral feature is the original feature of HSI, while the spatial feature refers to the relationship between the center pixel and its neighborhoods. In the early stages, researchers mainly focused on pure spectral feature-based methods, simply applying classifiers to pixels, such as support vector machine (SVM) [5] and logistic regression (LR) [6], while spectral classifiers cannot achieve good performance due to the existence of different objects with the same spectrum or different spectral features with the same object.

Therefore, the spectral-spatial classifiers emerged to improve the accuracy by fusing spectral and spatial features. Spatial correlation analysis [7], Markov random fields [8], Gabor filter [9], and morphological features [10] are widely used in spatial feature extraction. However, the strong correlation between pixels results in information redundancy after feature extraction stage. It is a primary challenge to extract discriminative features from HSI that requires adequate approaches.

Deep learning methods have become a powerful and popular tool in remote sensing community due to the excellent ability of feature extraction and generalization. Convolution-based model is one of the most commonly used deep learning methods for HSI classification, which obtains the image response to specific features at different positions by convolutional kernels as a classification basis [1113]. Lee et al. [14] proposed a novel deep Convolutional Neural Network (CNN) to optimally explore local contextual interactions by jointly exploiting local spectral-spatial relationships of neighboring pixels. Feng et al. [15] presented a novel CNN method that considered multilayer spectral-spatial feature fusion and sample augmentation with local and nonlocal constraints. In [16], a CNN-based spatial feature fusion algorithm was employed to integrate spatial information with spectral features. In [17], a multidimensional hybrid convolutional architecture was used for extracting spectral-spatial features in two steps, which optimized the description of the spatial details. It can be learned from the above literature that convolution-based models have been used to learn feature representations from raw data and form more complex and abstract concepts successfully by combining simpler concepts, which was a hot spot in previous research.

However, the convolutional structure extracts the features from HSI via the local receptive field, resulting in the lack of a description of the global features in HSI. The emerging Vision Transformer (ViT) [18] can retain the global structure and extract high-level visual semantic information for HSI classification [19, 20]. Although the ViT excels at capturing the information contained in spectral-spatial signatures, it fails to make adequate use of spatial information for the division of fixed patches. To address this issue, we develop a novel framework using the well-designed Nested Transformers (NesT) [21] to classify HSI, which realizes the nonlocal feature extraction between pixels and achieves a more refined pixel-wise feature representation.

In this paper, the NesT is used for HSI classification for the first time. With the help of the block aggregation module in the NesT, the insufficient feature extraction between patches raised by ViT is mitigated. The experimental results obtained from four hyperspectral datasets show that the NesT is superior to other classical HSI classifiers in terms of accuracy, especially with limited training samples. It can be proved that the NesT has the potential to implement pixel-wise feature extraction.

The remainder of this paper is organized as follows. The related work is presented in Section 2. Section 3 presents the methodology of the NesT. The details of experimental results are reported in Section 4. In Section 5, relevant conclusions and discussions are presented.

Transformer [22] was proposed for machine translation tasks and has been a standard for natural language processing (NLP) tasks. To search for important information about the input sequence and capture global information, the Transformer architecture transforms one sequence into another through the encoder and decoder based on the self-attention mechanism. These sequences are referred to as “tokens,” which are feature vectors in the original data mapped in the model. The Transformer extracts features of sequence data and assigns attention weights to the raw data mainly by Multilayer Perception (MLP) to build multi-head attention structures. However, the mapping in Transformer considers global information when applied to image data, resulting in a sharp increase on the model scale that brings a heavy burden in calculation and training, so there exists a limitation in image feature extraction [23, 24].

Numerous studies have been carried out on the variety of Transformers in the image processing field recently. Bello et al. proposed a local self-attention to replace the convolution layer [25], Wu et al. proposed a multiscale structure to approximate the global self-attention in sparse Transformer [26], Ho et al. introduced the self-attention for different sizes of image blocks to extract the spatial information [27], etc. However, the application results of the above methods were not satisfactory. Until 2021, inspired by the great success of Transformer in the NLP community, ViT was proposed and applied directly to Computer Vision (CV) with the fewest possible modification. When pretrained on the ImageNet-21k dataset and the in-house JFT-300M dataset, ViT approached or outperformed the state-of-the-art model on multiple image recognition benchmarks [28].

In previous studies, the tokens in ViT are all fixed scale, which may be unsuitable for vision applications. Hence, the Swin Transformer (Swin) [29] established a hierarchical ViT structure referring to CNN that used multiple feature extractors to expand the receptive field by gradually merging the images with the deepening of network depth. In addition, Swin used a variable-sized attention window to reduce the computation of a single self-attention operation. Although the outputs of each layer were concatenated and then redivided, the Swin Transformer ignores fusing the edge features between patches.

Therefore, the NesT was proposed to fuse the spatial features before being divided. The NesT incorporates Swin Transformer features to build a hierarchical ViT model which shares similarities with several pyramid designs. To solve the problem of fixed patch division in ViT, the model carries out simple spatial operations on the outputs of each layer, such as convolution and pooling, to fuse the edge information between patches. The interpatch nonlocal features are extracted and fused after patch segmentation by the block aggregation module, thus effectively improving the ability of spatial feature representation and prediction accuracy. In addition, this additional feature fusion method makes it more efficient to extract data features from the model, which can reduce the width of the network in deeper layers. Therefore, aiming at producing sufficient feature extraction between patches, in the pixel-wise classification task, we introduce a novel workflow for HSI classification based on NesT.

3. Methodology

3.1. ViT

ViT is a pioneering work in applying Transformer architectures to image classification and has had a profound impact on the design of subsequent visual backbones with its impressive speed and accuracy in image classification. As shown in Figure 1, the input data is first split into non-overlapping patches with position embeddings, and each patch is treated as a token. Then, these tokens have constructed a group of learnable sequences and are input into a hierarchical Transformer.

The Transformer in vision backbones has a standard encoder architecture, but its decoder consists of an MLP head and a classifier. The encoders of the Transformer mine the dependencies between tokens to capture the structural information in the image. The decoders are employed to classify the integrated tokens. In the multi-head self-attention layer of the encoders, a token is mapped to groups of vectors, and each group includes a query (), a key (), and a value (). Next, the dot product of and is utilized to measure the dependencies of various tokens. Let denotes the input tokens and denotes the output tokens, where , , is the length of the token sequence, and is the length of the token in each head. The multi-head self-attention layer mines the various dependencies among tokens. Since each token corresponds to a local patch, the dependencies effectively express the structural information of the image. The multi-head self-attention mechanism is calculated by: where , , , , , , , , and is the number of heads. The output tokens are processed by residual connections and layer normalization to pass into the feedforward layer that contains two fully connected operations. Hyperspectral image feature extraction relies on pixels and their neighborhoods, and the window region input to the model includes a variety of objects. On this basis, the tokens further correspond to object-level features, which can express effective structural and discriminative information.

3.2. NesT

The NesT is a more effective variant of ViT when integrating global information in the CV. The patch split in ViT directly splits near pixels of HSIs and destroys the correlation of partitioned pixels. The NesT carries out secondary feature extraction that can cover all patches after Transformer to obtain edge information between patches. We introduce the NesT for HSI classification task, as shown in Figure 2, consisting of the patch embedding, Transformer encoders, and the block aggregation structure. First, a raw image is split into non-overlapping patches, added patch and position embeddings, and then fed into the standard Transformer encoders to deeply mine the relationship between pixels. Then, the features between the different patches are extracted through block aggregation. And finally, the label distribution is mapped to the output by the softmax layer.

3.3. Hierarchical Structure

Each patch in NesT performs standard Transformer layers to process information independently. Firstly, a hierarchical structure referred to Swin Transformer is built, as shown in Figure 3. Secondly, global information is communicated and mixed during the block aggregation step via some simple spatial operations. Performing block aggregation in the image plane can promote information exchange between nearby patches. Taking the spatial operation as an example, the output at hierarchy is unblocked to full image plane . Several spatial operations are applied to downsampling feature maps . Finally, the feature maps are blocked back to for hierarchy . It is easy to see the sequence length always remains the same and the total number of blocks is reduced by a factor of 9 until reduced to 1 at the top hierarchy (i.e., ). In this way, the receptive field expands gradually by constructing the hierarchically nested structure.

The nested hierarchy with an independent block process in NesT resembles a tree-like structure that each block is encouraged to learn non-overlapping features and be selected by the block aggregation [21]. This unique feature-driven method explores a new way of interpreting models, named the gradient-based class-aware tree-traversal (GradCAT) method. The main idea is to utilize the tree structure to find the most valuable path from a child node to the root node. As shown in Figure 4, the model can be visualized using GradCAT of the NesT for HSI, which shows how the network selectively utilizes the underlying local visual information to make inferences about the overall category of the image.

4. Experiment Results

4.1. Datasets

The NesT can efficiently extract features suitable for more complex HSI. In this paper, four public HSI datasets including the Salinas, Indian Pines, Tea Farm, and Xiongan New Area in Matiwan Village (Xiongan) are used to verify the performance of NesT and compared with other deep learning methods. The detailed information of the four datasets is shown in Table 1.

The Salinas dataset is collected by an airborne visible/infrared imaging spectrometer (AVIRIS) from the Salinas Valley, California, USA in Figure 5(a), while the spectral range is 0.4~2.5 μm with 204 effective bands after removing the bands with severe water absorption and a spatial resolution of 3.7 m. This dataset includes 16 regular fields of different crops.

The Indian Pines dataset is also gathered by AVIRIS to perform spectral imaging of a pine forest in Indiana, USA; see Figure 5(b). The spectral range is 0.4~2.5 μm with 200 effective retained bands after removing the bands with severe water absorption. It has a spatial resolution of 20 m and includes 16 feature categories.

The Tea Farm dataset is captured from the tea planting base in Jiangsu province of China, collected by a push-broom hyperspectral imager (PHI). The spectral range is 0.417~0.855 μm with 80 bands. As shown in Figure 5(c), there are 10 types in the labeled categories with a spatial resolution of 2.25 m.

In Figure 5(d), the Xiongan dataset is an aerial hyperspectral remote sensing image of Matiwan Village in Xiongan New Area of China, which is acquired using the visible and near-infrared imaging spectrometer designed by the Shanghai Institute of Technical Physics, Chinese Academy of Sciences. The spectral range is 0.4~1 μm with 256 bands, and the spatial resolution is 0.5 m. The land cover types labeled here are 20, which are mainly cash crops.

4.2. Experiment Setup

In this paper, some experiments are taken to evaluate the performance of NesT and compared with three machine learning classifiers (e.g., LR, SVM, and Random Forest (RF) [31]) and three deep learning classifiers (e.g., CNN, ViT, and Swin Transformer). The typical convolutional structure in the CNN classifier is to extract spatial features through pixel-by-pixel translation and can stack convolution kernels to extract nonlinear features. The application of the ViT classifier shows the performance of combining patch split with Transformer on HSI. The Swin Transformer is used to analyze the impact of different attention window sizes on the classification. The block aggregation is crucial for feature extraction and classification in the NesT, which uses an aggregation nesting approach to achieve feature fusion between patches and adds GradCAT to significantly improve the classification accuracy, compared with the above three models. Furthermore, due to the large number of spectral bands and the existence of information redundancy, principal component analysis (PCA) is performed to reduce the dimensions of the four datasets, and eight principal components are selected to replace the original data according to the variance contribution. Each dataset constructs a training set by extracting the pixels in the neighborhood around the sampling points.

The NesT is applied to the classification of HSI. We set the patch size as and used one layer of the original NesT due to the high correlation of pixels and the small scale of features in HSI. A fully connected layer was used to map the features extracted by block aggregation to the category dimension after fusing the features between patches. The detailed parameters of NesT and the comparison models in this paper are shown in Table 2. Finally, we set up Stochastic Gradient Descent (SGD) as the optimizer and trained at 350 iterations with a learning rate of 0.01 and a batch size of 128.

Considering that Transformer is easy to overfit on limited data, we sampled different percentages of labeled data as the training set to explore the impact on classification and determine the sampling rate for the following experiment. The sampling rate is set as 1%, 3%, and 5%, respectively, of the labeled data for model training, and the accuracy is obtained as shown in Table 3. The accuracies of all methods are significantly improved with the increase of sampling rate, which improved most from 1% to 3% and achieved top accuracy at 5%. In the subsequent experiments, we set the sampling rate as 5% to avoid the underfitting caused by insufficient random sampling.

4.3. Investigation of the Model Feature Extraction Performances

The main goal of the t-SNE (t-distributed stochastic neighbor embedding) [30] is to transform the high-dimensional data into low dimensional data for visualization. The t-SNE calculates the similarity among all samples according to the T distribution and then reduces the dimension to project the data randomly to the 2D space.

The complexity of the raw data and the features extracted by the model is analyzed before evaluating the classification performance. The t-SNE is used to realize high-dimensional data visualization, as shown in Figures 69. The following observations are obtained by the analysis of the data distribution after the t-SNE visualization: (1)The results of the Salinas dataset indicate that most ground objects are linearly separable since some categories can be easily distinguished from others, while the misjudged categories need to be distinguished by nonlinear mapping via deep learning models(2)The Indian Pines dataset has the lowest spatial resolution (20 m) compared to the other experimental datasets and exists a more serious phenomenon of mixed pixels which indicates the most nonlinear features within it so that the original data has difficulty in distinguishing(3)The projection distribution of the Tea Farm dataset with a similar spatial resolution to the Salinas that the primary categories can be distinguished linearly which just have a small scale of hard-to-distinguish categories, requiring the feature extraction and classification by employing deep learning models(4)The Xiongan dataset has the highest spatial resolution (0.5 m), which results in fewer features of ground objects contained in each pixel, and has the more severe phenomena of “same material, different spectra” and “different material, same spectra,” which made linear classifiers difficult to distinguish

The feature extraction performance from different datasets by deep learning models is visualized using t-SNE (see Figures 69). The CNN has a weaker capability of feature extraction which shows a cross-aggregation state of features in various categories and is hard to classify. The ViT can correctly classify some categories which are misjudged in CNN by extracting separable features through patch embedding and Transformer operations and shows its better data generalizability on different HSI datasets than CNN. The extracted feature by Swin of each category which utilizes the different attention window sizes similarly to the convolutional structure also shows an aggregated distribution phenomenon. The NesT realizes the communication of feature information between patches by adding a block aggregation module, so it has a stronger ability to distinguish easily misjudged categories, which is consistent with results obtained in Figures 69.

4.4. Results

Tables 47 show the classification accuracy per model. The results indicate that the classification accuracy of machine learning methods is lower than that of deep learning methods, mainly because machine learning methods cannot combine the spatial features in HSI data, resulting in low overall recognition accuracy.

The NesT exhibits the optimal accuracy of 99.61% in the Salinas dataset, 90.37% in the Indian Pines dataset, 99.25% in the Tea Farm dataset, and 99.86% in the Xiongan dataset. In particular, in the Indian Pines dataset, the accuracy of the NesT is significantly better, with an improvement of 3.62%, 0.17%, and 14.66% over the CNN, ViT, and Swin, respectively. The experimental results show the effectiveness and generalization of the NesT for hyperspectral classification.

Furthermore, Figures 1013 depict the visualization results and illustrate the machine learning method unable to utilize spatial information, resulting in the severe salt-and-pepper phenomenon in the prediction results. On the other hand, the deep learning methods use the feature extraction method of space spectrum fusion, which has a good regional prediction effect in HSI.

From the classification results of the Salinas dataset (Figure 10 and Table 4), the misjudged pixels by CNN show continuity as a regional distributed block and are relatively scattered in specific categories. As mentioned, multiple stacked filters are used in CNN to extract features that cause the misjudged regions often fragmented in classification. The result indicates that the CNN just extracts fragmentized information without overall cognition formation. Differently, the Transformers in ViT, Swin, and NesT help to extract the deep-level visual semantic information by focusing on the correlation between features and pixel spatial positions, which results in the phenomenon of “salt and pepper,” shown as discretely distributed on misjudged points. The same signatures shown in classification results can also be observed in Figures 1113.

In the process of mining the internal information between pixels, ViT extracts feature by splitting an image into patches that are easily affected by the noise in HSI. The capability of its MLP structure is relatively weak to extract spatial features which shows the “salt and pepper” phenomenon in prediction results. Moreover, there is no information interaction between patches in ViT, resulting in the inability to integrate information across the image. As a result, the model has good performance on the Indian Pines dataset with more spectral features and fewer spatial features, while the classification accuracy is not satisfactory for the Salinas and Tea Farm datasets with more complex spatial structure.

The Swin realizes the feature extraction of overall patches by transforming the patch window size and operating shifted window function to improve the accuracy of HSI with high spatial resolution. However, the poor performance of Swin Transformer on the Indian Pines dataset shows that its optimization is relatively little helpful for the phenomena of “same material, different spectra” and “different material, same spectra.”

The NesT achieves top accuracy in four experimental HSI datasets that sufficiently prove the effectiveness and generalization of pixel-wise classification tasks, which is in line with the previously discussed. As can be seen from the NesT classification results in Figures 1013 and Tables 47, the phenomenon of salt and pepper is the lightest. The block aggregation structure in the NesT can effectively reduce the number of misjudged pixels caused by the interference of adjacent pixels due to strong correlation, making the model suitable for the extraction of pixel-wise features.

The deep learning model can calculate the probability of each category corresponding to this sample in the classifier by the softmax function, which cannot be calculated by traditional machine learning models. The softmax function is used to calculate the entropy of the prediction results for each model in the experiments. For the positive samples (i.e., correctly predicted samples), the smaller the entropy, the better the predictive effect of the model, and the less frequently other error categories appeared in the predicted results. On the contrary, for the negative samples (i.e., misjudged samples), the smaller entropy indicates that the model learned more features from misjudged categories and tended to overfit. The performance of the model can be evaluated by the entropy of positive and negative samples assisted. The entropies of the positive and negative samples in Table 8 demonstrate that compared with the NesT, the entropies of negative samples of the CNN are too small in the Salinas and Tea Farm datasets. It is indicated that CNN cannot correctly learn the features extracted from misjudged samples, which can be considered as overfitting. The classification performance of ViT and Swin is lower than that of NesT. Therefore, in most cases, calculating the entropies of positive and negative samples classified by the NesT can also reflect its excellent learning ability of distinguishable features. In more detail, the Xiongan dataset has the highest resolution and more dense features among pixels, so finer features are required to be extracted. The NesT obtains the lowest entropy of positive samples, indicating that it learns more inherent features of ground objects so it is more decisive in classifying the category of a pixel. The NesT has the highest entropies of negative samples in the Salinas and Tea Farm datasets, showing a more hesitant attitude towards the classification of misjudged pixels, which indicates that with appropriate fine-tuning, the misjudged pixels will be more likely to be correctly classified.

5. Conclusion

In this paper, we propose a novel workflow based on the NesT for HSI classification. With the help of the block aggregation module in the NesT, the problem of fixed patch division raised by ViT is mitigated, which causes sufficient feature extraction between patches. The applicability and generalization of the NesT for HSI classification are explored for the first time. We conduct experiments on four public HSI datasets to investigate the performance of the NesT, compared with other classical hyperspectral classification methods. The experimental results show that deep learning methods can extract HSI spectral–spatial features and improve the classification accuracy of the model compared with machine learning methods. For the deep learning models, NesT algorithm achieves the best performance in feature extraction and classification when all scenarios are considered. The NesT has huge potential as a candidate for hyperspectral classification with a better model visualization. Our work enriches the usage scenarios of the NesT and opens a new window for HSI classification. However, NesT also contains limitations, such as the requirement of high computing costs and training instances. In the future, we will conduct extensive tests analyzing the performance-complexity tradeoff of the NesT and try to apply it to the semantic segmentation and reconstruction of HSI.

Data Availability

In this section, please provide details regarding where data supporting reported results can be found, including links to publicly archived datasets analyzed or generated during the study. Graña, M., & Duro, R. J. (Eds.). (2008). Computational intelligence for remote sensing (Vol. 133). Springer (http://www.ehu.eus/ccwintco/index.php/Hyperspectral_Remote_Sensing_Scenes). Zhang, X., Zhang, B., Zhang, L. F., & Sun, Y. L. (2017). Hyperspectral remote sensing dataset for tea farm. Global Change Data Repository (http://www.ehu.eus/ccwintco/index.php/Hyperspectral_Remote_Sensing_Scenes). Yi, C., Lifu, Z., Xia, Z., Yueming, W., Wenchao. Q., Senlin, T., & Peng, Z. (2020). Aerial hyperspectral remote sensing classification dataset of Xiongan New Area (Matiwan Village). Journal of Remote Sensing, 24(11), 1299-1306 (http://www.hrs-cas.com/a/share/shujuchanpin/2019/0501/1049.html).

Conflicts of Interest

The authors declare no conflict of interest.

Authors’ Contributions

Zitong Zhang, Na Gong, and Qiaoyu Ma were responsible for conceptualization; Zitong Zhang was responsible for formal analysis; Zitong Zhang, Heng Zhou, and Na Gong were responsible for investigation; Zitong Zhang and Qiaoyu Ma were responsible for methodology; Qiaoyu Ma was responsible for software; Zitong Zhang was responsible for supervision; Zitong Zhang and Heng Zhou were responsible for validation; Qiaoyu Ma was responsible for visualization; Zitong Zhang and Na Gong were responsible for writing—original draft; Zitong Zhang, Qiaoyu Ma, and Heng Zhou were responsible for writing—review and editing; Zitong Zhang was responsible for writing—revised draft. All authors have read and agreed to the published version of the manuscript.