Abstract
Due to the imaging mechanism of hyperspectral images, the spatial resolution of the resulting images is low. An effective method to solve this problem is to fuse the low-resolution hyperspectral image (LR-HSI) with the high-resolution multispectral image (HR-MSI) to generate the high-resolution hyperspectral image (HR-HSI). Currently, the state-of-the-art fusion approach is based on convolutional neural networks (CNN), and few have attempted to use Transformer, which shows impressive performance on advanced vision tasks. In this paper, a simple and efficient hybrid architecture network based on Transformer is proposed to solve the hyperspectral image fusion super-resolution problem. We use the clever combination of convolution and Transformer as the backbone network to fully extract spatial-spectral information by taking advantage of the local and global concerns of both. In order to pay more attention to the information features such as high-frequency information conducive to HR-HSI reconstruction and explore the correlation between spectra, the convolutional attention mechanism is used to further refine the extracted features in spatial and spectral dimensions, respectively. In addition, considering that the resolution of HSI is usually large, we use the feature split module (FSM) to replace the self-attention computation method of the native Transformer to reduce the computational complexity and storage scale of the model and greatly improve the efficiency of model training. Many experiments show that the proposed network architecture achieves the best qualitative and quantitative performance compared with the latest HSI super-resolution methods.
1. Introduction
Hyperspectral imaging can capture images of the same scene at different wavelengths simultaneously. Its rich spectral features are of great importance in the field of remote sensing [1]. HSI has been applied to tracking [2], classification [3–5], segmentation [6], and clustering [7], and the results are significantly improved. Compared with conventional images (e.g., colour or grayscale images), HSI contains richer spectral information of real scenes. However, images with high spatial resolution and high spectral resolution cannot be obtained simultaneously due to the limitations of existing imaging sensors. However, it is easier to obtain RGB (red, green, blue) or panchromatic images with higher spatial resolution but lower spectral resolution [8] for conventional cameras. Therefore, fusing low-resolution hyperspectral images (LR-HSI) and multispectral high-resolution images (HR-MSI) is an effective way to solve the hyperspectral super-resolution problem. This process is often referred to as HSI super-resolution or HSI fusion.
Currently, the fusion problem is formulated as an image restoration problem [8–18]. The method follows a physical degradation model where the input LR-HSI and HR-MSI images are considered as spatially degraded observations and spectrally degraded observations of potential HR-HSI, respectively. The observation model is expressed aswhere , , and represent the two-dimensional matricesof LR-HSI, HR-MSI, and HR-HSI after expansion along the third dimension, respectively. and refer to the width and height at low resolution, and and refer to the height and width at high resolution, and and S represent the number of spectral bands at multispectral and hyperspectral levels, respectively. In addition, , , and are the blur matrix, subsampling matrix, and spectral response matrix, respectively. Based on the observation model (1), many methods have been proposed and achieved good performance.
HSI super-resolution has a large-scale factor in both spatial and spectral domains, and it is a highly ill-defined problem. Therefore, it is vital to integrate a priori information to constrain the solution space. Reference [19–21] uses the prior knowledge that the spatial information of HR-HSI can be sparsely represented under the dictionary trained by HR-HSI [22]. A local spatial smooth prior for HR-HSI images is assumed, which is encoded into the optimization model using total variational regularization. The low-rank characteristics of the spectrum are utilized to reduce spectral distortion [23]. Although valid for some applications, its rationality depends on subjective prior assumptions about the unknown HR-HSI. However, the HR-HSI images collected in real scenes are highly spatially and spectrally diverse. This traditional learning method cannot adapt to different HSI image structures.
In recent years, deep learning has performed better than traditional methods in many computer vision tasks [24]. Satisfactory performance is also achieved in HSI/MSI image fusion field. An enhanced deep learning model is proposed by combining the observation model with low-rank prior information of HSI spectral. For example, Dian et al. [25] have designed a deep model to learn the prior information inside the image through deep CNN. This learned prior information can be combined with the traditional regularization model to obtain better image features than the single regularization model. In the study of [26], a multiscale fusion model is proposed, which adaptively adapts features in combination with the attention model, showing the ability to retain spectral information andspatial details, thus obtaining the most advanced HSI super-resolution results. Compared with traditional methods, the CNN-based approach is significantly improved, but it still suffers from its own drawbacks. For example, CNN-based methods focus on fine architecture design, and the model is usually complex. Secondly, CNN pays more attention to local features, and the model effect of long-term dependence and global features is poor.
Recently, we noticed that Transformer [27] and its various variants have achieved remarkable achievements in natural language processing and advanced computer vision tasks. Transformers have also been introduced into low-level vision tasks due to their excellent performance. For example, Chen et al. [28] proposed a multitask large-scale model IPT based on the original Transformer for image super-score. Liang et al. [29] proposed the SwinIR model based on the Swin-Transformer to solve the problem of image restoration. Both methods are aimed at the naturalimage restoration problem and lack the properties of HSI. Subsequently, Hu et al. [30] proposed a pixel-level Transformer model FuseFormer to solve the HSI/MSI fusion problem. However, the network’s ability to extract local features is insufficient, and the restoration of spatial details is mainly concerned without considering spectral features.
Based on the previous factors, we propose a simple and efficient hybrid network model HMFT based on a Transformer to solve the problem of hyperspectral image fusion. Specifically, HMFT is mainly composed of spectral information extraction and spatial information extraction. (1) In spectral information, LR-HSI is up-sampled to a high-resolution scale and directly transmitted to the end of the network through a long connection, so as to retain the spectral information contained therein to the maximum extent; (2) in spatial information, through clever design, the spatial details and remaining spectral information are extracted by fully combining the advantages of CNN, which pays more attention to local features, and Transformer, which pays more attention to global features. In addition, the feature segmentation module (FSM [31]) is added to reduce the time and space complexity of the network, and the Convolution Block Attention Module (CBAM [32]) is added to explore the correlation between the spectral to promote spatial enhancement and spectral consistency. Finally, the extracted spectral information and spatial information are fused to generate HR-HSI. In summary, the main contributions of this article can be summarized as follows:(i)A novel model HMFT based on Transformer is proposed to solve the super-resolution problem of hyperspectral images. The self-attention mechanism of the Transformer can capture the global interaction between contexts and make up for the disadvantage that CNN only focuses on local features. We combine the advantages of both to extract rich feature information.(ii)Considering the huge amount of hyperspectral data, the native Transformer needs a lot of memory and calculation, and the model is difficult to train. Therefore, the feature split module (FSM) is introduced to replace the native Transformer self-attention calculation method to reduce the space and time complexity of the model.(iii)Experimental results on three different datasets demonstrate that our proposed network model HMFT is effective and generalizes well compared to previous state-of-the-art methods.
2. Related Work
2.1. Deep CNN
Deep CNN-based learning methods [33–45] have achieved good performance in the field of image SR. Yang et al. [39] proposed the PanNet network model, which uses ResNet as the feature extraction backbone network. In particular, the network is trained with high-frequency information of images, which can reduce the network training pressure and retain more high-frequency information of images, while enhancing the generalization ability of the model. MHFnet [34] defines the fusion task as an optimization problem. It combines the degradation model of the hyperspectral image with the spectral low-rank prior of HSI to construct the algorithm model. Different from the traditional optimization algorithm solution method, they extend the near-end iterative algorithm to the CNN network model to learn the near-end operator and model parameters. Hu et al. [26] designed a multiscale fusion model HSRnet to extract spatial information of different scales and introduce an attention mechanism to make the network focus on important components of the image and suppress noise. Although all the previous methods achieve good results, the fact that CNN networks are limited by the size of the convolutional kernels cannot be ignored.
2.2. Vision Transformer
In recent years, the natural language processing model Transformer [27] has been gradually applied in the field of image super-resolution with its excellent performance. Chen et al. [28] proposed a multitask model IPT, which extracts image global information by stacking multiple native Transformer modules to solve low-level vision tasks. Liang et al. [29] proposed the SwinIR model, which uses residual Swin Transformer blocks (RSTB) as the basic unit to build a deep feature extraction network to solve the single image SR problem. Hu et al. [30] proposed the FuseFormer fusion model, which uses each pixel of the hyperspectral image as the input of the Transformer module to construct a pixel-level end-to-end mapping network. The Transformer module can significantly improve the performance of the network thanks to its advantage of building long-term dependencies on images. However, due to the huge parameter scale and high GPU performance consumption, it is rarely used in the field of hyperspectral image fusion.
3. Methodology
This section describes HMFT in detail. The purpose is to learn an end-to-end mapping function with parameters by fully mining the spatial andspectral information between low-resolution hyperspectral image LR-HSI, high-resolution multispectral image HR-MSI, and ground truth HR-HSI. Finally, an image with high resolution and hyperspectral characteristics is reconstructed throughwhere and are the HR-MSI and the LR-HSI, respectively. represents the fusion result and represents the entire network parameter, which can also be regarded as implicit prior knowledge.
3.1. Network Architecture
Figure 1 presents the overall schematic diagram of the HS/MS fusion network-based super-resolution model HMFT. The network takes LR-HSI and HR-MSI as input and finally outputs a HR-HSI. The network is mainly divided into upper and lower parts. The upper part is mainly composed of an upsampling module and a long residual connection, which is used to preserve the spectral information in LR-HSI to the greatest extent, and the lower part is composed of convolution layer, Efficient Transformer layer, and Spatial-Spectral Attention Module, which is used to preserve the spatial information and remaining spectral information.

3.2. Input of Transformer
As Figure 1 shows, the network takes LR-HSI and HR- MSI as input. The is first upsampled to the same scale as using a bicubic linear interpolation algorithm to obtain , followed by stitching with HR-MSI along the spectral dimension, immediately followed by a 3 × 3 convolutional layer, which works well for early visual processing and is more stable and optimal for extracting shallow spatial-spectral feature information [46].
The native Transformer [47] divides the original image into nonoverlapping blocks and then stretches them into a one-dimensional vector, while adding positional embeddings to represent the positional relationships between patches. Since our test data comes from images of various sizes in different scenes, there is a parameter mismatch. Considering the previous reasons, the positional embedding is removed and the unfolding technique is used to manipulate the feature map . The partitions of the “Unfold” operation are sequential and it automatically reflects the information about the location of each patch [31]. In detail, feature map is unfolded (by kernel = stride = k) to a patch sequence, i.e., , where is the total number of the 1-D features. After that, those are sent to the transformer module for further processing.
3.3. Efficient Transformer Blocks
This part consists of several Transformer encoder modules. As illustrated in Figure 2, a single encoder block mainly consists of an efficient multihead self-attention module (EMHA [31]) and multilayer perception (MLP). Meanwhile, layer normalization (Norm [48]) and residual connections are interspersed.

Let’s suppose that the input features is . Due to the large number of hyperspectral image bands, the input features dimension after patch division is too high, which will lead to too many network training parameters, and the model is easy to fall into overfitting, making it difficult to train the network. Therefore, we add a reduction layer to reduce the dimension of input features by n times, where the reduction layer includes a full-join operation. Then for input features, the query, key, and value matrices , , and are calculated aswhere , , and are the projection matrices. Generally, we have . is the number of heads. The original MHA directly uses , , and for large-scale matrix multiplication computations, resulting in a large amount of resource GPU memory and computational resources being occupied,. i.e., while calculating directly with and , the shape of the self-attention matrix is and then we perform matrix multiplication with . However, hyperspectral images usually have high resolution, causing the N after dividing the patch to be very large. Obviously, direct calculations are not suitable for hyperspectral data.
In SR tasks, the predicted pixels of super-resolution images usually depend only on local neighbourhood in LR. However, the local neighbourhood is much larger than CNN’s receptive field. The spatial and temporal complexity of the model can be reduced by dividing the feature into blocks. Hence, we use the feature segmentation module (FSM [31]) to divide , , and into s segments, where a segment is denoted by a triplet . The size of the local neighbourhood is controlled by s. Each segment performs a self-attention calculation separately and obtains the intrasegment self-attention matrix , which is thus computed by the self-attention mechanism in a segment as
We perform the attention function for h times in parallel and concatenate the results for multihead self-attention (MHA). Then, the results of each segment were combined to generate a complete attention matrix . GPU memory usage is further reduced significantly by using segmentation for self-attention matrix computation.
Next, a multilayer perceptron (MLP) with two fully connected layers and GELU nonlinearity between layers is used for further feature transformation. Layer Norm (LN) layers are placed before the MHA and MLP, and both modules are connected using residuals. Finally, to be consistent with the dimensions of the original input features, we restore it to its original dimensions by an expansion layer. The whole process is formulated aswhere is denoted as the input to the efficient transformer block.
3.4. Spatial-Spectral Attention Module
Spectral characteristics are another important feature of HSI. Conventional convolution operations usually act on the entire waveband, which leads to spectral disorder and distortion. In addition, Transformer prefers to capture low-frequency information and lacks local high-frequency information. To address these problems, we use CBAM [32] (see Figure 3) to act as a spatial-spectral attention module to refine and correct the features. In detail, the weight descriptor of each channel is calculated along the space dimension and then multiplied with the channel of the corresponding feature map F to make it consistent with the GT spectral feature, which plays an important role in correcting the spectrum effect. Next, spatial weight descriptors are computed along the spectral dimension and multiplied by the pixels at each corresponding location to enhance important regions of the image, such as edges. The specific calculation steps are as follows:where and represent compression along the spatial dimension and compression along the channel dimension, respectively, and AvgPool and MaxPool represent average and maximum pooling operations, respectively. is the sigmoid operation.

3.5. Long Residual Connection
As shown in Figure 4, upsampled LR-HSI and GT HR-HSI have the same number of bands, and most of the spectral information of HR-HSI is contained in upsampled LR-HSI. We plot the spectral vectors of GT and at a location in Figure 4 to confirm it. Therefore, in order to maximize the retention of spectral information, is passed to the end of the network through a long residual connection to directly sum up with the output features of another part of the network , followed by a 3 × 3 convolution for spatial-spectral feature adaptive fusion, finally generating the final high-resolution hyperspectral image . The remaining spectral information is acquired by another part of the network. In addition, contains more low-frequency information, and transmitting it directly to the end makes the network focus on learning high-frequency information, reducing the pressure on the network to reconstruct the whole HR-HSI.

3.6. Loss Function
In order to measure the super-resolution performance, several cost functions are studied to make the super-resolution result close to the real high-resolution image of the ground. In the current literature, L2 and L1 are the most used and relatively reliable loss functions. Compared with L1 loss, the MSE loss function is beneficial to the peak signal-to-noise ratio (PSNR), but it has several limitations, such as convergence and excessive smoothing. Therefore, we use L1 loss to measure the accuracy of network reconstruction. Where L1 loss is defined by Mean Absolute Error MAE (MAE) between all reconstructed images and real ground valueswhere the superscript denotes the ith out of the N total training images.
The previous losses can well preserve the spatial information of the super-resolution results. However, the correlation between spectral features is ignored, and the reconstructed spectral information may be distorted. In order to ensure the spectral consistency of the reconstruction results at the same time, we apply a spectral angle loss function between the reconstructed image and the ground truthwhere the superscript s denotes the sth out of the S total spectral bands.
In summary, the final objective loss function used to optimize the model consists of the weighted sum of the previous two losseswhere is used to balance the contributions of different losses. In our experiments, we set it as a constant, .
4. General Information
4.1. Data Sets
We conduct a detailed analysis and evaluation of our proposed method with three public hyperspectral image datasets. These include two natural hyperspectral image datasets, the CAVE dataset [49] and one remote sensing hyperspectral image dataset, the Chikusei dataset [50].
The abovementioned three data sets serve as ground truth values of high spatial resolution HR-HSI, whose corresponding HR-MSI is generated by using the corresponding camera spectral response function (SRF). The CAVE and Harvard datasets (see Figure 5) use Nikon D700, and the Chikusei datasets use Canon EOS 5D Mark II.

(a)

(b)
4.2. Comparison Methods
Five state-of-the-art hyperspectral super-resolution methods are selected as baselines for comparison with our proposed method. Among them, three are traditional fusion methods, namely, CSTF [16], FUSE [51], and GLP-HS [52] and the other two are deep learning fusion methods MHFnet [34] and HSRnet [26]. For a fair comparison, all comparison methods are from the public code. In addition, both HSRnet and MHFnet training datasets are consistent with this paper.
4.3. Evaluation Measures
Six quantitative image quality metrics widely used in image domains are used to comprehensively evaluate the performance of our proposed method, namely, Cross Correlation (CC), Spectral Angle Mapping (SAM) [53], Root Mean Square Error (RMSE), Relative Global Dimensional Synthesis Error (ERGAS) [54], Peak Signal-to-Noise Ratio (PSNR), and Structural Similarity (SSIM) [55]. PSNR and SSIM evaluate the spatial reconstruction quality of each band of the image. CC and SAM evaluate image spectral reconstruction quality. RMSE and ERGAS evaluate image loss size.
5. Results and Discussion
5.1. Performance on CAVE Dataset
The CAVE dataset contains 32 indoor HSI data under controlled lighting conditions with an image size of 512 × 512 and contains 31 bands ranging from 400 to 700 nm, with spectral bands in 10 nm steps.
For the CAVE dataset, 20 images are randomly selected from the CAVE dataset as a training set and the rest are used for testing. First, we normalize the dataset within the range of and randomly crop each image to extract 3920 patches of size 64 × 64 × 31 as HR-HSI. Then, the HR-HSI bicubic downsampling (OpenCV-python function resize) is used to generate the corresponding LR-HSI patches, where the downsampling factor is 4. HR-MSI patches were generated by the Nikon D700 spectral response function identical to most of the experiments; 80% of the training set was used for training and 20% was the validation set.
We experimented directly with 11 test images with the trained model. Table 1 gives the average test metric results for 11 test images under different methods. In order to let the reader have an intuitive feeling, we select a test image flower to present the fusion result pseudo-colour image and the corresponding error map in different ways. Table 2 gives the indicators of different methods on the specified image. It is obvious that our method outperforms other comparative methods. As can be seen from the error diagram in Figure 6, the corresponding error of our proposed method is smaller than that of the comparison method, which shows that HMFT is more effective in recovering fine-grained texture and coarse-grained structure. On the contrary, it can be clearly seen that the fusion result of the HSRnet method has some obvious light spots, while the MHFnet image outline is still clearly visible, and the error is large. In addition, we plot the spectral vectors of the specified images to observe the spectral fidelity (see Figure 7). The spectral vectors of the fusion results of our method are most like those of GT.


5.2. Performance on Harvard Dataset
The Harvard dataset contains 50 indoor and 22 outdoor HSIs captured under daylight illumination. The spatial size is 1392 × 1040 with 31 bands in 10 nm steps covering the visible spectrum from 420 to 720 nm.
We select the upper left corner of the image (1000 × 1000) and then randomly select 10 images for testing. The same as the previous settings, the raw data are regarded as HR-HSI, and the acquisition method of LR-HSI and HR-MSI is the same as that of section B. We are consistent with the HSRnet approach, that is, without any retraining or fine-tuning of the models, we test them directly on the Harvard dataset, whose performance on the Harvard dataset can directly reflect the generalization ability of the model.
The Harvard dataset has the same band as CAVE, no special training is performed, and 10 images are randomly selected for testing directly. Table 3 gives the average indicator results for different methods. Likewise, we select a test image computer and plot the pseudo-colour images of the fusion results of different methods, the corresponding error maps, and spectral vectors. Table 4 gives detailed metrics for different methods for specific images. From Figure 8, there is a significant colour difference in the fusion results of MHFnet and HSRnet. In addition, the ERGAS and SAM comparison index values of CSTF and MHFnet fluctuate significantly, indicating that the model is sensitive to the parameters of different images and has weak generalization ability. However, our proposed method is stable in all indicators, indicating that the model generalization ability is better than other methods. Figure 9 also shows that our spectral fidelity is also better than other methods


5.3. Performance on Chikusei Dataset
In order to demonstrate the performance of our proposed method on hyperspectral remote sensing images, we conducted experiments on Chikusei dataset. The Chikusei dataset contains an airborne HSI, taken by a Visible and Near-Infrared (NIR) imaging sensor over agricultural and urban areas in Chikusei, Ibaraki, Japan. The hyperspectral dataset has 128 bands in the spectral range from 363 nm to 1018 nm, and the scene consists of 2517 × 2335 pixels.
Likewise, the original data are treated as HR-HSI, and the LR-HSI is simulated according to the previous experimental method. The HR-MSI is generated by the Canon EOS 5D Mark II spectral response function. After that, we select the 1024 × 2048 region in the upper left corner as training data and randomly crop 3920 overlapping patches with a size of 64 × 64 . Eight non-overlapping 512 × 512 patches were cropped from the remainder for test data.
Table 5 gives the average metrics on test data for different comparison methods. Clearly, our method outperforms other comparison methods on every metric. Likewise, we select an image to display its pseudo-colour image and the error map for visual comparison. Table 6 gives the corresponding comparison index table. It can be clearly seen from Figure 10 that the fusion results of FUSE, GLP-HS, and CSTF are blurry and contain obvious spectral distortion. In addition, we also plot the spectral fidelity of the spectral features observed (see Figure 11). From a visual and data perspective, our method still performs well on hyperspectral remote-sensing images.


5.4. Ablation Study
(1) Convolution Layer Analysis. Transformer focuses more on global features and its self-attention mechanism can capture the global interactions between contexts well, while convolution focuses more on local features and can capture rich local details. We believe that the effective combination of the two can better learn the spatial-spectral information representation. To verify the validity of convolution, we compare our model with its variant, which has no convolutional layer. Tables 7 and 8 show the average metrics of the two networks on the CAVE dataset and the Harvard dataset. All indicators have been improved, so the network with convolution has a better effect.
(2) Feature Split Module Analysis. Hyperspectral images usually have high resolution, and direct calculation with native transformers will lead to huge computation and storage scale, and even more seriously, it may lead to memory overflow. In the SR task, we consider that the predicted pixels of the super-resolution image usually only depend on the local neighbourhood in the LR. Therefore, the same effect can be achieved by using the feature segmentation module (FSM) to block the features and then compute the attention. In order to demonstrate the effectiveness of the FSM, we conduct detailed comparative experiments on it. Table 9 shows the average quality metrics of the two models on the CAVE dataset test images. Obviously, the network performance with the FSM module is better, especially the test time difference is 10 times, and the memory usage difference is 4 times, where the memory usage is obtained by the memory difference before and after the test module runs.
(3) Convolutional Attention Mechanism Analysis. In order to pay more attention to information features such as high-frequency information conducive to HR-HSI reconstruction and explore the correlation between spectra, we added a convolutional attention mechanism to further refine the extracted features in spatial and spectral dimensions, respectively. To demonstrate its effectiveness, we compare two networks with and without the convolutional attention mechanism. Among them, Table 10 shows the comparison results of 11 test images in the CAVE dataset. The network performance of the convolutional attention mechanism is relatively better.
6. Conclusions
In this paper, we propose a simple and efficient hyperspectral and multispectral fusion method. The network first uses convolution to propose shallow details and then uses Transformer to extract both spatial and spectral information of LR-HSI and HR-MSI. We added the FSM module to the MHA module of the Transformer to reduce the computational and memory costs. In addition, the CBAM module is added to the network to make more important channels and regions in the network. Finally, the L1 and spectral joint loss functions are used to train the whole network.
In future works, the HSI and MSI fusion method proposed in this paper will be further extended in two directions. On the one hand, multiscale technology is considered to further improve the feature extraction capability of the network. On the other hand, we try to further reduce the computational and memory cost of the model.
Data Availability
The data that support the findings of this study are available on these websites (http://www.cs.columbia.edu/CAVE/databases/multispectral/, http://vision.seas.harvard.edu/hyperspec/download.html, http://naotoyokoya.com/Download.html).
Conflicts of Interest
The authors declare that they have no conflicts of interest.
Acknowledgments
The project is supported partly by National Key Research and Development Program of China (2019YFE0126600); Major Project of Science and Technology of Henan Province (201400210300); Science and Technological Research of Key Projects of Henan Province (212102210393, 202102110121, 202102210352, 202102210368); Science and Technological Research of Kaifeng (2002001).