Abstract
MRI is often influenced by many factors, and single image super-resolution (SISR) based on a neural network is an effective and cost-effective alternative technique for the high-resolution restoration of low-resolution images. However, deep neural networks can easily lead to overfitting and make the test results worse. The network with a shallow training network is difficult to fit quickly and cannot completely learn training samples. To solve the above problems, a new end-to-end super-resolution (SR) method is proposed for magnetic resonance (MR) images. Firstly, in order to better fuse features, a parameter-free chunking fusion block (PCFB) is proposed, which can divide the feature map into branches by splitting channels to obtain parameter-free attention. Secondly, the proposed training strategy including perceptual loss, gradient loss, and L1 loss has significantly improved the accuracy of model fitting and prediction. Finally, the proposed model and training strategy take the super-resolution IXISR dataset (PD, T1, and T2) as an example to compare with the existing excellent methods and obtain advanced performance. A large number of experiments have proved that the proposed method performs better than the advanced methods in highly reliable measurement.
1. Introduction
MRI is a noninvasive imaging technology in vivo that uses the phenomenon of magnetic resonance to obtain molecular structure and thus information about the internal structure of the human body. MRI not only provides more information than many other imaging techniques in medical imaging, but it can also directly make cross-sectional, sagittal, coronal, and various oblique images of the body, which does not produce the artifacts in CT detection, does not require contrast injection, does not have ionizing radiation, and has less adverse effects on the body. MRI is very effective in detecting intracerebral hematomas, extracerebral hematomas, brain tumors, and other diseases. Of course, MRI has its shortcomings [1]. It is relatively slow, has less spatial resolution than CT, has motion artifacts, etc. Therefore, obtaining high-resolution MRI images has become the direction of current research.
High-resolution MRI can not only clearly show the relationship between tumor and surrounding tissues but also the anatomical structure of the brain. It has high application value in the early and middle stages of diagnosis [2].
However, the generation of high-resolution MRI images is odnften influenced by many factors, such as hardware equipment, imaging time, the motion of the human body, and the effect of environmental noise. Therefore, in order to perform effective high-resolution restoration of the low-resolution images obtained by MRI, image super-resolution is an effective and cost-effective excellent technique to improve the spatial resolution of MR images. This technique offers the feasibility of a high signal-to-noise ratio and high-resolution reconstruction of low-resolution MRI images [3].
The traditional SR algorithms include interpolation-based and reconstruction-based methods, which are generally difficult to reconstruct from the high-frequency detailed information of the image, more complicated to compute, and take longer time to reconstruct [4]. In order to solve these problems, scholars have applied deep learning to SR reconstruction in recent years and made a lot of breakthroughs, and nowadays, SR algorithms based on deep learning have occupied the mainstream position of SR algorithm research. In the field of medical images, deep learning-based SR algorithms can obtain prior knowledge from medical image training set data and reconstruct low-resolution images into high-resolution images using neural networks based on this information.
In recent years, with the continuous development of deep learning [5–8], many advanced deep learning-based SR methods have emerged in the field of SR image [9, 10], enabling the performance and efficiency of SR image to be continuously enhanced. Super-resolution convolutional neural network [11] and fast super-resolution evolutionary neural network [12] were pioneering works of deep learning in the field of super-resolution reconstruction. They use a convolutional neural network (CNN) for super-resolution image reconstruction for the first time. Subsequently, on the basis of this pioneering work, researchers proposed many new super-resolution image networks to further improve the model performance, such as deeply recursive convolutional network [13] and deep recursive residual network [14] based on recurrent neural networks and super-resolution using very deep convolutional networks [15]. FFTI [16] was a fine inpainting method which is an incomplete image inpainting method based on feature fusion and two-step inpainting. However, most of these methods were aimed at natural images and are not suitable for medical images.
Recently, many literature studies in the field of medical images have also proposed many SR methods for medical images, such as [17–21]. However, unlike ordinary images, high-quality medical image datasets are relatively scarce, and most of the images are gray-scale images, and the images are relatively single. Using this data set to train a model with a deep network layer will easily lead to overfitting and make the test result worse. A model with a shallow training network will be difficult to fit quickly and cannot learn the training samples completely. Therefore, SR medical images trained by a traditional network cannot meet the requirements of SR tasks.
Considering the above problems, in order to make a SR image model more suitable for medical image tasks, in this paper, we introduce residual learning and a parameter-free chunking fusion method to improve the above difficulties. In the stage of feature extraction, residual learning is designed similar to the residual network [22] to acquire features, which uses layerNorm [23] in the transformer for reference. LayerNorm is also used in residual learning to make the training smoother and avoid the impact of variance differences between different batches. Subsequently, a parameter-free chunking fusion block is used to better fuse features and perform effective feature enhancement. In the module, the feature graph chunking is divided into branches for different information transmission, and then the SimAM [24] is performed on each branch to enhance the features of different branches, and finally the semantic information of different branches is integrated. SimAM can effectively enhance the feature on different branches and effectively integrate at the end. Moreover, SimAM has no parameters to learn and can improve the model performance without parameter training. In addition, in order to further accelerate model fitting and improve prediction accuracy, this paper proposes a composite loss to optimize the training strategy by combining perceptual loss, gradient loss, and L1 loss.
In order to solve the above problems, we have proposed corresponding solutions, to which the follow-up work mainly makes three contributions:(1)A parameter-free chunking fusion block (PCFB) model is proposed, which divides the feature map into branches for parameter-free attention and then integrates the feature information of different branches, so as to better fuse features and perform effective feature enhancement, which can improve the expression ability of the feature map without adding parameters, thereby improving the accuracy.(2)A composite loss for our SR method is proposed which combines perceptual loss, gradient loss, and L1 loss. The loss can further make the model pay attention to the impact of loss in different dimensions, thus enhancing the model’s expressiveness.(3)A new end-to-end SR method for MR images is proposed, where the methods contain PCFB and composite loss, which can improve SR method performance more effectively and avoid overfitting.
The rest of this paper is organized as follows: Section 2 introduces some related work in this paper. The proposed methods and experimental results are described in detail in Sections 3 and 4, respectively. We conclude our thesis in Section 5.
2. Related Work
2.1. Super-Resolution in Deep Learning
With the development of deep convolutional neural networks (DCNN), research on super-resolution has made progress recently. For deep learning methods with SISR, fast response and reconstruction quality are important references for measuring super-resolution methods. Super-resolution convolutional neural network (SRCNN) [11] and fast super-resolution evolutionary neural network (FSRCNN) [12] were pioneering works of deep learning in the field of super-resolution reconstruction. The two neural networks first used bicubic interpolation to reduce and enlarge low-resolution images to obtain comparable super-resolution images. Then convolutional neural network was first introduced to achieve image reconstruction. In addition, the traditional SR method based on sparse coding can also be regarded as a deep convolutional network from the two networks, and compared with the traditional method, all sublayers in the two networks were optimized to give full play to the performance of each component. DRCN has a very deep recursion layer (up to 16 recursions), and recursive supervision and skip connections were further proposed by taking into account gradient disappearance/explosion. For deep models, the residual structure exhibits excellent performance. Therefore, the residual structure is introduced into the super-resolution method to make up for the shortcomings caused by gradient disappearance and gradient explosion. The deep super-resolution network (EDSR) [25] was inspired by the residual structure. Compared with the traditional residual structure, the residual blocks of EDSR discard unnecessary modules, thus constructing a multiscale depth super-resolution system (MDSR), which can reconstruct high-resolution images with different magnification factors in a single model. In addition, the SR robustness of images in complex scenes should also be focused on. A heterogeneous group SR CNN [9] contains multiple heterogeneous group blocks. These blocks increase the internal and external relations of different channels in a parallel way to cope with SR in complex scenarios. An enhanced super-resolution group CNN (ESRGCNN) [26] can fully fuse the correlation between wide channel features and retain the long-distance context dependence in the upsampling operation to obtain more accurate low-frequency information. Further, in order to solve the common problems in image super-resolution algorithms, such as image edge blurring caused by redundant network structure, inflexible selection of convolution kernel size, and slow convergence speed of training process, MFFN [27] used a lightweight fusion multilevel single image super-resolution method to achieve SISR.
2.2. Super-Resolution in Medical Imaging
The problem of super-resolution has been widely discussed in medical imaging. Due to limitations such as image acquisition time, low radiation dose, or hardware limitations, the spatial resolution of medical images is insufficient [28]. To solve this problem, Zhu et al. [29] proposed a method for arbitrary scale super-resolution (MIASSR) of medical images, where the method also combined meta-learning with GAN, which can be used for super-resolution at any magnification.
To get as many useful image details as possible, Bing et al. [20] proposed a SR method in medical imaging based on an improved generative adversarial network. This method can not only avoid the interference of high-frequency false information but also integrate the low-level feature constraints to train the model. Zhang et al. [21] proposed a fast medical image super-resolution method, in which subpixel convolution layer addition and mini-network replacement in the hidden layer were crucial to improving the speed of image reconstruction. Inspired by the super-resolution convolutional neural network method based on three hidden layers, Deeba et al. [18] proposed a wavelet-based microgrid network super-resolution method for medical images, where image restoration was speeded up by adding a subpixel layer to replace the small grid network on the hidden layer.
2.3. Attention Mechanism for Vision Tasks
Attention has arguably become one of the most important concepts in the field of deep learning. It was inspired by human biological systems, which tend to focus on unique parts when processing large amounts of information [30]. Liu et al. [31] proposed a multiattention domain module to weigh and reorganize the features; the channel and spatial domain information in the super-resolution method are effectively fused, and the quality of the super-resolution image is effectively improved. Wang et al. [32] proposed two new attention mechanisms: context-weighted channel attention and persistent spatial attention. The proposed attention modulates rich features by suppressing useless features and enhancing features of interest in a channel and spatial manner. Liu and Chen [33] made the following improvements on the basis of the super-resolution universal reverse network (SRGAN). Firstly, they added the channel attention (CA) module to the SRGAN network and increased network depth to better express high-frequency features. Secondly, the old batch normalization layer is deleted to improve network performance. Finally, the loss function is modified to reduce the influence of noise on the image.
3. Methods
3.1. Overview
In the image super-resolution task, our goal is to take the low-resolution (LR) image as the input of the super-resolution model and generate the super-resolution (SR) image . While the general low-resolution image is obtained by downsampling the ground-truth of the high-resolution image . We expressed the super-resolution model as and the parameter as . The super-resolution task can be expressed as the following formula:
In order to make as similar to as possible, it is necessary to optimize the model with the loss function , and finally the optimal parameter is obtained. The objective formula is as follows:
The proposed architecture of super-resolution is shown in Figure 1. Then, the details are given about the feature extraction block, parameter-free chunking fusion block (PCFB), and image reconstruction block. Finally, the composite loss and the training strategy are introduced to enhance the model’s expressiveness.

3.2. Network Architecture
3.2.1. Feature Extraction
The feature extraction part is composed of convolution, activation function, and residual block.
First, if the normal ReLU activation function is used, when the feature is less than 0, will be suppressed to 0, and the feature information will be lost. Therefore, we use PReLU [34] (parametric rectified linear unit) to replace ReLU. PReLU adds a learnable parameter on the basis of ReLU, which can adjust the activation function according to different experimental conditions. The formula is as follows:where represents the the feature map, is a learnable parameter.
Second, if batch normalization (BN) is used, due to the difference in the mean and variance of data in the mini-batch, unstable statistical data may be brought [35], and instance normalization [36] can avoid the above small batch problems. However, the work reported in [37] shows that adding instance normalization does not always bring performance improvement, and manual adjustment is required. Therefore, we introduce layer normalization (LN), which was used by relevant papers of transformer [23] in the early stages. Many recent SOTA methods [38–40] also use this normalization. LayerNorm is independent of the batch size, so it will not be affected by the above problems, and there are no parameters that need to be manually adjusted in the instance normalization. Therefore, LN is introduced to stabilize the training and improve the performance. The normalization formula is as follows:where represents the feature map, is a small constant, is mean, is variance, and and is scale and shift. The same normalization method is used as BN, but the difference is that LN normalizes each single batch rather than normalizing all batches together like BN.
3.2.2. Parameter-Free Chunking Fusion Block (PCFB)
In order to improve the propagation of feature information, Zhao et al. [41] designed module CSB to help the neural networks deal with hierarchical features with different attributes. Because CBF contains a large number of parameters that need to be learned and the fitting speed is slow, we propose PCFB that does not need to learn a large number of parameters on the basis of maintaining image quality. In PCFB, chunking and fusing are represented as channel splitting and channel merging, respectively.
The difference from CSB is that the size of the chunking is determined by the parameter , where each input feature is divided into chunks, and each chunk is the size of . Subsequently, in order to carry out targeted feature enhancement for each block of data, SimAM is used to process features of different blocks, and SimAM does not need redundant parameters to be learned, so the number of model parameters will not be increased.
(1) Chunking and Fusing. The input feature can be divided into chunks along the channel direction, and the dimension of each chunk is . It can be formally expressed as follows:where is the chunking function which split feature map into chunks . In contrast, is the fusing function, which can merge back to the original dimension use concat function.
(2) Parameter-Free Attention. Normally, spatial attention is often used for spatial information, while channel attention is often used for channel information to focus on feature information. However, in human eyes, spatial attention and channel attention coexist and jointly promote information selection in visual processing. Therefore, we need a three-dimensional attention to focus on the features in each channel and spatial position, so a parametric 3D attention SimAM is used to enhance the features of different chunks in the paper. The structure of the proposed method is shown in Figure 2.

SimAM evaluates the importance of each neuron by constructing an energy function . The lower the energy, the greater the difference between the neuron and surrounding neurons, and the higher the importance of features. The energy function is as follows:where is a neuron which means a pixel of feature map , , and represent the mean and standard deviation of the characteristic map, respectively, and is a hyper parameter.
Therefore, the importance of neurons can be obtained by . In addition, the attention mechanism can be realized by weighting the feature map through the sigmoid function. The formula is as follows:where means element-wise multiplication, and is the energy matrix containing all . This module does not introduce any additional training parameters, so it does not increase the original network parameters on the premise of improving performance.
(3) Parameter-Free Chunking Fusion Block. In order to better learn and enhance the features, we use equation (5) to obtain chunks and then let each chunk pass through equation (8) alone for 3D weighted attention. Equation (6) is used to fuse them into the original size like equation (9). The process is shown in Figure 1.
3.2.3. Image Reconstruction Block
In order to change the image to the super-resolution size, the upsampling operation is required, and we build the image reconstruction part to realize it. As shown in Figure 1, image reconstruction includes convolution, convolution, PReLU, and PixelShuffle [42] layers.
The main function of PixelShuffle is to obtain high-resolution feature maps by multichannel recombination of low-resolution feature maps. As shown in Figure 3, the feature mapping of the channels is recombined into the supersampling result of of a single channel. Pixel shuffle transforms the feature map from low-resolution space to high-resolution space.

3.3. Our Composite Loss for Super-Resolution
3.3.1. Conventional Loss
Most super-resolution methods use pixel loss to optimize the network. Pixel loss measures the pixel-wise difference between SR image and HR image, which contains L1 loss and L2 loss. Compared with L1 loss, L2 loss penalizes large errors but has a higher tolerance for small errors. In actual training, L1 loss [25, 43] shows better convergence than L2 loss. Finally, a higher peak signal-to-noise ratio (PSNR) index will be obtained, so it is the most widely used loss function in the super-resolution field. The formula is as follows:
However, since such pixel loss does not consider the image quality, such as edges, textures, and high-frequency details, which may be too smooth to maintain sharp edges to obtain visual effects.
3.3.2. Perceptual Loss
In order to incorporate high-level feature loss on the basis of pixel loss, perceptual loss [44] is introduced. The perceptual loss uses the pretrained VGG [45] network to extract the high-level features of the image and constructs the perceptual loss through the Euclidean distance between the HR image features and the SR image features to restore the perceptual quality of the image. The formula of perceptual loss is as follows:where denotes the -th layer output of the VGG model.
3.3.3. Edge-Aware Loss
In order to combine the loss of image edge information on the basis of pixel loss, we further introduce edge-aware loss [46]. In edge-aware loss, edges of the SR image and HR image are extracted according to the edge extraction operator, and then the difference is calculated between the output and the label edge. In this paper, Laplacian operator is used to extract edge features. The formula of edge-aware loss is as follows:where denotes an edge extraction method based on Laplacian operator.
3.3.4. Our Composite Loss
Our loss function uses L1 loss as the basic loss function, adds perceptual loss to avoid the loss of high-level features, and adds edge perceptual loss to further monitor the integrity of image edge information. The formula is as follows:where and are hyper-parameters.
We use our composite loss to optimize the proposed model, and the algorithm for training the model is shown in Algorithm 1.
4. Experiments
4.1. Dataset
The IXISR dataset was constructed by Zhao et al. through further processing of IXI dataset [41], which contains three types of MR images: 81 T1 volumes, 578 T2 volumes, and 578 PD volumes. In this work, we take the intersection of these three types of MR images to obtain 576 3D volumes of each type of MR image. These 3D volumes are then trimmed to 240 240 96 to fit the three scaling factors. For SISR, each 3D MR voxel is divided into 96 gray-scale images. LR images are generated based on bicubic downsampling and K-space truncation. As for truncation degradation, HR images are first converted to k-space by discrete Fourier transform (DFT) and then truncated along the height and width directions.
|
4.2. Implementation Details
Our method is implemented by using the paddle framework. Similar to the previous work, in the IXISR [41] dataset, we use 70% of the images as the training dataset, 10% as the validation dataset, and 20% as the test dataset. The size of the small batch is set to 16, and the parameter in the loss function is set to 0.3, the parameter is set to 0.1, and the parameter is set to 2. We use a size of randomly extracted from LR slices and the corresponding HR area. Data enhancement is simply achieved by random horizontal flipping and 90 degree rotation [25]. And millions of iterative trainings are conducted on the NVIDIA GeForce GTX 3090 GPU. We use Xavier initialization [47] and Adam optimizer for all model parameters and an initial learning rate of 0.001 for iterative optimization. Through the optimization of Algorithm 1, a single iteration of the proposed model including all modules takes about one minute. The space complexity depends on the number of parameters involved in the calculation. Specifically, the representation of the number of parameters is reflected in Table 1.
4.3. Evaluation Metrics
For quantitative comparison, highly reliable metrics are introduced, such as root mean square error (RMSE), peak signal-to-noise ratio (PSNR), and structural similarity index (SSIM). The calculated metric scores are derived from the comparison of the results obtained by the super-resolution method and the high-resolution image .
4.3.1. Root Mean Square Error (RMSE)
where and together represent the position of the pixel in and .
4.3.2. Peak Signal-to-Noise Ratio (PSNR)
where is the number of bits per pixel value, which generally takes 8.
4.3.3. Structural Similarity Index (SSIM)
where is obtained by the super-resolution method and is the high-resolution image, respectively; and are the average; and are the standard deviation; is the covariance of and ; and and are small constants.
4.4. Experimental Results
In this paper, the expressiveness of different models is compared in the case of the IXISR dataset (PD, T1, and T2) of super-resolution. PSNR, SSIM, and RMSE are used to evaluate the expressiveness of the model. Subdatasets are used under two different sampling (bicubic degradation and truncation degradation) in the dataset. Bicubic downsampling is widely used by LR image generation simulation in SR images, where bicubic downsampling is used to downsample HR images and generate LR images. Truncation degradation is a process that simulates the real image acquisition process. The LR image is obtained by k-space truncation, which means that the HR image is intercepted in frequency space for sampling.
Tables 1 and 2, respectively, show the evaluation results of different models of PD, T1, and T2 datasets under the bicubic downsampling and truncation degradation methods. From Figures 4 and 5, we can see that our model has higher expression ability than other models. Compared with the two residual-based networks SRResNet and EDSR, our module adds PCFB, which helps to improve the performance of the model.


4.5. Ablation Studies
The proposed method is based on the improvement of SSResNet, so the ablation experiment will also be carried out around SSResNet. In Tables 3 and 4, we compare the number of parameters and the performance in PSNR, SSIM, and RMSE for all methods. Note that all results are the average values of PSNR, SSIM, and RMSE calculated from MR images on the same dataset. The experimental results show that the proposed method improves the PSNR, SSIM, and RMSE of LR images obtained from BD and TD by and , respectively, compared with SRResnet, although the amount of parameters is only lower. This shows that PCFB is more effective.
In order to evaluate the effectiveness of the composite loss we constructed, we performed ablation experiments with different loss functions on the PD data in the dataset, as shown in Table 5. Compared with L1 and L2 loss functions, the PSNR performance of our composite loss function is and higher than that of only using L1 and L2 loss, respectively, which is a very significant increase. However, there is no significant decrease in RMSE, which only decreases by 0.0013 and 0.0014. In conclusion, the above results show that the loss function designed by us can retain more effective features and provide more reference value for medical applications.
4.6. Model Visualization
In order to understand the ability of the proposed model, the model trained in the comparative experiment is used to visually predict the test data. Our method is compared with Bicubic, ESPCNN, VRCNN, SRResNet, and EDSR on the datasets obtained by the two down sampling methods. The visual results are shown in Figures 4 and 5. It can be seen from the enlarged detail feature map that the image reconstructed by Bicubic, ESPCNN, VRCNN, SRResNet, and EDSR methods still has fuzzy distortion to varying degrees, and the visual perception effect is inferior to our method.
5. Conclusion and Future Work
High-resolution MR images have smaller voxel sizes, providing clinical physicians with more accurate structural and textural details. However, generating high-resolution MR images usually incurs enormous costs. Image super-resolution is an effective and cost-efficient alternative technique for high-resolution restoration of low-resolution images. In this work, we propose a novel end-to-end MR image super-resolution method. First, we introduced a parameter-free block fusion block (PCFB) that can split the feature map into n branches for better fusion features without parameters. Second, a training strategy combining perceptual loss, gradient loss, and LI played an important role in accelerating model fitting and improving prediction accuracy. Finally, the proposed method is effective in the super-resolution task of MR images, improving model accuracy. Our future work needs to focus more on lightweight processing of the model to reduce the model’s parameters while achieving the optimal model accuracy mentioned in the paper.
Data Availability
The IXISR dataset used to support the findings of this study are included within the article [41].
Disclosure
Mingyang Hou and Hongyi Wang should be considered as co-correspondents.
Conflicts of Interest
The authors declare that they have no conflicts of interest.
Authors’ Contributions
Mingyang Hou and Hongyi Wang have made equal contributions to this work.
Acknowledgments
This work was supported in part by the West Light Foundation of the Chinese Academy of Science, the Research Foundation of the Natural Foundation of Chongqing City (cstc2021jcyjmsxmX0146), the Scientific and Technological Research Program of Chongqing Municipal Education Commission (KJZDK201901504, KJQN201901537), the Bingtuan Science and Technology Program in China (Grant No. 2021AB026), the Scientific Research Foundation of Chongqing University of Science and Technology (Grant no. ckrc2020027), and the Chongqing Science and Technology Military-Civilian Integration Innovation Project (2022).