Abstract

U-Net++ is one of the most prominent deep convolutional neural networks in the field of medical image segmentation after U-Net. However, the semantic gaps between the encoder and decoder subnets are still large, which will lead to fuzzy feature maps and even target regions of segmentation. To solve this problem, we propose an improved semantic segmentation model utilizing channel attention mechanism and Laplacian sharpening filter, SCU-Net++: dense skip connections are redesigned with sharpening filters to ease the semantic gaps, and channel attention modules are used to make the model pay more attention on the feature maps that are useful for our pixel-level classification task. Compared with U-Net++, the proposed model obtains a more competitive performance on the Pancreas Segmentation dataset and Liver Tumor Segmentation dataset, while increases a very small number of learnable parameters and thus almost does not make additional training and reasoning costs. The training of the proposed method is carried out in deep supervision mode, which alleviates the problem of gradient disappearance, and pruning mechanism can be activated to accelerate the reasoning speed.

1. Introduction

Since the advent of digital medical imaging equipment, the application of image processing technologies in medical image analysis has attracted extensive attention. Semantic segmentation is an important field in automatic medical image processing, promoting the development of automatic computer-aided diagnosis system in medical field.

Semantic segmentation on medical images refers to the process of classifying every pixel in medical images into one of the predefined categories. For example, in the semantic segmentation of liver tumor task, a patient’s CT case is taken as input to predict the category (such as background, liver or lesion) to which every pixel in CT image should belong. The goal is to understand each pixel in the image semantically, so as to distinguish regions of interests (ROIs) in the image, such as the pancreas, liver, and tumor, which can help doctors analyze only important parts of the medical image that are difficult to understand in diagnosis. Semantic segmentation has many medical applications, such as brain tumor image segmentation [1] and lung node segmentation [2]. However, all the research achievements mentioned above are inseparable from the proposal and development of convolutional neural networks [3]. Convolutional neural networks have permeated into all fields of computer vision in the recent years, from image classification to the field of object detection and semantic segmentation.

Semantic segmentation in medical images has entered the era of deep learning due to the potential of convolutional neural networks. U-Net [4] is one of the landmark achievements. U-Net constructs an elegant architecture, namely, “fully convolutional neural network”, which allows the network to propagate context information to higher resolution layers, and completely abandons the fully connected structure. The U-Net framework can produce more accurate segmentation with fewer training sets and thus has become a baseline model for semantic segmentation of medical images.

After U-Net, there are many improved models based on U-Net, such as U-Net++ [5], U-Net3+ [6], Attention U-Net++ [7], and Multi-Res U-Net [8]. Among so many improved models in U-Net family, Attention U-Net++ is one of the best semantic segmentation models in terms of performance, while Sharp U-Net [9] is prominent in the lightweight feature, which represents two different improvement directions.

The author of Sharp U-Net [9] thinks that it is inappropriate for U-Net to “directly connect and convolution feature maps of encoder and decoder to obtain results,” because feature maps generated by encoders and decoders have semantic gaps. Encoder features are low-order and fine-grained because they are computed from the shallow layers of the network, while decoder features are high-order, coarse-grained, and semantic because they are computed from the deeper layers of the network. Therefore, the direct fusion of features in different semantic levels between encoder and decoder subnetworks will make the feature map fuzzy and thus affect the segmentation effect of the region of interest. The author proposes a method to apply sharpening filter to deep medical image segmentation network. Before fusing with decoder features, encoder features are convolved with Laplacian sharpening filter to smooth artifacts and mitigate semantic mismatches of features. Because the filter has fixed parameters and does not need to be trained, the structure is actually very lightweight and does not increase the number of parameters of U-Net model. However, the disadvantage is that the encoder features and decoder features are not fully integrated, and the semantic gap is still large.

Attention U-Net [10] proposes a “gated attention” mechanism. Before the feature fusion of encoder and decoder, the higher-order features in the decoder are used as the input of the attention block, and the encoder features are used as the input of the attention block. The encoder feature map is fused with the higher-order features of the decoder through the grid-based gating mechanism, and then, the output of the attention block is connected and convolved with the higher-order features of the decoder to obtain the output. This approach can reduce the semantic gap between encoder and decoder to improve the segmentation effect, but the disadvantage is that the number of parameters is too large, and each attention block contains three different convolution blocks. Attention U-Net++ is a model based on U-Net++; in order to improve the performance of U-Net++, this kind of gated attention block is used to each skip connection of U-Net++ architecture, resulting in the doubling of the number of parameters. Such an improvement strategy is inadvisable, because the cost of training and reasoning is unacceptable.

Under the circumstances of comprehensively considering the shortcomings and advantages of a series of variants in the U-Net family, we propose SCU-Net++ (Sharp and Channel Attention U-Net++), a lightweight semantic segmentation model in medical images domain based on U-Net++ architecture. Our model achieves more competitive performance than U-Net++ in pancreas segmentation task and liver tumor segmentation task with little additional cost. The method we propose alleviates the problem that encoder of Sharp U-Net is not rich enough in semantic features and makes up for the defect that U-Net++ does not pay enough attention to edge features. The contributions of this paper are summarized as follows: (i)We redesigned skip connections between the encoder and decoder subnetworks utilizing a Laplacian sharpening filter to reduce bad artifacts in training and make the skip pathways perform better(ii)The lightweight channel attention block is inserted before the convolution-sigmoid block getting the segmentation results, helping the model better focus on the features beneficial to segmentation task and almost not increasing the number of learnable params(iii)We demonstrated the more competitive performance of our architecture than U-Net++ on pancreas and liver tumor segmentation task through the repeated experiments on two public medical datasets of binary and multiclass segmentation

2.1. Medical Semantic Segmentation Models

Before deep learning became the mainstream in the field of medical image segmentation, most of the processing algorithms were model-driven, which mainly solves problems by mathematical modeling according to the characteristics of the task. Representative model-driven algorithms include threshold method, region growth method, and level set [11]. The morphological differences between organs and lesions in CT images are wide, which are difficult to be expressed by simple mathematical models. Therefore, such algorithms generally run fast but have plain performance. With the development of computing power, medical image segmentation has gradually moved to the directions based on machine learning. Medical image segmentation algorithms based on machine learning have been constantly proposed, such as a liver tumor segmentation method based on support vector machine (SVM) proposed by Vorontsov et al. [12].

Later, due to the excellent performance of convolution network (CNN) in the field of image classification, deep learning upsurge was set off in the field of computer vision, and the field of medical image segmentation was no exception. Since convolutional neural network is relatively insensitive to image contrast and noise, it has a good effect on medical image segmentation task. Many classic CNN models were proposed in semantic segmentation domain, such as FCN [13], PSPNet [14], EncNet [15], and UNet.

PSPNet has proposed the spatial pyramid pooling module, a clever design for obtaining global context priors. The error segmentation of FCN is alleviated by introducing more context information and multiscale information. The purpose of EncNet is similar to that of PSPNet. It introduces Context Encoding Module to add context semantic information in pretrained backbone: ResNet, which is essentially implemented by dilated convolution. Many excellent semantic segmentation models have emerged, but U-Net framework is more distinctive, because the normal tasks of medical image segmentation usually involve less sufficient training samples and more noise than normal segmentation tasks, which is very suitable to the elegant and concise framework of U-Net. U-Net series has excellent segmentation effects on different medical sets and has become a new benchmark in the field of medical image segmentation. Its encoder-decoder structure has been widely applied in different segmentation scenarios, such as HookNet [16] applied in histopathology and embryo health prediction model [17]. The classical encoder-decoder structure proposed by U-Net has achieved great success, but the potential of this architecture has not been fully explored; therefore, there is a huge room for improvement.

2.2. U-Net Variants

The classic encoder-decoder structure proposed by U-Net is very concise, which can reduce the impact of noise to a certain extent. It is very lightweight and has excellent performance. Based on this structure, many variants have emerged, such as Multi-Res U-Net [18], which uses convolutional kernels of different sizes similar to inception [18] network to build encoders to enhance the feature extraction capability of the model. Attention U-Net uses gated attention blocks to improve the architecture, as heavy as Multi-Res-UNet. U-Net++ redesigns the dense skip connections and the architecture, which enables high-resolution feature maps from the encoder network to be more semantically rich before being fused with the feature maps from the decoder network and to capture the fine-grained details of foreground objects more effectively. We find that the architecture of U-Net++ still has great potential, but it already has quite a large number of parameters, not suitable to be further improved by applying blocks like Multi-Res U-Net or Attention U-Net.

2.3. Laplacian Sharpening Filter

Spatial filtering is a neighborhood-based image processing technique, which performs operations on the neighborhood of each pixel in an image to achieve image enhancement (such as sharpening). Laplacian sharpening convolution kernel (also known as filter) is a discrete approximation of Laplacian operators in the field of image processing, which is a kind of second derivative operator. The Laplace operator of image highlights the area where the intensity changes rapidly, so it is often used for edge enhancement. Sharp U-Net [9] authors first introduced the sharpening filter, which can sharpen image edges, into U-Net, and extended the U-Net architecture to sharpen encoder features before fusing encoder features with decoder features, and achieved better results than U-Net on several public data sets. We use this structure to improve the skip connections in U-Net++. This sharpening operation helps to balance the semantic gaps between decoder and encoder subnets, helping to sharpen feature details and reduce the influence of artifacts on feature images during training. We use a Laplacian kernel commonly used for image sharpening. The initialization parameters of the kernel are given by:

Convolving an image with a Laplacian convolution kernel increases the brightness of the center pixel relative to its neighbors, because a kernel consists of positive center values and negative off-center values. In addition, the sum of all elements of the filter kernel is always zero. Its role and the way of using are slightly different from that in the image processing: the former is used in convolution neural network to alleviate the artifacts during training and reduce semantic gaps between encoders and decoders, but the latter is used in the image preprocess to highlight image edges. The former is implemented with the depth-wise convolution [19] without gradient update after fixed initial parameters, while the latter does not need to set gradient update strategy.

2.4. Squeeze and Excitation Attention Module

Squeeze and excitation attention (SE) module is an extremely lightweight channel attention module. It was first proposed in SE-Net [20]. The attention block consists of squeeze and excitation operations. The squeeze operation corresponds to “global average pooling,” and the excitation operation consists of two fully connected layers and a sigmoid function. We use this lightweight attention mechanism in the model to make the model pay more attention to the abstract features that are more helpful for classification. The module architecture will be shown in Figure 1.

3. Proposed Architecture: SCU-Net++

3.1. Motivations

We observed that the structure of U-Net++ has significantly reduced the semantic gap of the encoder-decoder subnet, but the semantic information can be further enriched. However, the parameter number of U-Net++ of five layers has reached 20 M, which is not suitable for further heavyweight improvement. For example, Attention U-Net++ [7], which adds a heavy attention block to every skip connection, is not desirable. In addition, we think that different groups of convolutional kernels have different kernel parameters, and the importance of features extracted from them is also different. Therefore, it is necessary to use channel attention to make the model pay more attention to feature maps that have a large effect on segmentation.

After the weighting the importance of high accuracy and lightness, we proposed a novel architecture, namely, SCU-Net++ (Sharp and Channel Attention U-Net++), a semantic segmentation model in medical images, which performs better and has almost no increase in the number of learnable parameters compared with U-Net++. The underlying hypothesis behind our architecture is that Laplacian sharpening filter helps reduce bad artifacts during training and make the skip pathways of U-Net++ perform better. The channel attention module also slightly improves the performance and almost adds no additional training and reasoning cost, fitting our idea of lightweight design. According to our experiments, our improvements are effective.

Sharp U-Net firstly came up with the innovation of applying the Laplacian filter on U-Net, and we just migrated this innovation to U-Net++ and made some experiments to prove the effectiveness. The source code can be found at github (https://github.com/cuihu1998/SCU-NetPP/).

3.2. Our Architecture

The structure is shown by Figure 1, and the depth is set as 4. Let denotes the output of node , where indexes the number of layers downsampled by the encoder and indexes the number of layers along the skip connection. The feature maps represented by is computed as: where function denotes the operation of convolution followed by a normalization operator and an activation function. The normalization operator uses instance normalization [21], and the activation function uses Leaky ReLU [22], which can fix the problem of “neuron death” and thus make model easier to train. denotes the operator of concatenation along channels. For example, is the concatenation result of and , and is the concatenation result of and . represents the upsampling operator, implementing by transposed convolution. represents Laplacian sharpening filter, corresponding to the “Sharp Conv Block” in Figure 2. It is implemented by the convolution operators applying Laplacian kernel and can be described as follows: where is the feature map as input, is the Laplacian kernel defined in Equation (1), and denotes convolution.

Another difference from U-Net++ is that we add channel attention blocks. Take U-Net++ as an example, in order to introduce deep supervision in the training process, , , , and will be converted into channels through a convolution operator, respectively, and then activated by sigmoid function into category prediction probability of each pixel. represents the total number of classification categories of segmentation tasks. In our model, after , , , and is converted into channels through the convolution, they will each pass through a channel attention block followed by a sigmoid function. The segmentation output of all levels of the model can be obtained by Equation (4):

In Equation (4), denotes the squeeze and excitation module described in Figure 2 and can be computed as:

In Figure 2, is the input feature maps, and denotes the output. and represent fully connected layers. is Leaky ReLU function. denotes the operator of global average pooling.

3.3. Hybrid Loss Function

We designed a hybrid loss function to adapt to the deep supervision mode, which was combined with Dice loss [23] and crossentropy loss, shown by where is the segmentation result computed by Equation (3) and denotes the real result annotated by specialists. is the dice coefficient loss while is the crossentropy loss, is a constant set to 1-8, and denotes the weight of the crossentropy loss. In our experiments, we set .

3.4. Model Pruning

We trained our network on deep supervision mode. Deep supervision mechanism trains multiple layers of different depths of the network at the same time, so not only the last layer can output segmentation results, but also the shallow layers can. Model pruning can be done if needed, dropping some layers on the right side of the network and sacrificing a bit of segmentation accuracy to improve reasoning speed. For example, we dropped the last layer of our model in Figure 3. Meanwhile, training network with the deep supervision can alleviate the problem of gradient disappearance and improve the process of training.

4. Experimental Details and Results

To evaluate the performance of the method we proposed, we made repeated experiments on the public dataset from NIH Pancreas-CT [24] and MICCAI 2017 Liver Tumor Segmentation challenge (LiTS).

4.1. Datasets
4.1.1. NIH Pancreas-CT Dataset

A total of 82 abdominal enhancement 3D CT scans were performed in 53 male and 27 female subjects at the National Institutes of Health Clinical Center. 17 subjects were healthy kidney donors and scanned prior to nephrectomy. The remaining 65 patients were selected by radiologists from patients who had neither major abdominal lesions nor pancreatic cancer. CT scan resolution was pixels, and slice thickness was between 1.5 and 2.5 mm. All data were manually segmented by medical students and validated/modified by experienced radiologists slice by slice.

4.1.2. Liver Tumor CT Dataset

Liver is a common site for tumor development. Automatic segmentation of tumor lesions is challenging due to the heterogeneity and diffuse nature of their shapes. MICCAI hosted the Liver Segmentation Challenge in 2017, which collected liver data from clinical centers around the world. The training dataset consisted of 130 CT scans and the test dataset consisted of 70 CT scans, with the training dataset finely annotated by experts. The dataset annotation contains three tags: background, liver, and tumor.

4.2. Data Preprocessing

At the first stage, we crop the training cases to their nonzero region, which can reduce the image sizes of datasets and improve computational efficiency. Then, the following resampling and data augmentation method is used.

Resampling: in the medical image processing domain, the voxel spacing (the physical space the voxels represent) is heterogeneous. To deal with this problem, we normalized the voxel spacing of the case to the median of the voxel spacing of the dataset obtained by statistics, using third-order spline function.

Data augmentation: we applied rotation strategy during training to improve the generalization performance of the model, and the applying probability was 0.3. Then, the zero mean normalization method was applied.

4.3. Training Details

The loss function we used is described in Equation (6), which combines Dice loss and crossentropy loss. The model optimizer used the SGD optimizer, the momentum was set as 0.99, and the initial learning rate was 1-2. If the average loss of the training set or validation set decreases less than 5-3 within 30 epochs, the learning rate will reduce by 5 times. When the learning rate is less than 1-6, the training will be stopped in advance. The max training epoch was 1000. The patch size is , and batch size is 2. Our model is based on fully convolutional neural networks, all initialized by Kaiming uniform.

4.4. Evaluation Metrics

We used dice similarity coefficient, precision, and recall as performance indicators to evaluate the performance of models. The three indicators are defined as: where TP is the number of foreground voxels correctly predicted as the foreground, FP is the number of background voxels wrong predicted as the foreground, and FN denotes the number of background voxels wrong predicted as the foreground. Our indicators do not include IOU, because Dice similarity coefficient is the same with IOU in meaning.

4.5. Experiment Results and Analysis

In order to ensure the fairness of the experiments, in every experiment, all annotated data were randomly divided into three subsets, namely, 70% as training set, 10% as validation set, and 20% as test set, while all evaluation results of our experiments were based on the test set. Training and reasoning were carried out in the same configuration for all involved models. We repeated the experiments five times and recorded the average indicators of each model, which were shown in Table 1 and Table 2. Table 1 shows the data records of experiments on pancreas dataset. We compared the outstanding and mostly widely used variants of U-Net with our proposed model, including Sharp U-Net, Attention U-Net, and U-Net++. SD represents the standard deviations of the model’s Dice similarity coefficient in five experiments. The results of repeated experiments on Liver Tumor Segmentation dataset, as shown in Table 2, just compared ours with the target model U-Net++ on this large dataset.

It can be seen that our proposed model has advantages in all three evaluation indicators, and most importantly, our model exceeds our baseline U-Net++ with little increase in the number of parameters (2 K actually). The time of model reasoning and training hardly changes. For instance, the time of each step (two 3D CT sample) during training with LiTS dataset increases from 1052 ms to 1083 ms on average, and inference time for patch increases from 109 ms to 115 ms. The Params and FLOPs are calculated with the tool OpCounter (https://github.com/Lyken17/pytorch-OpCounter/), shown by the table. All the experiments were carried out on one card GTX 3090 with 24GB memory, and speed tests may vary slightly.

We also visualized the segmentation results, as shown in Figure 4. The visual comparison results are corresponding to the displayed results in Table 1 and Table 2. In the top half of Figure 4, the red zone represents pancreas. In the bottom half, the red zone denotes liver, and the green zone represents tumor.

5. Conclusions

In this paper, we proposed a semantic segmentation model based on channel attention and Laplacian sharpening filter, SCU-Net++. We proved through repeated experiments that our model has more competitive performance than U-Net++ on pancreas and liver tumor segmentation. Our method improves the performance of U-Net++ while increases a little number of learnable parameters (2 K) and a few additional computation, clearly stated in the table. We use Laplacian sharpening filter (used to be applied on image enhancement) in model to make the skip pathways of U-Net++ perform better, and we implement this operation by fixed-parameter depth-wise convolution. Each experiment was performed five times to demonstrate that it is a practical semantic segmentation model in medical images. But there are also constrains in our improvement method. The method is targeted at our baseline model U-Net++, and extending to other models may not necessarily improve performance. Further research on this feature will be carried out to extend it to other models or explore other forms of organization. If there are any detail questions, please contact cuihu@hrbeu.edu.cn.

Data Availability

The public datasets used to support the findings of this study, including NIH-Pancreas and LiTS, can be accessed by clicking the following hyperlink: NIH-Pancreas, LiTS (https://academictorrents.com/details/80ecfefcabede760cdbdf63e38986501f7becd49/).

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This work was supported by the National Natural Science Foundation of China (grant no. 62072135), Fundamental Research Funds for the Central Universities (grant no. 3072020CF0602), and Natural Science Foundation of Ningxia Hui Autonomous Region (grant no. 2022AAC03346).