Abstract

Recent years have witnessed the success of encoder-decoder structure-based approaches in lung region segmentation of chest X-ray (CXR) images. However, accurate lung region segmentation is still challenging due to the following three issues: (1) inaccurate lung region segmentation boundaries, (2) existence of lesion-related artifacts (e.g., opacity and pneumonia), and (3) lack of the ability to utilize multiscale information. To address these issues, we propose an edge-assisted computing and mask attention based network (called EAM-Net), which consists of an encoder-decoder network, an edge-assisted computing module, and multiple mask attention modules. Based on the encoder-decoder structure, an edge-assisted computing module is first proposed, which integrates the feature maps of the shallow encoding layers for edge prediction, and uses the edge evidence map as a strong cue to guide the lung region segmentation, thereby refining the lung region segmentation boundaries. We further design a mask attention module after each decoding layer, which employs a mask attention operation to make the model focus on lung regions while suppressing the lesion-related artifacts. Besides, a multiscale aggregation loss is proposed to optimize EAM-Net. Extensive experiments on the JSRT, Shenzhen, and Montgomery datasets demonstrate that EAM-Net outperforms existing state-of-the-art lung region segmentation methods.

1. Introduction

Lung is one of the most important organs in the human body. There are many types of lung diseases with high incidence [1]. Among them, lung cancer is the most common and deadliest tumor in China, with a 5-year survival rate of 16.4% [2]. Chest X-ray (CXR) is a widely used technique to identify lung diseases. However, manual interpretation of CXR images is usually time-consuming, laborious, and subjective. To this end, many efficient image processing methods have been proposed for the computer-aided diagnosis of lung diseases, in which lung region segmentation is a critical task. It provides basic information on lung shape and size measurements, which can be used for pathological analysis [3]. Therefore, automatic lung region segmentation of CXR images deserves an in-depth study.

Traditional image processing methods use edge detection, thresholding, and clustering to segment the target area of an image [4]. These methods are simple to implement but suffer from poor generalization performance. With the development of deep learning technology, the application of convolutional neural networks (CNNs) for image segmentation is increasing. Compared with traditional image processing methods, CNNs can automatically learn meaningful features from data, thus achieving improved segmentation performance.

Currently, the encoder-decoder structure is one of the most effective CNN models in the field of image segmentation because it preserves the detailed features of images [5]. For example, Long et al. [6] proposed a fully convolutional network, which can take an image of any size as input and produce a pixel-wise segmentation prediction map. Ronneberger et al. [7] proposed a U-shaped network (called UNet), which has been widely used in medical image analysis. UNet consists of three parts: a shrinking path, an expanding path, and skip connections. The input image is downsampled four times on the shrinking path for feature extraction, and upsampled four times on the expanding path to restore to its original input size. Between the shrinking and expanding paths, feature maps at the same level are connected by skip connections to supplement the loss of image details during successive downsampling.

Although encoder-decoder structure-based approaches have achieved remarkable results in the field of image segmentation, precise lung region segmentation is still challenging due to the following three issues. First, the calculation of various pathological indicators relies on accurate lung region segmentation boundaries. However, the edge of the lung region often has some problems such as oscillation, deformation, and noise. As a result, the edge cannot be preserved well. Second, in more complex cases, patients may suffer from some diseases that affect the lung region segmentation, such as opacity, consolidation, tuberculosis, pulmonary nodules, and pneumonia, as shown in Figure 1. From Figure 1, the high-intensity abnormal pixels overlap with the real lung region, thus reducing the contrast of the lung region boundaries. During segmentation, lung regions that overlap with opacity may be incorrectly predicted as lung boundaries. Third, due to the loss of information caused by successive downsampling operations, the model cannot make full use of the effective information of features at different scales during training, including semantic features for lung region identification and detailed features for edge localization.

To address the abovementioned issues, this paper proposes an edge-assisted computing and mask attention based network (called EAM-Net) for lung region segmentation of CXR images. EAM-Net follows the classic encoder-decoder structure, where the encoder extracts the feature map of an input image, and the decoder restores the feature map to the original input size. In order to learn robust lung region segmentation boundaries, an edge-assisted computing module is designed, which fuses the feature maps of the first three encoding layers to generate an edge evidence map. Note that this edge evidence map is used as a strong cue to guide the lung region segmentation, thus refining the edge parts of the lung region segmentation result. We further design a mask attention module after each decoding layer, which employs a spatial attention mechanism to learn important regions on the segmentation features while ignoring irrelevant artifacts. EAM-Net is optimized by a multiscale aggregation loss, which consists of a multiscale lung region segmentation loss and an edge prediction loss. Overall, EAM-Net has the capability to deal with the abovementioned three issues in lung region segmentation of CXR images. The main contributions of this paper are summarized as follows:(i)We propose the edge-assisted computing module, which integrates the feature maps of the shallow encoding layers for edge prediction and transfers meaningful edge information to guide the lung region segmentation.(ii)We design a mask attention module after each decoding layer, which utilizes the feature map output from the decoding layer for segmentation prediction, and enhances the lung regions on the segmentation features through a mask attention operation.(iii)A multiscale aggregation loss is proposed to jointly supervise the lung region segmentation task and edge prediction task in EAM-Net, which is beneficial to explore the correlation between these two tasks and can make full use of the multiscale information through a deep supervision method.(iv)Extensive experiments are conducted on the JSRT, Shenzhen, and Montgomery datasets to verify the effectiveness of EAM-Net. The results show that EAM-Net can provide more accurate lung region segmentation results than all its competitors.

The rest of this paper is organized as follows: Section 2 introduces the related work. The proposed approach, i.e., EAM-Net, is elaborated in Section 3. Section 4 and Section 5 present the experimental setup and the experimental results, respectively. Finally, Section 6 concludes this paper.

Since our method is based on the encoder-decoder structure and we also make use of attention mechanism to enhance features, their related work for lung region segmentation is briefly introduced in this section.

2.1. Encoder-Decoder Structure

In the field of medical image segmentation based on deep learning [810], the encoder-decoder structure is one of the most commonly used network structures [1113]. UNet [7] and SegNet [14] are two representatives of the encoder-decoder structure-based methods. In the following, we introduce the related work on these two methods and their variants for lung region segmentation.

UNet is a segmentation network proposed for medical images. The core of UNet is the skip connection, which combines high-level semantic information with low-level image details. Novikov et al. [15] proposed an improved UNet called InvertedNet to segment the clavicle, lung, and heart. They added a Gaussian decay after each convolutional layer, and used the exponential linear unit [16] instead of rectified linear unit (ReLU) to speed up model training. Yahyatabar et al. [17] proposed a lightweight UNet model called Dense-Unet for lung region segmentation, which utilizes dense connections among layers for feature reuse. Rahman et al. [18] proposed a two-stage framework for lung region segmentation. In the first stage, a UNet model is trained to generate initial segmentation results for lung regions. In the second stage, the morphological processing is performed on the initial segmentation results to obtain the final refined segmentation results. Abas Hasan and Mohsin Abdulazeez [19] proposed an improved UNet with less parameters and network training time. Compared with the original UNet, it reduces the number of convolution kernels in each layer and the network training time by 75% and 70%, respectively. Zhang et al. [20] proposed an edge attention guided network called ET-Net for lung region segmentation. ET-Net employs an edge-guided module to learn edge representations, and a weighted aggregation module to integrate the edge representations. Solovyev et al. [21] proposed a Bayesian-based deep learning framework for lung region segmentation using a standard ResNet50 as the encoder and a Bayesian feature pyramid network as the decoder. Kholiavchenko et al. [22] proposed a contour-aware segmentation framework for lung region segmentation, which combines the advantages of UNet, LinkNet, and Tiramisu architectures. Arsalan et al. [23] proposed an improved UNet called X-ray-Net for multiclass segmentation of the lung, heart, and clavicle bones. X-ray-Net uses a residual mesh in its network to preserve the spatial edge information. Milletari et al. [24] proposed a U-shaped approach called CFCM for lung region segmentation. The core idea of CFCM is to use long short-term memory networks to fuse features from different encoding layers.

Different from UNet, SegNet utilizes the maximum pooling index for upsampling, with the aim of better preserving the boundary feature information. In SegNet, the index of the maximum value in the pooling layer is recorded, which is directly used for the unpooling operation during upsampling. Kalinovsky and Kovalev [25] combined SegNet with histogram equalization and local contrast normalization (LCN) for lung region segmentation. The histogram equalization is employed on the input image in the preprocessing stage, and the LCN operations are implemented in each encoding layer. In [25], the segmentation accuracy is 96.2% on 354 CXR images. Saidy and Lee [26] employed a modified SegNet for lung region segmentation, which performs batch normalization and ReLU activation functions after each convolutional layer, and adopts pooling indices to restore the feature maps to the original input size. They achieved the Dice coefficient of 96% on a test set of 35 unseen images. Mittal et al. [27] proposed LF-SegNet for lung region segmentation, which upsamples feature maps by simple replication and uses dropout layers for regularization and avoiding overfitting. On 199 images from the JSRT and Montgomery datasets, the average accuracy and Jaccard of LF-SegNet are 98.73% and 95.10%, respectively.

2.2. Attention Mechanism

The attention mechanism in deep learning is similar to human visual attention, which can focus on information that is more important to the current task. Many attention-based methods have been proposed to enhance the feature representation capabilities of CNNs [28, 29]. For example, Hu et al. [28] proposed SENet that models the dependencies among channels to improve the performance of the network. Fu et al. [29] proposed a dual attention network called DANet for scene segmentation. DANet jointly employs a position attention module and a channel attention module to learn spatial and channel dependencies.

There are also some attention-based approaches for lung region segmentation. Tang et al. [30] proposed a criss-cross attention [31] based network for lung region segmentation, in which the criss-cross attention is used to capture global contextual information and enhance pixel-level representations in both horizontal and vertical directions, thus boosting the lung segmentation performance. Kim and Lee [32] proposed a UNet with a self-attention module for lung region segmentation. They utilized the self-attention mechanism for feature optimization and achieved improved performance on several medical image segmentation datasets. Cao and Zhao [33] proposed a lung region segmentation method to address the issue of opacity regions in patient CXR. They designed a three-terminal attention mechanism, which integrates both the channel and spatial attention mechanisms to make the model focus on target regions. Li et al. [34] introduced the SE module [28] into a fully convolutional neural network for lung region segmentation. Given an input feature map, the SE module can automatically learn and generate the channel attention weight. Then, the input feature map can be adaptively calibrated and optimized along the channel dimension by using this weight.

From the above introduction, we can find that existing methods usually make use of the original encoder-decoder architecture or its attention-based variants for segmentation. Accurate lung region segmentation is still challenging due to inaccurate lung region segmentation boundaries, existence of lesion-related artifacts, and lack of the ability to utilize multiscale information. Therefore, to further improve the performance of lung region segmentation, it is necessary to propose a unified framework that can address the abovementioned three issues.

3. Proposed Approach

3.1. Overview

Based on the analysis in Section 2, we propose EAM-Net for lung region segmentation of CXR images, which combines the advantages of the encoder-decoder structure and attention mechanism to achieve accurate segmentation of lung regions. Figure 2 shows the architecture of EAM-Net, which consists of an encoder-decoder network, an edge-assisted computing module, and five mask attention modules. The edge-assisted computing module is placed on the side of the encoder, which utilizes the features of shallow encoding layers for edge prediction and uses the edge information to achieve better segmentation. The mask attention module is added after each decoding layer, which employs a mask attention operation to highlight important regions on segmentation features and ignore irrelevant artifacts on them.

3.2. Encoder-Decoder Network

The encoder-decoder network is constructed based on the classic UNet and consists of an encoder and a decoder. In order to extract more representative features, we adopt ResNet [35] as the backbone of the encoder. As shown in Figure 2, the encoder consists of five encoding layers: conv1, conv2_x, conv3_x, conv4_x, and conv5_x. The encoding layers at different stages are used to process feature maps with different resolutions. In the experiments of this paper, we used ResNet18, ResNet34, and ResNet50 as the backbone of the encoder, respectively, to study the effect of different network depths on the segmentation performance. Table 1 describes the structures of ResNet18, ResNet34, and ResNet50. Note that ResNet18 and ResNet34 use basic block as the residual module, while ResNet50 adopts bottleNeck block instead. Figure 3 presents the specific structures of these two blocks. From Figure 3, the basic block consists of two convolutions and a residual connection, while the bottleNeck block consists of two convolutions, a convolution, and a residual connection. Assuming that is the input of a residual module, the output of the residual module (denoted as ) is calculated as follows:where represents the residual mapping of the stacked convolutional layers in the residual module, and represents the parameter weights of these stacked layers. In EAM-Net, the use of residual module makes the network stack deeper, which increases the representation ability of the network and avoids the vanishing gradient problem.

The decoder consists of five decoding layers that gradually restore the feature map output by the encoder to the original input size. Note that between two adjacent decoding layers, there is a mask attention module, which is proposed to further optimize the decoding features. Except for the first decoding layer (i.e., the decoding layer connected to conv_5), each decoding layer contains two inputs, i.e., the feature map output by the previous mask attention module and the feature map output by the encoding layer at the corresponding level. The former is first upsampled by the bilinear interpolation operation, and then concatenated with the latter, followed by a convolution and a ReLU activation function for feature fusion.

3.3. Edge-Assisted Computing Module

In existing encoder-decoder based segmentation methods, the edge and structure information of an image may be lost due to successive downsampling operations, thus leading to inaccurate segmentation edges. Generally, the detailed edge and structure information are included in the feature maps of the early stages of the network. Based on this, we design an edge-assisted computing module, which integrates the feature maps of the shallow encoding layers for edge prediction and transfers meaningful edge information to help better segmentation.

The structure of the edge-assisted computing module is shown in Figure 4. Considering the trade-off between model performance and computation complexity, we adopt the feature maps of the first three encoding layers to compute the edge evidence map. We denote these three feature maps as , , and , respectively, where , , and are their corresponding numbers of channels, and and represent the height and width of the original input image, respectively. From Figure 4, , , and are processed by three parallel branches to generate , , and . Each branch contains a convolution and a convolution. The convolution is used to adjust the numbers of channels of , , and to 32, and the convolution is used to perform feature extraction. Afterward, , , and are upsampled to the resolution of the input image and concatenated together to generate a fused feature map, denoted as :where , , and represent , , and bilinear interpolation operations, respectively, and denotes the concatenation operation. Because is further used for edge prediction, this fused feature map is also called the edge evidence map. Subsequently, on the one hand, is processed by a convolution to reduce the number of channels to 1 for edge prediction, which generates the edge prediction map, denoted as . On the other hand, as shown in Figure 2, is concatenated with the output feature map of the last mask attention module for lung region segmentation. The purpose of this feature concatenation operation is to connect the edge prediction task with the lung region segmentation task, and to introduce meaningful edge information included in to enhance the lung region segmentation performance.

3.4. Mask Attention Module

Existing attention-based methods have achieved promising results in various segmentation tasks. However, they usually rely on global average pooling operation or global max pooling operation to learn spatial attention weights [36, 37]. These global pooling operations roughly merge the features of each channel, which may lead to the loss of some key information. To this end, we design a mask attention module, which employs a supervised learning branch to learn a mask feature map representing lung regions, and adopts this mask feature map as an attention weight to enhance the lung regions on segmentation features.

The structure of the mask attention module is shown in Figure 5. The input to this module is the output feature map of the decoder at the th stage, denoted as . For , we first construct a supervised learning branch consisting of a convolution and a convolution to obtain a mask feature map :where and indicate the and convolutions, respectively, and and are their corresponding parameters. Herein, the convolution is employed to extract image features and the convolution is used to adjust the number of channels to 1. As shown in Figure 5, on the one hand, is used for lung region segmentation, which can receive the optimization of lung region information during training. On the other hand, is processed by the Sigmoid activation function and used as the attention weight to multiply . This multiplication operation is called the mask attention operation. Since is optimized to characterize lung regions, the mask attention operation can improve the lung region parts of while suppressing irrelevant lesion-related distractions. To ensure the convergence of the model, a residual connection from to the output is added, which generates the final refined feature map, denoted as . The abovementioned mask attention operation can be formulated as follows:where represents the Sigmoid activation function, denotes the element-wise product operation, and is a learnable weighting factor to adjust the contribution of the attention part in the final refined feature map. is initialized to 0 and its value can be dynamically optimized during network training.

3.5. Multiscale Aggregation Loss Function

Existing encoder-decoder based segmentation methods usually design a network to directly learn the mapping from an input image to the segmentation result, and only supervise the prediction at the last stage of the network [38]. Therefore, they cannot make full use of the effective information at early and immediate stages of the network. Unlike these methods, EAM-Net adds an edge prediction task as the auxiliary task to improve the segmentation boundaries, and adopts a deep supervision method to supervise the segmentation prediction at each stage of the network. In order to train EAM-Net, we propose a multiscale aggregation loss function, which is defined as follows:where and denote the edge prediction loss and multiscale segmentation loss, respectively. To deal with the class imbalance problem in medical image segmentation tasks, we jointly use the binary cross-entropy loss, Dice loss, and Jaccard loss. Specifically, and are calculated as follows:where , , and represent the binary cross-entropy, Dice, and Jaccard losses, respectively, and are the edge prediction result and edge ground-truth label, respectively, and and are the segmentation result and segmentation ground-truth label at the th stage of the decoder, respectively. As shown in Figure 2, assuming that the resolution of the input image is , the dimensions of the feature maps output from the first to fifth stages of the decoder are , , , , and , respectively.

4. Experimental Setup

4.1. Datasets

The performance of EAM-Net was examined on three public datasets, i.e., the JSRT, Shenzhen, and Montgomery datasets. The JSRT dataset [39] is commonly used for lung nodule detection and lung region segmentation, containing 247 CXR images from 14 institutions in Japan and USA. These images are 12-bit images with a resolution of and include the labels of nodule location (on 154 images) and diagnosis (malignant or benign). Note that the lung region segmentation labels for these images were obtained from [40]. The Shenzhen dataset [41] contains 662 CXR images, of which 326 images belong to normal and 336 images represent pulmonary tuberculosis. These images were collected at the Third People’s Hospital of Shenzhen, Guangdong Province, China. They are 8-bit images with the tuberculosis annotations and lung region segmentation labels. The Montgomery dataset [41] contains 138 CXRs images, including 80 normal and 58 pulmonary tuberculosis images. These images were collected by the Tuberculosis Control Program at the Montgomery County Department of Health and Human Services, Maryland, USA, and annotated with nodule location and lung region segmentation labels.

From the above description, all the three datasets have the interference of lung diseases, which makes lung region segmentation more difficult. Many researchers have tested their methods on these three datasets. Therefore, it is convenient to compare EAM-Net with other methods on them. However, due to the fact that there is no official training and test data separation for these three datasets, existing methods employ different data separation methods to evaluate their performance. In the following experiments, we adopted a ratio of 7 : 3 to randomly split the training and test sets, and conducted 10 independent experiments on the proposed method. This separation ratio is commonly used in the machine learning community and can reflect the general performance of the model on a dataset. In addition, we averaged the results of the 10 independent experiments as the model performance, which can greatly reduce the accidental error caused by the one-time data separation to the model performance. The detailed data distributions of the JSRT, Shenzhen, and Montgomery datasets are shown in Table 2.

4.2. Algorithms for Comparison

The compared methods of EAM-Net on the JSRT, Shenzhen, and Montgomery datasets can be simply divided into two categories: the baseline method and the state-of-the-art lung region segmentation methods. Res-UNet is chosen as the baseline method, which is a U-shaped network with ResNet as the encoder backbone. EAM-Net also adopts ResNet as the encoder backbone, and additionally proposes two new modules, i.e., the edge-assisted computing module and the mask attention module. The state-of-the-art lung region segmentation methods refer to those that have reported excellent performance on the three selected datasets over the past five years. The details of these methods have been introduced in Section 2. Note that for a fair comparison with these state-of-the-art methods, we reproduced the results of these methods using the same training and test data separation as the proposed EAM-Net. For [9, 20, 21, 24], we directly used their source codes to reproduce the results under our data separation. For those methods without source codes available (i.e., [8, 10, 15, 17, 18, 22, 23, 32]), we also reproduced their results under our data separation according to the technical details of the corresponding papers.

4.3. Implementation Details

During the training phase, all training images of EAM-Net were prepared in the form of edge images and multiscale segmentation images, which have been introduced in Section 3.5. We first resized the original input images to the size of . Note that we obtained the edge mask (i.e., the ground truth of the edge) by applying a contour detection algorithm to the segmentation mask, where the thickness of the edge contour was set to 8. Then, we performed the online data augmentation, including random horizontal flip and random vertical flip, with the aim of preventing the model from overfitting. The Adaptive Moment Estimation (i.e., Adam) algorithm was used as the optimizer. The initial learning rate was set to 0.001, and we reduced the learning rate by 1/10 when the loss does not decrease for three consecutive epochs. The batch size of the training set was set to 8. We loaded the pretrained weights of residual blocks on ImageNet for the encoder of EAM-Net. The maximum number of training epochs was set to 50. In the test phase, for an input CXR image, EAM-Net could generate multiscale segmentation results, and the segmentation result with the size of was used as the final prediction result to evaluate the performance of the model. The source codes of the proposed EAM-Net can be downloaded from the publication page (https://intleo.csu.edu.cn/publication.html) of our research group for reproducing the experimental results of this paper.

4.4. Evaluation Metrics

Three commonly used evaluation metrics were adopted to measure the performance of lung region segmentation: pixel Accuracy (PA), Dice, and Jaccard (JA). They are defined as follows:where , , , and are the true positives, true negatives, false positives, and false negatives, respectively, and and represent a segmentation map and a ground-truth map, respectively. PA is used to calculate the percentage of correctly classified pixels in an image, while Dice and Jaccard are used to measure the similarity between the segmentation map and the ground-truth map.

5. Results and Discussion

5.1. Comparion with the Baseline Method

First, we compared EAM-Net with the baseline method (i.e., Res-UNet) on the JSRT, Shenzhen, and Montgomery datasets. Table 3 summarizes the results of EAM-Net and Res-UNet using different encoder backbones (i.e., ResNet18, ResNet34, and ResNet50). From Table 3, no matter which encoder backbone is used, EAM-Net achieves better segmentation performance than Res-UNet in terms of all the three evaluation metrics on the three datasets. Compared with Res-UNet18, EAM-Net18 improves JA by 1.0%, 0.9%, and 1.27% on the JSRT, Shenzhen, and Montgomery datasets, respectively. Compared with Res-UNet50, EAM-Net50 improves Dice by 0.42%, 0.6%, and 0.82%, respectively, and JA by 0.66%, 0.87%, and 0.92%, respectively, on these three datasets. Since EAM-Net is the network that adds the edge-assisted computing module and mask attention module to Res-UNet, the abovementioned results demonstrate the effectiveness of these two modules proposed in this paper.

5.2. Comparion with the State-of-the-Art Lung Region Segmentation Methods

We also compared the performance of EAM-Net with that of the state-of-the-art lung region segmentation methods. Herein, ResNet50 was used as the backbone in EAM-Net. The results of these methods on the JSRT, Shenzhen, and Montgomery datasets are reported in Tables 46, respectively.

The results in Table 4 suggest that on the JSRT dataset, EAM-Net achieves the highest overall performance among all the compared methods. Compared with the Bayesian feature pyramid network [21], EAM-Net improves the performance by a large margin, with a 4.98% improvement in terms of Dice and a 9.01% improvement in terms of JA. Compared with the second-best model (i.e., SED [9]), EAM-Net improves Dice by 0.60% and JA by 1.01%.

As can be seen from Table 5, among all the compared methods, EAM-Net obtains the highest performance in terms of all the three evaluation metrics on the Shenzhen dataset. Specifically, in terms of Dice and JA, EAM-Net outperforms the second best method (i.e., Kim and Lee [32]) by 0.68% and 1.01%, respectively. Compared with X-ray-Net [23], EAM-Net improves Dice and JA by 1.21% and 1.94%, respectively.

From Table 6, on the Montgomery dataset, EAM-Net achieves the highest PA of 99.16%, highest Dice of 98.23%, and highest JA of 96.52% among all the compared methods. Compared with X-ray-Net, which performs the second best, EAM-Net improves PA, Dice, and JA by 0.59%, 0.83%, and 1.59%, respectively. Compared with ET-Net [20] that also uses the edge attention mechanism, EAM-Net obtains a 0.65% improvement in terms of PA and a 2.20% improvement in terms of JA.

The abovementioned results demonstrate the superiority of EAM-Net over the state-of-the-art lung region segmentation methods.

5.3. Visualization of Segmentation Results

In order to show the segmentation performance more intuitively, we visualized some segmentation results of EAM-Net (using ResNet50 as the backbone) in Figure 6. For each CXR image in Figure 6, we used red and green curves to denote the segmentation prediction and ground truth, respectively, and the corresponding Dice and JA values were also provided.

Figure 6(a) depicts the segmentation results of five CXR images on the JSRT dataset, in which the first four images represent lungs with lung nodules and the last image represents a normal lung. It can be observed that even in the presence of lung nodules, EAM-Net can still obtain accurate lung region segmentation boundaries. Figures 6(b) and 6(c) show the segmentation results on the Shenzhen and Montgomery datasets, respectively. The images in the first four columns represent lungs with tuberculosis, and the images in the last column represent normal lungs. From Figures 6(b) and 6(c), the presence of tuberculosis causes a slight segmentation performance drop, but EAM-Net still achieves promising Dice and JA values. Overall, the abovementioned visualization results suggest that EAM-Net can achieve precise lung region segmentation despite the existence of lesion-related artifacts.

5.4. Ablation Study

In EAM-Net, we proposed two kinds of modules, i.e., the edge-assisted computing module and the mask attention modules. Furthermore, we designed a multiscale aggregation loss to optimize EAM-Net, which consists of an edge prediction loss and a multiscale segmentation loss. To investigate their effectiveness, we conducted the ablation experiments on each of them. Table 7 gives the comparison results on the Montgomery dataset. Note that all methods took ResNet50 as the backbone of the encoder for a fair comparison.

5.4.1. Effectiveness of the Edge-Assisted Computing Module

To validate the effectiveness of the edge-assisted computing module, we compared EAM-Net with and without the edge-assisted computing module. For EAM-Net without the edge-assisted computing module, we removed the edge-assisted computing module from the model and directly used the output feature of the last mask attention module to generate the final segmentation result. Meanwhile, the multiscale aggregation loss was degraded to a multiscale segmentation loss. From Table 7, the edge-assisted computing module improves Dice and JA by 0.44% and 0.47%, respectively. The performance gain of EAM-Net can be attributed to the fact that the edge-assisted computing module could transfer meaningful edge information from the edge prediction task to the segmentation task, thereby improving the lung region segmentation performance.

5.4.2. Effectiveness of the Mask Attention Modules

We also compared EAM-Net with and without the mask attention modules. For EAM-Net without the mask attention modules, five mask attention modules were removed from the model, i.e., two adjacent decoding layers in the decoder were directly connected. In addition, we only used the decoding feature from the last stage for segmentation; thus, the loss of the model only included the edge prediction loss and the final segmentation loss. The results in Table 7 suggest that compared with EAM-Net, EAM-Net without the mask attention modules reduces PA, Dice, and JA by 0.32%, 0.45%, and 0.54%, respectively. The abovementioned results suggest that the mask attention modules can highlight the lung regions on the segmentation features while suppressing irrelevant lesion-related artifacts, which is beneficial for the lung region segmentation task.

5.4.3. Effectiveness of the Multiscale Aggregation Loss

To investigate the effectiveness of the multiscale aggregation loss, we compared it with the single-scale loss, which includes an edge prediction loss and a final segmentation loss. Note that we still kept the mask attention operations in the decoder, but only performed the segmentation prediction at the last stage of the network instead of multiscale segmentation prediction. From Table 7, replacing the multiscale aggregation loss with the single-scale loss, drops Dice and JA by 0.38% and 0.45%, respectively. The abovementioned results demonstrate the effectiveness of the multiscale aggregation loss. In principle, the proposed multiscale segmentation loss belongs to deep supervision, which adds additional supervision (intermediate segmentation predictions) to the model, assisting the model in learning discriminative features to obtain performance gains.

6. Conclusion

Lung region segmentation is a fundamental and critical medical image analysis task. Existing lung region segmentation methods still suffer from inaccurate lung region segmentation boundaries, lesion-related artifacts, and limited ability to exploit multiscale information. In this paper, we proposed EAM-Net for lung region segmentation, which included an encoder-decoder structure and two new modules to address the abovementioned issues. First, an edge-assisted computing module was proposed, which integrated the shallow feature maps of the encoder for edge prediction, and transferred the edge prediction information to the segmentation task to optimize the edge parts of the segmentation results. We further designed a mask attention module after each decoding layer, which used the mask attention operation to enhance the lung regions while suppressing lesion-related artifacts. In addition, a multiscale aggregation loss was proposed to exploit multiscale segmentation information and jointly optimize the edge prediction and lung region segmentation tasks.

The results on the JSRT, Shenzhen, and Montgomery datasets confirmed that EAM-Net achieved better lung region segmentation performance than the baseline and state-of-the-art lung region segmentation methods. Our future work includes the following two aspects. On the one hand, we intend to extend EAM-Net to other medical image segmentation tasks. On the other hand, we plan to explore an end-to-end deep learning model for joint lung region segmentation and lung disease classification.

Abbreviations

CXR:Chest X-ray
CNNs:Convolutional neural networks
ReLU:Rectified linear unit
PA:Pixal accuracy.

Data Availability

The data that support the findings of this study are available on request from the corresponding author.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This work was supported in part by the National Key R&D Program of China under Grant 2022YFC2010200, in part by the National Natural Science Foundation of China under Grant 61976225, and in part by the Science and Technology Innovation Program of Hunan Province under Grant 2022RC3013.