Abstract
Semantic segmentation is widely used in automatic driving systems. To quickly and accurately classify objects in emergency situations, a large number of images need to be processed per second. To make a semantic segmentation model run on hardware with low memory and limited computing capacity, this paper proposes a real-time semantic segmentation network called MRFDCNet. This architecture is based on our proposed multireceptive field dense connection (MRFDC) module. The module uses one depthwise separable convolution branch and two depthwise dilated separable convolution branches with a proposed symmetric sequence of dilation rates to obtain local and contextual information under multiple receptive fields. In addition, we utilize a dense connection to allow local and contextual information to complement each other. We design a guided attention (GA) module to effectively utilize deep and shallow features. The GA module uses high-level semantic context to guide low-level spatial details and fuse both types of feature representations. MRFDCNet has only 1.07 M parameters, and it can achieve 72.8% mIoU on the Cityscapes test set with 74 FPS on one NVIDIA GeForce GTX 1080 Ti GPU. Experiments on the Cityscapes and CamVid test sets show that MRFDCNet achieves a balance between accuracy and inference speed. Code is available at https://github.com/Wsky1836/MRFDCNet.
1. Introduction
Semantic segmentation is one of the fundamental problems in computer vision. The goal of semantic segmentation is to classify each pixel in an image according to its real object label. This technique is widely used in computer vision and facilitates some valuable applications, such as autonomous driving [1–3], intelligent medicine [4, 5], and video surveillance [6–8]. Autonomous driving systems use video sensors to collect image information around a given vehicle, such as roads, buildings, pedestrians, and cars. Such a system divides an input image into a series of disjoint regions to analyze the surroundings of the vehicle.
In recent years, convolutional neural networks have become the preferred solution for image processing tasks. Some convolutional neural networks [9–11] have hundreds or more convolutional layers, which effectively improve the performance of these networks but simultaneously sacrifice their inference speeds. In some convolutional neural network applications, a large number of images need to be processed per second, which requires the network to have a high inference speed. Some lightweight networks [12–14] were designed to achieve reduced computational complexity by decreasing the numbers of layers and channels in the network. Although the inference speed improved, the feature recognition ability of the networks was weakened, leading to performance degradation.
With the rapid growth of real-time interaction, semantic segmentation has begun to shift to real-time semantic segmentation. If an autopilot system cannot quickly and accurately classify objects in an emergency, serious consequences will be observed. In practical applications, due to costs and other factors, semantic segmentation models need to run on devices with low memory and low power consumption. When facing real-time semantic segmentation problems, SegNet [15] limits the input image resolution to implement real-time requirements, and a low-resolution input image results in lower computational complexity and memory consumption. ENet [16] designs a lightweight network, but compared with that of other high-performance semantic segmentation models, the segmentation accuracy of ENet is decreased by 10%. Although both of the above models achieve increased inference speed and reduced memory usage, they both significantly decrease accuracy. Therefore, it is a challenging task for semantic segmentation models to achieve high inference speed and high accuracy.
In this work, we introduce a novel lightweight network architecture for real-time semantic segmentation. We design a multireceptive field dense connection (MRFDC) module to extract features under different receptive fields. This module contains one depthwise separable convolution and two depthwise dilated separable convolution branches with different dilation rates. The depthwise dilated separable convolution branches with different dilation rates allow the network to learn feature representations with different scales without reducing the size of the feature map. Compared with the increasing sequence of dilation rates proposed in the previous work [17, 18], we propose a symmetric sequence of dilation rates. To make full use of local and contextual information, the module utilizes dense connection to allow different scale features to complement each other. To describe the boundaries of objects, we also design a guided attention (GA) module that uses high-level semantic context to guide low-level spatial details.
The main contributions of this paper are summarized as follows:(i)We propose a multireceptive field dense connection module to extract local information and contextual information and utilize a dense connection to allow local and contextual information to complement each other.(ii)To enhance the boundaries of the objects to be described, we design a guided attention module that aggregates low-level spatial details with high-level semantic context.(iii)We design a new lightweight network named MRFDCNet. In this structure, there are 10 multireceptive field dense connection modules with a symmetric sequence of dilation rates.
2. Related Work
2.1. Semantic Segmentation
To enable the semantic segmentation models to run on devices with limited computing capacity and memory, it is necessary to achieve high accuracy with few parameters and achieve a balance between accuracy and inference speed. SegNet employs a symmetrical encoder-decoder network, which gradually reduces the size of the feature map in the encoder stage and restores the size of the feature map by using antipooling in the decoder stage. ESPNet [19] utilizes dilated convolution with different dilation rates to capture features under different receptive fields and fuses the features through hierarchical feature fusion operations. ERFNet [20] uses 1D factorized convolution to reduce the number of parameters and utilizes a stepwise upsampling decoder to improve its segmentation performance. CENet [21] constructs an encoder-decoder network, which inputs multiscale context information from encoder to decoder through skip connection to improve segmentation accuracy. CGNet [22] designs context guided blocks to learn local and contextual information and achieves improved segmentation accuracy by integrating the information of each stage. MDCCNet [23] designs a multiscale deep context convolutional network that combines multiscale features and restores object boundaries through dense connected CRF. Curv-Net [24] proposes a new U-shape network, which is composed of SK module and multi-Bi-ConvLSTM. The SK module is used to extract multiscale features, and multi-Bi-ConvLSTM is used to fuse feature information of deep and shallow stages. DABNet [17] uses a bottleneck structure to reduce the number of parameters. BiSeNetV2 [25] constructs a bilateral segmentation network to process spatial details and semantic context and designs a guided aggregation layer to fuse the two types of features.
2.2. Dilated Convolution
The receptive field determines the ability of the network to extract targets of different sizes. The dilated convolution aims to solve the problem of information loss during the process of downsampling feature maps. It can enlarge the receptive field with the same number of parameters as that used by the standard convolution. Atrous spatial pyramid pooling (ASPP) [26] uses dilated convolution with different dilation rates to collect multiscale information on different scales. Due to the large receptive field of dilated convolution, this technique cannot effectively extract the adjacent features, so local dependence is lost. An unreasonable dilation rate also significantly reduces the segmentation accuracy.
2.3. Attention Mechanism
Motivated by applications in speech recognition [27] and machine translation [28], attention mechanisms have been widely used in computer vision. The convolutional block attention module [29] is a classic module that combines channel attention and spatial attention. The channel attention mechanisms squeeze and excite channels to give greater weight to important information. The spatial attention module focuses on the spatial locations that have more important information. ACFFNet [30] assigns weights to different channels by supervising the channels of the highest-level features from different receptive fields. BiSeNet [31] uses a global average pooling branch to capture more information and enhance the semantic consistency of pixel-level classification. SA-UNet [32] proposes an attention module along the spatial dimension, which can adaptively enhance important features. AGLNet [18] designed a global attention upsampling module. In the process of upsampling at the end of the network, low-level spatial details were used to guide high-level semantic context to enhance the object boundary prediction. DANet [33] utilizes dual attention networks to capture remote dependencies by using nonlocal operations. The remote dependencies can be spatial, channel, or spatiotemporal. To further improve segmentation accuracy, it is very important to combine low-level spatial details with high-level semantic context.
3. Methods
In this section, we start by introducing the multireceptive field dense connection module, which is the crucial component of our model. The MRFDC module contains multiple receptive field branches and allows local information and contextual information to flow between different branches through dense connection. Thereafter, we explain the proposed guided attention module. The GA module utilizes both low-level spatial details and high-level semantic context to enhance the segmentation accuracy of object boundaries. Finally, we demonstrate the overall structure of our model. The architecture of MRFDCNet is shown in Figure 1.

3.1. Multireceptive Field Dense Connection Module
We now introduce more details about the multireceptive field dense connection module, as shown in Figure 2. The module uses a bottleneck structure to reduce the number of parameters. The bottleneck structure of the MRFDC module is similar to that of ResNet. This structure reduces the number of channels by convolution and finally restores the number of channels by convolution. The asymmetric structure can effectively reduce the number of parameters. In each MRFDC module, we use a 3 × 3 convolution to reduce the number of channels by half. The 3 × 3 convolution has a larger receptive field, so it can capture complex features. At the end of the MRFDC module, a 1 × 1 convolution recovers the number of channels.

In the semantic segmentation task, the fusion of multireceptive field features is crucial for achieving high segmentation accuracy. Effectively extracting and fusing multiple receptive field features are still a research focus. In the DAB module, convolution is used to extract local information, while dilated convolution is used to extract contextual information. The DAB module solves the problem of omitting contextual information in feature extraction. However, sufficient contextual information cannot be extracted with only one dilated convolution branch. Therefore, we design the MRFDC module, which is a novel three-branch structure, and each branch is used to extract features under different receptive fields. Each branch is responsible for extracting features by using a set of factorized convolutions (3 × 1 and 1 × 3). A set of factorized convolutions reduces the computational complexity from O(N2) to O(N) relative to a standard convolution. Replacing the standard convolution with depthwise separable convolution can significantly reduce the number of parameters.
The first branch is responsible for extracting local information. The second and third branches aim to extract contextual information under the larger receptive field. Therefore, we set different dilation rates for the dilated convolution branches to extract more contextual information. We use the dense connection to enhance the feature flow between parallel branches. The dense connection combines several parallel branches together to increase the depth. In this design, the output of one branch is added to the inputs of all the back branches. The use of dense connection in different branches can allow local information flow to the dilated convolution branch, and contextual information can also complement each other to make the extracted features more comprehensive.
3.2. Guided Attention Module
The semantic segmentation network is usually U shaped, and the feature map is recovered to the input size in the final stage. Some methods [4, 20] use bilinear interpolation or transposed convolution to upsample feature maps. In a semantic segmentation network, shallow layers are full of spatial details, while deep layers contain more semantic context, and the existing models ignore shallow spatial details. Because of the different sizes and numbers of channels of the feature map, it is difficult to effectively aggregate them. Inspired by BiSeNetV2, we design a guided attention module to guide spatial detail by using semantic context. Under the guidance of semantic context, the network captures features that are more comprehensive and enhances the diversity of the two types of information.
To fuse feature maps with different sizes and numbers of channels, 1/8 feature map needs to be upsampled to 1/4 feature map size. As shown in Figure 3, the GA module consists of four branches. The 1/4 feature map and the 1/8 feature map each have 2 branches. In branch 1 of the 1/4 feature map, a 1 × 1 convolution layer is first used to increase the number of channels to the number of channels of the 1/8 feature map, and then, 3 × 1 and 1 × 3 depthwise separable convolutions are used to extract local information. The 1/8 feature map branch 2 uses a 3 × 3 depthwise separable convolution and a 1 × 1 convolution to reduce the number of channels to the number of channels of the 1/4 feature map. We implement skip connection to combine the input with branch 2 of the 1/8 feature map to supplement the initial spatial details in the deep network. The guided 1/4 feature maps are fused together through concatenation operation.

In the GA module, branch 1 of the 1/8 feature map is first used to guide branch 1 of the 1/4 feature map . The feature map after the guide can be defined as follows:where is the sigmoid function, and is the upsampling function. Then, we utilize branch 2 of the 1/8 feature map to guide branch 2 of the 1/4 feature map .where is the feature map after the guide. Finally, the 1 × 1 convolution is used to convert the number of channels to N.where is a 1 × 1 convolution, is a concatenation operation, and is the final feature map of the network.
3.3. Network Architecture Design
The MRFDCNet structure is shown in Table 1. The model is divided into three stages, and each downsampling is defined as one stage. In the first stage, 3 × 3 convolutions are used to extract the initial features of the input image, where the stride of the first convolution is set as 2 to reduce the size of the feature map. The second downsampling consists of a concatenation of a 3 × 3 convolution with stride 2 and a 2 × 2 max pooling. The third downsampling is a 3 × 3 convolution with stride 2. In addition, inspired by ESPNetv2 [34], we implement skip connections to connect the input with the deeper layer, which can compensate for the loss incurred during the process of spatial detail transmission.
There are two types of MRFDC blocks in the network structure: MRFDC block 1 and MRFDC block 2. Each MRFDC module contains two dilated convolution branches with symmetric dilation rates, and the dilation rate is gradually increased to increase the receptive field of the network. The dilation rates in MRFDC block 1 are 2 and 4, and they use small dilation rates to extract features. We deploy the MRFDC block 2 to extract features under large symmetric dilation rates. The dilation rates in MRFDC block 2 are 2, 4, 8, and 16.
4. Experiments
4.1. Implementation Details
We train our model on the Cityscapes [35] and CamVid [36] datasets. The Cityscapes dataset is an urban scene dataset that is focused on streetscape segmentation. It contains 2975 training images, 500 validation images, and 1525 testing images. These images have 2048 × 1024 resolution and 30 semantic categories. CamVid is a smaller road scene dataset containing 367 training images, 101 validation images, and 233 test images. All images have 32 categories with the resolution of 960 × 720. For a fair comparison, we downsample the image resolution of the Cityscapes and CamVid datasets to 1024 × 512 and 480 × 360, respectively, and 19 and 11 of the categories in these datasets are used to train the model. To maximize the use of the datasets, we adopt dataset-enhancement strategies, including horizontal flipping and random scaling. The random scaling contains 0.75, 1, 1.25, 1.5, 1.75, and 2.
All experiments for our model are performed on RTX 2080 Ti GPU and Ubuntu 16.04. On the Cityscapes dataset, we adopt the SGD optimizer with batch size 8, momentum 0.9, weight decay 1e − 4, and initial learning rate 4.5e − 2. On the CamVid dataset, we use the Adam [12] optimizer with batch size 8, weight decay 1e − 4, and initial learning 1e − 3. For all datasets, we train MRFDCNet using the cross-entropy loss function and set the maximum number of epochs to 1000. We adopt the “poly” learning rate policy with a power of 0.9. The “poly” learning rate can be defined as follows:
4.2. Evaluation Results on Cityscapes
We compare the inference speed and segmentation accuracy of the proposed MRFDCNet model with those of several advanced real-time semantic segmentation methods with a single NVIDIA GTX 1080 Ti GPU, as shown in Table 2. FPENet achieves the fastest inference speed and has the lowest segmentation accuracy of all methods. ESPNetv2, BiSeNet, and ICNet use pretraining strategies to improve their mIoU. The segmentation accuracy can be effectively improved through a pretrained backbone. MRFDCNet has 0.18 M fewer parameters than ESPNetv2, but its performance is better than that of ESPNetv2. Our method only has 1.07 M parameters and achieves 72.8% mIoU and 74 FPS inference speed. Overall, MRFDCNet achieves a balance between the number of parameters, segmentation accuracy, and inference speed.
Table 3 summarizes the individual category results on the Cityscapes test set. Our model obtains the best mIoU for 18 out of the 19 object categories. Because the network can make full use of the features of multiple receptive fields, it can effectively achieve improved segmentation accuracy for large-size categories (Tru, Bus, and Tra). By combining high-level semantic context with low-level spatial details, the mIoU of small-size categories (Fen, Pol, TLi, and TSi) can be improved by approximately 5%. Figure 4 shows the qualitative results of the Cityscapes validation set.

4.3. Evaluation Results on CamVid
We also compare our model with other advanced models on the CamVid test set. Table 4 shows the individual category results of each model on the CamVid test set. Our model achieves the best mIoU in 7 of the 11 categories. MRFDCNet achieves better accuracy on the CamVid test set than the other models. The visualization results are shown in Figure 5.

4.4. Ablation Study
We conduct ablation experiments on the Cityscapes test set at 512 × 1024 resolution. The ablation experiment includes the MRFDC module and GA module. We first study the influence of the MRFDC module dense connection. As shown in Table 5, the use of dense connection can improve the mIoU by 1.3%. Experimental results show that the dense connection can effectively improve the segmentation accuracy of the network.
In MRFDCNet, the dilation rates of the dilated convolution also significantly affect the network performance. To investigate the effectiveness of the sequence of dilation rates utilized in the network, we compare several sequences of MRFDC block 2 dilation rates. In Table 6, the sequence of small dilation rates {1, 2, 4, 8, 1, 2, 4, 8} obtains 71.2% mIoU on the Cityscapes test set. The large receptive field sequences can effectively improve accuracy, and the performance of the dilation rate sequence {4, 4, 8, 8, 16, 16, 32, 32} improves by 0.6%. The sequence of symmetric dilation rates that we propose achieves the best mIoU. The experiment shows that the network can obtain the best performance by setting reasonable dilation rates.
We test four variants of the GA module. The GA module aims to use high-level semantic context to guide low-level spatial details. As shown in Table 7, MRFDCNet can achieve 71.8% mIoU without using the GA module. The use of skip connections to combine the input with the GA module can recover the initial spatial details. The GA module without combining the initial input obtains 72.2% mIoU. The position of the combination of the input and GA module also affects the accuracy, as shown in Figure 6. Combining the input with the initial position of the 1/8 feature map significantly degrades the network performance. The best performance is achieved when the input is combined with 1/8 feature map branch 2. The experimental results show that the GA module can further improve the segmentation accuracy of the network.

5. Conclusions
In this paper, we propose MRFDC module to extract features and enhance feature flow through dense connection. The network can achieve the best performance by properly setting the dilation rates of the depthwise dilated separable convolution in the MRFDC module. The GA module improves the network performance by aggregating low-level spatial details and high-level semantic context. The proposed model achieves significant improvements over the current advanced methods on the Cityscapes and CamVid datasets, achieving a balance between segmentation accuracy, the number of parameters, and inference speed. Through ablation studies on the Cityscapes test set, we demonstrate the influences of different components on the segmentation accuracy. We hope that our method will promote further research in the semantic segmentation field. In the future, we would like to reduce model parameters and achieve the best performance possible.
Data Availability
We used the Cityscapes dataset and CamVid dataset.
Conflicts of Interest
The authors declare that they have no conflicts of interest.
Acknowledgments
The research was supported by the Natural Science Foundation of China under grant no. 61703046 and the Fundamental Research Funds for the Central Universities under grant no. (2015ZCQ-XX).