Abstract

Salient Object Detection (SOD) simulates the human visual perception in locating the most attractive objects in the images. Existing methods based on convolutional neural networks have proven to be highly effective for SOD. However, in some cases, these methods cannot satisfy the need of both accurately detecting intact objects and maintaining their boundary details. In this paper, we present a Multiresolution Boundary Enhancement Network (MRBENet) that exploits edge features to optimize the location and boundary fineness of salient objects. We incorporate a deeper convolutional layer into the backbone network to extract high-level semantic features and indicate the location of salient objects. Edge features of different resolutions are extracted by a U-shaped network. We designed a Feature Fusion Module (FFM) to fuse edge features and salient features. Feature Aggregation Module (FAM) based on spatial attention performs multiscale convolutions to enhance salient features. The FFM and FAM allow the model to accurately locate salient objects and enhance boundary fineness. Extensive experiments on six benchmark datasets demonstrate that the proposed method is highly effective and improves the accuracy of salient object detection compared with state-of-the-art methods.

1. Introduction

The goal of salient object detection (SOD) is to find the most distinct and salient objects in an image. Salient object detection as an important preprocessing task in computer vision applications has been widely applied in many fields, such as semantic segmentation [1, 2], video segmentation [3], object recognition [4, 5], and cropping [6].

Inspired by cognitive studies of visual attention, most early works were based on handcrafted features, such as contrast [7, 8], center prior [9, 10], and so on [1113]. With the improvement of GPU computing power, deep convolutional neural networks (CNNs) [14] have successfully broken the limits of traditional methods. These CNN-based SOD methods have achieved great success on widely used benchmarks.

Inspired by the excellent performance of FCN [15] based CNN in the field of semantic segmentation, FCNs have also been massively applied to SOD, such as several end-to-end deep network structures [1618]. The basic units of salient object map finally output by these end-to-end network structures are the individual pixels in the image area, which can highlight the salient information. As the depth of the convolutional layer increases, the location of salient objects becomes more accurate. However, the detailed boundaries of salient objects are lost due to the pooling operation, see Figure 1.

Boundary information is critical for SOD. Therefore, many SOD jobs also try to enhance boundary details by different means. Some salient object detection models [1921] refine high-level features with local information by combining U-Net with a bidirectional or recursive approach to obtain accurate boundary details. Some methods use preprocessing (superpixel) [22] and postprocessing (CRF) [17, 21] to preserve object boundary information. Besides, loss functions have also been used to obtain high-quality salient objects. For example, BASNet [23] uses the proposed hybrid loss to improve boundary accuracy. Some methods [2426] attempt to use the edge as the supervision for training SOD models, which significantly improves the accuracy of the saliency map.

We explicitly model edge features and use the attention mechanism to fuse edge features and salient features to obtain salient objects with a high-quality edge.

Our contributions can be summarized as follows:(A)We propose an MRBENet network that utilizes FFM to fuse salient edge features to enhance the boundary and semantic features of salient objects. From the top layer to the bottom layer, the edge details of salient features are sequentially optimized. When extracting edge features, the guidance of high-level semantic features can effectively avoid the influence of shallow noise. Experimental results show that it can filter out noise.(B)The edge features are first supervised by the salient edge ground truth, and then fused with the salient object features through a feature fusion module. The feature aggregation module extends the receptive field through multi-scale convolution, which can not only effectively aggregate features but also promote feature fusion and enhance the edge details of salient features.

Traditional nondeep learning methods predict salient objects based mainly on low-level features, such as pixel contrast [12], average image color difference [27], and phase spectrum with Fourier transform [28].

Compared with traditional methods, the convolutional neural network (CNN) performs extraordinarily. In [15], Long et al. first proposed a fully convolutional neural network (FCN) to predict each pixel. The FCN replaces the last fully connected layer of the convolutional network with a convolutional layer. At the end of the network structure, the feature map is up-sampled, and then the up-sampled feature map is classified into pixels. The final output is an end-to-end image.

In recent years, most neural network models for salient object detection have extended or improved fully convolutional neural networks. HED [29] added a series of side output layers after the last convolutional layer of each stage of VGGNet [30] and fuses the feature maps output by each layer to obtain the final result map. In DSS [17], Hou et al. added several short connections from the deeper side output to the shallower side output, so that higher-level features can help locate lower-level features, and lower-level features can enrich the details of higher-level features. The smart combination of higher and lower-level features makes it possible to detect salient objects accurately. PoolNet [26] made full use of the function of the pool and incorporated three residual blocks. Wu et al. [31] embedded a mutual learning module and an edge module in the model. Each module is separately supervised by the salient object and the edge of the salient object and is trained in an intertwined bottom-up and top-down manner. Wang et al. [32] designed a pyramid attention module for salient object detection and proposed an edge detection module. The former extends the receptive field and provides multi-scale clues. The latter uses explicit edge information to locate and enhance the saliency of the object edge. Wang et al. [33] proposed an iterative collaborative top-down and bottom-up reasoning network for salient object detection. The two processes of top-down and bottom-up are alternately executed to complement and enhance the fine-grained saliency and high-level saliency estimation. Noori et al. [34] proposed a multiscale attention guidance module and an attention based multilevel integrator module. These two modules not only extract multi-scale features but also assign different weights to multi-level feature maps.

Given the huge body of work in this field, the latest research progress of SOD can be quickly grasped through relevant surveys. In [35], Borji et al. comprehensively reviewed the works and development trends of salient object detection before 2019 and discussed the impact of evaluation indicators and dataset bias on model performance. Recently, Wang et al. conducted a comprehensive survey [36] covering all aspects of SOD. They summarized the existing SOD evaluation datasets and evaluation indicators, constructed a new SOD dataset with rich attribute annotations and analyzed the robustness and portability of the deep SOD model for the first time in this field.

Recently, RGB-D/RGB-T SOD is a growing trend. The accuracy of saliency detection can be improved by learning simultaneous multimodal information. For example, Ji et al. [37] proposed a depth calibration framework (DCF) learning strategy. DCF generates depth images quality weights by classifying positive and negative samples of depth images. The depth images are then calibrated based on these weights. Through this strategy, the accuracy of saliency estimation is improved by depth information, and it tackles the influence of bad information in depth images on saliency results.

3. Proposed Model

3.1. Features Extraction

We use the vgg16 network as the backbone network. As shown in Figure 2, we delete the full connection layer in vgg16 and add a set of deeper successive convolution layers. The deeper convolution block and CP module form a GGM module for extracting and enhancing high-level semantic features. The high-level features can help locate the location of salient objects and edges. From the backbone network, we obtain six side features, Conv1-2, Conv2-2, Conv3-3, Conv4-3, Conv5-3, and Conv6-3, from the backbone network. The six side features can be denoted by the backbone feature set :

Side path1 is abandoned because side path1 is too close to the input image, so the receptive field is very small. In addition, encoding shallow features will significantly increase the computational cost [38]. Side path1 has little effect on the final result.

Since the side output features have different resolutions and number of channels, we first use a set of CP modules to compress the number of side output feature channels into an identical, smaller number, denoted as k. This is beneficial for reducing the amount of subsequent computation and for performing subsequent elementwise operations. The compressed side features can be expressed as follows:where represents the convolutional layers with parameter (it can change the number of channels of the feature map), and represents the ReLU activation function, .

Low-level features have rich information, but some of the information will interfere with the SOD task. So, it is necessary to highlight the salient information of low-level features. The added GGM has the largest receptive field. Therefore, we predict a coarse saliency map for this layer to guide the network to extract useful details from low-level features. The coarse prediction map can roughly locate the salient object regions having larger saliency values (weights) than the background regions. We upsample the coarse saliency map to make its resolution consistent with that of the low-level feature layers. In order to find the right details of salient objects in low-level features, we combine low-level features with coarse prediction maps to enhance the useful details of salient objects. The data flow indicated by the purple dotted arrow in Figure 2 represents the guidance of the coarse saliency map to the low-level features. can be expressed as follows:where denotes the bilinear interpolation operation, .

We explicitly model the edge features on side path2. We utilize a U-shaped network to extract edge features at four different resolutions. This network consists of a CP module and six convolution blocks. and are elementwise summed and input into the network. The CP module compresses the input features into k channels. The convolution block consists of two convolution layers to enhance and extract edge features. We also add four edge supervisions to this network. As shown in Figure 2, we get the edge feature .

3.2. Features Fusion Module (FFM)

As shown in Figure 3, the Spatial Attention (SA) module performs maximum pooling and average pooling on features. Two pooling operations are used to aggregate the channel information of the features, and two single channel maps are obtained. The two images are concatenated together to get a spatial attention through a standard convolutional layer. Spatial attention focuses on the weights of each part of the feature map. Therefore, the spatial attention model can find the most important part (salient object) in the feature map, which is very suitable for SOD tasks.

The corresponding salient feature and edge feature are input into the FFM module for feature fusion. We first enhance the edge features in the salient features by multiplication. Then we use a 3 × 3 convolution layer to drive the preliminarily fused feature , which can be expressed as follows:

Meanwhile, we apply the spatial attention module to salient features to get a feature vector, then multiply it with the edge feature to obtain the feature , which can be expressed as follows:

The final fusion feature can be expressed as follows:

3.3. Decoder and Features Aggregation Module (FAM)

As shown in Figure 4, FAM utilizes multiscale learning (dilated convolution with different dilation rates) to expand the receptive field, enhance the boundary details of salient objects, and promote the fusion of salient features and edge features. The input feature is denoted as . We expand the number of channels to M times by 1 × 1 convolution. The depth separable convolution with different expansion rates is used for multiscale learning. This process can be expressed as follows:where are dilation rates, taken as 1, 2, 3, 4 here. BN is the abbreviation of batch normalization. Here we set up a residual connection, which can be expressed as follows:

Then, we apply a spatial attention to and obtain the attention vector . The feature of the final output can be expressed as follows:

We obtain the final saliency prediction map from top to bottom through FAM. We added depth supervision (purple arrow in Figure 2) after the four FAM modules to refine the saliency map by learning the error between the saliency map and the ground truth.

3.4. Loss Function

The total loss function of our network consists of edge lossand saliency loss. Assume that represents supervision from saliency ground-truth and edge ground-truth, and and represent the edge prediction map and the saliency prediction map. The total loss function can be expressed as follows:anduses the widely used cross-entropy loss function:where represents the pixel index, .

4. Experiment

4.1. Datasets

We train our network on the subset DUTS-TR in the dataset DUTS [39]. We have evaluated the proposed network on six standard benchmark datasets: DUT-OMRON [40], DUTS [39], ECSSD [41], PASCAL-S [42], HKU-IS [43], and SOD [44, 45]. DUT-OMRON contains 5168 high-quality images. There are one or more salient objects with a relatively complex background structure. DUTS is so far the largest salient object detection dataset available. DUTS contains two subsets: training subset DUTS-TR and test subset DUTS-TE. DUTS-TR has 10553 images designed for training, and DUTS-TE has 5019 images designed for testing. ECSSD contains 1000 meaningful and complex semantic images with various complex scenes. PASCAL-S has 850 images with chaotic background and complex foreground. HKU-IS has 4447 images with high-quality annotations. Most images in the dataset have multiple connected or disconnected salient objects. SOD contains 300 high-quality images with a complex background. It was originally designed for image segmentation [44]. Pixel-level annotations of salient objects are generated in [45] and used for object detection. Although the SOD dataset has fewer images, it is currently one of the most challenging object detection datasets, since most of the images contain multiple salient objects, and some salient objects overlap with the boundary or have low contrast.

4.2. Experimental Details

We train our network on the DUTS-TR dataset. We use vggnet16 as the backbone network. All weights of the newly added convolutional layer are randomly initialized with truncated normal ( = 0.01), and the deviation is initialized to 0. The hyperparameters of our network model are set as follows: learning rate = 2e-5, weight decay = 0.0005, momentum = 0.9, batch-size = 8. Backpropagation is processed in a group of 50 images. We do not use the validation dataset during the training process. The model is trained for 30 epochs, and the learning rate after 15 epochs is divided by 10. We implement our network model based on the publicly available Pytorch framework. We use a GTX 2080ti GPU (12 GB RAM) to train and test our model.

4.3. Evaluation Metrics

We use some widely used standard metrics, including F-measure, Mean Absolute Error (MAE) [7] and S-measure [46], and the PR curve, to evaluate our model and other advanced models. The PR curve is a standard method for evaluating the probability map of saliency prediction. It is actually a curve obtained by two variables, Precision (precision rate) and Recall (recall rate), where recall is the abscissa and precision the ordinate.

F-measure is an overall performance measurement, computed from the weighted harmonic mean of precision and recall. It is expressed as follows:

is set to 0.3 to weight precision more than recall.

The MAE value represents the average absolute pixel difference between the saliency map (represented by ) and the ground truth map (represented by ). It is expressed as follows:where and represent the width and height of the saliency map, respectively.

S-measure focuses on evaluating the structural information of saliency maps. S-measure is closer to the human visual system than F-measure. The S-measure could be computed as follows:where and denote the region-aware and object-aware structural similarity. is set as 0.5 by default.

4.4. Ablation Experiment and Analysis

In this section, we use DUTS-TR as the training set to verify the effectiveness of the key components in the proposed network. We also discuss the effects of different components in the proposed network on different datasets.

The baseline model is an encoder-decoder structure. It can integrate multi-scale features. We adopt saliency supervision and the cross-entropy loss function in this model. From Table 1, the U-shaped network built with vgg16 still has excellent performance.

The Base + E model adds edge supervision to the side path2 of the Baseline model. The saliency prediction map is obtained by fusing salient edge features and salient features. As shown in Figure 5(f), there is a lot of redundant information in the edge features of the picture. From Table 1, after incorporating edge information into the Baseline model, the evaluation metrics are improved.

The Base + U-E model uses a U-shaped structure to extract edge features and fuses salient features and edge features by adding elements. Figure 5(e) is the obtained feature prediction map at the largest resolution among the four different resolutions. Compared with Figures 5(f), 5(e) has clearer object boundaries and less redundant information. Although the edge map obtained by Base + U-E model is finer than that obtained by Base + E model, from the evaluation metrics in Table 1, their SOD tasks do not differ significantly. Therefore, the network has to be further optimized.

Base + U-E-G model adds a GGM module to the Base + U-E model. Although the saliency map obtained by GGM has blurred boundaries (see Figure 2 coarse prediction map), its spatial location information is the most abundant. The predicted coarse saliency map serves as guidance to enhance the saliency information of the side output feature. By fusing the top-level semantic feature, the edge feature extracted by the U-shaped network is even finer, as shown in Figure 5(d). The evaluation metrics are also significantly improved.

Our final model adds FFM and FAM modules to the Base + U-E-G model. From the data in Table 1, through the optimization of FFM and FAM, our model has the best performance. This verifies that the proposed FFM and FAM modules can more effectively promote the fusion of edge features and salient features to improve performance.

As shown in Table 2, experiments are conducted using different feature fusion methods on SOD, HKU-IS, and PASCAL-S datasets. Method (a) uses elementwise addition instead of FFM to fuse salient features and edge features. Method (b) uses elementwise multiplication instead of FFM. Method (c) concatenates the two feature maps and performs a convolution operation to fuse the features. Method (d) utilizes a spatial attention module after using element addition for feature fusion. Compared with method (a), method (d) has improved performance after increasing spatial attention. In our model (e), FFM is a combination of elementwise addition and multiplication, and convolution and spatial attention. Comprehensively comparing the indicators of these datasets, our FFM module performs best.

4.5. Comparison with State-of-the-Arts

In this section, we compare our proposed MRBENet with 16 state-of-the-art methods, including AFNet [47], BMPM [20], BASNet [23], EGNet [24], PoolNet [26], Picanet [21], RAS [48], CPD [38], ASNet [49], Gatenet [50], ICON [51], CII [52], Auto-MSFNet [53], MINet [54], U2Net [55], and DNA [56]. For a fair comparison, we either use the saliency map provided by the author or run their source codes to get the saliency map.

Quantitative evaluation. We evaluate our model MRBENet with other advanced models on six datasets. As shown in Table 3, we can see the MAE, Max F-measure, and S-measure values of different methods in different datasets. We draw PR curves of the different methods in Figure 6. Combining the graphs and tables, it can be seen that our method outperforms most methods. Our vgg16-based model has better performance than some Resnet-based models such as CPD and BASNet. After replacing the backbone network with resnet50, the performance of the model is improved.

Visual comparison. In Figure 7, we show the visualization results of different methods. It can be seen that our method performs well on images with low contrast (rows 1 to 3), complex background (rows 4 to 6), blurred borders (rows 1 to 5), and multiple objects (rows 7 to 8). Our method makes full use of high-level semantic information and edge information, and can still recognize salient objects in complex scenes.

5. Conclusion

In this paper, we propose an MRBENet network that enhances the fineness of salient objects through the multiscale fusion of salient edge features. The GGM incorporated into the backbone network can extract high-level semantic features, which can help locate object boundaries accurately for shallow features. The FFM fuses edge features and salient object features to enhance the edge of salient objects. Our model performs well against the state-of-the-art methods on six datasets. The experimental results show that the model can improve the salient object localization and edge fineness although the images have complex backgrounds and low contrast. In the future, we will continue to explore how to use edge information to improve saliency detection performance.

Data Availability

The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.

Acknowledgments

This work was supported by SDUST Young Teachers Teaching Talent Training Plan (Grant no. “BJRC20180501”); National Natural Science Foundation of China (Grant no. 61976125). A preprint [57] has previously been published only in SSRN. On the basis of the preprint, The authors reorganized and upgraded the model. The related work has been modified to cover more research, and the new comparative experiments have been performed with some latest model. The experimental results demonstrate that the performance of the model is promoted.