Abstract
When mobile robots run in indoor environment, a large number of similar images are easy to appear in the images collected, probably causing false-positive judgment in loop closure detection based on simultaneous localization and mapping (SLAM). To solve this problem, a loop closure detection algorithm for visual SLAM based on image semantic segmentation is proposed in this paper. Specifically, the current frame is semantically segmented by optimized DeepLabv3+ model to obtain semantic labels in the image. The 3D semantic node coordinates corresponding to each semantic label are then extracted by combining mask centroid and image depth information. According to the distribution of semantic nodes, the DBSCAN density clustering algorithm is adopted to cluster densely distributed semantic nodes to avoid mismatching due to the close distance of semantic nodes in the subsequent matching process. Finally, the multidimensional similarity comparison of first rough and then fine is adopted to screen the candidate frames of loop closure from key frames and then confirm the real loop closure to complete accurate loop closure detection. Testing with public datasets and self-filmed datasets, experimental results show that being well adapted to illumination change, viewpoint deviation, and item movement or missing, the proposed algorithm can effectively improve the accuracy of loop closure detection in indoor environment.
1. Introduction
Simultaneous localization and mapping (SLAM) is one of the research hotspots in the field of robotics. Visual SLAM means that a mobile robot is equipped with a vision sensor to collect the surrounding environment information for simultaneous localization and mapping. Due to the advantages of light weight, low price, small size, and intuitive and visible collected information, visual SLAM has attracted the attention of a large number of researchers at home and abroad [1] and has become an emerging research field in recent years. In visual SLAM, due to the errors of the visual sensor itself, the robot may make motion estimation mistakes. With the increase of working time and moving distance of the robot, these accumulated errors may affect the accuracy of positioning and mapping and even cause the failure of mapping. Therefore, it is necessary to introduce loop closure detection during robot movement to eliminate the accumulated errors of visual odometer in SLAM [2]. Basically involving scene identification [3] and image matching [4], loop closure detection is to compare the current scene with the historical scenes to judge whether the scenes are the same. Loop closure detection is conducive to the optimization of robot pose and global map and the improvement of the accuracy and robustness of SLAM.
At present, loop closure detection methods are mainly divided into two categories: traditional loop closure detection method and deep-learning-based loop closure detection method. The traditional loop closure detection method usually abstracts the image, extracts the descriptor in the image, and completes the loop closure detection with the descriptor. When there are a large number of repeated objects or illumination changes in the scene, the traditional loop closure detection methods often have serious descriptor mismatching, which reduces the accuracy of loop closure detection. The loop closure detection method based on deep learning is to extract various graphic features with the help of deep neural network and then calculate the similarity between features to judge whether there is a loop closure. Deep learning processes images through neural networks and extracts image features layer by layer, so as to obtain more useful image information and better cope with environmental changes [5]. The method of extracting image features by the convolutional neural network has attracted extensive attention, and it outperforms the traditional loop closure detection method in multiple tasks. As a very representative part of deep learning applications, semantic segmentation has achieved favorable results in automatic driving, remote sensing image, medical diagnosis, and other fields. Semantic segmentation is also applied to loop closure detection algorithms because it is less sensitive to illumination change and object shape and can accurately segment the current environment to obtain accurate image information. However, semantic segmentation also has some shortcomings, such as higher requirements for the segmentation precision of the semantic segmentation model and the similarity comparison method of semantic nodes.
The rest of the paper is organized as follows. Some descriptions and analyses of traditional loop closure detection and deep-learning loop closure detection are in Section 2. Section 3 discusses the proposed loop closure detection based on image semantic segmentation in indoor environment. The method of obtaining 3D semantic nodes is presented in Section 3.1. The comparison of semantic node similarity is described in Section 3.2. The proposed algorithm flow is introduced in Section 3.3. Section 4 details the experimental results. Section 4.1 introduces the datasets used in the experiment in this paper. Section 4.2 is loop closure judgment in a similar environment, and Section 4.3 is the analysis of the accurate-recall curve. Finally, the paper is concluded in Section 5.
2. Related Works
Similar to image retrieval, the traditional loop closure detection method generally abstracts image information and extracts the artificially designed image features [6–8] as image descriptors to represent the image. The commonly used algorithms can be divided into two categories: bag-of-words (BoW) [9] and global descriptor algorithm. BoW extracts feature points from images to produce descriptors, such as SURF [10], SIFT [11], and BRIEF [12], and then generates words and dictionaries (bag-of-words) through clustering, so as to transform loop closure detection into similarity comparison between word vectors in images. However, in the real environment, there may exist many identical items between different images. As BoW merely considers whether there is a certain word vector in the image and ignores location, geometry, and other factors, it is easy to cause mismatching of the word vector [13–15]. The global descriptor algorithm directly compresses the image into a descriptor (such as GIST [16], HOG [17]) by the original pixel and then carries out the matching between images through the descriptor to complete the loop closure detection. The global descriptor performs well when the environment varies but is not satisfactory when occlusion or viewpoint deviation occurs in the image.
Since 2015, scholars at home and abroad have begun to extract image features using deep learning and apply these features to loop closure detection, with many achievements achieved. For example, in reference [18], Gao and Zhang proposed an unsupervised loop closure detection algorithm based on a deep neural network, in which a stacked denoising autoencoder (SDA) is used to extract features from images, and then, a similarity matrix is used for loop closure detection. The results are better than the traditional two kinds of loop closure detection algorithms. However, as the training set used in neural network training is the same as the test set, its generalization ability is weak and its practicability is not satisfactory. In reference [19], Hou et al. proposed a loop closure detection method based on place convolutional neural network (PlaceCNN), in which the environment images are input into the PlaceCNN, the features of each layer are extracted, the effects are compared, and the layer with the best effect is selected as the final output to compete loop closure detection. This method greatly improves the influence of illumination on the effect of loop closure detection, but its calculation time is long and the requirement of timeliness cannot be met. In reference [20], Xia et al. used the image features extracted by principal component analysis network (PCANet) for loop closure detection, and their experimental results are better than those of traditional loop closure detection algorithms. However, since the algorithm has high requirements on the size of the input image, its applicability is low. In reference [21], Liu and Duan proposed a loop closure detection algorithm combining a convolutional neural network and bag-of-words. AlexNet network is used to extract high-dimensional features in images, and each channel in CNN is clustered to generate words, transforming the original global description into the local description and thereby improving antinoise.
With the development of deep learning, semantic segmentation, image classification, object detection, and other visual fields also began to apply deep learning methods, which also provide a new idea for loop closure detection. For example, in reference [22], Zhang et al. combined target detection with topological relation. After the selection of semantic nodes, the relative spatial position relationship between nodes is used to establish semantic topology, and the similarity of the semantic topology between the current frame and the key frame is calculated to determine whether a loop closure is formed. In reference [23], Cao et al. used the center point of the target detection box and depth information to establish nodes and then connected the nodes to build subgraph. The node similarity matrix and edge similarity matrix are obtained by comparing the image subgraphs, and the global similarity score is calculated by combining the two similarity matrices, and then, the loop closure detection task is completed. This method can deal with the influence of illumination change on loop closure detection. In the above two references, target detection is adopted as the method to obtain feature points. But, the selected region of the target detection box is relatively rough for the object, and the feature points obtained according to these boxes cannot accurately represent the object. However, semantic segmentation can segment the object boundary, and the feature points obtained from the boundary can more accurately represent the object, thus improving the accuracy of loop closure detection. Common semantic segmentation algorithms include U-NET [24], PSPNet [25], DeepLab series algorithms [26–28]. Among them, DeepLab series algorithms share higher segmentation accuracy and better segmentation effect, thus have been widely concerned by scholars at home and abroad. DeepLab network is developed on the basis of a full convolutional neural network (FCN) [29]. By introducing a conditional random field (CRF), the segmentation boundary can be more accurate. In 2017, Chen et al. introduced void atrous spatial pyramid pooling (ASPP) into DeepLab network and proposed DeepLabv2 network, achieving accurate segmentation of small objects by expanding receptive fields and integrating multiscale context information. Subsequently, Chen et al. replaced the backbone network of DeepLabv2 with a residual network and proposed DeepLabv3 network, which is further updated to DeepLabv3+ network architecture in 2018. Its network structure is shown in Figure 1. DeepLabv3+ uses an encoder-decoder structure to extract rich semantic information (encoder) and restore fine object edges (decoder). In the encoder-decoder structure, precision and computation time can be balanced by dilated convolution [30]. In reference [31], Li et al. used semantic segmentation to remove dynamic targets, retained the background part of the image, and then combined with CNN features of different dimensions for loop closure detection. This method eliminates interference by using semantic information and avoids the influence of dynamic targets on loop closure detection. In reference [32], Wu et al. used semantic segmentation to obtain semantic features of images and used CNN to obtain the convolution features of the image; after the convolution feature is calibrated by semantic feature, descriptors (TNNLoST) are constructed using TNNVLAD method, and then, loop closure detection is completed according to TNNLoST. In this method, semantic segmentation and TNNVLAD are combined to construct a new descriptor TNNLoST, which achieves a good loop closure detection effect.

In this paper, the semantic segmentation method is used to construct image semantic nodes with high precision and strong adaptability, and then, a loop closure detection suitable for indoor environment is proposed. The main contributions of the paper are as follows:(i)A method for constructing 3D semantic nodes of images is proposed. The structure of DeepLabv3+ network is optimized, and the image is semantic segmented using the optimized DeepLabv3+, and the segmentation results are classified according to the semantic labels, so as to obtain the 2D semantic nodes of each class. Then, 3D semantic nodes are constructed by combining image depth information, and the DBSCAN algorithm [33] is used to cluster densely distributed semantic nodes to avoid mismatching caused by the proximity of nodes in the subsequent node matching process, effectively improving the accuracy of node matching.(ii)Multidimensional comparison of image similarity is adopted to improve the accuracy of loop closure detection. Firstly, images with high similarity are selected as loop closure candidate frames through a comparison of semantic label similarity. Then, the cosine similarity and Euclidean distance between semantic nodes are compared to obtain the point pairs with high similarity, and the real loop closure will be screened from the loop closure candidate frames, which can not only reduce the computational complexity but also effectively improve the accuracy of the loop closure detection.
3. The Proposed Algorithm
3.1. 3D Semantic Nodes Acquisition
3.1.1. Semantic Node Coordinates Construction
DeepLabv3+ uses Xception [34] as the main network and replaces the largest pooling layer in Xception with deep separation convolution. Compared with previous semantic segmentation models, although the segmentation results are more accurate, the running speed of the algorithm is seriously affected due to a large number of parameters. MobileNetv2 [35] is a lightweight deep neural network proposed by Google in 2018 for embedded mobile devices such as mobile phones, the structure of which is shown in Figure 2. Both MobileNetv2 and Xception use deeply separated convolution as the core idea. In this paper, based on DeepLabv3+, MobileNetv2 is used to replace Xception for image feature extraction, which can reduce the number of parameters, improve the network computing speed, and reduce the requirement of computing power to the device. The limited computing power and high real-time requirement of the mobile robot can also be satisfied.

After the mobile robot collects the RGB image and depth image of the surrounding environment by a depth camera, the semantic segmentation of the environment image is carried out by optimized DeepLabv3+. The high repeatability of consecutive images can easily lead to misjudgment and computing redundancy in loop closure detection. In this paper, the time interval constraint method is adopted to extract a key frame for loop closure detection at intervals.
It is assumed that for any , the image collection collected by the mobile robot is , and for any , the semantic segmentation result contained in image is . In the semantic segmentation result S, every pixel point in the image is classified to determine the semantic label it belongs to, and every semantic label is represented with a mask of different colors; that is, in the semantic segmentation result, each item type is represented by a certain color. Then, according to the result of semantic segmentation, the mask is output based on the semantic label one by one. As shown in Figure 3, the first picture in the upper left corner is the original image, the second is the resulting image obtained after semantic segmentation, and the rest are mask images output one by one according to semantic labels. Next, the centroid is extracted block by block according to the masks. The coordinates of centroid are taken as the 2D coordinates of the corresponding semantic nodes of the mask blocks, as shown in Figure 4, which is the result of 2D coordinates extraction of semantic nodes from the CHAIR label in Figure 3. In the depth image corresponding to the original image, the depth information of the pixel corresponding to the centroid is obtained, which is taken as the depth information of the current mask block. Combined with the coordinates of the centroid , the 3D coordinates of the semantic nodes corresponding to of the current mask block can be obtained. Finally, each semantic node corresponding to each semantic label is output in turn and then saved by class. By virtue of the above method, the accurate 3D coordinates of each item in the result of semantic segmentation can be obtained.


3.1.2. Clustering of Semantic Nodes
In the result of semantic segmentation, there may be some densely distributed mask blocks, as shown in the white circles in Figure 5(a). These dense mask blocks may generate corresponding dense semantic nodes, as shown in Figure 5(b). In the following similarity comparison, these dense semantic nodes are easy to cause mismatching, as shown in Figure 5(c), thereby affecting the accuracy of similarity comparison. To solve this problem, this paper uses the DBSCAN clustering algorithm to cluster the similar semantic nodes whose coordinates are close and the number is greater than 2, and the output clustering center point represents the clustering center of multiple semantic nodes, avoiding node mismatching caused by node closeness and improving matching accuracy. The output semantic node coordinates after clustering are the final 3D semantic node coordinates.

(a)

(b)

(c)
3.2. Similarity Comparison of Semantic Nodes
The whole process of semantic node similarity comparison is essentially a process of layer upon layer screening. Firstly, semantic label similarity is compared, when the similarity exceeds the threshold value, and cosine similarity and Euclidean distance are then compared to judge whether the current frame and key frame can form a loop closure.
3.2.1. Similarity Comparison of Semantic Labels
In the semantic segmentation results in Section 2.1, there are various semantic labels of images. If two frames of images are similar, they must contain numerous identical semantic labels. Therefore, key frames with large differences can be roughly excluded through similarity comparison of semantic labels. Firstly, all semantic labels contained in the current frame and key frame are extracted, represented by and , respectively, and the number of identical semantic labels in the two frames counted, represented by , as shown in Figure 6. Then, the coefficient is set if formula (1) is satisfied, and the similarity of the two frames is judged to be high, and the cosine similarity and Euclidean distance can be compared. Otherwise, the key frame is abandoned, and semantic label comparison is continued for the next frame until all candidate frames of loop closure are screened out.

3.2.2. Cosine Similarity and Euclidean Distance Comparison
After comparison of semantic labels, the candidate frame contains many semantic labels identical to the current frame. Then, the cosine similarity and Euclidean distance are compared between the two frames. The cosine similarity of each semantic node in the current frame and the candidate frame is firstly compared. All semantic nodes of labels in the current frame and candidate frame are extracted according to the semantic label. It is assumed that the semantic node set of semantic label A in the current frame is , and the semantic node set of semantic label A in the candidate frame is . A 3D coordinate system taking the upper left corner of the image as the origin of coordinates, the upper edge of the image as the X-axis, the left edge as the Y-axis, and the image depth as the Z-axis is established. Assume and , then the cosine calculation formula iswhere represents the cosine value between the two semantic nodes and , through which the cosine similarity of the two nodes can be determined.
As shown in Figure 7, the green points are semantic nodes in the current frame, and the yellow points are semantic nodes in the candidate frame. In Figure 7(a), the semantic node of the current frame and all nodes in are put into the same 3D coordinate system, to compare their cosine similarity one by one. After comparison of is completed, the cosine similarity of is compared with each node in one by one, and the rest can be done in the same manner. Then, the number of point pairs whose cosine similarity is greater than the threshold is counted, and the coordinates of point pairs successfully matched recorded. The point pair set corresponding to semantic label A is , where represents the semantic node coordinates of and that can be matched as point pairs. As shown in Figure 7(b), the cosine similarity of and is greater than the threshold , and then, the matching is a point pair, while the cosine similarity of is less than ; thus, the matching is not a point pair.

(a)

(b)

(c)
Mismatches may occur during semantic node matching, as shown in Figure 8. In Figure 8(a), it is assumed that P1 and P2 are two semantic nodes whose cosine similarity is greater than in the current frame. In Figure 8(b), P3 and P4 are two semantic nodes whose cosine similarity is greater than in the candidate frame. P1 VS P3 and P2 VS P4 are two point pairs. Compared with Figure 8(a), the viewpoint in Figure 8(b) is slightly deviated to the right. In the case of cosine similarity comparison, as shown in Figure 8(c), P1 and P4 may be mismatched as a point pair. It can be seen from Figure 8(c) that the angle formed by P1 and P4 is significantly smaller than the angle formed by P1 and P2; that is, the cosine similarity of the angle formed by P1 and P4 is greater than . Yet, P1 and P4 are far away from each other, indicating that they cannot form a similarity point pair. Therefore, it is easy to mismatch semantic nodes by relying solely on cosine similarity. To avoid the situation that the cosine similarity of two semantic nodes is high but their distance is far, this paper introduces Euclidean distance. The Euclidean distance of point pairs in set is calculated one by one, and the similar point pairs whose Euclidean distance is less than the threshold (selected adaptively according to image resolution) are screened out and counted. Through Euclidean distance comparison, it can be seen that in Figure 7(c), point pair cannot constitute a similarity point pair.

(a)

(b)

(c)
In accordance with the above process, the similarity of semantic nodes corresponding to each semantic label is compared, and the number of similar point pairs in all semantic labels counted. and , respectively, represent the number of all point pairs contained in the current frame and candidate frame, displays the number of similar point pairs, and coefficient is set. If formula (3) is satisfied, it is determined that the two images constitute a loop closure; otherwise, the loop closure cannot be formed, and the next frames shall be compared.
3.3. Flow of the Proposed Algorithm
The flow chart of the proposed algorithm is shown in Figure 9. Firstly, the optimized Deeplabv3+ model is used to perform semantic segmentation on the input RGB images, and the segmentation results are output one by one according to the semantic labels. Then, the centroid of each mask block in the segmentation results is obtained, and the 3D semantic nodes are constructed by combining the depth information in the depth image. Then, the DBSCAN algorithm is used to cluster densely distributed semantic nodes to avoid mismatching in node matching. After obtaining 3D semantic nodes, the similarity of semantic labels in the current frame and key frame is compared. If the difference of semantic labels in the two frames is too large, it means that the similarity of the two frames is low, and there is no need to carry out a more detailed similarity comparison, which is a rough comparison process. The candidate frames of loop closure can be screened by semantic label similarity comparison, that is, images with high similarity to the current frame. Then, the real loop closure will be further screened from the candidate frames by cosine similarity comparison and Euclidean distance comparison. This is a process of fine comparison. Finally, the results of loop closure detection are outputted.

The loop closure detection proposed in this paper uses lightweight backbone network MobileNetv2 to replace Xception network adopted by Deeplabv3+, which can effectively shorten the time of semantic segmentation. Furthermore, by using the multidimensional similarity comparison of first rough and then fine, the accuracy of loop closure detection can also be improved. On the other hand, semantic segmentation classifies every pixel in the image and assigns corresponding semantic labels to it, which can accurately separate the object from the image. It is more adaptable to the environment and has more widely applied scenarios due to its low sensitivity to image texture and illumination changes.
4. Experimental Results and Analysis
In order to verify the effectiveness of the loop closure detection proposed in this paper, the NYUv2 dataset is used to pretrain the optimized Deeplabv3+ semantic segmentation model, and on the public datasets of TUM RGB-D and SUN RGB-D and self-filmed dataset, the proposed algorithm is compared with SDA [18], CNN-W [21], DBoW [36], and Conv3 [37] in detail.
4.1. Experimental Datasets
TUM RGB-D dataset [38] is a public dataset published by Computer Vision Lab, and it is filmed using Microsoft Kinect cameras in different scenarios. The dataset contains sequences from 39 chambers, which can be used to evaluate the performance of the algorithm under different textures, illumination, and structural conditions. Three image sequences fr1-room, fr2-large-with-loop, and fr3-long-office-household of the TUM RGB-D dataset are selected for the experiment. The image resolution is 640 × 480, and the bit depth is 16. Figures 10(a)–10(c) are sample images of three sequences, respectively. Among them, the objects in fr1-room are messy; the space in fr2-large-with-loop is relatively empty, and the light varies greatly; the viewpoint deviation range and image difference in fr3-long-office-Household are both slight.

(a)

(b)

(c)

(d)

(e)
SUN RGB-D dataset [39] is a public dataset published by the Vision & Robotics Group of The University of Princeton, which provides a total of 10335 RGBD images, and these images are captured in environments such as universities, houses, and furniture stores in North America and Asia and can be used for tasks ranging from semantic segmentation to object detection to scene recognition. The images contained in this dataset have a resolution of 640 × 480, and the bit depth is 16. Figure 10(d) is the sample images of this dataset.
Self-filmed dataset is taken by the research group using the Microsoft Kinect sensor in the laboratory environment, mainly for indoor illumination, object changes, and other situations, a total of 433 images. The image resolution is 960 × 540, and the bit depth is 16, and the shooting environment is 8m × 9 m in the laboratory, which contains various objects such as tables and chairs, bookshelves, and people, and the environment is relatively complicated. Figure 10(e) is the sample images of this dataset.
4.2. Pretraining of Semantic Segmentation Model
Before performing the algorithm performance comparison experiments, it is necessary to train and test the optimized DeepLabv3+ semantic segmentation model, on the open RGB-D dataset NYUv2 in this study. The original dataset contains a total of 894 types of labels, but the number is too large, and of little significance for segmentation and classification tasks. Therefore, this paper adopts the semantic label division and recognition method provided in reference [40] to divide the NYUv2 dataset into 40 labels for evaluation of classification accuracy. The computer hardware and software configurations used for semantic segmentation are shown in Table 1.
After the training of the optimized DeepLabv3+ semantic segmentation model, the images to be tested in the experiment of this study are input into the model for semantic segmentation, and the semantic segmentation results are used for loop closure judgment.
4.3. Loop Closure Judgment under Similar Environment
In this part, images are selected from fr1-room in the TUM RGB-D SLAM dataset, SUN RGB-D dataset, and self-filmed dataset to verify the proposed algorithm.
Due to the low accuracy of traditional loop closure detection algorithms in the case of viewpoint deviation, illumination change, and object change in environmental images, this study groups the tested images into three categories, that is, viewpoint deviation in Figure 11(a), illumination change in Figure 11(b), and object movement or missing in Figure 11(c). In the above three cases, 50 groups of experimental images are selected in each case for the algorithm to judge whether there is a loop closure and count the number of correctly judged groups. As for experimental images, 54 groups of images are selected from the fr1-room dataset, 51 groups of images are selected from the SUN RGB-D dataset, and 45 groups of images are selected from the self-filmed dataset.

(a)

(b)

(c)
Through analysis of substantial experimental data, the thresholds involved in the algorithm in this paper are set as follows: the threshold of the same label proportion in semantic label similarity comparison is set as 0.8, the threshold of cosine similarity is 0.999, and the number of similarity point pairs is 0.75 in the comparison of cosine similarity and Euclidean distance. The optimal Euclidean distance threshold is 2.5% of the larger pixel within the width and height of the image. Therefore, in this experiment, is set as 640 ∗ 2.5% (16) for the TUM dataset and 960 ∗ 2.5% (24) for the self-filmed dataset. The algorithm in this paper, SDA, DBoW, CNN-W, and Conv3 algorithms are, respectively, used to test the three groups of images, and the test results are shown in Table 2.
As can be observed from Table 2, the accuracy of loop closure detection by the proposed method is higher than that of SDA, CNN-W by 6%, DBoW by 10%, and Conv3 by 14% in the case of illumination change. Viewpoint deviation will cause a change in the angle and distance of the objects in the image, leading to a change of semantic node coordinates. To this end, the proposed algorithm introduces cosine similarity and Euclidean distance to match semantic nodes, increasing the detection accuracy by 24% than SDA, 36% than DBoW, 30% than CNN-W, and 32% than Conv3. Besides, in the case of an object moving or missing, the loop closure detection accuracy of this proposed method is 16% higher than that of SDA, 18% than DBoW, 10% than CNN-W, and 20% than Conv3. As can be seen from the above experimental results, under similar circumstances, the detection accuracy of the proposed algorithm is superior to that of the traditional loop closure detection algorithms.
4.4. Analysis of Precision-Recall
In this part, fr1-room, fr2-large-with-loop, and fr3-long-office-household in the TUM dataset and the self-filmed dataset are selected for testing. A set of consecutive images is selected from each dataset, and loop closures are set manually. Then, the algorithm in this paper is used to detect the number of loop closures contained in this set of images, and the precision-recall curve is drawn according to the experimental results. Figure 12 shows examples of the images contained in the four datasets. According to the definition of loop closure, loop closures are set for the four datasets, respectively. In the four datasets, two images similar to serial number 1 and serial number 2 are set as true loop closures, and the other two images cannot constitute loop closures.

(a)

(b)

(c)

(d)
In the comparison of the predicted value of the algorithm with the real value, there are mainly four types of results, namely, TP (true positive), FP (false positive), TN (true negative), and FN (false negative), as shown in Table 3.
In loop closure detection, as one of the important indicators to judge the effectiveness of loop closure detection algorithms, the precision-recall curve is frequently employed to evaluate the performance of an algorithm. The precision rate represents the proportion of true loop closures in all predicted loop closures, and recall rate means the proportion of the predicted correct loop closures to all true loop closures. Low precision entails considerable misjudgment of loop closures, and a low recall rate entails missing of certain loop closures, which may lead to a great difference between the built map and the real environment, and the ultimate failure of map construction [41].
Precision rate P and recall rate R are calculated according to formulas (4) and (5), different prediction results are then obtained by changing the threshold size, and P-R curve is finally integrated.
Fr1-room, fr2-large-with-loop, fr3-long-office-household, and self-filmed datasets are used to test the algorithm and draw the P-R curves (Figure 13). The green curve represents the algorithm proposed in this paper, the red the DBoW loop closure detection algorithm, the blue SDA algorithm, the black CNN-W algorithm, and the yellow Conv3 algorithm. As can be seen from Figure 13, as the threshold decreases, the number of loop closures detected by the algorithms increases, and the recall rate keeps rising, but the precision of the algorithm decreases accordingly. CNN-W algorithm is generally superior to SDA, DBoW, and Conv3 algorithm, Conv3 algorithm performed poorly, and the SDA algorithm is superior to the DBoW algorithm in most cases. In some cases, as shown in Figure 13(c), when the recall rate is between 0.53 and 0.94, the DBoW algorithm is superior to the SDA algorithm. However, compared with SDA, DBoW, CNN-W, and Conv3 algorithms, the precision and recall rate of the proposed algorithm are both significantly enhanced. Therefore, under different test environments, the proposed algorithm in this paper can not only ensure the accuracy of loop closure detection but effectively improve the recall rate and detect more real loop closures than other methods.

(a)

(b)

(c)

(d)
5. Conclusions
As a key step in visual SLAM, loop closure detection can effectively eliminate the accumulated errors of mobile robots in the process of motion and improve the accuracy of autonomous localization and map construction. In this paper, a loop closure detection for visual SLAM based on semantic segmentation in indoor environment is proposed. The optimized DeepLabv3+ network is used to perform semantic segmentation for images, and 3D semantic nodes are constructed according to the results of semantic segmentation. The loop closure detection is completed through multidimensional similarity comparison. Experimental results show that the proposed algorithm can effectively improve the accuracy and adaptability of loop closure detection in different indoor environments by utilizing the rich information in semantic segmentation results. However, the proposed algorithm only uses the image semantic information to construct 3D semantic nodes without considering other information of the image. How to fuse semantic information with other information in the image as well as establish the descriptors that can more accurately represent the image content, and thus to construct a reasonable comprehensive comparison model of image similarity to further promote the accuracy of similarity comparison and avoid the interference of illumination changes and object changes in the environment on loop closure detection, which is an important research content for us in the future. On the other hand, optimization to improve algorithm performance and ensure real-time performance of the algorithm on mobile robots with limited computing resources is also an important work for us in the future.
Data Availability
The datasets used or analyzed during the current study are available from the corresponding author on reasonable request.
Conflicts of Interest
The authors declare that they have no conflicts of interest.
Authors’ Contributions
Jinming Li contributed significantly to analysis and wrote the manuscript, Peng Wang contributed to the conception of the study, Cui Ni contributed to perform the data analyses and manuscript preparation, Wen Rong performed the experiment.
Acknowledgments
This work was supported by the National Natural Science Foundation of China (Grant no. 61502277), China Postdoctoral Science Foundation (Grant no. 2021M702030), and Science and Technology Project of Shandong Provincial Department of Transportation (Grant no. 2021B120).