Loop Closure Detection Based on Image Semantic Segmentation in Indoor Environment

Li, Jinming; Wang, Peng; Ni, Cui; Rong, Wen

doi:https://doi.org/10.1155/2022/7765479

Mathematical Problems in Engineering

On this page

Abstract Introduction Related Works Experimental Results and Analysis Conclusions Data Availability Conflicts of Interest Authors’ Contributions Acknowledgments References Copyright Related Articles

Research Article | Open Access

Volume 2022 | Article ID 7765479 | https://doi.org/10.1155/2022/7765479

Loop Closure Detection Based on Image Semantic Segmentation in Indoor Environment

Jinming Li,¹Peng Wang,^1,2Cui Ni,¹and Wen Rong³

Academic Editor: Rahib Abiyev

Received17 Nov 2021

Revised19 Jan 2022

Accepted11 Feb 2022

Published10 Mar 2022

Abstract

When mobile robots run in indoor environment, a large number of similar images are easy to appear in the images collected, probably causing false-positive judgment in loop closure detection based on simultaneous localization and mapping (SLAM). To solve this problem, a loop closure detection algorithm for visual SLAM based on image semantic segmentation is proposed in this paper. Specifically, the current frame is semantically segmented by optimized DeepLabv3+ model to obtain semantic labels in the image. The 3D semantic node coordinates corresponding to each semantic label are then extracted by combining mask centroid and image depth information. According to the distribution of semantic nodes, the DBSCAN density clustering algorithm is adopted to cluster densely distributed semantic nodes to avoid mismatching due to the close distance of semantic nodes in the subsequent matching process. Finally, the multidimensional similarity comparison of first rough and then fine is adopted to screen the candidate frames of loop closure from key frames and then confirm the real loop closure to complete accurate loop closure detection. Testing with public datasets and self-filmed datasets, experimental results show that being well adapted to illumination change, viewpoint deviation, and item movement or missing, the proposed algorithm can effectively improve the accuracy of loop closure detection in indoor environment.

1. Introduction

Simultaneous localization and mapping (SLAM) is one of the research hotspots in the field of robotics. Visual SLAM means that a mobile robot is equipped with a vision sensor to collect the surrounding environment information for simultaneous localization and mapping. Due to the advantages of light weight, low price, small size, and intuitive and visible collected information, visual SLAM has attracted the attention of a large number of researchers at home and abroad [1] and has become an emerging research field in recent years. In visual SLAM, due to the errors of the visual sensor itself, the robot may make motion estimation mistakes. With the increase of working time and moving distance of the robot, these accumulated errors may affect the accuracy of positioning and mapping and even cause the failure of mapping. Therefore, it is necessary to introduce loop closure detection during robot movement to eliminate the accumulated errors of visual odometer in SLAM [2]. Basically involving scene identification [3] and image matching [4], loop closure detection is to compare the current scene with the historical scenes to judge whether the scenes are the same. Loop closure detection is conducive to the optimization of robot pose and global map and the improvement of the accuracy and robustness of SLAM.

At present, loop closure detection methods are mainly divided into two categories: traditional loop closure detection method and deep-learning-based loop closure detection method. The traditional loop closure detection method usually abstracts the image, extracts the descriptor in the image, and completes the loop closure detection with the descriptor. When there are a large number of repeated objects or illumination changes in the scene, the traditional loop closure detection methods often have serious descriptor mismatching, which reduces the accuracy of loop closure detection. The loop closure detection method based on deep learning is to extract various graphic features with the help of deep neural network and then calculate the similarity between features to judge whether there is a loop closure. Deep learning processes images through neural networks and extracts image features layer by layer, so as to obtain more useful image information and better cope with environmental changes [5]. The method of extracting image features by the convolutional neural network has attracted extensive attention, and it outperforms the traditional loop closure detection method in multiple tasks. As a very representative part of deep learning applications, semantic segmentation has achieved favorable results in automatic driving, remote sensing image, medical diagnosis, and other fields. Semantic segmentation is also applied to loop closure detection algorithms because it is less sensitive to illumination change and object shape and can accurately segment the current environment to obtain accurate image information. However, semantic segmentation also has some shortcomings, such as higher requirements for the segmentation precision of the semantic segmentation model and the similarity comparison method of semantic nodes.

The rest of the paper is organized as follows. Some descriptions and analyses of traditional loop closure detection and deep-learning loop closure detection are in Section 2. Section 3 discusses the proposed loop closure detection based on image semantic segmentation in indoor environment. The method of obtaining 3D semantic nodes is presented in Section 3.1. The comparison of semantic node similarity is described in Section 3.2. The proposed algorithm flow is introduced in Section 3.3. Section 4 details the experimental results. Section 4.1 introduces the datasets used in the experiment in this paper. Section 4.2 is loop closure judgment in a similar environment, and Section 4.3 is the analysis of the accurate-recall curve. Finally, the paper is concluded in Section 5.

Similar to image retrieval, the traditional loop closure detection method generally abstracts image information and extracts the artificially designed image features [6–8] as image descriptors to represent the image. The commonly used algorithms can be divided into two categories: bag-of-words (BoW) [9] and global descriptor algorithm. BoW extracts feature points from images to produce descriptors, such as SURF [10], SIFT [11], and BRIEF [12], and then generates words and dictionaries (bag-of-words) through clustering, so as to transform loop closure detection into similarity comparison between word vectors in images. However, in the real environment, there may exist many identical items between different images. As BoW merely considers whether there is a certain word vector in the image and ignores location, geometry, and other factors, it is easy to cause mismatching of the word vector [13–15]. The global descriptor algorithm directly compresses the image into a descriptor (such as GIST [16], HOG [17]) by the original pixel and then carries out the matching between images through the descriptor to complete the loop closure detection. The global descriptor performs well when the environment varies but is not satisfactory when occlusion or viewpoint deviation occurs in the image.

Since 2015, scholars at home and abroad have begun to extract image features using deep learning and apply these features to loop closure detection, with many achievements achieved. For example, in reference [18], Gao and Zhang proposed an unsupervised loop closure detection algorithm based on a deep neural network, in which a stacked denoising autoencoder (SDA) is used to extract features from images, and then, a similarity matrix is used for loop closure detection. The results are better than the traditional two kinds of loop closure detection algorithms. However, as the training set used in neural network training is the same as the test set, its generalization ability is weak and its practicability is not satisfactory. In reference [19], Hou et al. proposed a loop closure detection method based on place convolutional neural network (PlaceCNN), in which the environment images are input into the PlaceCNN, the features of each layer are extracted, the effects are compared, and the layer with the best effect is selected as the final output to compete loop closure detection. This method greatly improves the influence of illumination on the effect of loop closure detection, but its calculation time is long and the requirement of timeliness cannot be met. In reference [20], Xia et al. used the image features extracted by principal component analysis network (PCANet) for loop closure detection, and their experimental results are better than those of traditional loop closure detection algorithms. However, since the algorithm has high requirements on the size of the input image, its applicability is low. In reference [21], Liu and Duan proposed a loop closure detection algorithm combining a convolutional neural network and bag-of-words. AlexNet network is used to extract high-dimensional features in images, and each channel in CNN is clustered to generate words, transforming the original global description into the local description and thereby improving antinoise.

With the development of deep learning, semantic segmentation, image classification, object detection, and other visual fields also began to apply deep learning methods, which also provide a new idea for loop closure detection. For example, in reference [22], Zhang et al. combined target detection with topological relation. After the selection of semantic nodes, the relative spatial position relationship between nodes is used to establish semantic topology, and the similarity of the semantic topology between the current frame and the key frame is calculated to determine whether a loop closure is formed. In reference [23], Cao et al. used the center point of the target detection box and depth information to establish nodes and then connected the nodes to build subgraph. The node similarity matrix and edge similarity matrix are obtained by comparing the image subgraphs, and the global similarity score is calculated by combining the two similarity matrices, and then, the loop closure detection task is completed. This method can deal with the influence of illumination change on loop closure detection. In the above two references, target detection is adopted as the method to obtain feature points. But, the selected region of the target detection box is relatively rough for the object, and the feature points obtained according to these boxes cannot accurately represent the object. However, semantic segmentation can segment the object boundary, and the feature points obtained from the boundary can more accurately represent the object, thus improving the accuracy of loop closure detection. Common semantic segmentation algorithms include U-NET [24], PSPNet [25], DeepLab series algorithms [26–28]. Among them, DeepLab series algorithms share higher segmentation accuracy and better segmentation effect, thus have been widely concerned by scholars at home and abroad. DeepLab network is developed on the basis of a full convolutional neural network (FCN) [29]. By introducing a conditional random field (CRF), the segmentation boundary can be more accurate. In 2017, Chen et al. introduced void atrous spatial pyramid pooling (ASPP) into DeepLab network and proposed DeepLabv2 network, achieving accurate segmentation of small objects by expanding receptive fields and integrating multiscale context information. Subsequently, Chen et al. replaced the backbone network of DeepLabv2 with a residual network and proposed DeepLabv3 network, which is further updated to DeepLabv3+ network architecture in 2018. Its network structure is shown in Figure 1. DeepLabv3+ uses an encoder-decoder structure to extract rich semantic information (encoder) and restore fine object edges (decoder). In the encoder-decoder structure, precision and computation time can be balanced by dilated convolution [30]. In reference [31], Li et al. used semantic segmentation to remove dynamic targets, retained the background part of the image, and then combined with CNN features of different dimensions for loop closure detection. This method eliminates interference by using semantic information and avoids the influence of dynamic targets on loop closure detection. In reference [32], Wu et al. used semantic segmentation to obtain semantic features of images and used CNN to obtain the convolution features of the image; after the convolution feature is calibrated by semantic feature, descriptors (TNNLoST) are constructed using TNNVLAD method, and then, loop closure detection is completed according to TNNLoST. In this method, semantic segmentation and TNNVLAD are combined to construct a new descriptor TNNLoST, which achieves a good loop closure detection effect.

In this paper, the semantic segmentation method is used to construct image semantic nodes with high precision and strong adaptability, and then, a loop closure detection suitable for indoor environment is proposed. The main contributions of the paper are as follows:(i)A method for constructing 3D semantic nodes of images is proposed. The structure of DeepLabv3+ network is optimized, and the image is semantic segmented using the optimized DeepLabv3+, and the segmentation results are classified according to the semantic labels, so as to obtain the 2D semantic nodes of each class. Then, 3D semantic nodes are constructed by combining image depth information, and the DBSCAN algorithm [33] is used to cluster densely distributed semantic nodes to avoid mismatching caused by the proximity of nodes in the subsequent node matching process, effectively improving the accuracy of node matching.(ii)Multidimensional comparison of image similarity is adopted to improve the accuracy of loop closure detection. Firstly, images with high similarity are selected as loop closure candidate frames through a comparison of semantic label similarity. Then, the cosine similarity and Euclidean distance between semantic nodes are compared to obtain the point pairs with high similarity, and the real loop closure will be screened from the loop closure candidate frames, which can not only reduce the computational complexity but also effectively improve the accuracy of the loop closure detection.

3. The Proposed Algorithm

3.1. 3D Semantic Nodes Acquisition

3.1.1. Semantic Node Coordinates Construction

DeepLabv3+ uses Xception [34] as the main network and replaces the largest pooling layer in Xception with deep separation convolution. Compared with previous semantic segmentation models, although the segmentation results are more accurate, the running speed of the algorithm is seriously affected due to a large number of parameters. MobileNetv2 [35] is a lightweight deep neural network proposed by Google in 2018 for embedded mobile devices such as mobile phones, the structure of which is shown in Figure 2. Both MobileNetv2 and Xception use deeply separated convolution as the core idea. In this paper, based on DeepLabv3+, MobileNetv2 is used to replace Xception for image feature extraction, which can reduce the number of parameters, improve the network computing speed, and reduce the requirement of computing power to the device. The limited computing power and high real-time requirement of the mobile robot can also be satisfied.

After the mobile robot collects the RGB image and depth image of the surrounding environment by a depth camera, the semantic segmentation of the environment image is carried out by optimized DeepLabv3+. The high repeatability of consecutive images can easily lead to misjudgment and computing redundancy in loop closure detection. In this paper, the time interval constraint method is adopted to extract a key frame for loop closure detection at intervals.

It is assumed that for any , the image collection collected by the mobile robot is , and for any , the semantic segmentation result contained in image is . In the semantic segmentation result S, every pixel point in the image is classified to determine the semantic label it belongs to, and every semantic label is represented with a mask of different colors; that is, in the semantic segmentation result, each item type is represented by a certain color. Then, according to the result of semantic segmentation, the mask is output based on the semantic label one by one. As shown in Figure 3, the first picture in the upper left corner is the original image, the second is the resulting image obtained after semantic segmentation, and the rest are mask images output one by one according to semantic labels. Next, the centroid is extracted block by block according to the masks. The coordinates of centroid are taken as the 2D coordinates of the corresponding semantic nodes of the mask blocks, as shown in Figure 4, which is the result of 2D coordinates extraction of semantic nodes from the CHAIR label in Figure 3. In the depth image corresponding to the original image, the depth information of the pixel corresponding to the centroid is obtained, which is taken as the depth information of the current mask block. Combined with the coordinates of the centroid , the 3D coordinates of the semantic nodes corresponding to of the current mask block can be obtained. Finally, each semantic node corresponding to each semantic label is output in turn and then saved by class. By virtue of the above method, the accurate 3D coordinates of each item in the result of semantic segmentation can be obtained.

3.1.2. Clustering of Semantic Nodes

In the result of semantic segmentation, there may be some densely distributed mask blocks, as shown in the white circles in Figure 5(a). These dense mask blocks may generate corresponding dense semantic nodes, as shown in Figure 5(b). In the following similarity comparison, these dense semantic nodes are easy to cause mismatching, as shown in Figure 5(c), thereby affecting the accuracy of similarity comparison. To solve this problem, this paper uses the DBSCAN clustering algorithm to cluster the similar semantic nodes whose coordinates are close and the number is greater than 2, and the output clustering center point represents the clustering center of multiple semantic nodes, avoiding node mismatching caused by node closeness and improving matching accuracy. The output semantic node coordinates after clustering are the final 3D semantic node coordinates.

(a)

(b)

(c)

3.2. Similarity Comparison of Semantic Nodes

The whole process of semantic node similarity comparison is essentially a process of layer upon layer screening. Firstly, semantic label similarity is compared, when the similarity exceeds the threshold value, and cosine similarity and Euclidean distance are then compared to judge whether the current frame and key frame can form a loop closure.

3.2.1. Similarity Comparison of Semantic Labels

In the semantic segmentation results in Section 2.1, there are various semantic labels of images. If two frames of images are similar, they must contain numerous identical semantic labels. Therefore, key frames with large differences can be roughly excluded through similarity comparison of semantic labels. Firstly, all semantic labels contained in the current frame and key frame are extracted, represented by and , respectively, and the number of identical semantic labels in the two frames counted, represented by , as shown in Figure 6. Then, the coefficient is set if formula (1) is satisfied, and the similarity of the two frames is judged to be high, and the cosine similarity and Euclidean distance can be compared. Otherwise, the key frame is abandoned, and semantic label comparison is continued for the next frame until all candidate frames of loop closure are screened out.

3.2.2. Cosine Similarity and Euclidean Distance Comparison

After comparison of semantic labels, the candidate frame contains many semantic labels identical to the current frame. Then, the cosine similarity and Euclidean distance are compared between the two frames. The cosine similarity of each semantic node in the current frame and the candidate frame is firstly compared. All semantic nodes of labels in the current frame and candidate frame are extracted according to the semantic label. It is assumed that the semantic node set of semantic label A in the current frame is , and the semantic node set of semantic label A in the candidate frame is . A 3D coordinate system taking the upper left corner of the image as the origin of coordinates, the upper edge of the image as the X-axis, the left edge as the Y-axis, and the image depth as the Z-axis is established. Assume and , then the cosine calculation formula iswhere represents the cosine value between the two semantic nodes and , through which the cosine similarity of the two nodes can be determined.

As shown in Figure 7, the green points are semantic nodes in the current frame, and the yellow points are semantic nodes in the candidate frame. In Figure 7(a), the semantic node of the current frame and all nodes in are put into the same 3D coordinate system, to compare their cosine similarity one by one. After comparison of is completed, the cosine similarity of is compared with each node in one by one, and the rest can be done in the same manner. Then, the number of point pairs whose cosine similarity is greater than the threshold is counted, and the coordinates of point pairs successfully matched recorded. The point pair set corresponding to semantic label A is , where represents the semantic node coordinates of and that can be matched as point pairs. As shown in Figure 7(b), the cosine similarity of and is greater than the threshold , and then, the matching is a point pair, while the cosine similarity of is less than ; thus, the matching is not a point pair.

(a)

(b)

(c)

Mismatches may occur during semantic node matching, as shown in Figure 8. In Figure 8(a), it is assumed that P1 and P2 are two semantic nodes whose cosine similarity is greater than in the current frame. In Figure 8(b), P3 and P4 are two semantic nodes whose cosine similarity is greater than in the candidate frame. P1 VS P3 and P2 VS P4 are two point pairs. Compared with Figure 8(a), the viewpoint in Figure 8(b) is slightly deviated to the right. In the case of cosine similarity comparison, as shown in Figure 8(c), P1 and P4 may be mismatched as a point pair. It can be seen from Figure 8(c) that the angle formed by P1 and P4 is significantly smaller than the angle formed by P1 and P2; that is, the cosine similarity of the angle formed by P1 and P4 is greater than . Yet, P1 and P4 are far away from each other, indicating that they cannot form a similarity point pair. Therefore, it is easy to mismatch semantic nodes by relying solely on cosine similarity. To avoid the situation that the cosine similarity of two semantic nodes is high but their distance is far, this paper introduces Euclidean distance. The Euclidean distance of point pairs in set is calculated one by one, and the similar point pairs whose Euclidean distance is less than the threshold (selected adaptively according to image resolution) are screened out and counted. Through Euclidean distance comparison, it can be seen that in Figure 7(c), point pair cannot constitute a similarity point pair.

(a)

(b)

(c)

In accordance with the above process, the similarity of semantic nodes corresponding to each semantic label is compared, and the number of similar point pairs in all semantic labels counted. and , respectively, represent the number of all point pairs contained in the current frame and candidate frame, displays the number of similar point pairs, and coefficient is set. If formula (3) is satisfied, it is determined that the two images constitute a loop closure; otherwise, the loop closure cannot be formed, and the next frames shall be compared.

3.3. Flow of the Proposed Algorithm

The flow chart of the proposed algorithm is shown in Figure 9. Firstly, the optimized Deeplabv3+ model is used to perform semantic segmentation on the input RGB images, and the segmentation results are output one by one according to the semantic labels. Then, the centroid of each mask block in the segmentation results is obtained, and the 3D semantic nodes are constructed by combining the depth information in the depth image. Then, the DBSCAN algorithm is used to cluster densely distributed semantic nodes to avoid mismatching in node matching. After obtaining 3D semantic nodes, the similarity of semantic labels in the current frame and key frame is compared. If the difference of semantic labels in the two frames is too large, it means that the similarity of the two frames is low, and there is no need to carry out a more detailed similarity comparison, which is a rough comparison process. The candidate frames of loop closure can be screened by semantic label similarity comparison, that is, images with high similarity to the current frame. Then, the real loop closure will be further screened from the candidate frames by cosine similarity comparison and Euclidean distance comparison. This is a process of fine comparison. Finally, the results of loop closure detection are outputted.

The loop closure detection proposed in this paper uses lightweight backbone network MobileNetv2 to replace Xception network adopted by Deeplabv3+, which can effectively shorten the time of semantic segmentation. Furthermore, by using the multidimensional similarity comparison of first rough and then fine, the accuracy of loop closure detection can also be improved. On the other hand, semantic segmentation classifies every pixel in the image and assigns corresponding semantic labels to it, which can accurately separate the object from the image. It is more adaptable to the environment and has more widely applied scenarios due to its low sensitivity to image texture and illumination changes.

4. Experimental Results and Analysis

In order to verify the effectiveness of the loop closure detection proposed in this paper, the NYUv2 dataset is used to pretrain the optimized Deeplabv3+ semantic segmentation model, and on the public datasets of TUM RGB-D and SUN RGB-D and self-filmed dataset, the proposed algorithm is compared with SDA [18], CNN-W [21], DBoW [36], and Conv3 [37] in detail.

4.1. Experimental Datasets

TUM RGB-D dataset [38] is a public dataset published by Computer Vision Lab, and it is filmed using Microsoft Kinect cameras in different scenarios. The dataset contains sequences from 39 chambers, which can be used to evaluate the performance of the algorithm under different textures, illumination, and structural conditions. Three image sequences fr1-room, fr2-large-with-loop, and fr3-long-office-household of the TUM RGB-D dataset are selected for the experiment. The image resolution is 640 × 480, and the bit depth is 16. Figures 10(a)–10(c) are sample images of three sequences, respectively. Among them, the objects in fr1-room are messy; the space in fr2-large-with-loop is relatively empty, and the light varies greatly; the viewpoint deviation range and image difference in fr3-long-office-Household are both slight.

(a)

(b)

(c)

(d)

(e)

SUN RGB-D dataset [39] is a public dataset published by the Vision & Robotics Group of The University of Princeton, which provides a total of 10335 RGBD images, and these images are captured in environments such as universities, houses, and furniture stores in North America and Asia and can be used for tasks ranging from semantic segmentation to object detection to scene recognition. The images contained in this dataset have a resolution of 640 × 480, and the bit depth is 16. Figure 10(d) is the sample images of this dataset.

Self-filmed dataset is taken by the research group using the Microsoft Kinect sensor in the laboratory environment, mainly for indoor illumination, object changes, and other situations, a total of 433 images. The image resolution is 960 × 540, and the bit depth is 16, and the shooting environment is 8m × 9 m in the laboratory, which contains various objects such as tables and chairs, bookshelves, and people, and the environment is relatively complicated. Figure 10(e) is the sample images of this dataset.

4.2. Pretraining of Semantic Segmentation Model

Before performing the algorithm performance comparison experiments, it is necessary to train and test the optimized DeepLabv3+ semantic segmentation model, on the open RGB-D dataset NYUv2 in this study. The original dataset contains a total of 894 types of labels, but the number is too large, and of little significance for segmentation and classification tasks. Therefore, this paper adopts the semantic label division and recognition method provided in reference [40] to divide the NYUv2 dataset into 40 labels for evaluation of classification accuracy. The computer hardware and software configurations used for semantic segmentation are shown in Table 1.

After the training of the optimized DeepLabv3+ semantic segmentation model, the images to be tested in the experiment of this study are input into the model for semantic segmentation, and the semantic segmentation results are used for loop closure judgment.

4.3. Loop Closure Judgment under Similar Environment

In this part, images are selected from fr1-room in the TUM RGB-D SLAM dataset, SUN RGB-D dataset, and self-filmed dataset to verify the proposed algorithm.

Due to the low accuracy of traditional loop closure detection algorithms in the case of viewpoint deviation, illumination change, and object change in environmental images, this study groups the tested images into three categories, that is, viewpoint deviation in Figure 11(a), illumination change in Figure 11(b), and object movement or missing in Figure 11(c). In the above three cases, 50 groups of experimental images are selected in each case for the algorithm to judge whether there is a loop closure and count the number of correctly judged groups. As for experimental images, 54 groups of images are selected from the fr1-room dataset, 51 groups of images are selected from the SUN RGB-D dataset, and 45 groups of images are selected from the self-filmed dataset.

(a)

(b)

(c)

Through analysis of substantial experimental data, the thresholds involved in the algorithm in this paper are set as follows: the threshold of the same label proportion in semantic label similarity comparison is set as 0.8, the threshold of cosine similarity is 0.999, and the number of similarity point pairs is 0.75 in the comparison of cosine similarity and Euclidean distance. The optimal Euclidean distance threshold is 2.5% of the larger pixel within the width and height of the image. Therefore, in this experiment, is set as 640 ∗ 2.5% (16) for the TUM dataset and 960 ∗ 2.5% (24) for the self-filmed dataset. The algorithm in this paper, SDA, DBoW, CNN-W, and Conv3 algorithms are, respectively, used to test the three groups of images, and the test results are shown in Table 2.

As can be observed from Table 2, the accuracy of loop closure detection by the proposed method is higher than that of SDA, CNN-W by 6%, DBoW by 10%, and Conv3 by 14% in the case of illumination change. Viewpoint deviation will cause a change in the angle and distance of the objects in the image, leading to a change of semantic node coordinates. To this end, the proposed algorithm introduces cosine similarity and Euclidean distance to match semantic nodes, increasing the detection accuracy by 24% than SDA, 36% than DBoW, 30% than CNN-W, and 32% than Conv3. Besides, in the case of an object moving or missing, the loop closure detection accuracy of this proposed method is 16% higher than that of SDA, 18% than DBoW, 10% than CNN-W, and 20% than Conv3. As can be seen from the above experimental results, under similar circumstances, the detection accuracy of the proposed algorithm is superior to that of the traditional loop closure detection algorithms.

4.4. Analysis of Precision-Recall

In this part, fr1-room, fr2-large-with-loop, and fr3-long-office-household in the TUM dataset and the self-filmed dataset are selected for testing. A set of consecutive images is selected from each dataset, and loop closures are set manually. Then, the algorithm in this paper is used to detect the number of loop closures contained in this set of images, and the precision-recall curve is drawn according to the experimental results. Figure 12 shows examples of the images contained in the four datasets. According to the definition of loop closure, loop closures are set for the four datasets, respectively. In the four datasets, two images similar to serial number 1 and serial number 2 are set as true loop closures, and the other two images cannot constitute loop closures.

(a)

(b)

(c)

(d)

In the comparison of the predicted value of the algorithm with the real value, there are mainly four types of results, namely, TP (true positive), FP (false positive), TN (true negative), and FN (false negative), as shown in Table 3.

In loop closure detection, as one of the important indicators to judge the effectiveness of loop closure detection algorithms, the precision-recall curve is frequently employed to evaluate the performance of an algorithm. The precision rate represents the proportion of true loop closures in all predicted loop closures, and recall rate means the proportion of the predicted correct loop closures to all true loop closures. Low precision entails considerable misjudgment of loop closures, and a low recall rate entails missing of certain loop closures, which may lead to a great difference between the built map and the real environment, and the ultimate failure of map construction [41].

Precision rate P and recall rate R are calculated according to formulas (4) and (5), different prediction results are then obtained by changing the threshold size, and P-R curve is finally integrated.

Fr1-room, fr2-large-with-loop, fr3-long-office-household, and self-filmed datasets are used to test the algorithm and draw the P-R curves (Figure 13). The green curve represents the algorithm proposed in this paper, the red the DBoW loop closure detection algorithm, the blue SDA algorithm, the black CNN-W algorithm, and the yellow Conv3 algorithm. As can be seen from Figure 13, as the threshold decreases, the number of loop closures detected by the algorithms increases, and the recall rate keeps rising, but the precision of the algorithm decreases accordingly. CNN-W algorithm is generally superior to SDA, DBoW, and Conv3 algorithm, Conv3 algorithm performed poorly, and the SDA algorithm is superior to the DBoW algorithm in most cases. In some cases, as shown in Figure 13(c), when the recall rate is between 0.53 and 0.94, the DBoW algorithm is superior to the SDA algorithm. However, compared with SDA, DBoW, CNN-W, and Conv3 algorithms, the precision and recall rate of the proposed algorithm are both significantly enhanced. Therefore, under different test environments, the proposed algorithm in this paper can not only ensure the accuracy of loop closure detection but effectively improve the recall rate and detect more real loop closures than other methods.

(a)

(b)

(c)

(d)

5. Conclusions

As a key step in visual SLAM, loop closure detection can effectively eliminate the accumulated errors of mobile robots in the process of motion and improve the accuracy of autonomous localization and map construction. In this paper, a loop closure detection for visual SLAM based on semantic segmentation in indoor environment is proposed. The optimized DeepLabv3+ network is used to perform semantic segmentation for images, and 3D semantic nodes are constructed according to the results of semantic segmentation. The loop closure detection is completed through multidimensional similarity comparison. Experimental results show that the proposed algorithm can effectively improve the accuracy and adaptability of loop closure detection in different indoor environments by utilizing the rich information in semantic segmentation results. However, the proposed algorithm only uses the image semantic information to construct 3D semantic nodes without considering other information of the image. How to fuse semantic information with other information in the image as well as establish the descriptors that can more accurately represent the image content, and thus to construct a reasonable comprehensive comparison model of image similarity to further promote the accuracy of similarity comparison and avoid the interference of illumination changes and object changes in the environment on loop closure detection, which is an important research content for us in the future. On the other hand, optimization to improve algorithm performance and ensure real-time performance of the algorithm on mobile robots with limited computing resources is also an important work for us in the future.

Data Availability

The datasets used or analyzed during the current study are available from the corresponding author on reasonable request.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Authors’ Contributions

Jinming Li contributed significantly to analysis and wrote the manuscript, Peng Wang contributed to the conception of the study, Cui Ni contributed to perform the data analyses and manuscript preparation, Wen Rong performed the experiment.

Acknowledgments

This work was supported by the National Natural Science Foundation of China (Grant no. 61502277), China Postdoctoral Science Foundation (Grant no. 2021M702030), and Science and Technology Project of Shandong Provincial Department of Transportation (Grant no. 2021B120).

References

J. M. M. Montiel, “Robotics; investigators at university of zaragoza detail findings in Robotics (ORB-SLAM: a versatile and accurate monocular SLAM system),” Journal of Robotics & Machine Learning, vol. 31, 2015.
View at: Publisher Site | Google Scholar
W. Yunfeng, W. Xiuling, W. Wei, and G. Donghui, “A loop closure detection algorithm for visual SLAM based on avarice strategy,” Journal of Tianjin university (Natural Science & Engineering), vol. 50, no. 12, pp. 1262–1270, 2017.
View at: Google Scholar
Z. Chen, O. Lam, and A. Jacobson, “Convolutional neural network-based place recognition,” 2014, https://arxiv.org/abs/1411.1509.
View at: Google Scholar
O. Guclu and A. B. Can, “Integrating global and local image features for enhanced loop closure detection in RGB-D SLAM systems,” The Visual Computer, vol. 36, 2019.
View at: Publisher Site | Google Scholar
K. Chatfield, K. Simonyan, and A. Vedaldi, “Return of the devil in the details: delving deep into convolutional nets,” Proceedings of the British Machine Vision Conference, vol. 2014, 2014.
View at: Google Scholar
D. G. Lowe, “Distinctive image features from scale-invariant keypoints,” International Journal of Computer Vision, vol. 60, no. 2, 2004.
View at: Google Scholar
R. Mur-Artal and D. Tardos, “ORB-SLAM2: an open-source SLAM system for monocular, stereo, and RGB-D cameras,” IEEE Transactions on Robotics, vol. 33, no. 5, pp. 1255–1262, 2017.
View at: Publisher Site | Google Scholar
M. Milford and G. Wyeth, “SeqSLAM: visual route-based navigation for sunny summer days and stormy winter nights,” in Proceedings of the 2012 IEEE International Conference on Robotics and Automation, Saint Paul, MN, USA, May 2012.
View at: Publisher Site | Google Scholar
R. Shekhar and V. Jawahar, “Word image retrieval using bag of visual words,” in Proceedings of the 2012 10th IAPR International Workshop on Document Analysis Systems, Gold Coast, QLD, Australia, March 2012.
View at: Publisher Site | Google Scholar
T. Tuytelaars, A. Ess, and L. Van Gool, “Speeded-up robust features (SURF),” Computer Vision & Image Understanding: CVIU, vol. 110, no. 3, 2008.
View at: Google Scholar
C. N. Pauline, “SIFT: predicting amino acid changes that affect protein function,” Nucleic Acds Research, vol. 31, no. 13, pp. 3812–3814, 2003.
View at: Google Scholar
C. Michael, L. Vincent, and S. Christoph, “BRIEF: binary robust independent elementary features,” Computer Vision - ECCV 2010, Springer, Berlin, Germany, pp. 778–792, 2010.
View at: Publisher Site | Google Scholar
T. Qin, L. Peiliang, and S. Shen, “VINS-mono: a robust and versatile monocular visual-inertial state estimator,” IEEE Transactions on Robotics, vol. 34, no. 4, pp. 1004–1020, 2018.
View at: Google Scholar
M. Cummins and P. Newman, “FAB-MAP: probabilistic localization and mapping in the space of appearance,” The International Journal of Robotics Research, vol. 27, no. 6, pp. 647–665, 2008.
View at: Google Scholar
F. Salas-Moreno, A. Newcombe, and H. Strasdat, “SLAM++: simultaneous localisation and mapping at the level of objects,” in Proceedings of the 2013 IEEE Conference on Computer Vision and Pattern Recognition, Portland, OR, USA, June 2013.
View at: Publisher Site | Google Scholar
A. Oliva and A. Torralba, “Building the gist of a scene: the role of global image features in recognition,” Progress in brain research, vol. 155, no. 0, pp. 23–36, 2006.
View at: Google Scholar
N. Dalal and B. Triggs, “Histograms of oriented gradients for human detection,” in Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), San Diego, CA, USA, June 2005.
View at: Publisher Site | Google Scholar
X. Gao and T. Zhang, “Unsupervised learning to detect loops using deep neural networks for visual SLAM system,” Autonomous Robots, vol. 41, no. 1, 2017.
View at: Google Scholar
Y. Hou, H. Zhang, and S. Zhou, “Convolutional neural network-based image representation for visual loop closure detection,” 2015, https://arxiv.org/abs/1504.05241.
View at: Google Scholar
Y. Xia, L. Jie, and Q. Lin, “Loop closure detection for visual SLAM using PCANet features,” in Proceedings of the 2016 International Joint Conference on Neural Networks (IJCNN), Vancouver, BC, Canada, July 2016.
View at: Publisher Site | Google Scholar
Q. Liu and F. Duan, “Loop closure detection using CNN words,” Intelligent Service Robotics, vol. 12, no. 4, 2019.
View at: Google Scholar
K. Zhang, Y. Zhang, G. Lü, and Y. Gong, “Loop closure detection based on local semantic topology for visual SLAM system,” Robot, vol. 41, no. 5, pp. 649–659, 2019.
View at: Google Scholar
Q. Cao, Y. Zhang, and Y. Liu, “Semantic loop closure detection based on graph matching in multi-objects scenes,” Journal of Visual Communication and Image Representation, vol. 76, Article ID 103072, 2021.
View at: Publisher Site | Google Scholar
O. Ronneberger, P. Fischer, and T. Brox, “U-net: convolutional networks for biomedical image segmentation,” International Conference on Medical Image Computing and Computer-Assisted Intervention, Springer, Berin, Germany, pp. 234–241, 2015.
View at: Google Scholar
H. Zhao, J. Shi, and Q. Xiaojuan, “Pyramid Scene Parsing Network,” in Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, July 2017.
View at: Publisher Site | Google Scholar
L. Chen, G. Papandreou, and I. Kokkinos, “DeepLab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 40, no. 4, pp. 834–848, 2018.
View at: Google Scholar
L. Chen, G. Papandreou, F. Schroff, and H. Adam, “Rethinking atrous convolution for semantic image segmentation,” 2017, https://arxiv.org/abs/1706.05587.
View at: Google Scholar
L. Chen, Y. Zhu, G. Papandreou, F. Schroff, and H. Adam, “Encoder-decoder with atrous separable convolution for semantic image segmentation,” 2018, https://arxiv.org/abs/1802.02611.
View at: Google Scholar
E. Shelhamer, J. Long, and T. Darrell, “Fully convolutional networks for semantic segmentation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 39, no. 4, pp. 640–651, 2017.
View at: Google Scholar
F. Yu and V. Koltun, “Multi-scale context aggregation by dilated convolutions,” 2015, https://arxiv.org/abs/1511.07122.
View at: Google Scholar
H. Li, C. Tian, L. Wang, and H. Lv, “A loop closure detection method based on semantic segmentation and convolutional neural network,” in Proceedings of the 2021 International Conference on Artificial Intelligence and Electromechanical Automation (AIEA), pp. 269–272, Guangzhou, China, May 2021.
View at: Publisher Site | Google Scholar
P. Wu, J. Wang, and C. Wang, “A novel fusing semantic- and appearance- based descriptors for visual loop closure detection,” Optik, vol. 243, Article ID 167230, 2021.
View at: Publisher Site | Google Scholar
J. Xiao, J. Lu, and X. Li, “Davies Bouldin Index based hierarchical initialization K-means,” Intelligent Data Analysis, vol. 21, no. 6, 2017.
View at: Google Scholar
F. Chollet, “Xception: deep learning with depthwise separable convolutions,” 2016, https://arxiv.org/abs/1610.02357.
View at: Google Scholar
M. Sandler, A. Howard, and M. Zhu, “MobileNetV2: Inverted Residuals and Linear Bottlenecks,” in Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, June 2018.
View at: Publisher Site | Google Scholar
D. Gálvez-López and D. Tardós, “Bags of binary words for fast place recognition in image sequences,” IEEE Transactions on Robotics, vol. 28, no. 5, pp. 1188–1197, 2012.
View at: Google Scholar
N. Sünderhauf, S. Shirazi, F. Dayoub, B. Upcroft, and M. Milford, “On the performance of convnet features for place recognition,” in Proceedings of the Intelligent Robots and Systems, IROS, 2015 IEEE/RSJ International Conference on, pp. 4297–4304, Hamburg, Germany, October 2015.
View at: Publisher Site | Google Scholar
J. Sturm, N. Engelhard, and F. Endres, “A benchmark for the evaluation of RGB-D SLAM systems,” in Proceedings of the 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, Vilamoura-Algarve, Portugal, October 2012.
View at: Publisher Site | Google Scholar
S. Song, P. Lichtenberg, and J. Xiao, “Sun : A RGB-D scene understanding benchmark suite,” in Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, June 2015.
View at: Publisher Site | Google Scholar
S. Gupta, P. Arbelaez, and J. Malik, “Perceptual organization and recognition of indoor scenes from RGB-D images,” in Proceedings of the 2013 IEEE Conference on Computer Vision and Pattern Recognition, Portland, OR, USA, June 2013.
View at: Publisher Site | Google Scholar
W. Dan, S. Chaoxia, and W. Yanqing, “Loop closure detection method based on unsupervised deep learning,” Computer science, vol. 47, no. 10, pp. 228–232, 2020.
View at: Google Scholar

Copyright

Copyright © 2022 Jinming Li et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

PDF Download Citation

Download other formats

Order printed copies

Mathematical Problems in Engineering

Loop Closure Detection Based on Image Semantic Segmentation in Indoor Environment

Abstract

1. Introduction

2. Related Works

3. The Proposed Algorithm

3.1. 3D Semantic Nodes Acquisition

3.1.1. Semantic Node Coordinates Construction

3.1.2. Clustering of Semantic Nodes

3.2. Similarity Comparison of Semantic Nodes

3.2.1. Similarity Comparison of Semantic Labels

3.2.2. Cosine Similarity and Euclidean Distance Comparison

3.3. Flow of the Proposed Algorithm

4. Experimental Results and Analysis

4.1. Experimental Datasets

4.2. Pretraining of Semantic Segmentation Model

4.3. Loop Closure Judgment under Similar Environment

4.4. Analysis of Precision-Recall

5. Conclusions

Data Availability

Conflicts of Interest

Authors’ Contributions

Acknowledgments

References

Copyright