Abstract

With the popularity of phones, the demand for pedestrian location-based services has increased significantly. PDR (pedestrian dead reckoning algorithm) can provide pedestrians with real-time and continuous localization services in the outdoor environment. However, due to the accuracy of the built-in inertial sensor of the mobile phone collecting data, there is heading drift in PDR resulting in localization error. The localization method based on VPR (visual place recognition) can provide the accurate position of a single point. The VPR based on the lightweight model only generates a low amount of calculation. We propose an outdoor localization scheme fusing PDR and VPR to provide continuous localization service for outdoor pedestrian localization. The experiment results show that compared with the PDR algorithm, the ARMSE of the fusion localization scheme is reduced by 40% to 60%, and the final start-to-end error is reduced by 31.9 m. The fusion scheme can provide an accurate and continuous localization service.

1. Introduction

With the development of mobile technology and wireless communication technology, the combination between physical world and spatial information is closer. Through the acquisition and processing of information, it can provide people with effective, real-time, and accurate localization information services. With the development of smart phones, the localization of smartphones appears in daily life. The localization service based on smartphones has now been applied to pedestrian localization, vehicle guidance, and other fields. It can be said that the localization based on smartphones has become one of the main methods to obtain localization information in daily life [1, 2]. With the increasing maturity of global navigation satellite system (GNSS) technology, the accuracy of outdoor localization has been greatly improved, especially in the navigation and localization of pedestrians and vehicles in the city, but there are also obvious defects. In the complex partially sheltered environment such as urban canyon and boulevard, the satellite signal will be partially sheltered resulting in the failure of localization service [3, 4]. In the building or inside the warehouse, due to the more serious shielding, the satellite signal may be completely lost resulting in the inability to complete the localization service. It can be seen that only relying on GNSS is difficult to meet the requirements of high precision, continuity, and high reliability for localization services. Therefore, the localization algorithm that does not rely on satellites and location base stations and only relies on its own sensor equipment to predict the position is very important [5].

With more and more sensors in smartphones, PDR based on inertial measurement unit (IMU) can predict the current position of pedestrians by acquiring pedestrian motion data and send the position information through wireless technology [6, 7]. PDR can be completed only by using low-cost sensors of mobile phones. It does not need the assistance of external satellite signals or location base stations [8, 9]. However, due to the false data collected by low-cost smartphone sensors and the disadvantage that the accuracy of PDR decreases over time, single PDR algorithm is difficult to meet the needs of long-term localization and navigation [10].

With the rapid development of computer vision and image recognition technology, the technology of extracting corresponding information from images is the focus of current research. Images are widely used in recognition, classification, and segmentation after processing because of their simple sampling, convenient transmission, and rich information [11, 12]. VPR is to process the visual information, extract the corresponding geographic location information, integrate the information, judge whether it is within the recognition category, and give the ranking of similarity. In recent years, with the development of deep learning technology, VPR has made great progress in the fields of location recognition, mobile robot, virtual reality, and enhancement [1315]. However, it is disturbed by environmental factors such as illumination, and the localization of a single point is much better than continuous long-distance distance. So it is more suitable as a single location recognition and localization method.

Our main contributions in fusion localization scheme can be summarized as follows: (i)We propose a fusion localization scheme, fusing the PDR localization algorithm and VPR. The fusion scheme improves the localization accuracy, providing accurate and continuous outdoor localization service for pedestrians(ii)We conduct experiments on the threshold in the fusion scheme and select the appropriate threshold for the fusion scheme, which improves the localization accuracy and stability of the fusion scheme

The organization of the paper is as follows. Section 2 discusses related studies followed by the fusion scheme in Section 3 and results in Section 4. The conclusion is finally presented in Section 5.

2.1. Visual-Based Localization

Visual-based localization (VBL) is to retrieve the visual input as a query in the known visual database to obtain the corresponding position and heading. The methods of VBL can be divided into two categories: indirect method [16], which converts the localization task into an image retrieval problem and finally outputs the position or rough attitude information of the query image. The direct method can directly regress the 6 degrees of freedom (DOF) attitude of the visual system. Compared with the direct method, the indirect method has lower operation costs and is not easy to lose location and data drift. Among the indirect methods, visual place recognition (VPR) is a hot research method because of its high speed and high precision. The first step is to extract the image feature information. Torralba [17] proposed the earliest global feature gist for describing the image in 2001. The more commonly used local feature descriptors are sift [18], surf [19], brief [20], and orb [21]. If the image is represented by local feature descriptors, the amount of descriptors stored is very large, and the amount of data is very large in the process of similarity comparison. Because of the difficulties in retrieval, the current mainstream method is to use global feature descriptors to represent images. Global feature descriptors are usually obtained by clustering local feature descriptors. The commonly used methods include bag-of-visual words (BoVW) [22], Fisher vector (FV) [23], and VLAD vector [24]. The image can be described by only one global vector, which reduces the time required for similarity comparison and saves the amount of calculation.

2.2. Fusion Localization Algorithm

The location-based service based on smartphone has become the main content of terminal services today. But in the face of complex environment, a single location method is easy to fail. The fusion location method is an effective means to solve the location in complex environment. In indoor localization, the commonly used localization methods include WiFi, Bluetooth, geomagnetic, and PDR localization. In the outdoor open environment, the GNSS-based satellite localization mode is the preferred localization mode, but in the face of narrow or sheltered environment, the GNSS localization mode will lose its position [25]. Due to its low cost, the signal receiving equipment of smartphone will have worse localization effect in this case. In order to solve similar problems, Zhang et al. [26] proposed a fusion localization method integrating GPS, UWB, and MARG localization methods, which realize the outdoor localization method in complex scenes. GPS is a utility owned by US that provides the users with positioning, navigation, and timing-related information. Ultrawideband (UWB) is a radio technology that uses very low energy level for the communications in short range and high bandwidth over large area of the radio spectrum. The GPS tracking of MARG help in tracking of staff activity allows smart scheduling and also provides real-time updates about field sales force ensuring positive customer experience. The average error of the system is reduced from 8.9 m to 3.2 m. However, due to the need to set additional signal base stations, the cost of this fusion method is relatively high. Tadic et al. [27] further improved the fusion localization method of GPS and UWB. The experiment results show that the accuracy error can reach 0.34 M while reducing the cost. Lee [28] combines the camera with the compass to develop a fusion indoor localization algorithm with an error of submeter level. Ruotsalainen et al. [29] use the camera for pedestrian step and heading localization in indoor state. It mainly uses the camera image to assist in judging the heading, and its accuracy is better than the compass built in the phone. Richter and Toledano-Ayala [30] proposed a method of fusing GNS and WLAN. The error is 5 meters. In recent years, with the development of pedestrian localization and navigation in indoor localization, the fusion algorithm for PDR localization has developed rapidly. Li et al. [31] proposed a method of integrating geomagnetic and WiFi, which better improved the problem of false matching in the process of geomagnetic matching.

3. Fusion Scheme

3.1. VPR

We use lightweight VPR network, Ghost-NetVLAD to realize visual localization based on image retrieval. Ghost-NetVLAD is an extension of NetVLAD algorithm which was primarily developed for face recognition. The algorithm adds ghost clusters along with the NetVLAD clusters. The ghost clusters help in mapping any noise or unwanted content which are excluded during the feature aggregation stage. Thus, during the feature aggregation phase, the addition of noisy and irrelevant features to the normal VLAD leads to assigning of less weight wherein the ghost clusters majority of the weights [32, 33]. VPR is a process of identifying a previously visited location using visual information even under varying appearance condition, changes in view point, and computational constraints. The process of VPR uses the methods relevant to localization, loop closure, and image retrieval and is extremely predominant in applications of autonomous vehicles ranging from drones to framework involving computer vision systems [34]. Ghost-NetVLAD extracts global features of every dataset images as feature dataset. When input query image to Ghost-NetVLAD, global feature of the query is extracted, which is compared with the similarity of all global features in the feature dataset outputting the position from image retrieval. The framework of our proposed Ghost-NetVLAD is shown in Figure 1. The Ghost-NetVLAD contains two parts. The lightweight feature extraction architecture (GhostCNN) is shown in Section 3.1.1, and the NetVLAD layer is described in Section 3.1.2.

The feature extraction part is lightweight CNN architecture, which is a 5-stage CNN architecture. Through every stage, the feature map becomes a half one. The NetVLAD layer processes local descriptors to global descriptors.

3.1.1. GhostCNN

For the traditional CNN, the redundancy in feature maps always guarantees a comprehensive understanding of the input data. For example, there are many similar feature maps through the convolution layer of ResNet [35], just like ghosts of each other. The redundancy in feature maps is an essential feature for the success of deep neural networks. However, it results in massive computational costs. Inspired by basic Ghost modules in the GhostNet [36], we design the lightweight neural network named GhostCNN for front-ended feature extraction. Ghost modules skillfully utilize linear transformations to generate ghost feature map pair examples to reduce the computational cost significantly and ensure a satisfactory accuracy simultaneously.

3.1.2. NetVLAD Layer

Vector of locally aggregated descriptors (VLAD) [37] store the sum of residuals (difference vector between the descriptor and its corresponding cluster center) for each visual word. The NetVLAD [38] uses the CNN architecture to capture the information about the statistics of local descriptors aggregated over the image. Given the input image , the local descriptor can be obtained by

In other words, the output of GhostCNN’s last convolution layer is a feature map which can be considered as a set of -dimensional descriptors extracted at spatial locations. Similarly, the feature map can be deemed as -dimensional feature descriptors with each of them representing the local features at specific local positions of the input image.

Formally, given local image descriptors as input and cluster centers (visual words) as VLAD parameters, the output VLAD image representation is -dimensional. The element of can be expressed by where is the -th dimensions of the -th descriptor and denotes the -th cluster center; is the weight between the descriptor and the -th cluster center. The weight ranges from 0 to 1, with the highest weight assigned to the closest cluster center. Namely, is 1 if cluster is closest to the descriptor and 0 otherwise. . The weight is trainable via back-propagation and can be described as follows: which assigns the weight of descriptor to cluster according to their proximity.

3.2. PDR Localization Algorithm

The PDR (pedestrian dead reckoning) algorithm is an algorithm for pedestrian localization, which is a step pedestrian position calculation based on inertial measurement unit (IMU) data of smartphone sensors. The PDR algorithm used accelerometers and gyros to calculate step, stride, and heading. In comparison to the traditional localization techniques using wireless signals and vision sensors, PDR helps in calculating accurate positions in a shorter time period, and the speed updating of speed of the pedestrian’s location is much faster with an additional advantage of low power consumption [39]. The data collected by the algorithm is mainly based on the data collected by a low precision sensor placed in the chest. Step pedestrian position calculation can be divided into two steps: step detection and position update, which are described below.

3.2.1. Step Detection

When the phone is placed on the chest while walking, acceleration data will change in peak and trough with the rise and fall of the pace. When the acceleration data reaches a peak, a step is taken in peak detection for step detection. Because the low cost acceleration sensor contains more noise, so before the step detection, the acceleration data is performed a mean filtering to filter noise. It can reduce the noise causing the step misdetection. Figure 2 is the mean filtering acceleration data, which can be seen the filtered acceleration data more smoothly and each step with the peak can be identified as a detected step.

3.2.2. Location Update

By calculating the step size and heading of the sensor data, the position of the next peak time can be obtained from the position of the current peak time through the following formula:

In formula (4), represents the step size. In the PDR, we adopted, the step size is fixed, and represents the heading.

To calculate the heading , the MARG heading prediction algorithm proposed by Singh [40] is adopted. In addition to collecting acceleration data and angular velocity data, the MARG heading localization algorithm also collects magnetic field data through the magnetometer and carries out nonlinear compensation through magnetic field data to obtain the heading at the current peak time and update the position.

3.2.3. Disadvantages of PDR

From Figure 3, it can be seen that the PDR results based on the data collected by the inertial sensor in the phone are quite different from the actual track, and the accuracy is not high. This is because the accuracy of built-in inertial sensor of the phone is low. And the accuracy of the obtained acceleration data and gyroscope data is not high enough, which will lead to the low accuracy of the PDR algorithm. That is an unavoidable defect of the low-cost inertial sensor. And the PDR algorithm does not have the ability of self-correction. When there is a heading judgment false, all subsequent positions are located on the basis of the false heading, resulting in localization error.

Therefore, we propose to use visual place recognition to assist PDR, reduce the localization error of PDR, and get a higher location precision.

3.3. Fusion Localization Scheme of Ghost-NetVLAD and PDR

Ghost-NetVLAD plays the role of correction point in the actual experiment. By keeping the phone camera in the whole process recording mode and breaking up the recording for a fixed time, the corresponding image is obtained as the query image of Ghost-NetVLAD and the coordinates of the corresponding correction point image are obtained after the retrieval of Ghost-NetVLAD. As the main frame of localization method, PDR provides the main position coordinates through processing. As shown in Figure 4, it is the flow chart of fusion scheme.

The fusion scheme of Ghost-NetVLAD and PDR is to correct the position coordinate of PDR through the correction point position coordinate , which is output by Ghost-NetVLAD. That is, when the correction conditions are met, make the PDR position coordinate , so as to achieve the goal of higher precision localization. The fusion scheme mainly includes three parts: time alignment, correction point identification, and false correction point determination.

Time alignment refers to the alignment of the starting time of Ghost-NetVLAD and PDR. If the time is not aligned, it will cause the correction point to incorrectly correct the position, resulting in the correct position distance and greater position error. In order to align the Ghost-NetVLAD and PDR at the starting time, manually align the first second of PDR with the first second of starting motion captured by the phone camera.

Correction point identification means that after the pictures scattered by the mobile camera are sent to the Ghost-NetVLAD as query pictures at the correction point, 25 retrieval result images sorted according to the similarity can be obtained. By setting the threshold, when the number of images with the same coordinate label in the 25 retrieval result images is greater than the threshold, the retrieval result is recognized as the correction point result, and the GPS coordinates corresponding to the correction point are obtained, used to correct the position of PDR. In the process of processing, the correction point identified by several consecutive pictures may be consistent, which may lead to the error of correction position due to multiple consecutive corrections. For the case that three consecutive pictures and more correction points are consistent, we choose to take the correction point identified by the middle picture as the correction point and the correction point identification by other pictures as the value of 0 (not as the correction point).

False correction point determination means that after the correction point is identified, the possible correction point is still not the best correction point result or may be the wrong correction point result. At this time, if the wrong correction point result is used to correct the PDR localization result, it will only get worse results and fail to achieve the ideal correction effect. Therefore, we use the threshold determination method for the false correction point result. When the distance between the GPS coordinate of the correction point result and the position coordinate of PDR is greater than the threshold, it can be recognized as a false correction point, and the position result of PDR will not be corrected.

4. Experiments

This section describes the design of experiments (Section 4.1), VPR dataset (Section 4.2), and evaluation metrics (Section 4.3). Finally, it is the experiment analysis to validate our proposed scheme (Section 4.4).

4.1. Design of Experiments

In order to verify the effectiveness of the proposed fusion algorithm, three groups of experiments are carried out on the campus of Tianjin University. One approximate straight line route and two closed routes are selected for the experimental route, and the time period with as little pedestrian interference as possible is selected for the experiment. The true value of the route is shown in Figure 5.

The green track in Figure 5 is the processing track recorded by the hand-held mushroom head GPS antenna. The mushroom head GPS antenna supports four-star multifrequency centimeter-level localization. It can be considered that the track record is the real value in this experiment. There are 10 correction points on the three selected paths (the position marked by the blue point in the figure), which are evenly distributed on the path. The experimental time is about 14 : 00 Beijing time. At this time, the light intensity is better, there are fewer pedestrians and vehicles, and there are fewer interference factors. When walking to the correction point, it will stand still at the correction point for 3-4 seconds, to intercept the image for correction point identification. At the same time, to record the path length conveniently, the hand-held wheel hub rangefinder is used to measure the route distance, respectively.

To record PDR data and video data, select the phone IphoneX, use its built-in IMU device to record acceleration data, angular velocity data, and magnetic data, and use its front camera to record video data during processing. Fix the phone on the chest. All the captured video data are scattered and processed, and an image is extracted every 10 seconds as the image for the query.

Due to the limitations of its algorithm and the low accuracy of IMU we use, the position obtained by PDR will have existence errors and large errors. The ideal result of the experiment is that all correction points are identified accurately, and the PDR position is corrected, which can help to improve the accuracy of PDR localization.

4.2. VPR Dataset

TJU-Location dataset contains multiple street-level images taken at different times. We selected 50 positions with distinctive characteristics of the buildings on the Weijin campus of Tianjin University. All points are isolated geographically. We recorded images from these locations by two smartphones (i.e., iPhone XR and Huawei Mate 20). In one place, images were token per 45 degrees in the horizontal plane. In one horizontal direction, two images with different pitch angles were captured. We collected images in one location at 9 : 00 and 14 : 00, respectively. Therefore, 64 images were recorded at each location. Finally, a database containing 3.2 k images with 2.3 k queries was established. We split all database images into three roughly equal parts for training, validation, and testing. These subsets were independent of each other geographically. To obtain accurate positions of the collected images, we used the portable localization module called Qianxun magic cube MC120M with submeter accuracy from Qianxun spatial intelligence company.

4.3. Evaluation Metrics

For the evaluation index of location fusion method, the average root mean square error (ARMSE) and start-to-end error are used. The ARMSE is used to find the central tendency of a data which is predominantly used in scientific and engineering applications.

ARMSE refers to the average root mean square error between the real value of the actual position and the estimated position, and its formula is shown as follows:

In formula (5), represents the real value of the position of the point in the actual track, represents the position value of the point in the path calculated by the localization scheme, and represents the number of test points. In the experiment proposed in this paper, all points used for ARMSE evaluation are taken as the correction points corresponding to each experiment.

For the path of closed-loop circling, the start-to-end error is used as the evaluation index. The start-to-end error refers to the distance between the position of the end position and the starting position when the track is a closed curve.

4.4. Experiment Analysis

In order to verify the effectiveness of the proposed fusion scheme, the accuracy of three groups of experiments is evaluated. At the same time, in order to verify the advantages of the scheme, experiments are carried out on the correction point identification and false correction point determination, respectively.

4.4.1. Scheme Accuracy

For experiment path a, the path is an approximately straight-line path from north to south. The whole length of the track is 879.1 m. There are 10 correction points in the whole process. The track obtained by PDR is shown in the red track in Figure 6.

It can be seen from Figure 6 that among the 10 correction points, 4 correction points have actually completed the correction. Compared with the single PDR localization, the fusion scheme based on visual place recognition has greatly improved the localization method.

For experiment path b, the path is a closed track, and the whole length of the track is 839.4 m. The track obtained by PDR is shown in the red track in Figure 7. It can be seen that PDR can also maintain good accuracy at the beginning of a journey. After a period of time, due to the influence of external interference factors such as magnetic field, large deviation begins to appear in subsequent localization and there are problems in distance length calculation. PDR calculates that the whole length of the track is 917.4 m, which has a large error from the true distance of the track.

The track after introducing the fusion scheme is yellow track. It can be seen that five correction points were corrected in the whole process, and the corrected track coincided more with the ground truth track.

The experiment path c is also a closed track with a total length of 862.8 m. The track obtained by PDR is shown in the red track in Figure 8. Compared with the PDR track of experiments a and b, the track of experiment c is closer to the ground truth. However, due to the disadvantage of distance calculation deviation in the fixed step algorithm, there is still an error between the distance and the ground truth.

The track after introducing the fusion scheme is yellow. The fusion scheme of experiment c corrects two correction points correctly, which has a certain correction effect on the PDR and the correction effect is relatively limited, but it still plays a good role in the correction of some key points.

To calculate the ARMSE of three groups of experiments, take 10 correction points in the PDR track as the test point ; then, the value of is 10. To further reflect the advantages of the fusion scheme, the start-to-end error is used as the index to measure the localization accuracy. To fairly reflect the improvement of the localization accuracy by the fusion scheme, the start-to-end error is proposed to remove the correction of the last correction point to the PDR, that is, the true start-to-end error. Because the last correction point in experiment c is not identified successfully, experiment c does not have a true start-to-end error. The final ARMSE and start-to-end error of the three experiments are shown in Table 1.

It can be seen from the Table 1 that for ARMSE, the results of the fusion scheme are smaller than the PDR, and for experiment a, the ARMSE of the fusion scheme is only close to one-third of the PDR algorithm. In terms of the start-to-end error, the three groups of experimental results show that the fusion scheme is significant to the PDR algorithm. To highlight the advantages of the fusion scheme in localization accuracy, analyze the true start-to-end error. After the correction, the true start-to-end error of the fusion algorithm in the experiment a changes from 0 to 16.05 m, and the true start-to-end error of the fusion algorithm in experiment b changes from 0 to 47.7 m. The true start-to-end error of the fusion scheme is greater than that of the PDR algorithm. It can be concluded that without the correction of the last correction point, the true start-to-end error becomes larger, resulting in greater path deviation. Therefore, it can be seen that the fusion scheme improves the localization accuracy.

4.4.2. Correction Point Identification

For Ghost-NetVLAD VPR network, at each correction point, the most similar 25 images in the database are obtained according to the retrieval correlation. However, the thresholds are different, which has a great difference on the accuracy of the fusion scheme. This section mainly discusses the impact of different threshold on the accuracy.

When the threshold is set to 13, that is, when the number of images with the same point in the 25 search result images is greater than 13, the ARMSE of the fusion scheme of experiment a is 36.232 m. In order to carry out the comparative experiment, take experiment a as an example and adjust the threshold appropriately. When the threshold is taken as 12, 13, 14, 15, and 16, respectively, the number of identified correction points (ICDP) and the number of correctly identified correction points (CICDP) are shown in Table 2.

It can be seen from the Table 2 that when the threshold is 12, although the number of identified correction points increases, the number of error correction points increases accordingly, and the possibility of path error correction increases; When the threshold is 16, the number of points is similar to that of threshold 15, that is, when the threshold value is greater than 16, the influence of the threshold value on the identification of correction points becomes very small. When the thresholds are 14 and 15, the number of correctly identified correction points is relatively small. Draw and analyze the fusion scheme tracks when the thresholds are 13, 14, and 15, respectively. The tracks are shown in Figure 9.

It can be seen that when the threshold value is 13, due to one more correctly identified correction point, the track of the fusion scheme is closer to the ground truth than that of the thresholds 14 and 15. The ARMSE of thresholds 14 and 15 is 38.597 m and 44.698 m, respectively, which are greater than that of the threshold 13. Therefore, it can be seen that the threshold 13 is the best result.

4.4.3. False Correction Point Determination

It can be seen from the Table 2 that when the threshold is slightly small, there will be misidentification of correction points, so it is necessary to add false correction point determination to the fusion scheme. The fusion scheme is a multielement fusion localization scheme based on PDR mainly. Therefore, some correction points that deviate from the PDR position too far should be discarded; otherwise, the correction point will be counterproductive. As shown in Figures 10 and 11, the fusion scheme tracks of experiment a and experiment c with and without threshold are shown, respectively.

The red correction points in Figures 10 and 11 are misidentification correction points. It can be seen that since there is no false correction point determination, the false correction points still correct position, resulting in a large deviation of the track. By setting the threshold, when the distance between the position of the correction point and the position obtained by the PDR is greater than 100 m, we think that the correction point is a false correction point, abandon the correction point and adopt the position of the PDR. The false correction point determination prevents a large deviation of the fusion scheme track.

5. Conclusion

In this paper, we propose a fusion localization scheme based on PDR and Ghost-NetVLAD to improve the accuracy of outdoor pedestrian localization. The PDR localization algorithm is used as the main localization basis to identify the VPR correction discrimination points on the pedestrian track. When the correction discrimination points can meet the threshold conditions, the position of the PDR is corrected through VPR position. The experiments show that the proposed fusion scheme has effectively corrected the position. Compared with the single PDR, the final pedestrian track has an average reduction of 20.234 m in the ARMSE and 31.9 m in the start-to-end error.

However, there are still some disadvantages in the fusion scheme. (1) The lightweight VPR model can only adapt to embedded devices in theory and has not been deployed in practice. (2) In this paper, the phone is placed on chest. During actual processing, phone can be placed on other parts, which will affect the effectiveness of sensor data. (3) There are errors in the time alignment of the two algorithms, which may lead to inaccurate correction positions. It needs to be further improved in the future.

Data Availability

The data used to support the findings of this study are available from the author upon request.

Conflicts of Interest

The authors declared that they have no conflicts of interest regarding this work.

Acknowledgments

This research is supported by the Yunnan Provincial Major Science and Technology Special Plan Projects: digitization research and application demonstration of Yunnan characteristic industry, under Grant: 202002AD080001 and the National Natural Science Foundation of China (61771338). Thanks Liqiang Zhang for designing the fusion scheme of PDR and VPR systems and implementing the PDR algorithm.