Abstract
Efficient localization plays a significant role in mobile autonomous robots’ navigation systems. Traditional visual simultaneous localization systems based on point feature matching suffer from two shortcomings. First one is that the method of tracking features is not robust for environments with frequent changes in brightness. Another one is the large of consecutive visual keyframes can consume expensive computational and storage resources in complex environments. To solve these problems, an end-to-end real-time six degrees of freedom object pose estimation algorithm is proposed to solve the robust and efficient challenges through a deep learning model. First, preprocessing operations such as cropping, averaging, and timestamp alignment are performed on datasets to reduce computational cost and time. Second, the processed dataset is fed into our neural network model to extract the most effective features for matching. Finally, the robot’s current 3D translation and 4D angle information are predicted and output to achieve an end-to-end localization system. A broad range of experiments are performed on both indoor and outdoor datasets. The experimental results demonstrate that the translation and orientation accuracy in outdoor scenes improved by 32.9% and 31.4%, respectively. The average improvement of localization accuracy in indoor scenes is 38.4%, and the angle improvement is 13.1%. Moreover, the effectiveness of predicting the global motion trajectories of sequential images algorithm has been verified and is superior to other convolutional neural network methods.
1. Introduction
Real-time localization plays a significant role in identifying the relative or absolute location in many autonomous systems, such as augmented reality [1], mobile robots [2], and so forth. Mobile robots obtain their positions with different types of sensors, such as wheeled odometers, LiDARs, and low-cost RGB-D cameras [3–6]. While the traditional localization methods based on wheeled odometers cause drift on complex and smooth landscapes, leading to accumulated error due to the biased measurement. The localization methods based on global positioning system are unsuitable for the indoor environment because of the poor or weak signal. Some methods improve the acquisition efficiency and accuracy, using high-resolution laser radar, raising extra costs. In recent years, vision-based simultaneous localization and mapping (SLAM) algorithms [7–11] and deep learning localization algorithms [12–14] have been widely applied for low-cost sensors and the convenient fusion of other sensors [15, 16].
The localization algorithms for monocular cameras can be divided into two groups: explicit matching methods based on feature points [17–19] and scene matching methods based on implicit expression [20, 21]. There are two challenges in the explicit matching methods based on feature points: selecting the proper features for bright changing frequently and designing a robust method to improve the matching efficiency. In the implicit method, the problems of the inability to remember the relationship between global feature points and network overfitting, leading to reduced localization accuracy are remaining to be solved.
The feature point explicit matching technique is used in most image-based localization methods. Such approaches to localization typically use the structure from motion (SFM) [22] or SLAM techniques [23, 24] to represent the scene by obtaining a 3D model. This algorithm specifies the robot position of the image be queried from a series of 2D–3D correspondences. Each of these 2D–3D correspondences is made by matching the upper and lower frame feature points with the descriptors associated with the 3D points by matching algorithms such as SIFT [25], SURF [26], ORB [27–29], and so forth. The computational complexity of this approach increases with the scale of the 3D model if the scene inherently is complex, such as in a large-scale urban environment where the 3D model becomes excessively complex. In this case, the ratio of anomalies during the matching process tends to become raised, which will cause a decrease in localization accuracy. As a consequence, Sattler et al. [30] used visual vocabulary information between 3D feature points to facilitate erroneous matches previously to position estimation. In this way, the matching error can be reduced to a certain degree to improve localization accuracy. Still, the accuracy of the mobile robot position in the image-based localization algorithm depends directly depends on the efficiency of feature matching. It is hard to get exact matching points on the TUM Handheld SLAM indoor datasets in certain complex scenes, such as repetitive structures and large texture-free areas, which would severely affect the accuracy of camera pose calculation. These are the essential factors that constrain geometry-based image localization methods. Such factors further drive the research of alternative methods based on deep learning.
Implicitly expressed scene-matching methods are commonly used in machine learning techniques or deep learning networks for camera pose prediction. Inspired by machine learning, Gronat et al. [31] used the standard bag-of-words vector as a feature to learn a linear classifier. They divide different bag-of-words databases and train classifiers to classify. Donoser and Schmalstieg [32] treat feature matching as a classification problem by forming a class of descriptors associated with each 3D model point. They use a set of Random Ferns to compute matches efficiently. Aubry et al. [33] learn feature descriptors for 3D scene models to reconstruct and locate the environment model. In the background of RGB-D image relocation using machine learning, Shotton et al. [34] propose a method to send multiple RGB-D depth information from the localized image into the random forest algorithm, using RANSAC-based pose calculation to determine the correct camera position. Brachmann et al. [35] adapted a random forest-based method to become independent of depth measurements during testing. However, the technique still requires depth data to predict the 3D coordinates of each pixel during the training phase.
Inspired by the research on convolutional neural networks (CNNs) [36], Kendall et al. [37] proposed the first CNN-based camera position prediction method and named PoseNet. PoseNet uses a pretrained GoogLeNet network model [38], which removes the three softmax layers and the final fully connected layers and replaces them with a regressor in the training stage to regress the current position of the robot from RGB images with six degrees of freedom (6-DoF). Subsequently, Kendall and Cipolla [39] proposed Bayesian PoseNet to estimate the localization uncertainty by using Bayesian CNN to decompose the global estimation and prediction task to local, thus reducing the estimation difficulty to increase the localization efficiency of the network architecture. For actual camera relocation, the effect of PoseNet cannot meet the demand of high precision localization. For example, its average position error in the Microsoft 7-Scenes dataset reached 0.48 m, yet its model space was only 2.5 m3. This is due to the fact that the design of the network does not attempt to locate the global map, so a deeper and larger network model was needed to sample the datasets for training. Clark et al. [40] proposed a CNN-based short video clip prediction model named VidLoc. Similar to PoseNet, VidLoc employs a modified pretraining GoogLeNet neural network model for the feature selection of short videos and implements a regression model for 6-DoF localization of video clips. The pose estimation is also smooth for Vidloc on short sequence videos (20 frames), and the localization errors can be significantly decreased. Still, the localization errors can show large deviations in case of larger frame sizes. Compared with algorithms based on matching point correspondence, their methods eliminated the requirement for visual feature point extraction, description, and matching processes. Other than that, Yang et al. [41] proposed an N-step priority double DQN based on reinforcement learning and obstacle avoidance experience screening mechanism for further use in path planning. This algorithm provides a new solution for path planning of underwater robots. The algorithm is well adapted, fast inference, and suitable for dynamic environment. In further, Xiao et al. [42] proposed a new method for stereo video quality assessment based on depth perception in the frequency domain, which exploits the frequency domain to explore the effect of distortion on stereo video quality. The proposed method achieves a deeper exploration of the frequency domain features without changing the original frame size of the stereo video. However, these methods perform poorly for continuous pose optimization and cannot accurately match feature points in environments with little texture (e.g., corridors) or the view is obscured (e.g., camera lens is partially obscured) resulting in reduced localization accuracy. In addition, the high-dimensional fully connected layer output can cause overfitting on the training data, which results in reduced localization accuracy. To deal with this problem, Lyu et al. [43] proposed a three-finger FBG haptic sensing system based on wavelength-scanning optical coherence tomography. The high-dimensional tactile signals are demodulated into time-dependent sequences using the long short-term memory network (LSTM) model modified by the recurrent neural network (RNN) model [44, 45]. Then, the tactile signal is fed into a squeeze-and-excitation LSTM model for object classification. As a result, the model is able to classify six objects of similar size and shape with high accuracy.
A real-time 6-DoF object pose estimation algorithm in dynamic environments is proposed to address the previous problems. The neural network structure of CNN combined with RNN is applied to handle the 6-DoF camera pose relocation problem. This issue is formulated as a regression problem and the camera pose is regressed directly from the position information stored in the input image. On the one hand, a CNN is used to select the appropriate image features for localization automatically. On the other hand, RNN is utilized to solve the problem of PoseNet’s inability to locate global motion trajectories while better processing image sequence data information. A CNN + RNN approach is employed to provide better localization robustness to blur and illumination changes caused by camera motion. From the results of PoseNet, it is not the best choice to use the Dropout strategy to handle the regression position after the fully connected layer. So the modified RNN model was used to perform the dimensionality reduction of the fully connected layer in high dimensions. In summary, our algorithm does not rely on the motion relationship between consecutive frames and only requires RGB images to regress the camera pose information. Furthermore, a CNN + RNN model is used to utilize the temporal smoothing of image sequences to improve the accuracy of 6-DoF camera relocation and the localization of global motion trajectories.
We summarize the contributions as follows:(1)We propose a new network structure consisting of CNN and RNN for mobile robot localization in indoor and outdoor scenes. The data overfitting problem is solved and the most valuable features are selected for localization estimation. The significant improvement in localization accuracy of this approach compared to localization algorithms using only CNNs.(2)In contrast to non-CNN-based methods, the localization algorithm in this paper does not depend on the motion relationship between the preceding and following frames. Our algorithm can predict the robot’s spatial position not only by regressing the mobile robot pose information directly from RGB images, but also by resisting the interference of moving objects during motion or within the field of view and being more robust to illumination and motion.(3)We verify the effectiveness of our algorithm in predicting the global motion trajectory by comparing it with the traditional visual SLAM algorithm. The results show that global motion estimation can be predicted and outperform more advanced visual SLAM localization algorithms.
The organization of the rest of this paper is as follows: Section 2 mainly describes the framework and details of the localization model. Section 3 shows the experimental results and compares them with the baseline method through tables and figures. Section 4 summarizes the work of the paper and provides an outlook for future work.
2. Materials and Methods
In this paper, a new end-to-end monocular camera localization neural network is proposed. The architecture of the end-to-end monocular camera localization network is shown in Figure 1. As shown in Figure 1, the image pixels of the dataset are resized to 224 × 224 and input into the Resnet-50 network for feature extraction. Based on this, the LSTM network is introduced to structurally reduce the dimensionality of the fully connected layer and extract the most valuable features for the camera position estimation task. Finally, the 7D poses are output through a two-layer fully connected for the position task.

The algorithm flow of this paper is shown in Figure 2. The dataset is initially preprocessed and input to the network as mentioned earlier model, and then the model is trained to output the image pose to be predicted. Thus, the robot localization is completed. The alignment of the poses with the timestamps is also required before image preprocessing. First, the timestamps of real tracks and RGB information in the dataset are read separately, and the lists are generated separately. Second, the maximum time error is set as the search radius for candidate generation, and the time offset between the two lists is calculated. Next, the closest match within the search radius is found in both lists. Finally, the data with the best match term timestamp in the real trajectory is added to the RGB information, generating a file with GB name, RGB timestamp, real poses, and poses timestamp. Only the required RGB information and the true bit pose are extracted.

Specifically, the environmental image dataset consists of images recorded throughout the environment with 7D coordinate label tags. The pictures image dataset refers to individual images within the environmental image dataset.
2.1. Image Preprocessing
In order to be able to train different pixel datasets, the image is rescaled before training while maintaining the aspect ratio such that the shorter side of the scaled image length is 256 pixels. The pixels of the scaled image are then randomly cropped to 224 × 224 as the pose prediction network inputs for training. The image was cropped to 224 × 224 pixels with a center crop method on our test set. On top of that, the mean and standard deviation of the pixel intensity in each dataset image were computed separately. It is employed to standardize the intensity value of each pixel in the image while saving preprocessing time.
2.2. CNN Network Architecture
The primary objective of the method proposed is to fulfill the localization task by using CNN to predict posture, so the pretrained Resnet-50 architecture was utilized as our basis network, which has good capabilities in other image classification tasks. The initial 7 × 7 convolutional layer and the 3 × 3 maximum pooling layer from the original Resnet-50 [46] model are retained. And the last 1 × 1 average pooling layer, the fully connected layer, and the softmax layer are removed. The architecture consists of four residual blocks with multiple residual units, each with a bottleneck architecture. It is divided into three convolutional layers in the order of 1 × 1 convolution, 3 × 3 convolution, and 1 × 1 convolution for each residual block. After each convolution, batch normalization and correction of linear units (ReLU) are performed. To modify the standard residual block structure, the standard linear units (ELU) replace ReLU [47, 48]. The ELU facilitates the reduction of neuronal excursions while avoiding the disappearance of gradients and yielding faster convergence. The equation of the ELU activation function is as follows:
The details of the modified Resnet-50 are shown in Figures 3 and 4. The two elements refer to the specific structure of the bottleneck mechanism and the specific structure of each residual layer of ResNet-50, respectively.


In supplement, the last average pooling layer is replaced with a global average pooling layer, and a fully connected layer (Dense1), which has a dimension of 1,024 is added. The high dimensionality on Dense1 renders the model susceptible to overfitting the dataset, and PoseNet deals with this problem through a dropout strategy. To solve this problem, the four LSTM layers with dimension 128 were inserted after Dense1. Two fully connected layers are added separately for regressing the 3D translation × and the 4D angle q. The details of the network configuration are shown in Table 1.
2.3. Memory Network
In PoseNet, the pose estimation is generated completely independently for different labeled images. Besides that, PoseNet cannot remember and produce complete trajectories. However, while using the dataset with images with temporal sequences. The large of position information is available, and the trajectory of the whole image stream can be predicted by using temporal patterns. The motion trajectory of the whole image stream is predicted by using temporal patterns to obtain a large amount of pose information. In the intact image stream, timestamped neighboring images share a portion of the image features. It enables to improve the localization degree of confidence for a particular position.
The network uses the LSTM model to capture these shared features and perform the dimensionality reduction work. The LSTM network is an extension of the RNN network, and its overall structure is shown in Figure 5.

The LSTM achieves long-term temporal dependencies through gate mechanisms, including input gate, forget gate, and output gate, as well as memory cells to store historical information. The input sequence of LSTM in this paper is a 7D pose , and the corresponding output sequence is . The main computation process of LSTM can be divided into three steps: forget, input, and output. Assuming the current time step is t, the previous hidden state is , the state of the memory cell is , and the input is . In each step, the memory cell and hidden state are updated. The following is a derivation of the three steps.
The main function of the forget gate is to determine which information should be retained and which information should be forgotten. The output of the forget gate is a value between 0 and 1, which is used to control whether the information at the corresponding position in the memory cell should be forgotten. Assuming the inputs of the forget gate are and , and the output is , the calculation formula of the forget gate is as follows:where and are the weight and bias terms for the forget gate, represents the concatenation of and along the columns, and denotes the sigmoid function.
The role of the input gate is to control which information needs to be updated in the memory cell. The candidate memory cell state represents the information from the current input data that can be added to the memory cell state. The current memory cell state is determined by the previous memory cell state , forget gate , and input gate . The output gate determines how much information should be passed to the next time step in the current output . The overall workflow formula of LSTM can be represented as follows:
In the above formula, and represent the weight matrices of the cell state, and output gate, respectively. and represent the bias vectors of the cell state, and output gate, respectively. The represents the sigmoid function, which maps inputs to values between 0 and 1. The tan h represents the hyperbolic tangent function, which maps inputs to values between −1 and 1.
The input to the LSTM is composed of a series of feature vectors output from the previous fully connected layer. In order to convert the output of each time step into a 7D vector for outputting the 7D pose, a fully connected layer needs to be added. The specific formula is as follows:where is a weight matrix of size and , is a bias, and denotes the he positional regression function that maps to a probability distribution. The final output of the whole LSTM model is .
2.4. Loss Function of the Network
The prediction network works by inputting pose labels in the image and outputs estimates of translation and orientation , the values of these labels being obtained by camera motion or otherwise. Where is the translation of the 3D camera can be expressed as , with being the translation variable predicted by the prediction network. The quaternion in the form of is the orientation of the 3D camera can be expressed as [q1, q2, q3, q4] and refers to the predicted direction variable.
The camera position can be estimated by forming a consecutive and introjected loss in Euclidean space. The translation error can be expressed in terms of L2 parametrization:
Since quaternions are convenient to iterate and don’t have singular values, quaternions have been chosen to record the robot’s pose instead of Euler angles in programs such as ROS and SLAM open datasets. In this paper, the 4D vector values in the quaternion space are quadratically normalized to the unit length by the inverse process to express the rotation angle better. The orientation loss function can be expressed as an equation:
The last weighted and linear loss functions for position and direction are as follows:
The loss function can learn the camera’s translation and orientation simultaneously, where β is a hyperparameter that ensures the balance between the expected values of translation and orientation errors. The results were found to be poor when using a single neural network to regress translation and orientation information separately. That was due to the absence of information in the individual networks to separate the translation and orientation regressors, so Equation (7) is used to couple the orientation and position regression models. Through the test, the value of β is found to be larger for outdoor scenes due to the relatively large position error. Similarly, the value of β for indoor scenes is smaller. The specific β value can be adjusted by the grid search method.
2.5. Evaluation Dataset
Three datasets were selected to evaluate our methods: the Cambridge Landmarks outdoor dataset, the Microsoft 7-Scenes indoor dataset, and the TUM Handheld SLAM dataset with real trajectories. Two sets of experiments were conducted on these datasets, and the data format is shown in Equation (8):
2.5.1. Cambridge Landmarks Dataset
As shown in Figure 6, the Cambridge Landmarks dataset is made up of images from outdoor scenes around the University of Cambridge taken with a smartphone in 1,920 × 1,080 pixels. Several photographers walk on different paths at the same time and use the SFM algorithm to calculate the true position labels. In this dataset, a complex environment makes the image confusing, making monocular camera localization challenging.

2.5.2. Microsoft 7-Scenes Dataset
The Microsoft 7-Scenes dataset comprises images captured from seven locations in the same office building with the handheld Kinect RGB-D camera. All images in the dataset are 640 × 480 pixels in size, and each subdataset is a sequence of consecutive images in a separate office. Each subset is recorded with the same camera at different motion speeds. These scenes have blurred shots, confusing perceptions, and few texture features, which cause significant interference in the localization task. We use the Microsoft 7-Scenes dataset as shown in Figure 7.

2.5.3. TUM Handheld SLAM
Figure 8 shows the TUM Handheld SLAM series of datasets which is a dataset published by TUM’s Computer Vision Lab. There are 8-bit three-channel color images and 16-bit single-channel depth images with a scale factor of 5,000 in the dataset. Captured images with motion blur, perceptual overlap, and an obscured field of view (e.g., the camera lens is partially obscured) greatly impact monocular camera localization. Furthermore, the ground_truth was captured by the external motion capture system, using it as the ground truth for the error calculation in experiment II.

The first two well-known datasets are mainly used to compare the localization accuracy of other CNN-based localization methods in experiment I, and the last dataset is used to compare the trajectories in experiment II.
3. Results and Discussion
To further support our above analysis, experiments were conducted for indoor and outdoor scenes, respectively. In experiments, the proposed method achieves good performances on the Cambridge Landmarks dataset in outdoor scenes, and improves localization accuracy on the Microsoft 7-Scenes dataset in indoor scenes with a complex environment. In addition, the trajectory error of the algorithm proposed is reduced, compared with ORB-SLAM2 on the TUM Handheld SLAM series data. The ORB-SLAM2 often loses keyframes resulting in the disappearance of trajectories or large trajectory errors owing to the presence of texture-free surfaces of image objects and the occlusion of the camera by hand at certain moments.
3.1. Training Parameter Setting
The Adam optimizer was utilized in our experiments for enabling iterative weight updates of the training network based on the large-scale scenario datasets. The Adam optimizer was configured with the following parameters: , , and . The initial learning rate of our training network is λ = 10−4, and the batch size is 32. During our experiments, the values of these parameters are kept fixed. The predictive network architecture was implemented on the PyTorch AI framework, and all models were executed on NVIDIA 2060TI GPUs for up to 20,000 iterations within approximately 6 hr. All datasets were partitioned into a 60% training set and a 40% validation set, with the validation set exclusively used for testing purposes.
3.2. Experiments on the Indoor and Outdoor Datasets
In this section, the experiment performed on the Cambridge Landmarks dataset and the Microsoft 7-Scenes dataset will be introduced. The evaluation scheme proposed in PoseNet was adopted to ensure fairness in the evaluation results. Table 2 compares the localization accuracy of our proposed method with current advanced CNN localization methods, including PoseNet, Bayesian PoseNet, and VidLoc. As VidLoc only provides experimental data for the Microsoft 7-Scenes dataset, it is only included in the experimental comparisons for that dataset.
As shown in Table 2, the localization accuracy of the method proposed is improved compared with other algorithms due to the use of ResNet and LSTM. On the Cambridge Landmarks dataset (outdoor scenes), the capabilities of the localization accuracy of the method proposed outperforms PoseNet by almost 50% in translation error and 15.6% in orientation error. In addition, the translation accuracy is improved by 30.2% and the orientation accuracy by 28.5% of the average error compared with Bayesian PoseNet.
As shown in Table 3, on the Microsoft 7-Scenes dataset (indoor scenes), the performance of the localization accuracy of the algorithm proposed outperforms PoseNet by nearly 50% in translation error and 15.6% in orientation error. In comparison with Bayesian PoseNet, the translation accuracy is improved by 53.2%, and the orientation accuracy is enhanced by 10.6% of the average error. A reduction of up to 12% in positional error was obtained compared to VidLoc.
3.3. Experiments on the TUM Handle SLAM Dataset
In the experiments, the localization accuracy of different algorithms is compared and verified the effectiveness of our algorithm in predicting global motion trajectories from sequential images. Unlike PoseNet, which generates each pose estimation independently for different labeled images, our algorithm avoids trajectory misalignment when training complete trajectories with sequential images. To evaluate our algorithm, the state-of-the-art V-SLAM (ORB-SLAM2) algorithm was compared with regards to global motion trajectory and motion accuracy.
In the test, five subdatasets were selected from the TUM Handle SLAM public dataset with different image frame sizes and fast and slow camera movements during shooting. The dataset has diverse structures and textures and some moments of camera occlusion by hand, all of which present significant challenges for camera relocation. Furthermore, this dataset has real trajectory data collected by the camera, which is used as a benchmark in the experiment to compete with ORB-SLAM2 and our algorithm to analyze the trajectory error.
The global motion trajectory evaluation was performed on the TUM public dataset freiburg3_long_office_household. The sequence length of this dataset is 5,990. Figure 9 shows the global motion trajectory comparison of our algorithm, ORB-SLAM2 algorithm, and ground truth. Figure 9(a) shows the motion trajectories of ORB-SLAM2 and our algorithm, and Figure 9(b) shows the 6-DoF positional estimates on the corresponding x, y, z, roll, pitch, and yaw. Figure 9(a) shows that our algorithm is capable of more exact camera tracking of ground truth in terms of 6-DoF positional estimation. The 6-DoF positional estimation in Figure 9(b) further demonstrates that our localization results are smooth and accurate. From the experimental results, it can be observed that the true value of the camera trajectory is tracked more efficiently by our algorithm, which is essential in relocation and loop closure detection. The experimental results show that our algorithm can predict the pose of a single image and learn to predict the global trajectory of camera motion more accurately than the traditional V-SLAM method.

(a)

(b)
In order to investigate whether CNNs have the ability to capture global trajectory information in trajectory prediction, ablation experiments are conducted. In the experiments, the ResNet network is isolated from the trajectory prediction algorithm proposed in this paper and is compared with the full algorithm and PoseNet. The purpose of the ablation experiment is to evaluate the contribution of the ResNet network to the accuracy of trajectory prediction and to verify whether CNNs can capture global trajectory information.
Through the ablation experiments, the trajectories of these three algorithms on freiburg3_long_office_household were obtained, as shown in Figure 10. It can be seen from Figure 10 that the results of the trajectory prediction using the ResNet and PoseNet networks separately are worse than that of the proposed algorithm. This indicates that it is unable to capture the global trajectory information in trajectory prediction with CNN networks alone. By contrast, the proposed algorithm in this paper can better capture global trajectory information while retaining local trajectory information.

Furthermore, global motion localization accuracy was compared on the TUM public dataset freiburg1_desk, with a sequence length of 2,337. The absolute positional error curves of the motion trajectory obtained by our algorithm and the ORB-SLAM2 algorithm are given in Figure 11. The experimental results show that the ORB-SLAM2 algorithm is vulnerable to ground truth noise and other uncontrollable factors with poor pose estimation results when using only a single image. This experiment verifies the effectiveness and necessity of our algorithm for global relocation on sequence images, reducing the perceptual confusion problem and improving localization accuracy.

(a)

(b)
Figure 12 shows the line graphs of the relocation errors (translation and orientation errors, respectively) for PoseNet and the algorithms in this paper. Figure 12(a) is the training error distribution on the freiburg3_long_office_household dataset, and Figure 12(b) is the training error distribution on the freiburg2_large_no_loop dataset. In terms of statistics, the location error of the algorithm proposed in this paper is mainly within 0–1.3 m, while it is less than 25% for PoseNet. The orientation error of the algorithm proposed in this paper is primarily within 0°–25°, and this percentage is less than 15 in PoseNet. In addition, PoseNet has some large errors, such as positional errors of more than 5 m and angular errors of more than 100°. These large errors suggest it may have a perceptual confusion problem during pose estimation.

(a)

(b)
4. Conclusions
This paper proposes a new neural network architecture that solves the robust and effective problems of mobile robot relocalization in indoor and outdoor environments. Our model presents that the images are uniformly cropped, and the mean and standard deviation of the pixel intensities in each dataset image are calculated to reduce the preprocessing time. Then the fine-tuned Resnet-50 is proposed for parameter optimization in order to obtain more appropriate weights. Eventually, the LSTM structure is introduced to store the global motion trajectory and select the most valuable features for the real-time mobile robot position prediction task. We demonstrate that our method outperforms other monocular RGB localization-related methods and is robust to localization concerning disturbances such as illumination changes and camera occlusions. In addition, it is able to store global motion trajectories and is superior to conventional V-SLAM localization algorithms.
In future work, we will focus on using deep learning methods to predict camera pose and pose errors from other sensors by sharing network layers. We also intend to use it in the DRL robot navigation algorithm to improve its perception and the success rate of robot navigation. In essence, it assists robot reinforcement learning decision algorithms such as DQN, PPO, and so forth.
Data Availability
The datasets used or analyzed during the current study are available from the corresponding author on reasonable request.
Disclosure
An earlier version was submitted as a Conferences paper titled “An End-to-End Robotic Visual Localization Algorithm Based on Deep Learning” in the conference year “2023 International Joint Conference on Neural Networks (IJCNN).”
Conflicts of Interest
The authors declare no conflict of interest.
Acknowledgments
This work was supported by the National Natural Science Foundation of China (NSFC, No. 61702320).