Abstract

Six-degree (6D) pose estimation of objects is important for robot manipulation but at the same time challenging when dealing with occluded and textureless objects. To overcome this challenge, the proposed method presents an end-to-end robust network for real-time 6D pose estimation of rigid objects using the RGB image. In this proposed method, a fully convolutional network with a features pyramid is developed that effectively boosts the accuracy of pixelwise labeling and direction unit vector field that take part in the voting process for object keypoints estimation. The network further takes into account measuring the distance between pixel and keypoint, which aims to help select accurate hypotheses in the RANSAC process. This avoids hypothesis deviations caused by the errors due to direction unit vectors in cases of distant pixels from keypoints. A vectorial distance regularization loss function is used to help Perspective-n-Point find 2D-3D correspondences between 3D object keypoints and their estimated corresponding 2D counterparts. Experiments are performed on widely used LINEMOD and occlusion LINEMOD datasets with ADD (-S) and 2D projection evaluation metrics. The results show that our method improves pose estimation performance compared to the state-of-the-art while still achieving real-time efficiency.

1. Introduction

The 6D object pose estimation is challenging due to occlusion and textureless surfaces of objects and becomes even more challenging when estimating 6D object poses from a single RGB image than from RGB-D or stereo images. 6D pose, which is the 3D rotation and 3D translation , is a rigid object transformation from the coordinates of the rigid object to the coordinates of the camera. The transformation here can be shown as a rigid transformation matrix , where and . With the advancement of robot manipulation, navigation, self-driving cars, and augmented reality, 6D object pose estimation has attracted the interest of researchers extensively. In literature, some single-shot approaches are used that regress 6D pose from the image coordinates directly [1], which are not effective in the occluded environment. Recently, two-stage methods have shown progress in this field of research which detects keypoints followed by Perspective-n-Points (PnP).

Some of these two-stage approaches [24] detect keypoints first by regressing their image coordinates and then calculating 6D poses, but these keypoints are sparse, due to which these networks also show sensitivity to occlusion. Some approaches like [5, 6] use postrefinement for 6DoF poses like iterative closest point (ICP) [7] after calculating initial 6DoF object poses using deep learning. In recent years, vector field-based keypoints voting methods [5, 8] tackled the issue of occlusion effectively even without postrefinement. We use the vector field-based keypoints voting approach. These approaches introduce pixelwise voting by a vector field from each pixel that votes to detect keypoints using RANSAC [9] and then estimate 6D poses using Perspective-n-Point (PnP) [10]. The keypoint localization is achieved through hypotheses with the highest voting score in the unit vector field [8]; however, these methods do not take into account the distances between the object’s pixels and the object’s keypoints which also cause errors and deviate hypotheses due to small errors of direction vectors of distant pixels from keypoints. Handling errors in the direction vectors that occur because of the distances from pixels to keypoints is useful and is considered by [7, 11, 12]. These studies have used different approaches to the problem based on PVNet’s RANSAC-based voting for keypoints estimation and the PnP solution. Reference [7] proposes atrous spatial pyramid pooling for capturing global context and distance-filtered pixel voting (ASPP-DF-PVNet) to calculate distances between pixels and keypoint. The attention voting network by [12] incorporates a channel-level attention module for adaptive feature fusion called the adaptive fusion attention module (AFAM) into U-Net and calculates distances between pixels and keypoints using prior distance augmented loss (PDAL).

Following these latest approaches, an end-to-end convolutional neural network (CNN) is used that takes an RGB image in 2D where occlusions occur among objects and estimates the 3D translation and 3D rotation, that is, the 6DoF pose of objects. The CNN based on fully convolutional networks (FCN) [13] with features pyramid resembling pyramid scene parsing network (PSPNet) [14] is developed for pixelwise labeling and unit vector field generation that brings robustness to occlusion as most errors occur due to incorrect pixel labels and direction vectors. The unit vectors vote for localizing object keypoints in a RANSAC hypothesis space, and a vector-based distance voting regularization loss function has been incorporated, which helps in the selection of accurate hypotheses in the voting process. Finally, PnP calculates 6D poses for objects using 2D-3D correspondences among object 3D model keypoints and their estimated corresponding 2D objects in the RGB image.

The loss function considers the distances between pixels and keypoints to reduce deviations in hypotheses that occur due to inaccurate direction vectors. As sampling and inliers search take part in the voting process, like RANSAC, it is difficult to differentiate. The difficulties arise while using voted keypoints and their actual ground truths for training networks end-to-end. For that, a proxy hypothesis is employed closely related to [11] but with a different computation technique, the vectorial distance voting loss, to calculate and approximate for each pixel the distance deviations among voted keypoints and their respective ground truths so that pixels produce approximated hypotheses with respect to their ground-truth keypoints. The proposed robust end-to-end network produces better results in heavy occlusion. The illustration of the system is given in Figure 1. Our architecture is as follows:(i)End-to-end CNN for 6D object pose estimation presenting robust pixelwise labeling that produces an accurate vector field for voting for object keypoints(ii)Calculating distances among pixels to keypoints to avoid errors due to inaccurate direction vectors(iii)Proposing a vectorial distance loss for the distance between pixel and keypoint, which is generalizable to any number of dimensions

Experiments performed on LINEMOD and occlusion LINEMOD datasets that are used widely in this area of research show significant performance. These datasets are specially produced for 6D object pose estimation using RGB image. Our end-to-end network achieves real-timing in estimating 6D object poses and achieves high accuracy in cluttered space compared to the state of the art. Our method does not calculate any postrefinement of the 6DoF object pose.

This section presents previous related works on 6DoF pose estimation based on a single RGB image. The 6D object pose estimation has been achieved using different approaches over the years.

2.1. Template-Based Methods

In this approach, a rigid template is used for scanning the image and calculating at each location in the image a similarity score, and then comparisons of these scores take place to obtain the best match. References [1521] are the conventional methods based on template matching. In 6D pose estimation, a template is usually obtained through corresponding 3D model rendering. Some deep learning-based object detection approaches basically used for 2D object detection have also been employed for template matching, which has been enhanced for 6D pose estimation [13, 22]. This approach works well for textureless object detection but does not work well in a cluttered environment where some objects are occluded. However, [16] has tried to detect 3D objects in occlusion also through multimodalities using a dense depth map with the input image.

2.2. Feature/Keypoint-Based Methods

This approach extracts points of interest or keypoints from images as features to detect the object and then establishes the 2D-3D correspondences of the object to its 3D model to achieve 6D poses. References [23, 24] are the traditional feature-based approaches that use feature engineering and are translation and scale variants and also sensitive to other variations in the scene. Feature-based methods are good at handling occlusions but need textured objects for feature extraction. Several deep learning methods [4, 2527] have been used to learn textured and textureless object detection features. A few conventional approaches directly regress pixels to the 3D object coordinate for 2D-3D correspondences [28, 29]. Similarly, [30] is a deep learning method for achieving the same task, but these approaches require RGB-D data to regress to 3D coordinates and avoid handling symmetry problems. As local feature extraction can be done through either keypoints or pixels of the objects in the image, some methods do not regress the pose directly from images; instead, they define semantic keypoints sets and for detecting keypoint use deep neural networks. This approach uses a two-stage process where it performs semantic segmentation and then predicts 2D keypoints on objects surface from which it estimates 6D poses via 2D-3D correspondences using Perspective-n-Point. BB8 [29] generates pixelwise labeling for objects and regresses keypoints from each object to predict 3D bounding boxes. References [3133] regress the 3D coordinates of objects from images directly and further use PnP for 2D-3D correspondences between objects and respective models for final poses. Reference [3] predicts the 2D projections of the corners of the 3D bounding box around the objects. The feature maps are fixed size and cannot handle occlusions well. Few methods solve the occlusion problem by producing pixelwise heatmaps of keypoints [4, 34].

2.3. Voting-Based Methods

In these methods, pixelwise labeling and pixelwise voting together take place for 2D object detection and key feature finding for 2D-3D correspondences to achieve the final 6D poses. References [3537] use the Hough voting scheme, and [28, 38] use the random forest to predict pixels’ 3D coordinates of objects. PoseCNN [5] uses semantic segmentation to localize objects in the RGB image, finds the center point of objects by estimating a vector field pointing towards the center of the object, then employs Hough voting for center prediction, and then predicts the depth to get object poses. Similar to PoseCNN, [6] employs semantic segmentation and objects center point but uses the dense approach for the final rotation quaternions. The 6D poses of the objects are regressed by a subnetwork. PoseCNN also uses depth information and ICP [6] to refine the estimated poses. DOPE [39] does not apply postalignment and uses a simple deep network architecture to infer image coordinates in 2D from the projected 3D bounding boxes and then applies Perspective-n-Point (PnP) [10]. DOPE recovers the final 3D translation and 3D rotation, that is, 6D pose of the object with respect to the camera, from the detected projected vertices of the bounding box. The system is fully trained on simulated data to avoid the generalization problem in PoseCNN, which occurs due to high correlation in real data, rather than only estimating a centroid. PVNet [8], a two-stage deep learning network, votes several features of interest on any object. Using pixelwise labeling and unit vectors from each pixel of the object, RANSAC-based voting hypotheses [9] are employed for finding key points, and then PnP is applied for the final pose estimation. A total of 8 keypoints are selected for each object using the farthest point sampling (FPS) algorithm on the objects’ 3D models. DPVL [11] has used a similar approach as PVNet but considers the distance between object pixels and object keypoints. As the RANSAC-like voting process is difficult to differentiate, it uses the proxy hypothesis to calculate and approximate for each pixel the distance deviations among voted keypoints and their respective ground truths. ASPP-DF-PVNet [7] considers global context using atrous spatial pyramid pooling and distances between pixels and keypoints. He et al. [12] incorporate a channel-level attention module for the adaptive feature fusion into U-Net and calculate distances between pixels and keypoints using prior distance augmented loss. Another related architecture based on the channel spatial attention network (CSA6D) is proposed by Chen and Gu [40] to estimate the 6D object pose from RGB-D images.

3. Proposed Method

In this paper, an end-to-end network for 6DoF object poses estimation is proposed, which is effective in a cluttered environment. The purpose of this research work is to handle occlusion, texture, and symmetry so that it can be used for robot manipulation. A voting-based approach is used as this approach is robust towards occlusions and view changes. 6DoF objects pose estimation is object detection in an RGB image and 3D translation and orientation estimation of those objects. A CNN based on FCN [13] with a feature pyramid approach has been used for pixel labeling and vector field prediction, and a voting loss with vector-based distance has been incorporated for selecting accurate hypotheses in the voting process. The RGB input image passes to the network, and it detects objects and calculates the 6D pose through 3D rotation and 3D translation accurately without any postrefinement. Here, we assume that the objects are rigid and their 3D models are available. Our method first calculates pixelwise classification and vector field prediction, then votes for 2D keypoints in the object body from the vector fields, and then estimates the 6DoF pose by solving a PnP problem. Due to the use of smooth loss for learning unit vectors, small errors in the vector may occur, leading later on to large deviation errors of hypotheses as loss does not consider the distances between pixels and keypoints. That is why we consider the distance between a pixel and a keypoint in order to avoid large deviation errors of hypotheses.

We use an approach that fulfills the pose estimation process in a two-stage pipeline similar to [3, 5, 8, 39], that is, semantic segmentation and 3D orientation and 3D translation that completes the process of 6D object pose estimation.

3.1. Semantic Segmentation and Unit Vectors

Inspired by FCN [13], our proposed multiclass semantic segmentation architecture uses a similar approach exploiting ResNet-50 v2 [41] as the backbone and using multiple scales of feature maps that further generate pixelwise classification and vector fields. This pixel labeling and pixel voting network takes as input an RGB image and outputs a tensor with similar dimensions, except that the last dimension is the number of classes ( for m-classes) and a tensor ( where is the number of keypoints) for unit vectors. To avoid problems in the early stages due to small receptive fields, our network leverages a larger receptive field as all of its layers are convolutional. The pixelwise classification and unit vector field prediction network is shown in Figure 2.

Taking the dimensions RGB image as input, the ResNet-50 v2 performs max-pooling two times to get the feature maps of dimensions of the original input image. Additional sets of feature maps are generated by using successive layers that result in feature maps of dimensions of the original input image . We further improved by modifying our semantic segmentation network, further processing each feature map with another convolutional layer, an approach similar to the PSPNet [14]. The feature pyramid is generated after the output of the ResNet. To achieve the size as the first set of feature maps, upsampling of the features’ pyramid takes place, which then is concatenated and further applying transposed convolution twice of 256 filters, each getting the original image size back. Finally, apply a transposed convolution with filters equal to the number of object classes leading to softmax to generate the pixelwise prediction. To obtain unit vectors along with the class probabilities, we apply a convolution on the final feature map.

Using a simple or some basic CNN architecture for semantic pixel labeling would improve the speed of the system by some margin but would decrease the accuracy accordingly. The proposed architecture for semantic segmentation and vector field generation is robust to occlusion and is inspired by [13, 14] and [8].

3.2. Object Detection and Pose Estimation

After processing the image and doing the pixelwise classification and obtaining unit vectors, our network predicts the 2D locations of 3D keypoints using RANSAC, from which the pose can be obtained using the EPnP algorithm. PoseCNN [5] uses Hough voting for finding the center point of objects, while PVNet [8] and DPVL [11] use RANSAC-based voting for keypoints localization. It is a two-stage process that is robust for occlusion, symmetry, and handling textureless objects. The first stage locates the 2D projection points of predefined 3D keypoints associated with the 3D objects’ models, where the keypoint localization is implemented through the RANSAC based on the pixel labeling and vector field representation. In the second stage, the 6D object pose estimation takes place using PnP. Figure 2 shows the complete proposed 6DoF object poses estimation method.

3.2.1. Keypoint Localization

Here, we first present the vector field representation, which is unit vectors directing from each pixel towards each keypoint and then the keypoint localization. To handle the varying objects’ sizes during detection, the vector field of direction vectors is estimated with a larger receptive field that covers larger parts of objects. Because of this, even invisible keypoints can be induced by the network from visible parts. Here, is a function of unit direction vector from pixel to a specific keypoint .

Key point hypotheses can be generated from semantic labels and unit vectors in a RANSAC-based voting scheme [6]. Given keypoint and its corresponding direction vectors to vote, we generate hypotheses for keypoint ; that is, . We consider initially all the intersections of any two direction unit vectors as candidate points for the final keypoints selection. The hypothesis deviation due to error in the predicted direction vector depends on both the angle and the distance between a pixel and a keypoint. If a pixel is far from a keypoint, a small angle between direction vectors can also generate a large hypothesis deviation. Finally, all the direction unit vectors in the generated vector field vote for choosing the keypoints wherever the deviation angle from the pixel to the keypoint relative to the direction is less than a certain threshold. The candidate points having most of the votes are considered as keypoint hypotheses. This way the voting directions deviating by a large angle from the hypothesis are removed. For this, we take the formula for voting from PVNet [6] given aswhere is the voting score for the hypothesis for 2D keypoint , is the mask of the object, is an indicator function that indicates whether pixel votes for a keypoint hypothesis or not, and is the threshold.

3.2.2. Loss Function

Assume that image and the keypoint locations where is the number of selected keypoints on the surface of the object. The smooth loss [42] between predicted and ground-truth direction vectors is used to regress the direction vectors aswhere is the loss of vector field, is the predicted direction vector, and is the mask of the object. Our network estimate vector fields in a similar way. The pixel segmentation labeling where is achieved through a softmax cross-entropy loss function as

The errors that occur due to the estimated direction vectors can cause large deviations in hypotheses even if the errors are small, which affects the pose estimation performance. We consider the hypotheses distributions enforcing all the hypotheses to be more effective and produce fewer errors towards the actual keypoints. To learn the distance between a keypoint and its respective foot of perpendicular with the direction vector of pixel , we get the hypothesis that is differentiable. It is given as

Here, is the foot of the perpendicular. This equation calculates the distances between all and keypoints that need to be minimized to achieve accurate hypotheses for keypoint voting, which will be achieved by minimizing the loss function as follows:where is the estimated unit vector, is the pixel with its coordinates, is the keypoint with its coordinates, and is a parameter. The vector-based approach to distance is generalizable to any number of dimensions.

Here, is the result of unit vector estimation by our network, so it needs normalization as the output may not be a unit vector which is why the normalization operation is involved in . Due to , the distance regularization voting loss, the points correctly to keypoints because of its sensitivity towards pixels locations. The final objective is calculated as follows:where is the total loss and and are the hyperparameters for trade-off management between pixel labeling and vector field estimation.

Our method starts at pixelwise labeling and pixelwise unit vectors discussed in Section 3.1 of the methodology, where unit vectors take part in the voting process with the vector field loss for keypoints localization using RANSAC, and then the loss for pixel to keypoint distances is used which are shown in the Section 3.2.2 of the methodology. Finally, the PnP is used for 2D object keypoints to 3D object model keypoints correspondences to calculate the final 6D object poses, which is the 3D translation and 3D rotation of rigid objects. Section 3.2.3 shows further details related to the implementation of the system.

3.2.3. Implementation Details

Based on [8, 11], 8 keypoints are selected for each object, and the farthest point sampling is used for this purpose on its 3D model. Initially, we consider the center of the object where the keypoint set is initialized. We apply data augmentation following [8] to the data to avoid overfitting where we achieve in some images a slight truncation due to random cropping. Some other processing performed includes color jittering, rotation, width shift, height shift, shear, zoom, channel shift, and horizontal flip. In training, and are set to similar values and set to . During experiments, is set to and increases to gradually and then increases by a factor of 1.1 each epoch. The learning rate at provides the best results according to DPVL, so we also set it to and decay by a factor of 0.75 to gradually and with a total of 100 training epochs. Adam optimizer has been employed. We train our method on the LINEMOD dataset. We do not perform any postrefinement operations. Our method performs in real time on a GTX 2080 Ti GPU at the input image .

4. Results and Discussion

This section explores experiments, results, comparisons, and discussion of our method with other related methods of the 6D object pose estimation using state-of-the-art datasets and evaluation metrics.

4.1. Datasets

Very popular datasets for 6D poses have been used for conducting experiments for this research work. The proposed method has been trained on LINEMOD [18] and evaluated using both the LINEMOD and the occlusion LINEMOD [28] datasets. LINEMOD dataset consists of 15783 images, 13 objects, and a total of about 1200 instances of each object with a mask. Each object is provided with its respective 3D model. Similarly, the occlusion LINEMOD dataset consists of 8 objects and 1214 images with occlusions which is more challenging. These datasets have been extensively reported in a number of research articles for comparative analysis of 6D object pose estimation.

4.2. Evaluation Metrics

Two evaluation metrics, the 2D projection metric [29] and the ADD score metric [18], have been used to evaluate our method. The 2D projection metric uses the estimated and ground-truth poses to calculate the average 3D distance of the model points. The correct pose estimated will have a distance of less than 5 pixels.where is the total number of points on the 3D object model , is a point or a set of points on the surface 3D object model, and are the rotation and translation respectively, is the target pose that transforms the point with transformation and vice versa, and is the camera’s intrinsic parameter matrix.

The ADD score calculates the average 3D distance between 3D model points transformed by the estimated and ground-truth poses. Then, the correctly estimated pose will have less than 10 percent distance from the diameter of the 3D model.

ADD (-S) is employed for symmetrical objects, using the closest point distance, and the 3D distance is calculated.

4.3. Comparisons with State of the Art

The state-of-the-art PoseCNN has successfully solved the problems of template-based and feature-based approaches for 6D pose estimation; however, postrefinement for the final poses is required for better accuracy. The voting-based keypoints prediction approaches are robust in this regard and can estimate accurate initial 6D poses, so they do not require any postrefinement. Our method follows a similar approach and focuses on providing robust semantic segmentation and vector field prediction, which further predicts object keypoints and distances and angles between each pixel to each keypoint. The robust semantic segmentation shows robustness to occlusions, due to which it provides better accuracy for the final pose estimation, so our method does not need any pose refinement for pose estimation improvement performance. Here, we compare our results with 6D pose estimation approaches using a single RGB image, which are state of the art in this research area. The comparisons have been carried out against PVNet [8], DPVL [11], ASPP-DF-PVNet with loss [7], and PDAL-AFAM approach of He et al. (2021) [12] and some previous approaches such as PoseCNN [5], SSD6D [1], YOLO6D [3], BB8 [29], CDPN [32], DPOD [31], Pix2Pose [33], and CSA6D [40]. The results are evaluated using ADD (-S) and 2D-Projection metrics on LINEMOD and occlusion LINEMOD datasets.

4.3.1. Comparisons Using ADD (-S) Metric

Table 1 shows the comparison of our method with several other methods mentioned above for pose estimation on the LINEMOD dataset with respect to ADD (-S) metric. It shows that our method outperforms state-of-the-art methods; especially, our method outperforms our baseline methods PVNet [8] and DPVL [11]. The performance is improved by a margin of 2.66% compared to [11] using ADD (-S) metric. Occluded, textureless, and symmetric objects are the main issues for pose estimation systems. Our method’s accuracy has improved significantly for all as for “ape,” accuracy has improved by 7.22%, and for “duck,” the accuracy has improved by 7.95% using ADD (-S) score. Both the “ape” and the “duck” are textureless objects. The accuracy for “glue,” which is a symmetric object, improves by 1.71%.

Table 2 shows the comparisons of our method with the state-of-the-art approaches on occluded LINEMOD dataset in terms of ADD (-S) scores where our method achieves better overall performance. Our method improves the performance of occluded objects by 3.88%, especially the accuracy of “glue” during occlusion improves significantly. The overall results show that our proposed method gives the best performance compared to the state-of-the-art approaches. Figure 3 demonstrates our method’s qualitative results visualization on the occlusion LINEMOD dataset. Our method outperforms PVNet and DPVL and the variations of DPVL, the ASPP-DF-PVNet, and the PDAL-AFAM. The robust semantic segmentation and vector field prediction lead the network to better pose estimation under heavy occlusion.

4.3.2. Comparisons Using 2D Projection Metric

We include only those results for comparisons that are provided by other methods as 2D projection-based results are not reported by some methods, so we do not include those in Tables 3 and 4. CSA6D [40] reported only 2D projection-based results on the LINEMOD dataset, so we only include those. Table 3 shows the comparison of our method with a number of other methods for pose estimation on the LINEMOD dataset concerning the 2D projection metric. It provides a 0.28% improvement using the 2D projection metric, which shows that our method outperforms the state-of-the-art methods and also outperforms PVNet and DPVL, which are our baseline methods. Table 4 shows the results of the occlusion LINEMOD dataset using the 2D projection metric. DPVL has not reported these results, but compared to the state-of-the-art ASPP-DF-PVNet with loss [7], our network shows a 1.81% improvement. Table 5 shows the number of wins by our method against all the datasets and evaluation metrics which show the robustness of our method. The number of wins shows how many times our method achieves the best score, and actually, it beats all the previous methods in the table.

4.4. Ablation Study

The two-stage processes show better results. The results presented in Section 4, Results and Discussions, show that the pixelwise voting [8] processes are more robust to occlusion compared to the processes that directly regress coordinates of keypoints using convolutional neural networks [3]. By incorporating the distance regularization to decrease the distance error between keypoints and hypotheses, the new method enhances the results further by incorporating a robust pixelwise labeling and vector field prediction network with the hypotheses that consider vectorial distance error between keypoints and pixels. The errors mainly increase due to incorrect pixel labels and direction vectors. Segmenting occluded objects can easily fail if the segmentation network is not robust, especially if the object looks thin from a specific view in the image. The example is the object “glue” in the dataset when it is partially occluded. The proposed semantic segmentation network is robust to occlusions and can be further optimized by changing the number of filters in the features’ pyramid to increase or decrease the number of parameters. The number of levels in the features’ pyramid can also be increased or decreased to test its speed and accuracy. Changing the size of the ResNet will affect efficiency and accuracy. The results of our network are improved significantly, which are shown in Tables 14, using LINEMOD and occlusion LINEMOD datasets and 2D projection metrics and ADD (-S) metrics for evaluation. Table 5 shows the comparison in terms of the number of wins against both datasets and evaluation metrics in comparison with all the state-of-the-art methods. The qualitative visualized results can be seen in Figure 3. Our method converges faster by using just 100 epochs to train and converge compared to PVNet, which needs 200 epochs during training for proper convergence. Reaching a consensus during voting for keypoints, our method shows robustness too. Other experiments are needed to achieve a further faster, more accurate, and more scalable network for 6D object pose estimation.

5. Conclusions

A network consisting of robust pixelwise classification and voting for keypoints, finding 2D-3D correspondences for the final pose estimation of objects is presented. The proposed pixelwise labeling improves the accuracy of the vector field and the system as a whole. For achieving further accurate voting for keypoints findings, the proposed system considers the distances among pixels and the keypoints that lead to better pose estimation. For this, a vectorial distance-based differentiable loss function is used to solve the problem of deviated hypotheses due to distant pixels from keypoints. The good thing about the vectorial distance function is its generalizable nature to any number of dimensions. The proposed approach speeds up the convergence of the network during training. The results in Tables 14 show the robustness of our model compared to the latest preexisting approaches. In terms of the number of wins presented, Table 5 also shows our system’s robustness. In future work, we will consider incorporating the suggestions presented in the “Limitations and Future Work” section.

6. Limitations and Future Work

As our work focuses on an efficient robust system for object pose estimation for robot manipulation, we adopt a robust semantic segmentation network and vectorized distance function. Recent works such as [8, 11] use a ResNet-18 as the backbone network that provides a weak segmentation. Hence failures occur in segmentation masks. Our backbone architecture ResNet-50 v2 in combination with FCN and PSPNet model produces comparatively more accurate segmentation masks. Due to sophisticated semantic segmentation architecture, the real-time speed of the complete system for object pose estimation affects slightly. Managing the speed versus accuracy trade-off is the key problem that needs to be solved. From PSPNet to FC-HarDNet-L2, any choice of selecting a segmentation network can be made, but probably the FASSD-Net-L1 and FC-HarDNet-L2 are better options for managing trade-offs between speed and accuracy. A thorough review of semantic segmentation networks has been presented by Rosas-arias and Benitez-Garcia [43]. One possible solution can be using more powerful GPUs or performing more experiments to find out new settings for another model. For more accuracy, training the same network on more data like occlusion LINEMOD and new datasets will also improve performance. Using direction vectors to extract pairwise features and triplet regularization can be another way to be used to see the accuracy of the method. Other approaches for finding loss function may affect the performance and should be tested. Some further postrefinement techniques will also improve the accuracy of the system.

Data Availability

All the data are included within the manuscript.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This work was supported by the Science and Technology Planning Project of Guangdong Province of China under Grant 2019A050520001 and the Princess Nourah Bint Abdulrahman University Research Supporting Project (no. PNURSP2022R54).