Abstract

Partial point cloud registration is an important step in generating a full 3D model. Many deep learning-based methods show good performance for the registration of complete point clouds but cannot deal with the registration of partial point clouds effectively. Recent methods that seek correspondences over downsampled superpoints show great potential in partial point cloud registration. Therefore, this paper proposes a partial-to-partial point cloud registration network based on geometric attention (GAP-Net), which mainly includes a backbone network optimized by a spatial attention module and an overlapping attention module guided by geometric information. The former aggregates the feature information of superpoints, and the latter focuses on superpoint matching in overlapping regions. The experimental results show that the method achieves better registration performance on ModelNet and ModelLoNet with lower overlap. The rotation error is reduced by 14.49% and 17.12%, respectively, which is robust to the overlap rate.

1. Introduction

As a key technology in computer vision and robotics, point cloud registration is a fundamental guarantee for accomplishing various downstream tasks, such as 3D reconstruction and simultaneous localization and mapping. Due to the rapid development of LiDAR, sensor technology, and stereo cameras, point clouds have become a very important data format and are widely used in many fields, such as autonomous driving, robotics, medical care, and cultural relics protection [1, 2]. However, the point cloud data of the actual 3D model need to be collected from different perspectives, which are incomplete. When registering, they only have partial correspondences. Therefore, the registration of real-world point clouds is still a challenge.

Since the point cloud registration was proposed, scholars at home and abroad have contributed a lot of research results. Iterative closest point [3] is the most classic algorithm. It searches for the closest point between the two point clouds to find the point-pair matching relationship and uses the Euclidean distance between the matching point pairs as the objective function to iterate until the accuracy meets the requirements or to iterate to convergence. Subsequently, handcrafted descriptors such as point pair feature (PPF) [4], signature of histogram of orientation [5], and rotational projection statistics [6] are designed to find point cloud local invariance features to aid in transform estimation for point cloud registration. However, the point cloud registration based on traditional methods cannot be generalized to a large amount of multiclass data.

In recent years, point cloud registration methods based on deep learning have attracted the attention of many scholars. The early deep learning-based point cloud registration network PointNetLK [7] utilizes the PointNet framework to extract the global features of the point cloud, and the Lucas and Kanade (LK) algorithm minimizes the distance between the point cloud features. Deep closest point (DCP) [8] adopts a graph-based approach to learn pointwise features of structures, establishing soft correspondences between point clouds by using rigid-invariant features extracted by an attention mechanism network. PointNetLK and DCP perform well when the input is a full point cloud but cannot handle part-to-part registration scenarios. Subsequently, RPM-Net [9] predicts the correspondence of partial point clouds through the Sinkhorn layer, which can process point clouds with partial visibility. PRNet [10] proposes a partial point cloud registration network for feature-based L2-norm keypoint detection to find common points in input point clouds. At the same time, the method based on feature learning focuses on extracting useful information such as the geometry of point clouds to form discriminative features, which has become the focus of recent scholars. PPFNet [11] introduces PPF features for feature encoding. Fully convolutional geometric features (FCGF) [12] utilizes a 3D fully convolutional network to expand the receptive field and extract geometric features. D3Feat [13] utilizes KPConv [14] to build a fully convolutional encoder–decoder architecture for joint dense detection and description. Based on D3Feat, PREDATOR [15] introduces a module for extracting key points in overlapping regions to establish correspondences. The above algorithms show that the method of finding correspondences on the downsampled superpoints has great potential for partial point cloud registration, and these algorithms have been able to achieve good performance in partial point cloud registration. However, it is still a challenging task to extract the common key points of two partially overlapping point clouds, as shown in Figure 1. Therefore, the method proposed in this paper to combine geometric features with Transformer provides a novel idea for the method of finding correspondences on the downsampled superpoints.

The accuracy of point cloud registration networks based on key point extraction is highly dependent on the accuracy of superpoint matching, so the extracted superpoints need to capture more global context features. Based on this, this paper proposes an improved partial point cloud registration network (GAP-Net) for accurate point cloud registration. Inspired by Transformer [16], our method employs Transformer to encode contextual information in registration before skipping connection blocks of the KPConv backbone network. In the overlap geometric attention module (OGA), the Transformer layer is guided to further aggregate the geometric features of the point cloud by using the coordinate and normal information and exchange information between the two point clouds. The main contributions of this paper can be summarized as follows:(1)A novel framework for partial-to-partial point cloud registration is proposed, which uses a spatial self-attention mechanism to optimize the KPConv backbone to capture the extracted features of each point cloud.(2)This paper adopts a random expansion strategy for the extracted superpoints to prevent the problem of K-nearest neighbor (KNN) layer breakage due to too few superpoints in the KNN algorithm. At the same time, it can expand the receptive field to facilitate the extraction of geometric features.(3)A new geometric self-attention (GSA) module is proposed that uses coordinate and normal feature information to guide the attention mechanism to integrate more global contexts with the learned geometric features. It allows for information exchange between the two point clouds, and the subsequent steps can be focused on overlapping regions for robust superpoint matching.

2.1. Traditional Registration Methods

Traditional point cloud registration methods generally include two stages: coarse registration and fine registration. Coarse registration provides a good initial position for fine registration, avoids falling into a local optimal solution during fine registration, and improves the accuracy of fine registration. The fine matching criterion is based on the coarse registration, which minimizes the differences between point clouds, such as spatial position differences, so as to obtain a more accurate rotation and translation matrix. For the commonly used algorithm sample consensus initial alignment [17] in the coarse registration stage, the FPFH feature is used to search for point correspondences, which makes the algorithm insensitive to the initial position of the point cloud. 4PCS [18] randomly selects four coplanar points in the target point cloud as the basic point pair for feature matching and uses the largest common pointset strategy to find the optimal matching point pair in the source point cloud. Super4PCS [19] adopts an intelligent indexing strategy, which reduces the computational complexity of 4PCS. Among the precise registration methods, the most classic ICP algorithm can obtain high-precision registration results and is widely used. However, ICP is very sensitive to initial values and outliers, and it is easy to fall into the local optimal solution. So, a series of variant algorithms, such as GO-ICP [20], are derived. In addition, there are some methods that use probability for registration. The normal distributions transform [21] algorithm determines the optimal transformation relationship between the point clouds to be registered based on optimization theory by discretizing the transformation space and combining the objective function to measure the registration error. The coherent point drift [22] algorithm transforms point cloud registration into a probability density estimation problem and uses a Gaussian mixture model and an EM algorithm to complete the registration. However, traditional registration methods have less research on overlapping regions, and they reduce the influence of outliers by dividing corresponding points into inliers and outliers after feature matching. Algorithms to find the correct inliers are RANdom SAmple Consensus (RANSAC) [23], 3DHV [24], etc. But their effect is limited when the proportion of overlapping regions is reduced. Therefore, extracting the points of overlapping regions accurately in the partial point clouds can ensure the registration performance of algorithms such as RANSAC and 3DHV. This paper chooses RANSAC for registration because it is easy to implement.

2.2. Learning-Based Registration Methods

Currently, learning-based methods are popular for registration tasks. DGCNN [25] extracts the feature information of the point cloud through EdgeConv [26]. The EdgeConv proposed by it can extract the local aggregation information of the point cloud under the premise of ensuring that the permutation is unchanged. SiamesePointNet [27] extracts pointwise descriptors directly for registration by introducing the Siamese Point Network, which contains a global shape constraint module and a feature transformation operator. However, some of the initially studied networks assumed that all points in the two point clouds were completely overlapping. Therefore, they are mostly unable to complete the task of partial point cloud registration. OPRNet [28] utilizes the Sinkhorn algorithm for partial registration. OMNet [29] learns masks in a coarse-to-fine manner to reject nonoverlapping regions, which converts the partial-to-partial registration to the registration of the same shapes. ROPNet [30] proposes a context-guided module to extract global features to predict point overlap scores, which are then registered using representative overlapping points with discriminative features. SCANet [31] effectively utilizes global information at different levels by introducing a spatial self-attention aggregation module in the feature extraction part and a channel cross-attention regression module in the pose estimation part for information interaction between the global features of the two point clouds to complete partial point cloud registration. SANet [32] proposes a subtract attention module to aggregate the pointwise features and then obtain the local correspondence between each point to complete the partial point cloud registration. MaskNet++ [33] utilizes spatial self-attention and channel cross-attention mechanisms to extract pointwise features and exchange information, respectively. STORM [34] employs EdgeConv and Transformer [16] to map the input points to a feature space, then performs overlap prediction to identify common points, and Transformer to refine the features, finally completing registration. G3DOA [35] proposes an overlap attention that extracts cocontextual information between the feature encodings of two point clouds to construct a feature descriptor suitable for partial point cloud registration. Inspired by PREDATOR [15], CoFiNet [36], which extracts hierarchical correspondences from coarse to fine, and GeoTransformer [37], which learns geometric features by using the designed attention module, both achieve robust matching on downsampled superpoints.

In summary, previous work has validated the potential of methods for matching on downsampled superpoints with partial registration. The local features and information interaction through the attention mechanism can improve registration performance even further. Based on these, in order to better realize the registration task for partial point clouds, this paper proposes an optimized partial-to-partial point cloud registration framework.

3. Method

3.1. Problem Statement

Given two point clouds and , the purpose of point cloud registration is to estimate a rigid transformation to align the two point clouds, where is a rotation matrix and is a translation matrix. Rigid transformations can be implemented in the following ways:where is the set of ground truth corresponding point pairs between the and point clouds.

However, in reality, point clouds are often collected from different perspectives, and they are incomplete. In order to form a complete point cloud of an object or scene, it is necessary to perform part-to-part point cloud registration on the two point clouds. Obviously, at this time, it registers two partial point clouds based on the information about the overlapping regions. According to first establishing the point correspondence between the two point clouds and then estimating the path of the transformation matrix, this paper mainly focuses on the former and establishes the point correspondence in the overlapping regions. To this end, this paper proposes GAP-Net, which takes two point clouds as input, outputs point correspondences, and then uses RANSAC [23] to estimate rigid transformations.

3.2. Network Architecture

GAP-Net is an encoder–decoder network, as shown in Figure 2. The encoder adopts the KPConv-SSA backbone network proposed in this paper to simultaneously downsample the input point clouds and extract multilevel features. The basic convolution block is composed of a ResNet-like KPConv/strided KPConv layer, an instance norm layer, and a LeakyReLU layer. At the same time, a spatial self-attention block is added before the strided KPConv block for pointwise feature encoding, which can utilize pointwise and global information at different levels. The spatial self-attention block is shown in Figure 3. The spatial self-attention mechanism in this paper consists of three operations: query (Q), key (K), and value (V). Specifically, given a source feature map , the self-attention map is obtained via the softmax function by multiplying the query (Q) in row , the key (K) in column , and the value (V) in column . Second, the attention-based feature map is obtained by concatenating the query (Q) and the attention map, respectively. Finally, update feature is shown in Equation 2. It is worth noting that, for simplicity, the operations of query and key share weights. The decoder consists of upsampling blocks and linear blocks. The upsampling block uses the nearest search for feature interpolation, and the linear block consists of a linear (MLP) layer, an instance norm layer, and a LeakyReLU layer.

GAP-Net takes the source point cloud , the target point cloud , and their feature descriptors that are corresponding size matrices initialized to 1 as input. First, the encoder performs downsampling and extracts features to obtain superpoints and corresponding features . Information interaction is then performed in the OGA, guided by geometric information, and the features and scores corresponding to the superpoints are output. Then, getting the feature and score of each point through the decoder. Finally, under the guidance of the score, enough key points are extracted to complete the registration task with RANSAC.

3.3. Overlap Geometric Attention Module

Exploiting attention mechanisms to capture global contextual information has played an important role in many computer vision tasks. At present, there are some methods that use attention to extract features using global context information for point cloud registration. However, these methods usually only exploit the high-level point cloud features provided by attention and neglect to use the geometric information of the point cloud to encode with attention. Therfore, this paper proposes OGA, an overlapping attention module guided by point cloud geometric information, to capture the geometric structure of point clouds and encode superpoint features. The OGA module is a bridge between encoders and decoders, and it mainly consists of a geometric information-guided self-attention module and a feature-based cross-attention module, as shown in Figure 4(a).

3.3.1. Random Dilation Cluster

Inspired by RSKDD [38], before using the attention mechanism to encode the point cloud features, this paper randomly expands the superpoints input to the OGA module to deal with the problem that the number of superpoints extracted from the sparse point clouds during registration is not enough to support the subsequent KNN algorithm operation, then causing feature layer breaks and the registration to fail. For the superpoints extracted by the encoder, a KNN search needs to be performed for each point in geometric coding. At this time, in order to solve the problem that the number of superpoints is too small, this paper adopts the random dilation cluster strategy to generate clusters, as shown in Figure 5. Assume that KNN are selected for a single cluster with an expansion rate of . This paper first searches the nearest neighbors of the center point and then randomly samples K points from them. Although this strategy is simple, it can effectively avoid the feature layer breakage of superpoints extracted from sparse point clouds when performing geometric encoding.

3.3.2. Geometric Self-Attention

As shown in Figure 4(b), the geometry-guided encoding module (GSA) takes superpoints and corresponding latent features as input and outputs geometrically enhanced features. Inspired by RPM-Net [9], the geometric feature of the superpoint is constructed with PPF [4], which can be formulated given as follows:where is the normal vector of a point in the superpoint set obtained via the encoder, it is calculated by averaging the normal calculated by Open3D over its surrounding points in . and is the radius of ’s neighborhood. represents the angle between two vectors. is implemented by PointNet. and is the radius of ’s neighborhood. represents channelwise maximum pooling. Inspired by PREDATOR and CoFiNet, a self-attention mechanism is introduced in the GSA module to further aggregate and enhance their contextual relations and obtain the semantic features output by the encoder as . Then, this paper fuses geometric features and semantic features to generate GSA features:

For computational efficiency, this paper adopts the same architecture as the cross-attention module but acquires features from the same point cloud to implement the self-attention mechanism, i.e., from to .

3.3.3. Information Interaction

The information interaction module in this paper consists of a cross-attention mechanism for information interaction and another GSA module for explicitly updating the local context. The cross-attention module adopts multihead attention, as shown in Figure 6. For the fusion feature obtained from the previous GSA block, the information of the potential overlap regions is obtained by mixing the feature information of the two point clouds through cross-attention and updating and enhancing the contextual information with the GSA block to complete the information interaction. The features of information interaction are calculated given as follows:where , and is the dimension of the parameter . , and are learnable weight matrices. Therefore, updating the information of a superpoint requires combining the query of that point with the keys and values of all superpoints . is the number of heads, and refers to the subsection “Geometric Self-Attention.”

3.4. Loss Function
3.4.1. Feature Loss

Circle losses for feature descriptors and are computed from the randomly sampled correspondences ( and ) from and :where is the number of the sampled correspondences, denotes distance in feature space, and are positive and negative margins, respectively. The weights and are determined individually for each positive and negative points and is a scale factor. Then, the loss is defined in the same way and the total circle loss for feature descriptors is .

3.4.2. Overlap and Saliency Loss

To supervise key points in the overlap regions, we follow PREDATOR [15] and use the overlap loss and matchability loss. Binary cross-entropy loss is used for overlap loss and saliency loss , i.e.,where and are the ground truth labels of point . Then, and are defined in the same way. The total overlap loss and the total saliency loss are , respectively.

3.4.3. Combined Loss

The complete loss function of GAP-Net is given as follows:where , and are weighting factors for sample balance.

4. Experiments

This paper compares GAP-Net with registration methods on synthetic, object-centric ModelNet40 and ModelLoNet (Section 4.1) and tests it using Stanford 3D scanning (Section 4.2). It is proved that the method in this paper can be used for partial registration of point clouds. Furthermore, this paper compares GAP-Net with registration methods on indoor scene point clouds, 3DMatch and 3DLoMatch (Section 4.3), proving that our method is not limited to simple geometric objects but can also be used for large-scale scene point cloud registration.

4.1. ModelNet40 and ModelLoNet
4.1.1. Dataset

ModelNet40 is a widely used point cloud registration dataset consisting of 9,843 CAD models of 40 different object categories for training and 2,468 models for testing. This paper uses 5,112 models for training, 1,202 models for validation, and 1,266 models for testing according to RPM-Net [9]. For a given point cloud, first copy the point cloud and randomly generate a rotation within (0°, 45°) and a translation within (−0.5, 0.5). Then, in order to generate partially overlapping point clouds, we randomly crop along one direction, retaining about 70% of the points. A further 50% downsampling was performed to retain 717 points. In addition to generating a ModelNet with an average pairwise overlap of 73.5%, this paper also generates a ModelLoNet with a lower (53.6%) average overlap according to PREDATOR [15] by retaining about 50% of the points when cropping and then randomly sampling the 717 points that remain in the end. The network was trained by the SGD optimizer, and the network parameters were updated on Intel(R) Xeon(R) CPU E3-1230 V2 3.3 GHz and NVIDIA GeForce GTX 1080 Ti GPU.

4.1.2. Metrics

This paper evaluates the registration based on the relative rotation error (RRE) and relative translation error (RTE) proposed in RPM-Net and the improved chamfer distance.where and represent the prediction and ground truth transformation, respectively, and represents the trace of the matrix.

4.1.3. Comparisons

This paper compares GAP-Net with DCP [8], RPM-Net, and PREDATOR, and the experimental results are shown in Table 1. Obviously, GAP-Net outperforms existing methods on ModelNet. GAP-Net’s RRE is reduced by 13.14% when compared to the next-best-performing RPM-Net on ModelNet. Furthermore, on the low-overlap ModelLoNet dataset, it not only outperforms RPM-Net, a method specially tuned for ModelNet, in terms of rotation–translation error by a large margin, but also outperforms PREDATOR, a method specially tuned for low-overlap point cloud registration. GAP-Net’s RRE is reduced by 17.12% when compared to the next-best-performing PREDATOR on ModelLoNet. This shows that GAP-Net is state-of-the-art in partial registration, especially robust in low-overlap states. Example results of our method on partially visible data are shown in Figure 7.

4.1.4. Relative Overlap Rate

In order to test the registration performance of GAP-Net under different overlap rates, this paper conducts a set of experiments with different cropping rates on the ModelNet40 complete point cloud dataset. The cropping retention rate ranges from 70% to 40% for a total of seven. There are 1,266 test pairs in each group, and the test results are shown in Figure 8. The results show that the registration performance of all three networks is at a high level when the crop retention rate is reduced from 70% to 60%. When the crop retention rate is reduced to 50%, the registration performance of RPM-Net is already significantly lower than that of GAP-Net and PREDATOR. When the crop retention rate is reduced from 50% to 40%, the RPM-Net error increases sharply. This is because after 50% and above cropping of cloud pairs, the proportion of overlapping regions decreases sharply, and some even have no overlapping regions. In this case, the extracted features have poor recognition ability, which confuses the model. The relative rotation errors of GAP-Net and PREDATOR can also be controlled within 10; GAP-Net is relatively better. In conclusion, GAP-Net outperforms state-of-the-art RPM-Net and PREDATOR in partial registration of ModelNet40 and is robust to changes in crop retention. Among them, the performance of RPM-Net’s partial registration of the point cloud decreases rapidly with the reduction of the clipping retention ratio, that is, the reduction of the relative overlap rate.

4.1.5. Ablations Study

To better understand the importance of the SSA components and the proposed OGA module, this paper conducts module ablation experiments on these two modules on the ModelNet and ModelLoNet datasets. The experimental results are shown in Table 2. GAP-Net is first compared with a baseline model in which the SSA component and the proposed OGA module are completely removed. The error achieves an RRE of 1.91° and 5.405° in the baseline model test. By adding the SSA component, the RRE is reduced by 0.116° and 0.552° on ModelNet and ModelLoNet, and the error is reduced by 6.07% and 10.21%, respectively. This indicates that GAP-Net benefits from the spatial self-attention aggregation (SSA) module, which effectively utilizes the internal and global information of each point cloud at different levels, so the three metrics on both datasets can achieve better performance. Taking this as a new baseline model, three different combinations of GSA and CA components in the OGA module were added, respectively. The combination of GSA and CA achieved the errors of RRE, RTE, and CD on the ModelNet dataset, which were only higher than those of GAP-Net. The gap with other better-performing combined metrics on the ModelLoNet dataset is also small, suggesting that the GSA component used to update the local context before upsampling further improves performance. In addition, compared with the new model, only adding the GSA component in the OGA module reduces the RRE by 0.158° and 0.056° on ModelNet and ModelLoNet, respectively, and adding the CA component reduces the RRE by 0.056° and 0.1° on ModelNet and ModelLoNet, respectively. This suggests that the self-attention mechanism GSA component guided by geometric encoding fuses the extracted features to further enhance their contextual relationship, and the CA component for mixing the feature information of the two point clouds to obtain information of potential overlapping regions are all improved network performance to some extent. Therefore, combining these four parts together, the GAP-Net, can achieve the best overall performance.

4.2. Stanford 3D Scanning
4.2.1. Dataset

This paper uses the Stanford 3D scanning dataset to test the generalization of GAP-Net. Compared to the synthetic ModelNet40 dataset, it is a real-world dataset. For partial registration, the partially overlapping point clouds are generated by randomly cropping about 30% of the points in different directions from two identical point clouds, and then the source and target point clouds are generated by rotation and translation for testing. The model trained on the ModelNet40 dataset is directly used here.

4.2.2. Metrics and Experiments

This paper uses RRE and RTE to evaluate the registration effect. The registration results are shown in Figure 9. Obviously, from the registration effect maps, although these object categories did not appear during training, GAP-Net can still perform very well on objects in the Stanford dataset. From both RRE and RTE, the registration errors on objects in the Stanford dataset are within the test error range on Model and ModelLoNet. This shows that our method has good generalization.

4.3. DMatch and 3DLoMatch
4.3.1. Implementation Details

Due to the large scale of the 3DMatch indoor scene point cloud, a group of basic convolution blocks are added at the front end of the network, and corresponding upsampling layers are added to increase the number of network layers to extract features. The experiment was performed on a computer with Intel(R) Core(TM) i9-10980XE CPU @3.00 GHz and NVIDIA GeForce RTX 3090 GPU.

4.3.2. Dataset

3DMatch contains 62 scenes, of which 46 are for training, eight for validation, and eight for testing. This paper conducts experiments using 3DMatch and 3DLoMatch preprocessed in PREDATOR [15], which contain >30% and 10%–30% partially overlapping scene pairs, respectively. This paper adopts registration recall (RR) as the main metric, since RR corresponds to the actual goal of point cloud registration. RR is the fraction of point cloud pairs for which the root mean square error of the estimated transformation compared to the ground truth is less than 0.2.

4.3.3. Comparisons

This paper compares GAP-Net with other feature-based registration methods: 3DSN [39], FCGF [12], D3Feat [13], and PREDATOR, as shown in Table 3. From the results, our GAP-Net performs only slightly worse than PREDATOR on 3DMatch and 3DLoMatch. It is not significantly different from PREDATOR, with registration recall being only 1.3% lower at worst. It is significantly worse than PREDATOR on 3DLoMatch, but the gap in registration recall is also in the 5% range. GAP-Net still performs better than other feature-based registration methods. An example result of our method on partially visible data is shown in Figure 10, and the registration effect is still ideal in the partial registration of numerous scenes.

5. Conclusion

This paper proposes GAP-Net, a partial point cloud registration network. A backbone network optimized using a spatial attention module is proposed to efficiently utilize the internal and global information of each point cloud at different levels. This paper also proposes an overlapping attention module based on geometric information for inferring points in overlapping regions. Experiments on the point cloud data of the ModelNet and ModelLoNet models show that our model has higher registration accuracy compared to state-of-the-art methods. In addition, the experiments on 3DMatch and 3DLoMatch scene point cloud data show that our method is also applicable for large-scale scene partial point cloud registration. In future work on this paper, we will further discuss how to adaptively select geometric information for different types of point cloud data, so that it can have better performance on scene point cloud data.

Data Availability

Data underlying the results presented in this paper are not publicly available at this time but may be obtained from the corresponding author upon reasonable request.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

We gratefully acknowledge that this work was supported in part by the National Natural Science Foundation of China under grant no. 61960206010 and the Science and Technology Support Program of Sichuan Province under grant no. 2021YJ0080 for providing the project.