Abstract

The space motion control is an important issue on space robot, rendezvous and docking, small satellite formation, and some on-orbit services. The motion control needs robust object detection and high-precision object localization. Among many sensing systems such as laser radar, inertia sensors, and GPS navigation, vision-based navigation is more adaptive to noncontact applications in the close distance and in high-dynamic environment. In this work, a vision-based system serving for a free-floating robot inside the spacecraft is introduced, and the method to measure space body 6-DOF position-attitude is presented. At first, the deep-learning method is applied for robust object detection in the complex background, and after the object is navigated at the close distance, the reference marker is used for more precise matching and edge detection. After the accurate coordinates are gotten in the image sequence, the object space position and attitude are calculated by the geometry method and used for fine control. The experimental results show that the recognition method based on deep-learning at a distance and marker matching in close range effectively eliminates the false target recognition and improves the precision of positioning at the same time. The testing result shows the recognition accuracy rate is 99.8% and the localization precision is far less than 1% in 1.5 meters. The high-speed camera and embedded electronic platform driven by GPU are applied for accelerating the image processing speed so that the system works at best by 70 frames per second. The contribution of this work is to introduce the deep-learning method for precision motion control and in the meanwhile ensure both the robustness and real time of the system. It aims at making such vision-based system more practicable in the real-space applications.

1. Introduction

Space programs on space robot, debris removing, rendezvous and docking, satellite formation, and other on-orbit service applications all involves the technology of moving body control [14]. The precondition for moving body control is to at first be acquainted with the body movement information, such as inertia, position, attitude, and velocity. Figure 1 shows several examples of on-orbit service application with vision system.

Generally, there are many techniques for measuring the relative position and attitude between two objects. Sensors such as GPS, gyroscopes, accelerometers, and star sensors are commonly used for self-navigation, and their position information is sent to each other by wireless communication. Optical-electronic sensors such as laser radar and vision-based system may be more suitable to measure relative position and attitude when the two objects are at the close range, especially in autonomous vehicles or aircrafts [57]. In addition, the vision-based system based on computer vision is also widely used for object localization in industry manufacturing lines, medical instruments, and some intelligent applications. Since the vision-based measurement is of low cost and flexible to setup, vision system is increasingly applied in space body control.

Vision system also has many schemes such as monocular vision [8], stereo vision [9], and active vision with structure light [10]. Besides, the active cameras such as Flash-LIDAR’s can be used to detect the unknown object [11]. For the noncooperative system localization, stereo vision and monocular vision [12] can both recognize the unknown objects by edge detection and feature matching. Active vision with structure light obtains the object 3D information when structure light scans the object surface, and it can not only help to recognize the object but also reconstruct the 3D information of unknown object. While monocular vision is more difficult to handle the unknown object, its precision and speed is not worse than stereo vision or active vision in applications of known target and environment.

This work is involving in an on-orbit service application. A free-floating robot can move inside the spacecraft, and the function of the robot includes routing inspection, astronaut assistant, and autonomous docking and charging. Many similar programs have been carried out in the satellite or Space Station such as the SHPHEREs [13], SCAMP [14], mini AERCam [15], and Astrobee [16], which are shown in Figure 2. This kind of robot does not require space orbit control but only serves for relative movement inside the spacecraft. However, once the environment suitability permitted, it can also work out of the spacecraft. In this work [17], a vision navigation camera is configured on a robot, and another camera is fixed at the docking place to recognize this robot. These two kinds of vision system above have the same function to measure the relative position and attitude.

The main problem of the positioning system is that the complex background and light environment may influence the image recognition. Another problem is the requirement of real-time processing speed and high precision in the control system. To resolve these problems, firstly, the deep-learning method is introduced robust for object detection and localization in the image sequence. Secondly, after the object is detected, the geometry of object position and attitude calculation is solved by P4P (perspective in 4 points) method, and the explicit calculation is gotten. The embedded electronic platform driven by GPU is applied for accelerating the image processing speed.

The ground test platform is established, and its testing result indicates that our measures greatly improved the recognition accept rate so that the precision of object localization is up to 1% and the embedded platform can process the image sequence at best by 70 frames per second. The works aimed at making such vision-based system more practicable in the real dynamic environment.

2. System Design and Working Mode

The vision-based positioning system works as shown in Figure 2. The system consists of a light source, camera and lens, and an embedded computer base on GPU and ARM. The system is tested in a scene with both simple and complex background that simulate the real-space environment. The reference marker with robust patterns of hamming code is set up both on the fixed place and on the robot, which are shown in Figure 2.

The system works in two modes as follows: (1)Long-Distance Navigation and Control. When the object is at a distance, the system recognizes the target from the image sequence and gets its coarse distance and position. This can be realized by identifying the position of the object located in the image and the image size d, then the coarse distance of the object can be known according to the real size of it. The robot is controlled close to the object under the help of a navigation system. In this working mode, we only estimate the position and distance of the object, without calculating its accurate attitudes(2)Close Docking and Precise Control. When the two objects are less than 1.5 meters apart, the camera seeks the known marker on the target and gets its accurate information to implement the precise control

3. Object Detection

According to the system working modes, the object detection process includes two parts: one is coarse recognition to find the object at a distance in the complex scenes, and the other is fine localization for calculating the accurate spatial position at close range.

3.1. Object Detection by Deep Learning

The traditional method to localize the target from the image is to directly adopt the precise image matching such as by SIFT features and ORB features [18, 19], but this method does not work well to distinguish the target in a complex background and may lead to mismatch and decrease in positioning accuracy. With the development of machine learning, the deep learning shows higher robustness and effectiveness to solve the problem of target detection in complex scenes [20, 21]. In this work, we use a method of target detection based on convolution neural network.

The target detection process is as follows: firstly, prepare the training set, label the positive and negative samples, and train and get the model with this dataset. Then, the algorithm is embedded in the computer, and the software will automatically determine whether our target is in the scene.

3.2. Dataset Organization and Image Preprocessing

For offline training, we select about 4000 positive samples and 9700 negative samples. It is common practice to use about 2/3 to 4/5 samples for training. The training set and test set used in this paper is 7 : 3. The positive samples are captured in different angles and different light intensity, and some are half-sheltered as shown in Figure 3. Negative samples without the target are also captured in the same environment, and some figures are downloaded from the internet as shown in Figure 4. Before data is imported into the network, the data need to be preprocessed. Firstly, all images are normalized to the dimension of pixels before importing the network. Secondly, image type is converted to a certain format, so that training software can use it directly. Finally, the mean of the training set is generated to test the training accuracy.

3.3. Training Configuration

In this paper, the neural network model is divided into eight layers, five convolution layers, and three fully connected layers. In each convolution layer, the excitation function ReLU (rectified linear unit) and local response normalized LRN (Local Response Normalization) are included and then downsampling for pooling. The batch_size is set to 50, and the training sample has 9590 images, so it takes 192 iterations to complete all the samples once. The test sample has 4110 images, batch_size is 50, and it takes 83 times to complete the test at a time. We set the snapshot to 1000, and the loss curve and accuracy curve are shown in Figure 5.

When the iterations are 768, the accuracy rate of training is 99.976% and the error rate of testing is 0.157%. The system basically reaches the steady state. And when the iterations are 30000, the model can basically meet the needs of this experiment. The training is done offline, and the time is about 60 hours in CPU or 24 hours in GPU.

3.4. Recognition Algorithm Test

To verify the reliability of the method, we use the database of 4110 samples to match, as shown in Table 1. With the alternation of ratio values, 10 groups of FPR (true positive rate) and TPR (false positive rate) are obtained. The recognition algorithm is compared with the traditional image matching method of ORB features. The former results are FPR1 and TPR1 while the latter is FPR2 and TPR2. It is often recommended that the threshold should be taken as 0.8 for image matching. In our application, we set the threshold value between 0.4 and 0.6, where the matching result was the best. If the threshold value is greater, the generalization performance will decrease.

Among the 4110 tested samples, the method with CNN has no mistake to identified negative samples as positive samples, and only 21 positive samples are not identified. The accuracy rate reached 99.847%, which is far better than the traditional object detection method. It benefits from the CNN and deep-learning technology.

3.5. Marker Recognition and Fine Localization

Marker recognition is performed when the object is less than the range of 1.5 meter and after the object has been recognized to a small range. Marker detection includes the following steps: firstly, the image is normalized and converted to binary image with a certain threshold. Then, after image segmentation and contour extraction, the polygon is approximated and four vertices of polygons are found. According to the polygon position, the polygon image can be segmented and represented by 1 and 0. This binary vector is matched with hamming codes, and the number of the April Tag and its orientation is recognized in the image, as shown in Figure 6. In addition, the marker is not the only target we can use during this stage. In fact, we tried other targets that we can apply, as shown in Figure 7. However, considering the self-correcting property of the marker, the stability of it is better.

After the four vertices of the marker are confirmed, they are used for further calculation. Considering that if one marker covers 50 pixels in the image, then the precision of one pixel leads to 1% error of localization, that is, 10 mm within the range of 1 m. For more precise localization, the coordinates of four control points are corrected to subpixel, and the error could decrease to 0.1%. There are many methods for subpixel edge detection, and we directly use the classic function of OpenCV for subpixel processing in this work. The subpixel edge detection is as shown in Figure 8. Since the subpixel detection is time-consuming, it is only used in fine operation control.

4. Position-Attitude Measurement

4.1. The Localization Algorithm

In the camera projective model, define the center of the marker as (, , ) in the marker coordinate system and (, , ) in the camera coordinate,

Here, the translation vector is and the rotation matrix is

The translation vector can represent the three-dimensional position of the target in the camera coordinate system, and the rotation matrix can characterize the attitude of the target in the camera coordinate system. Support is the original position of the moving object. The position measurement question can be shown as the below images: the question to calculate the robot’s position and attitude relative to the camera. Here is the equivalent to resolve the matrix rotation matrix and translation vector in

Let , , , as four coordinates of the marker vertices in the space and , , , as their projection coordinates in the camera plane, which is shown in Figure 9. Suppose that , , , , and the camera intrinsic parameters are known. The question to resolve the and is a classical P4P problem [22].

Define the control points in the camera coordinate as , , and , where is the camera focus, and define the optical center as , the plane is formed by and , . The normal vector of is , which can be calculated by linear equation of , and camera intrinsic parameter.

Let , .

Since vector is perpendicular to vector N1, there is

From equations (3)~(4), and can be solved, and correspondingly and is solved as , . If we take as the origin point, then the displacement is .

The axis in the camera coordinate is and is normalized as , which is the first column of rotation matrix .

The axis in the camera coordinate is and is normalized as , which is the second column of rotation matrix .

The axis in the camera coordinate can be calculated by the cross multiplication as

It is the last column of , and by now the rotation matrix has been solved.

5. Precision and Speed Testing

5.1. Experiment Platform
5.1.1. Experiment Design

A platform is set up to validate the precision of the system as shown in Figure 10. The reference marker is fixed on a 6-D precision displacement table, and the camera is installed on the other mechanic table. We record the displacement of the 6-DOF table as the real value, and the calculated displacement by vision system as the test value. The difference between the real value and the test value represents the precision of the system.

When the mechanical platform is moving, the computer software samples the image sequence, calculates the accurate position and attitude information of each picture, and draws the moving trajectory in the real time, as shown in Figure 11. The red square indicates the marker in camera coordinate. To reflect the direction of the marker, the upper left corner of it is defined as the first corner when it is standing and the other three corners in the counterclockwise order.

5.1.2. Camera Calibration

Before the setup of the camera, the camera should be calibrated offline at first. The camera intrinsic parameter commonly is described as where is the normalized focus of axis, is of axis, is the distortion factor, and , is the image coordinate of optical center. These factors only depend on the camera itself and can be calibrated in advance.

The method of camera calibration is a commonly used operation. According the theory of Zhang [23], we only need one 2D checkerboard marker and a picture of it in various angle and position, then the inner parameters can be obtained by the least-square method.

In this work, we use a 5 mm lens and a camera with 640480 gray in 1/3 inch; the camera parameters are calibrated as Table 2.

According to Table 2, the normalized focus of and axes is , the image coordinate of the optical center is , the distortion factor of and axes is , and the radial distortion parameter is .

5.2. Precision Test Results
5.2.1. Single-Dimension Precision Test

Move the mechanical table, respectively, in and axes, and test whether the calculated value is consistent with the real value. Figure 12 is the test results in and , and the error is less than 1%, except for some points in the side face of .

5.2.2. Multidimension Decoupling Test

Since the measurement of multiple degrees of freedom may have the coupling and its error model is difficult to analysis, the decoupling influence must be given by an actual test.

The corresponding experimental schematic diagram of the multidimension decoupling test is as shown in Figure 13, and Figure 14 is the curve when the object moves in a rectangle of  cm at a surface. At the same time, we record the value change of axis and ; the results show the decoupling with Z-XY and RZ-XY is, respectively, less than 1 mm and 2 degrees. This error may be caused by the camera calibration or the position calculation or other system error.

5.3. Real-Time Speed Test Results

For real-time operation, we choose high-speed camera with low exposure time and use a GPU embedded platform to implement and speed up the image processing algorithm. Here, the camera is from Basler gc300 ( and 300 fps), and GPU is the Tegra TX1 of NVIDIA company. These industry components are for ground test and may be enhanced for space environment.

GPU performs well on big-size images and can be accelerated for image processing and CNN applications [24]. FPGA performs well on small-size images and parallel processing, and the heterogeneous computing with GPU and FPGA will be more powerful for real-time applications. The information below is only for the GPU as the processor individually.

The vision-based system works with the following steps: (1)Image sequence sampling(2)Image normalization or image preprocessing(3)Image recognition and matching(4)Position and attitude calculation

Among them, the image sequence sampling is mainly related with the camera performance, and this step costs the most time. For image processing, the processing time depends on the working modes. The image recognition by CNN is a little time-consuming, since it will seek and match the feature in the whole image. When the objects are less than 1.5 meters, the recognition algorithm is much faster, since the reference marker feature is matched in a small range and the image has been converted to binary image. The time is shown in the Table 3.

As a conclusion in this chapter, ground experiment shows that the system sampling and the fasted processing speed can get up to 70 Hz in docking mode and 30 Hz in navigation mode, which has been able to satisfy most demand of vision-based control system. Actually, there are more strategies to improve the system speed, such as using the image tracking method to reduce the seeking size of image, or optimizing GPU program codes in CUDA accelerator, that will be discussed in the future.

6. Conclusions

A vision-based system that is mainly for space body 6-DOF position-attitude measurement and control is introduced in this paper. The configuration of this system, the image processing and object detection algorithms, and the position-attitude measurement formula are given. Our work mainly focuses on the practical problems for the system. Firstly, we bring forward CNN with deep-learning method for object detection and get 99.8% accuracy rate. Secondly, we resolve the P4P questions for object position and attitude problem and test it on the ground. The results show the precision is far less than 1% of range. The third is we use high-speed camera and GPU as the processor and accelerate the system to nearly 70 frames per second. The computer technology’s fast development such as GPU and deep-learning brings us great benefits for object detection applications. In the future, we will continue to optimize algorithms and reduce the time consuming and ensure this kind of vision-based system more robust and faster and to be used in more space applications.

Data Availability

The [code] data used to support the findings of this study have been deposited in the (Scientific Data’s List of Recommended Repositories) repository (DOI: 10.6084/m9.figshare.7149974).

Conflicts of Interest

The authors declare that there is no conflict of interest regarding the publication of this paper.

Acknowledgments

The authors would like to thank Tianyu Wang, Bin Yu, Hui Zhou, Xiaopeng Su, and Ting Wang for their help on this project. This work is funded by the space utilization system of Chinese Manned Spaceflight Program.