Abstract
Sports video moving target detection and tracking play an important role in enhancing the popularity of sports and the promotion of sports events. This paper combines the SIFT algorithm to carry out the research of sports video moving target detection and tracking technology, to identify sports features, and to improve the sports feature detection algorithm. Moreover, this paper divides the point cloud data into multiple cube grids under the coordinate system where it is located, and then finds the center of gravity of the data points in each grid, and replaces the coordinates of all points in the grid with the coordinates of the center of gravity. In addition, this paper combines data analysis to verify the algorithm and build a sports video moving target detection system. The experimental research results verify that the sports video target detection and tracking technology based on the SIFT algorithm proposed in this paper has good results.
1. Introduction
In recent years, sports behavior recognition technology has been increasingly integrated into daily sports video analysis. Moreover, there are more and more scientific researchers engaged in sports behavior recognition, and the research on behavior recognition technology is in full swing. In addition, the emergence of various latest methods and theories and the introduction of many new algorithms from other fields have made great progress in behavior recognition [1]. The main process of sports behavior recognition technology can be roughly divided into these four steps: feature extraction, feature representation, behavior modeling, and behavior classification [2]. According to specific research goals and needs, corresponding changes can be taken in these steps. For example, some algorithms directly combine feature extraction and representation into one step, and some methods do not even require behavior modeling to send the descriptors after feature extraction and representation directly into the classifier to recognize their classification. At the same time, some methods incorporate iterative feedback processes such as deep learning. In addition, some methods also perform further processing (such as dimensionality reduction, etc.) on the descriptors after the feature representation to make these features more distinguishable, etc. [3].
The models used in time series modeling can be regarded as further expressions after feature representation. The time series information extracted by these models is not presented in front of people in an intuitive form but through the parameters of the model after modeling. Make an expression. Existing methods for time series modeling include hidden Markov models, conditional random fields, linear dynamic systems, and the recently popular recurrent neural network models.
This article combines the SIFT algorithm to carry out the research of sports video moving target detection and tracking to identify sports features, which provides a theoretical reference for the application of dynamic recognition technology in sports competitions and sports training.
2. Related Work
The visual SLAM system completes the estimation of its position in the environment (white positioning), mainly relying on the visual odometer module [4]. The binocular visual odometer calculates the depth information by calculating the parallax between the left and right cameras, while the monocular visual odometer cannot obtain absolute scale information and can only be obtained through other sensors or environmental information [5]. Although the current advanced visual odometer has been able to run in real time and obtain high positioning accuracy, almost all methods are assumed to operate in a static environment [6]. If there is interference from moving objects in the field of view of the sports camera, the visual odometer will have a large estimation error or even fail. Aiming at the problem of moving objects in the scene interfering with the visual odometer, the random sampling consensus algorithm is currently the most mature and effective method [7]. This method fits the model and eliminates the data points that are inconsistent with the model as outliers. When the moving object in the camera’s field of view only occupies a small part, the RACSAC method can be used to filter the feature points, and the feature points on the moving object can be removed as external points, and combined with the motion estimation in the visual odometer, thus get better positioning results [8]. But when the moving object occupies a large part of the camera’s field of view, the RANSAC algorithm may also treat the characteristic points on the moving object as interior points. At this time, relying on this method will not be able to achieve the purpose of eliminating interference [9]. Literature [10] uses pretraining to segment the feature points in the image into dynamic and static feature points, but this method is difficult to implement in actual real-time applications. Literature [11] relies on the dense scene flow to segment dynamic objects, but the scene flow calculation also needs to use the visual mileage calculation method to compensate for the posture. Literature [12] uses image segmentation technology to segment the image into static background and motion regions and uses the partitioned regions to perform motion estimation separately and then perform a global fusion. Although this algorithm has high accuracy, it is difficult to meet real-time requirements. Literature [13] uses the IMU information as a priori to segment the dynamic feature points in the image and realize the visual positioning of the dynamic environment by combining the information of the inertial navigation system and the depth visual odometer.
Literature [14] uses a single-chip microcomputer as the controller to design a high-speed positioning control system for image monitoring dynamic brackets and dynamically calculates the rotation angle of the pan/tilt based on the control motor; Literature [15] analyzes the sports trajectory tracking strategy to design a dynamic Fuzzy PID controller for point-line calculation trajectory tracking, research on the visual pan-tilt pose calculation and dynamic trajectory tracking system control technology.
The vigorous development of machine vision under the compatibility of various devices also marks the beginning of vision requirements. Literature [16] analyzes the research on the coordination strategy of the multimovement sports system to capture the moving target and uses the stereo vision motion detection to complete the motion parameter estimation of the moving target. At present, the three-dimensional object positioning technology based on binocular stereo vision has become one of the hot spots in the research of vision measurement [17]. Compared with the monocular camera motion measurement technology in the large tank discharge hole visual positioning control system, stereo vision can calculate more information when the parallax is obtained. It is not only compatible with the characteristics of the monocular camera but also can be used to construct three-dimensional objects. The accuracy and applicability of the model are high [18].
3. Moving Target Detection Based on SIFT Algorithm
This article analyzes the moving target detection algorithm, and this part mainly combines the SIFT algorithm to identify and track the moving target.
The essence of the Nonmaximum Suppression (NMS) algorithm is to retain the most significant or least significant key points within a certain range from the key points initially extracted by the algorithm and discard other key points.
The specific steps are as follows.
For a point in the key point set , (1) the algorithm first takes its neighborhood and judges whether its saliency value is the largest or smallest in the neighborhood. If the maximum or minimum value is taken, it is marked as the true key point; otherwise, it is marked as the rest key point. (2) The algorithm traverses the entire set of key points and removes all the key points marked as surplus points.
The saliency in the above process can be determined according to the algorithm or requirements, and features such as curvature, the interval length of the normal vector distribution of the neighborhood points, and the shape index value can also be used.
According to the above method, if the neighborhood of size k = 3 is taken, it is determined whether the curvature of each key point in the graph is the maximum value in the neighborhood, and the nonmaximum points are removed, as shown in Figure 1. This example uses curvature as the saliency, which not only retains the high-quality key points but also greatly reduces the redundancy.

The Intrinsic Shape Signatures (ISS) algorithm was proposed by Zhong et al. The ISS algorithm uses the dispersion difference degree between the three main directions of the point neighborhood local reference coordinate system as the evaluation index of the significance of the point and extracts the points with a large dispersion difference degree by comparing with the preset threshold.
First, we use the PCA algorithm to calculate the three eigenvalues , of the covariance matrix in descending order. The three eigenvalues obtained after the eigenvalue decomposition of the covariance matrix of , respectively, represent the degree of dispersion along the three eigendirections , , and . Therefore, the ratio of the two eigenvalues can be used to express the degree of difference in the dispersion of the two main axis directions.
Formula (1) shows the criterion for extracting key points, where and are preset thresholds, and the size of the threshold determines how many key points are extracted. The smaller the threshold value is, the more key points are extracted.
For any point in the point cloud, according to formula (2), it calculates the centroid of the neighborhood of the point and transforms it to the centroid system, calculates the covariance matrix , and then decomposes the eigenvalue of the covariance matrix , as shown in the following formula [19]:
Then, the Hotelling transform is performed on the neighborhood point , and the neighborhood coordinates of the point are projected onto the three principal axes, as shown in the following formula:
is the coordinate of the neighborhood point before transformation and is the coordinate after transformation. Then, the algorithm calculates the ratio between the coordinate distribution ranges on the and axes (the first and second largest spindles), as shown in formula (4), where [20].
For symmetrical surfaces (that is, the neighbors are distributed in the same direction along the largest and second largest axes), the value of is equal to 1; for asymmetric surfaces, is greater than 1. The algorithm sets the threshold as and determines whether is satisfied, and if it is satisfied, it is recorded as a key point.
The algorithm traverses each point in the point cloud and completes the preliminary basket selection of the key points, which is marked as .
For the key point that has been preliminarily selected, a quadric surface fitting is performed on its neighborhood to obtain a parametric surface . Subsequently, a uniform grid is used to sample the fitted surface, and the principal curvature and Gaussian curvature at the sampling point are calculated, and the parameter d is used to evaluate the quality of key points according to formulas (5) and (6). In this paper, in order to simplify the calculation, the algorithm uses neighborhood points instead of uniform sampling points to calculate , and is the number of neighborhood points [21].
Finally, is the saliency parameter. This article uses the NMS algorithm to perform nonmaximum suppression, where only the maximum value is retained to complete the screening of key points.
The Local Surface Patch (LSP) algorithm uses the least squares method to fit a local point cloud into a parametric surface. It calculates the first basic quantity and the second basic quantity of the surface and then constructs the shape index and filters out the points whose satisfies certain conditions as the key points. Finally, the NMS algorithm with as the saliency parameter, further filters the initially selected key points to complete the final keypoint detection. The specific process is as follows.
For the neighborhood of any point in the point cloud , the algorithm first establishes the LRF and rotates the neighborhood to the three principal axis directions of the LRF to eliminate the influence of the initial pose on further calculations. Then, the quadric surface is fitted, and the principal curvature at the point on the surface is calculated according to the following formula [22]:
It can be seen that the value range of the shape index value defined by the above formula is [0,1]. When the value is large, the corresponding partial surface is convex; when the value is small, the corresponding partial surface is concave.
After completing the calculation of all points in the point cloud, the preliminary screening of the key points can be completed according to the following formula:
Among them, is the mean value of in the neighborhood of , and are preset parameters, and the value of and should be between 0 and 1.
Then, is used as a saliency parameter. Using the NMS algorithm, this paper judges whether the value is the maximum or minimum value of the value of each point in the neighborhood point by point. If it is, keep it. If it is not, delete it. Finally, the key point set is obtained.
The Histogram of Normal Orientation (HoNO) algorithm first calculates the angle between the normal vector of each point in the neighborhood of each point in the point cloud and the normal vector of the target point , and then counts them to form a histogram. According to the histogram characteristics, the flat area is excluded and the salient area of the feature is detected. Then, by evaluating the properties of the histogram and the neighborhood covariance matrix, key points are extracted from the salient regions.
First, for each point , this paper uses the algorithm to estimate the normal vector . Then, for every point except in the neighborhood of , calculate the normal vector angle between and according to formula (10) and count the included angles into the histogram containing boxes, where the length of each box is 10 degrees. In order to eliminate the influence of the neighborhood point density on the algorithm, it is necessary to normalize the histogram after the large-angle filling angle of the normal vectors of all neighborhood points is completed.
Obviously, the histogram of the point with the neighborhood approximate to the plane has the characteristic of “the first box has a higher value and the rest of the box values are approximately 0.” Similarly, the area with a larger degree of curvature has a large normal vector distribution range. Therefore, most of the box values in its histogram are nonzero. Therefore, it is necessary to design parameters to describe the distribution of values in the histogram.
As shown in formula (11), Kurt is used to express the kurtosis and dispersion of the histogram distribution. If the parameter kurtosis parameter of the point histogram is less than the preset parameter , there is no obvious presence in the histogram. That is, the distribution range of the values in the histogram is larger, and is retained as the key point; otherwise, it is removed.
Finally, after the Kurt parameter calculation of all points and the determination of the preliminary key points have been completed, the key point set is deredundant by using as the saliency parameter and NMS is used to obtain the final key point set .
The Harris operator is extended to three-dimensional space, and the specific steps are as follows.
First, for the point in the point cloud , the algorithm queries the neighborhood and establishes the LRF, and the neighborhood is translated to the LRF coordinate system that is the origin of .
After establishing LRF, the algorithm sets the parameters according to formula (13) and performs quadric surface fitting on the neighborhood to obtain the fitted quadric surface parameter .
For parametric surfaces, adding more high-order terms means that it can adapt to more complex surfaces. However, more complex surfaces do not have clearly defined derivatives at certain points in the defined domain. Moreover, when the neighborhood radius is not large, the vicinity of the target point can be approximated as a quadric surface. Therefore, the directional derivative can be easily obtained according to the following equations:
Considering the influence of noise, the Gaussian function originally proposed by Harris and Stephens can be applied, as shown in the following equations:
Substituting the quadric surface equation, it is simplified as shown in the following equations:
Then, the analysis matrix is constructed as follows:
By analyzing the determinant value and trace of the matrix , the Harris corner response value at the point is constructed. As shown in formula (23), is a nonnegative preset parameter.
The specific steps to implement the 3D-SIFT algorithm in this paper are as follows:(1)The algorithm constructs the point cloud scale space in a three-dimensional coordinate system. The scale space of the point cloud is , the point whose coordinate is and the pixel value in its neighborhood are convolved with a three-dimensional Gaussian function whose scale can be changed by changing , as shown in the following formula: Among them, the specific expression of the Gaussian function is shown in the following formula: The definition of the cube grid size during downsampling is shown in the following formula:(2)The algorithm builds the DoG space of the 3D point cloud. The process of downsampling is equivalent to performing local mean filtering on the point cloud set. The local features in each cube grid disappear, replaced by the center of gravity coordinates, and there will be discontinuities between the grids. Therefore, in order to ensure that the algorithm can find characteristic points in the point cloud stably, it is necessary to construct a DoG space for the point cloud. The construction formula is shown in the following formula: The construction of the DoG space of 3D point cloud data is simply the process of convolving the Gaussian scale function with the coordinate data of the point. It is equivalent to smoothing the point cloud data layer by layer, and each layer needs to be divided into several small scales distinguished by a certain step length, as shown in formula (28). In the formula, is a preset parameter for calculating the Gaussian scale space.(3)The algorithm calculates the Gaussian filter response value of the sampling point in the Gaussian scale space. When calculating the Gaussian filter response value, in order to improve the efficiency of data processing, the effect of increasing the distance between the points on the characteristics of the sampling points can be considered (here, the curvature is used as the main geometric feature). It is considered that the closer the point to the sampling point is, the greater the contribution to its curvature. Therefore, it is possible to consider weighting the coefficient of the distance from the point . Then, the Gaussian filter response value can be calculated according to the following formula: In formula (29), is the curvature value of the neighboring point of the sampling point . The calculation formula of the weighting coefficient is shown in formula (30). In the formula, represents the square of the distance between the sampling point and the neighboring point . According to the above formula, the corresponding value of Gaussian filtering for each data point in the 3D point cloud data can be calculated. By calculating the difference between the Gaussian filter response value of the sampling point at the current scale and the Gaussian filter response value F_last at the previous scale, the DoG value of the sampling point at the current scale can be obtained, as shown in the following formula: The algorithm repeats the above steps for all points in the point cloud at each scale until all the points in the point cloud are traversed to obtain all the DoG values of all points in the point cloud at different scales.(4)The algorithm detects extreme points in the DoG space of the point cloud. In the DoG space of point cloud of various scales, respectively, the DoG extremum of the neighboring point is found in the neighborhood of the current point. If the DoG value under a certain scale is greater than the DoG value under two adjacent scales, the current point under this scale is the key point. The feature points found by this method are scale-invariant and can be retained as feature points in different scale spaces.
4. Research on Sports Video Moving Target Detection and Tracking Based on SIFT Algorithm
The data set of this paper mainly comes from the network, and the data set of this paper is shown in Figure 2.

The key point detection experiment in this paper is based on the model point cloud of the above data set. The evaluation index is calculated in each data set and the average value is calculated. Finally, the average value of the data set is calculated and the corresponding curve is drawn.
This paper chooses the relative repetition rate, the accuracy rate of the descriptor matching experiment, and the operation efficiency as the evaluation indicators.(1)Relative repeatability (repetition rate). The repetition rate represents the consistency between the key points detected when the point cloud changes and the key points detected by the point cloud before the change. As shown in formula (32), is the set of key points detected by the point cloud , is the set of key points detected by the changed point cloud , then represents the same part in and .
The repetition rate is used to evaluate the robustness of the algorithm to factors such as spatial transformation, noise, and resolution. The noise repetition rate represents the proportion of the same part of the key points detected before and after adding noise. The outlier repetition rate represents the proportion of the same part of the key points detected before and after adding the outlier. The grid resolution repetition rate represents the proportion of the same part of the key points detected before and after the point cloud sampling.
In order to quantitatively set the neighborhood radius , for each data set, the algorithm first calculates the diagonal length of the spatial bounding box of the data set model point cloud and traverses all models to obtain the maximum length . Then, the algorithm reduces all the point coordinates in the point cloud by times, and in the subsequent evaluation experiment, the neighborhood radius r is uniformly set to 0.02.
The evaluation experiment includes three parts: repetition rate experiment, running time experiment, and descriptor matching experiment. Algorithms and test programs are written in MATLAB language.
The repetition rate experiment mainly includes two parts: the parameter change module and the repetition rate calculation module.(1)The point cloud quality change module has three modes to choose from: adding Gaussian noise, changing the point cloud density, and adding outliner’s. They correspond to the repeatability of the key point detection algorithm to noise, the repeatability of the grid resolution, and the repeatability of the outliers. In order to quantitatively obtain the amplitude of the noise signal conforming to the Gaussian distribution, this paper sets the parameter and combines the neighborhood radius r used by the key point detection algorithm, sets as the maximum amplitude of Gaussian noise, and calculates noise amplitude distribution by formula (33). Among them, represents a random number with a value range of [a, b]. After the algorithm calculates the noise amplitude , it chooses random direction angles and and calculates the unit vector according to formula (34). Finally, the algorithm calculates the coordinate p^ ′ = p + nl° after noise is added, is the point cloud coordinate before noise is added, and is the point cloud coordinate after noise is added. Outliers are a common type of noise in the point cloud obtained by three-dimensional scanning, which manifests as noise points far away from the surface of the object. Similarly, the algorithm uses random sampling to select a certain proportion of points in the point cloud as outliers. For point , the algorithm first calculates the normal vector , increases the coordinate value along the direction of the normal vector , and sets the increment size to 1 times the radius of the neighborhood. As shown in formula (35), the outlier set is obtained. In order to observe more intuitively, it is displayed in the form of a patch. In addition, in the repetitive experiments of the key point algorithm on the space transformation, the point cloud needs to be spatially transformed. Space transformation can be realized by rotation and translation, as shown in equation (36). By realizing the rotation and translation of all the points in the point cloud , the transformed point cloud is obtained. Among them, represents the transpose of , and the rotation matrix and translation matrix are obtained by formulas (37) and (38), respectively.(2)After completing the detection of the key point KP of the original point cloud and the key point KP′ of the transformed point cloud, it is necessary to calculate the same part of the two key point sets, that is, the repetitive calculation module.
Figure 3 of the repetition rate experiment shows that the algorithm selects a data set. In the model of the data set, the repetition rate change of the selected key point algorithm is tested when the selected conditions change. We use the noise repetition rate as an example. We must first add noise to the model point cloud, and then use the key point detection algorithm tested to perform key point detection on the point cloud before and after adding noise, and finally use Algorithm 1 or Algorithm 2 to calculate the repetition rate of the two sets of key points.

Descriptor matching experiments need to be combined with the PRC drawing process in the evaluation of the description ability of feature descriptors. The calculation process of PRC covers the complete target recognition process including feature matching, verification, and other steps. After selecting different key point detection algorithms and combining them with the same descriptor, the higher the accuracy of feature matching, the higher the quality of the extracted key points. First, we need to detect the key points of the scene point cloud and the model point cloud , respectively, and establish a feature descriptor . Then, we establish the corresponding relationship between the scene feature and the model feature and calculate the matching accuracy rate. According to the change trend of the PRC curve, we can determine the descriptive strength. That is, we know the effect of the key point detection algorithm and the feature descriptor.
5. Experimental Results and Analysis
5.1. Spatial Transformation Repetition Rate
The original point cloud is rotated along the three coordinate axes of by , the key points before and after the rotation are detected, and the repetition rate is calculated. The repetition rate calculated on the 4 data sets is shown in Table 1. The rotation angle-repetition rate curve shown in Figure 4 is drawn based on the average value of on 4 data sets. Table 1 is the average value of the spatial repetition rate of the key point detection algorithm on the 4 data sets.

The six algorithms tested have a repetition rate of 1 in all tests. That is, they have spatial transformation invariance. Some of these six algorithms realize the invariance of space transformation, and some of them are described in the next step by selecting features that have nothing to do with the choice of coordinate system. For example, the ISS algorithm uses the three eigenvalues of the covariance matrix, 3D-SIFT uses the layer constructed by curvature, and the normal vector angle is used by HoNO. Others describe features directly in the local reference coordinate system. For example, LSP and 3D-Harris both fit the quadric surface in the local reference coordinate system.
5.2. Gaussian Noise Repetition Rate
The original point cloud adds Gaussian noise according to the maximum noise amplitude of 0.05r, 0.08r, 0.1r, 0.15r, and 0.2r. The key points before and after adding the noise are detected, and the repetition rate is calculated. The on the 4 data sets are shown in Tables 2–5. The Gaussian noise-repetition rate variation curve shown in Figure 5 is drawn based on the average value of on 4 data sets.

From the above experimental results, it can be seen that 3D-Harris uses the parameter characteristics of local quadric surface fitting, and 3D-SIFT uses the parameter characteristics of the covariance matrix. Both methods have a certain inhibitory effect on Gaussian noise. The ISS algorithm also uses the eigenvalues of the covariance matrix to determine the key points, but it does not use the scale space to further smooth it like 3D-SIFT, so it is more sensitive to noise. The LSP algorithm and the KPQ algorithm use the combination of principal curvature and Gaussian curvature at the key points to form the shape index value Si and the key point quality Q as the rating indicators. Like the ISS algorithm, the neighborhood distribution characteristics of features are not considered when selecting key points, and there is no smoothing process for noise, so the repetition rate is lower when the noise intensity is high.
5.3. Resolution Repetition Rate
The original point cloud is reduced according to the reduction rate of 50%, 70%, 80%, 90%, and 95%. The key points before and after the reduction are detected, and the repetition rate is calculated. The on the 4 data sets are shown in Tables 6–9, respectively. The reduction rate-repetition rate curve shown in Figure 6 is drawn based on the average value of the on the 4 data sets.

From the data in the table, it can be found that increasing the neighborhood radius can make the algorithm more robust to resolution changes.
5.4. Outlier Repetition Rate
The original point cloud adds outliers according to the ratio of 0.1%, 0.5%, 1%, 2%, and 5%. The key points before and after the outlier are added, and the repetition rate is calculated. The on the 4 data sets are shown in Tables 10–13, respectively. The reduction rate-repetition rate curve shown in Figure 7 is drawn based on the average value on the 4 data sets.

In the above experiment, the key point detection algorithm decreases more as the proportion of outliers increases, indicating that the method of covariance matrix eigenvalue decomposition has stronger stability to outliers.
The above research verifies that the sports video moving target detection and tracking based on the SIFT algorithm proposed in this paper has good results.
6. Conclusion
In the sports behavior recognition system, many algorithms lack timing information. In order to make up for the deficiencies of these algorithms, many researchers make up for the lack of time information by establishing a time series model for feature descriptors to further describe the behaviors and reflect the information that different types of behaviors change in chronological order so that the expression of features is more accurate and distinguishable, and the classification accuracy is improved. This paper uses SIFT to perform the recognition and analysis of sports video images and uses SIFT algorithm to recognize and track moving targets. Finally, this paper verifies through experiments that the sports video moving target detection and tracking proposed in this paper based on the SIFT algorithm has good results.
Data Availability
The labeled dataset used to support the findings of this study is available from the corresponding author upon request.
Conflicts of Interest
The authors declare no competing interests.
Acknowledgments
This study was sponsored by Shenyang Jianzhu University.