Abstract
This paper proposes underwater target identification with local features and a feature tracking algorithm for acoustic image sequences. Feature detectors and descriptors are key to feature tracking. Their performance in underwater scene is evaluated by the change of multitarget parameters. A comprehensive quantitative investigation into the performance of feature tracking is thereby presented. Experimental results confirm that the proposed algorithm can accurately track potential targets and determine whether the potential targets are static targets, dynamic targets, or false alarms according to the tracking trajectories and statistical data.
1. Introduction
Underwater target identification has a wide range of applications in biology, geophysics, oceanography, and military [1–3]. Through underwater acoustic imaging technology, the operation is carried out in two steps [4, 5]: (a) the sonar system is used to obtain images of underwater scenes in the region of interest (ROI) and (b) acoustic images are processed and analysed to obtain potential targets. However, the disadvantages of acoustic images such as high speckle noise, low resolution, and intensity alterations pose serious challenges to identify the target in acoustic image sequences.
According to the research of bionics [6], the biological vision system divides an object into several subsystems and realizes the identification through the synthesis of local information. In acoustic image sequences, local features are different from the image patterns of the nearest neighbour [7, 8]. Analysing the local characteristics of ROI can not only obtain the target related information but also provide the clues to identify the potential target. Usually, the local feature involves detector and descriptor.
The feature detection finds significant image regions, which makes it robust to all possible image transformations. The leading algorithms [9–11] are divided into three categories: corner detection, spot detection, and region detection. The corner can be defined as the point with high curvature and generally captured by Harris detector and Features from Accelerated Segment Test (FAST) detector. Instead of corners, spot detection concerns the local extremum of the response of some filters. Both the Laplace of Gaussian (LOG) and the determinant of Hessian matrix (DOH) indicate the local structural information of image. Furthermore, difference of Gaussian (DoG) and FAST Hessian detector are the improved version of the former. Region detection divides the image into several regions according to some similar properties of pixels. Typical methods include the Maximally Stable Extremal Regions (MSER) detection.
Subsequently, a descriptor representing the local neighbourhood around must be created. The most direct descriptor is the image block around the point. However, it has no invariance. The Scale Invariant Feature Transform (SIFT) [12] is probably the most famous local descriptor, which stimulates several subsequent works. As an alternative to SIFT, Speeded Up Robust Features (SURF) [13] is adopted to accelerate the calculation speed. Moreover, binary descriptors [14, 15], namely, Binary Robust Independent Elementary Features (BRIEF), Oriented fast and Rotated BRIEF (ORB), Binary Robust Invariant Scalable Keypoint (BRISK), and FAST REtina Keypoint (FREAK), use simple pixel comparison to produce binary strings of usually shorter length.
Feature detection and description are not isolated tasks but are the foundation of feature tracking in image sequence. It can solve the tasks including visual odometer, Simultaneous Localization and Mapping (SLAM), Augmented Reality (AR), and so on [16–18]. However, few comparative evaluations regarding the local features are performed for target representation in acoustic image. In particular, we are not aware of any work devoted to using feature tracking to identify targets. The main contributions of this paper fall into the following two categories:(1)An extensive evaluation of detector-descriptor-based target representation is carried out. The performances are investigated under the influence of SNR, target position, and size, and the optimal of local features is selected for underwater task.(2)A feature tracking algorithm for target identification in acoustic image sequences is proposed. According to tracking trajectories and statistics, it can be determined whether the potential target is static target, dynamic target, or false alarm.
This paper is organized as follows. The analysis of feature detection and the description are described in Sections 2 and 3, respectively, in which we make a short review of each detector and descriptor and evaluate their performances by simulation experiments. In Section 4, we propose the feature tracking algorithm, tune the experimental parameters and implementation details, and discuss the results. The conclusions are drawn in Section 5.
2. Feature Detectors
The feature detectors are used to associate potential targets in acoustic image. The following sections review the detectors including Harris corner, FAST corner, DoG spot, Hessian spot, and MSER and investigate the application in acoustic image sequences.
2.1. Harris Corner Detector
Given an image , the autocorrelation matrix is calculated at each pixel :where and denote the derivatives of image and is the window patch at position . Let and be eigenvalues of , and the local shape of the neighbourhood can be classified as either uniform (both small), edge region (one small, one large), or corner region (both large). In order to avoid tedious computation of eigenvalues, corner points can be distinguished by corner response value and given bywhere is chosen accordingly and takes values around 0.04. Taking an example of an actual acoustic image containing two ROIs in the underwater scene, the result of Harris corner detection is shown in Figure 1.

2.2. FAST Corner Detector
FAST employs a circle of 16 pixels around the pixel of interest numbered from 1 to 16 clockwise. Let be the intensity of the pixel and set a threshold value . If a set of N contiguous pixels in the circle are all brighter or all darker than , then is labelled as a candidate FAST corner. Empirically, N is chosen to be 12. The following score is computed for each candidate point:where S+ is the subset of pixels brighter than and S− is the subset of pixels darker than . It can be seen from the FAST corner detection results shown in Figure 2 that detected points are also distributed on the edge of the ROI area. Only one feature is detected in ROI1, while three features are detected in ROI2. By comparison, FAST detector determines the feature according to the intensity as well as the position.

2.3. DoG Spot Detector
Detecting local extrema using DoG is a part of SIFT algorithm, and the descriptor part is described in Section 3.1. The scale space representation at different scales is obtained by convolution of the image and the Gaussian kernel:where is the scale space factor is the Gaussian kernel function and is expressed as
The pixel is identified by comparing the value of a pixel with its 8 neighbours in the same scale and the 18 pixels in the two neighbouring scales. If the pixel value of corresponds to a maximum in this neighbourhood, then it is labelled as a feature. The DoG spot detection result is shown in Figure 3. It is clear that one feature is detected in ROI1, and three features are detected around ROI2, of which one is located in the center and two are distributed at the edge. Similar to FAST, more DoG features are extracted from the ROI area.

2.4. Hessian Spot Detector
Similar to DoG, Hessian is also the detection part of SURF algorithm. The Hessian matrix with scale at point is defined as
Transforming the convolution operation into the box filtering operation, the score value of the candidate points can be calculated bywhere , , and are the convolution results of the filters and the approximate coefficient of is 0.9. The Hessian spot detection result is shown in Figure 4. Two features are detected distributing in the center of each ROI. Compared with the previous algorithms, Hessian algorithm ignores the edge point.

2.5. MSER Detector
Mark R as the boundary of the connected region, use to denote the pixels on the border, and use int (R) to denote the pixels in the area contained by the border. The determination of the extreme value area can be achieved by
The stability of the area can be defined aswhere is the area surrounded by the boundary . The results are shown in Figure 5. Three MSER are found in ROI1, and five MSER are found in ROI2. It exhibits that stable extreme regions are formed around the ROI.

2.6. Performance Evaluation of Detectors
An underwater scene simulation model is established to evaluate the performance of the detector. The parameters are set as follows: coverage sector is 140°, number of beams is 256, image size is 201 × 201, and resolution grid is 0.01 × 0.01 m2. Construct a square target with SNR being 15 dB, the length being 0.15 m, and the center position being (−0.07 m, 0.87 m). The background follows the Weibull distribution, with scale parameter and the shape parameter . The simulation is shown in Figure 6, in which the target is surrounded by a dashed frame. The accuracy performance can be defined aswhere is the total number of features and is the number of features falling into the man-made target.

The influence factors including size, SNR, and position on detection performance are investigated based on Figure 6. The simulation length of target is between 0.03m∼0.3m, SNR is between –10dB∼40dB, and center position is placed randomly. Subsequently, use the detectors mentioned above with the variation of each parameter and repeat the set of experiments 200 times. The corresponding relationships between the average detection accuracy and each parameter are shown in Figure 7. As shown in Figure 7(a), when the target size is larger than 0.06 m, Hessian and MSER fluctuate around 50% and 40%, FAST and Harris stabilize at 18% and 5%, and DoG slowly rises to 4%. In Figure 7(b), while SNR is greater than 10 dB, the detection rates are gradually increased, and the rates drop to below 5% while SNR is less than 5 dB. As the effect of distance shown in Figure 7(c), the detection rate of Hessian, FAST, and MSER decreases significantly, while Harris and DoG are always at a low level.

(a)

(b)

(c)
Table 1 shows the statistical characteristics, where is the mean square deviation of the accuracy rate and and represent the total features and the accuracy rate in average per experiment. It can be seen from Table 1 and Figure 7 that both Harris and DoG obtain a higher and , while the lower makes the curve change smoothly, MSER and Hessian behave the exact opposite to the above two detectors, and FAST is moderate in all aspects. In addition, one can learn that the detection performance is most affected by the change of the SNR, the position change is the least, and the size is in the middle.
3. Feature Descriptors
In order to measure the similarity in acoustic image sequence, the descriptors have to be implemented subsequently. Four state-of-the-art feature descriptors including SIFT, SURF, BRISK, and FREAK are used for research.
3.1. SIFT Descriptor
SIFT descriptor is computed using the gradient magnitude and orientations in a 16 × 16 window around the feature. These are stacked in 8-bin histograms formed in 4 × 4 subregions and weighted by a Gaussian window, yielding a descriptor vector of length 128. The gradient magnitude and orientation of a point in the Gaussian image are calculated by
3.2. SURF Descriptor
SURF descriptor designs a fan-shaped sliding window, which rotates with a step length of 0.2 radians, and accumulates the sum of Haar wavelet response values and . A square region is split up into 4 × 4 subregions and the following feature vector is
Each subregion has a descriptor vector containing 4 entries yielding a 64-element SURF descriptor. The corresponding and are given by
3.3. BRISK Descriptor
BRISK descriptor is composed as bit-string of length 512 by concatenating the results of simple brightness comparison tests. It applies the sampling pattern rotated by around the points. The bit-vector descriptor dk is assembled by performing all the short-distance intensity comparisons of point pairs , such that each bit can be expressed by
3.4. FREAK Descriptor
Similar to BRISK descriptor, FREAK descriptor also employs a hand-crafted sampling pattern. A binary descriptor is constructed by thresholding the difference between pairs of receptive fields with their corresponding Gaussian kernel:where is a pair of receptive fields, is the desired size of the descriptor, and is judged bywhere is the smoothed intensity of the first receptive field of the pair .
3.5. Descriptor Matching
It is necessary to select an appropriate similarity measurement method for tracking application. SIFT and SURF adopted matching methods based on nearest-neighbour ratio [7]. In contrast, binary descriptors such as BRISK and FREAK sample Hamming distances to measure the similarity [15]. Since only bitwise XOR operation and counting operation are required, the computational complexity is greatly reduced compared with the Euclidean distance. Figure 8 shows the matching result using SURF descriptors in two consecutive acoustic frames. Although the background contains a lot of noise and the target intensity along with position varies dramatically, the targets in the last frame are found in the next frame precisely by SURF descriptors.

3.6. Performance Evaluation of Descriptors
The same model in Section 2.6 is used to compare the matching performance, and the acoustic image shown in Figure 6 is considered as a reference image. Set the reference target length as 0.15 m, SNR as 15 dB, and the center position as (−0.07 m, 0.87 m). The evaluation of matching performance usually uses the nearest neighbour matching accuracy, which is defined aswhere is the matching number and is the mismatching number.
The influences of three target parameter on matching performances are shown in Figure 9. As observed from Figure 9(a), SURF maintains the accuracy rate above 80% except for individual breakpoints, and other descriptors have relatively lower matching accuracy. In Figure 9(b), both SIFT and SURF have a higher matching rate within a greater SNR variation range, and BRISK and FREAK obtain a lower matching probability when SNR alters widely. Specially, SIFT descriptor shows strong matching ability when the SNR is above 20 dB. Figure 9(c) exhibits that BRISK and FREAK only have a greater matching probability with a slight change of the distance, while SIFT and SURF still maintain a higher matching probability with a large fluctuation of distance. As the distance increased further, SURF has better matching performance than SIFT.

(a)

(b)

(c)
Table 2 lists the corresponding statistical characteristics. is the matching probability of feature pairs in average per experiment, and is the mean square error of . It is observed that SURF is more robust with the variation of target location and size, SIFT has the best robustness when SNR changes, and binary descriptors have lower matching performance in all respects. Therefore, the most significant factor for matching is SNR.
4. Target Identification Using Feature Tracking
4.1. Feature Tracking Algorithm
Inspired by the Track Before Detect (TBD) strategy [19, 20], a new idea involving feature tracking has been innovated for underwater target identification. It does not judge whether there is a target in a single frame of image but tracks multiple targets at the same time and then discriminates the potential targets according to the motion trajectory. In the underwater application, the features are used to characterize the potential targets, and the matching of descriptors measures the relevance of potential targets between frames. The flowchart of the proposed algorithm shown in Figure 10 comprises five main stages:(1)Input the first frame , obtain the feature set , and save it as a template .(2)Read the subsequent frame , acquire the feature set , and match the extracted feature from .(3)Make the matching feature as potential targets and update the corresponding feature in from .(4)Remove the mismatching feature of consecutive frames from (considering that acoustic imaging is susceptible to environmental interference resulting in insufficient stability, k is rounded to 10% of the total number of frames).(5)After traversing the entire image sequence, determine whether the remaining feature represents the real target and then obtain feature trajectory.

According to the previous research, five combinations of Hessian + SURF, DoG + SIFT, MSER + SURF, FAST + BRISK, and FAST + FREAK are selected for feature tracking research, and an acoustic image sequence with 15 frames is simulated. In the initial frame, set the target length as , SNR as , and the center position at . The mobile distance of the target center x in each subsequent frame is the random number within and that of y is the random number within . The range of target SNR is , and the range of target size is . Four test scenarios are designed as shown in Table 3: Test I is only the change of target center position, Test II is the change of target center position and SNR, Test III is the change of target center position and size, and Test IV is the change of all three parameters.
The matching of local features associated with the target in two consecutive frames is regarded as successful tracking, and the measure of tracking performance is given bywhere is the number of frames for successful target tracking and is the total number of frames. The test is repeated 200 times under four scenarios, and the performance comparisons of feature tracking are shown in Figure 11, where is the average tracking rate. It is clear that Hessian + SURF reaches the highest , DoG + SIFT and MSER + SURF achieve the close , and local features using binary descriptors behave worse, particularly, FAST + FREAK get the lowest . In addition, is highest when the target position alters alone, and it decreases when the target SNR or size also fluctuates.

4.2. Experimental Results and Analysis
A typical dataset collected during a trial is used for verifying the effectiveness of the proposed method. The acoustic image sequence contains 38 acoustic images with a size of 776 × 646 and a resolution grid of 0.02 × 0.02 m2. It presents a 16 × 13 m2 water scene parallel to the water surface, in which a circular steel tank with a diameter of 0.35 m is used as a static target and a small ball with a diameter of 0.2 m is used as a dynamic target.
From the initial frame shown in Figure 12(a), it is observed that a static target is centered at (1.1 m, 12.8 m), a dynamic one is located at (6 m, 12.4 m), and the red dotted box represents the mobile area of the dynamic target. The motion trajectory is divided into two sections. The first trajectory as shown in Figure 12(b) is from frame 1 to frame 20. The dynamic target approaches the static target horizontally from right to left, and the total mobile distance of the target is 2.70 m. The second trajectory as shown in Figure 12(c) is from frame 21 to frame 38. The dynamic target approaches the static target horizontally reversely, and the total mobile distance of the target is 2.68 m. Statistically, the target mobile distance between adjacent frames is from 0.05 m to 0.27 m, with an average of 0.15 m and a mean square error of 0.06.

(a)

(b)

(c)
The tracking process of features is shown in Figure 13. In the initial frame, 131 Hessian + SURF, 1220 DoG + SIFT, and 35 MSER + SURF features are acquired, respectively. The remaining 7, 6, and 5 features are obtained in the end frame. In contrast, 3 FAST + BRISK and 3 FAST + FREAK features are obtained in the initial frame, all FAST + FREAK features are lost in the 10th frame, and only one FAST + BRISK feature remains in the end. By comparing the feature coordinate with the motion trajectory, only Hessian + SURF successfully tracked the dynamic target with 30 frames corresponding to the dynamic target. DoG + SIFT and MSER + SURF lost the dynamic target at the 18th and the 10th frame, respectively, and missed the dynamic target.

The tracking statistics are shown in Table 4. The total offset of the Hessian + SURF for tracking success is close to the actual mobile distance of the target, while the movement offset of DoG + SIFT and MSER + SURF for tracking failure is close to the actual movement distance in first section and quite different in second section. In addition, Hessian + SURF, DoG + SIFT, MSER + SURF, and FAST + BRISK have 37, 25, 29, and 29 frames corresponding to static targets, respectively. Analysing the offset and coordinates of remaining features, the features located around the static target related to the cable of the circular steel tank and the features without clear correspondence are judged as false alarms.
5. Conclusion
We have introduced local features for identifying underwater targets, investigated the feature detectors and descriptors for target representation, and proposed a novel feature tracking algorithm. Hessian + SURF has achieved robust tracking results for dynamic targets. By comparison, DoG + SIFT acquires more features and can better track multiple static targets. The remaining combination have relatively poor tracking results. In the case that several frames fail to match during the tracking process, the feature is still passed on unless the rejection condition is triggered, and the potential target will not be lost. The algorithm used in this paper can be applied to linear and uncertain nonlinear systems [21–25].
Data Availability
No data were used to support this study.
Conflicts of Interest
The authors declare that there are no conflicts of interest regarding the publication of this paper.
Acknowledgments
This study was supported by the National Natural Science Foundation of China (no. 61903050), the Natural Science Foundation of Jiangsu Province (no. BK20181033), and the Natural Science Fundamental Research Project of Colleges and Universities in Jiangsu province (no. 18KJB120001).