Abstract
The feature information of small-scale targets is seriously missing under the interference of complex underwater terrain and light refraction. Moreover, the unbalanced distribution of underwater target samples can also affect the accuracy of spatial semantic feature extraction. Aiming at the above problems, this paper proposes a dynamic multiscale feature fusion method for underwater target recognition. Firstly, this paper uses multiscale info noise contrastive estimation (MS-InfoNCE) loss to extract the significant features of the target at 4 scales. Secondly, the method learns the spatial semantic features of the target through a dynamic conditional probability matrix. Finally, this paper designs different feature fusion mechanisms for different scale targets, dynamically fusing multiscale significant features and spatial semantic features to recognize underwater weak targets. The experimental results show that the recognition accuracy of the proposed algorithm is 1.38% higher than that of the existing algorithm when recognizing underwater distorted targets.
1. Introduction
The underwater environment is poorly lit, and the background is complex. The target is easily submerged in the background. In the absence of data tags, AUVs have difficulty recognizing weak targets with very low signal-to-noise ratios. Unsupervised representation learning can extract significant features of target distinguishability. Downstream tasks such as recognition and classification can be effectively accomplished through significant features. However, multiple scales of targets may exist simultaneously in the underwater images acquired by the AUV. The small-scale targets tend to disappear during downsampling. Multiscale features can improve the recognition accuracy of small-scale targets. However, the method significantly increases the computational power of the algorithm [1–4].
Small-scale targets are more susceptible to environmental disturbances. The underwater image is distorted by the refraction of light. AUV cannot extract the complete significant features of the target. The graph neural network learns the spatial semantic features of the target based on the correlation matrix, which can compensate for the missing features of distorted targets. The traditional correlation matrix is usually obtained from the label cooccurrence relationship of the training set [5–7]. However, the difficulty of acquiring different target images in the underwater environment varies, making the number of tags in the training set unevenly distributed. Moreover, some of the rare cooccurrence relationships may be noisy. In this case, the correlation matrix constructed based on the label cooccurrence relationship has some limitations.
As shown in Figure 1, in response to the above problems, this paper proposes a dynamic multiscale feature fusion method for underwater target recognition. Firstly, this paper extracts the significant features of the target at multiple scales and maximizes the retention of feature information of the small-scale target. Secondly, the method learns the spatial semantic relations of the target which is used to compensate for the missing features caused by factors such as image distortion. Finally, the proposed algorithm dynamically fuses multiscale significant features and spatial semantic features for target recognition.

The main contributions of the method in this paper are as follows: (1)The proposed algorithm uses multiscale InfoNCE loss to extract the significant features of the target at 4 scales. In the absence of data labels, the method can extract distinguishability features of targets at different scales(2)Improve the tag cooccurrence correlation matrix. In this paper, a new type of dynamic conditional probability correlation matrix is constructed using the data labels of the training set and the current training batch. When the labels are unevenly distributed in the dataset, the matrix can effectively model the spatial semantic relationship of the target(3)This paper proposes a dynamic multiscale feature fusion mechanism. The method dynamically fuses multiscale significant features and spatial semantic features according to different target scales, which reduces the recognition time of the algorithm while improving its recognition accuracy
2. Related Works
The refraction of light makes it difficult for the AUV to extract the complete significant features of the target. Spatial semantic features can compensate for the lack of features. Yang and Zhou [8] proposed to combine structured semantic relevance to solve the problem of missing labels in multilabel learning. This method captures the structured correlations between categories by constructing a semantic graph of the images. Yan et al. [9] proposed a feature attention network (FAN) containing a feature refinement network and a relevance learning network, which can solve the problems of inconsistent object scales and unbalanced category labels. Li et al. [10] use a graph convolutional network and adaptive labeled graphs to learn label correlation. The method generates adaptive labeled graphs by two convolutional layers. Yun et al. [11] proposed a dual aggregated feature pyramid network for multilabel classification. This network does not require region proposals and significantly reduces the computational burden. To solve the problem of difficulty in correctly classifying classes with complex features but small number of samples, Zhi [12] proposed an end-to-end convolutional neural network based on a multipath structure. Wang et al. [13] proposed a multilabel classification method that learns through privileged information. The method uses similarity constraints to capture the relationship between available and privileged information and ranking constraints to capture the dependencies between multiple labels. To improve the convergence speed of the algorithm, Cai et al. [14] designed an effective outer space acceleration algorithm (GAMP). Experimental results show that the algorithm has higher computational efficiency. Gao and Zhou [15] designed a multicategory attention area module that is aimed at keeping the number of attention areas as small as possible and at maintaining the diversity of these areas as much as possible.
Multiscale significant features contain richer feature information which facilitates the recognition of weak targets with very low signal-to-noise ratios. Ma et al. [16] proposed a multiscale spatial context-based semantic edge detection depth network (MSC-SED). This network obtains rich multiscale features while enhancing high-level feature details. Guo et al. [17] proposed a radar target recognition method based on feature pyramid fusion lightweight CNN. The method can improve the accuracy and robustness of radar target recognition under low signal-to-noise ratio conditions. Ju et al. [18] proposed adaptive feature fusion with attention mechanism (AFFAM) for multiscale target detection. The method can adaptively learn the importance and relevance of features at different scales. Jiang et al. [19] proposed a multiscale metric learning method (MSML) for small sample learning. This method extracts multiscale features and learns multiscale relationships between samples. Wang et al. [20] proposed an unsupervised multiview representation learning method. This method eliminates the differences in multiview data due to different distributions. To address the problem of inadequate performance of cross-entropy loss in the case of small samples, Lee and Yoo [21] computed enhanced feature extraction networks by contrastive learning. Chen et al. [22] extended the existing contrastive learning algorithm by embedding an attention mechanism. The method can improve the learning efficiency and generalization ability of the algorithm.
For the problems of underwater environment interference and real-time algorithm, Cai et al. [23] proposed a collaborative multi-AUV target recognition method based on migration reinforcement learning. Sun and Cai [24] proposed a multi-AUV target recognition method based on GAN-metalearning. The experimental results show that the method can improve the generalization ability of the model. Cai et al. [25] proposed a maneuvering target recognition method based on multiview optical field reconstruction. The method can ignore the effect of shooting angle on target recognition results. Chen et al. [26] proposed a new iterative visual inference framework. This framework can recognize targets using convolutional features and semantic features of images, which effectively improves the target recognition accuracy. To solve the problem of data double computation, Cai et al. [27] proposed a multiview optical field reconstruction method based on transfer reinforcement learning. Qin et al. [28] proposed a feature pyramid-based target detection algorithm. The method can solve the problem of low accuracy of small-size target detection. Cai et al. [29] propose an underwater distortion target recognition network. This method compensates for the problem of missing salient features by spatial semantic information, which effectively improves the accuracy of recognizing underwater distorted targets.
This paper proposes a dynamic multiscale feature fusion method for underwater target recognition. Firstly, this paper constructs a multiscale significant feature extraction network to extract the significant features of the target at 4 scales by multiscale InfoNCE loss. Secondly, this paper establishes a dynamic conditional probability matrix between underwater targets to learn the spatial semantic features of the targets which compensate for the missing features of distorted targets. Finally, the proposed algorithm dynamically fuses multiscale significant features and spatial semantic features according to the target scales for target recognition.
3. Dynamic Spatial Semantic Feature Extraction Model
The low light and complex background in the underwater environment cause weak target intensity in the images acquired by the AUV. Moreover, the target is distorted by the interference of light refraction. This paper enhances the original samples by color transformation, random cropping, and Gaussian blur. Different enhanced views from the same image are positive samples. Those from different images are negative samples. In this paper, we construct a feature library to store all the enhanced samples of the training process. For the images input to the feature extraction network, there are positive samples from the same image as the input samples and negative samples from different images in the feature library. The multiscale significant feature extraction model is shown in Figure 2.

This paper uses the ResNet network as the base network for the multiscale significant feature extraction model . The network can be divided into 5 layers in general and is denoted as . The features extracted by each layer of the network can be expressed as . This paper extracts the significant features of network layers 2-5, denoted as . The feature vector is projected nonlinearly through a fully connected layer as the vector . This paper trains the model by using a multiscale InfoNCE loss function. The cosine similarity of the samples is as follows: where denotes the nonlinear projection of the input image representation vector. denotes the nonlinear projection of positive and negative sample representations in the feature library. The multiscale InfoNCE loss function of the model is as follows. where denotes the weights of features at different scales. is the feature representation of the input image at different scales. denotes positive samples. are negative samples. is used to zoom in on the similarity measure of the image representation.
The feature library can make the number of negative samples larger and improve the training effect. However, the phenomenon also increases the difficulty of updating the feature library encoder . This paper dynamically updates the feature library encoder by encoder . The parameters of the encoder and are denoted as and , respectively. is updated as follows. where the momentum coefficient is . During the training process, updates the parameters by stochastic gradient descent. When is updated, updates the parameters according to the above way. After completing the training, the encoder can extract the multiscale significant features of the image. The multiscale significant features of the image can be expressed as where is the input test image and is the encoder with completed training.
4. Multiscale Significant Feature Extraction Model
Due to the uneven distribution of labels in the dataset, this paper designs a dynamic conditional probability correlation matrix to represent the semantic correlation between targets. The dynamic conditional probability matrix is , which represents the conditional probability of the appearance of label when label appears. Compared with the traditional correlation matrix, the dynamic conditional probability matrix is asymmetric; i.e., . This paper also calculates the local conditional probabilities of the current batch of training data, which further increases the robustness of the semantic relational model. We calculate the cooccurrence of targets in the training set and the current training batch separately to obtain the static cooccurrence correlation matrix and the local correlation matrix . This paper calculates the conditional probability matrix between the objectives by this matrix. where and are the weights. denotes the number of simultaneous occurrences of targets and in the training set. denotes the number of occurrences of target in the training set. indicates the number of simultaneous occurrences of and in the current batch. is the number of occurrences of in the current batch. denotes the probability that label appears when label appears. Since some of the rare cooccurrence relations may be noisy, a probability threshold is set to filter the noise in this paper. The filtered matrix is
This paper constructs a spatial semantic feature extraction network, as shown in Figure 3. The network utilizes a dynamic conditional probability matrix to represent the semantic correlation between targets and updates the feature representation through information transfer between nodes. The spatial semantic feature extraction network can be represented as where denotes the spatial semantic features and in the initial state. is the updated spatial semantic feature. is the normalized dynamic conditional probability correlation matrix. is the transformation matrix to be learned. is a nonlinear LeakyReLU activation function.

5. Dynamic Multiscale Feature Fusion Mechanism
The underwater images acquired by AUV may have targets of different scales. Multiscale features can effectively improve the recognition accuracy of small-scale targets. However, this method significantly increases the computational power of the algorithm. In this paper, a dynamic multiscale feature fusion mechanism is proposed. The method fuses feature from different scales to recognize targets at different scales.
The significant features of the shallow network output include more detailed information. When recognizing small-scale targets, the algorithm incorporates multiscale significant features and spatial semantic features. The fused features include both deep semantic information and shallow detailed information. This feature can improve the recognition accuracy of the algorithm. When recognizing normal-scale targets, the algorithm only fuses significant features and spatial semantic features from the output of the conv5 layer network. The fused features can be expressed as follows.
where denotes the feature fusion function. denotes the input test image. is the completed training encoder. denotes the number of network layers. denotes the weights of features at different scales. is the upsampling multiplier. denotes the significant features of the output of the conv5 layer network. are the spatial semantic features. denotes the ratio of the target size to the original image size.
The above features are input to the classification layer (Cls) and the regression layer (Reg). The Reg layer outputs the vertex coordinates of the sets of anchor boxes. The Cls layer outputs the labels and confidence levels of the anchor boxes. For the feature mapping, the proposed method generates target anchor boxes. This paper trains the model by minimizing the loss function. The loss function consists of two parts: classification loss and regression loss. The calculation is shown in where denotes the number of the anchor box. is the target class. denotes the predicted probability of the target class in anchor box . is the real label of anchor box . denotes the vertex coordinates of the target anchor box. are the vertex coordinates of the ground truth. is the smooth L1 function. and are given by the classification and regression layers. and denote the normalization of the loss function. is numerically equal to the minimum batch size for training. is equal to the number of target anchor boxes. denotes the balance weights. is the sigmoid function.
6. Experimental Results and Analyses
6.1. Experimental Settings
6.1.1. Controller Hardware
In this experiment, training and testing are performed in TensorFlow. The simulation calculation is run on a small server (RTX 3090 GPU, 64 G of RAM, and Win10 64-bit operating system).
6.1.2. Experimental Dataset
In this paper, the three datasets, cognitive autonomous diving buddy (CADDY) underwater dataset, Underwater Image Enhancement Benchmark (UIEB), and underwater target dataset (UTD), are used for training and testing. This paper trains a multiscale significant feature extraction model by 13,000 unlabeled images. In addition, this paper uses 1278 labeled images to train and test the spatial semantic feature extraction model and the target recognition network. The dataset is divided into training and test sets in the ratio of 7 : 3.
6.1.3. Implementation Details
This paper uses a stochastic gradient descent (SGD) optimizer to train the model with weight decay of 0.0005 and momentum of 0.9. The training minimum batch is 64, and the initial learning rate is 0.01. The entire training process is iterated 50,000 times, in which the learning rate decays when the number of iterations is 40,000 and 45,000, and the decay rate is 0.1.
6.2. Experimental Results
In this paper, multiscale significant features and spatial semantic features are dynamically fused to recognize the target. For underwater weak targets with different interference, three sets of simulation experiments are designed in this section to verify the effectiveness of the proposed algorithm and compare it with FISHnet [30], SiamFPN [31], SA-FPN [32], and literature [33]. The algorithm evaluation criteria are mean average precision (mAP) and recognition time.
This paper conducts ablation experiments to verify the effects of the multiscale significant feature (MSF) extraction module and the spatial semantic feature (SSF) extraction module, and the results are shown in Table 1. The multiscale significant feature extraction module improved the mAP by 1.29%. Spatial semantic features outperformed multiscale significant features, resulting in a 1.98% improvement in mAP. Therefore, this paper can effectively improve the recognition accuracy of underwater targets.
The results of conventional underwater image recognition are shown in Table 2. In the torpedo type and submarine type, the algorithm in this paper has the highest recognition accuracy, which is 0.6715 and 0.9573, respectively. In the frogman type, literature [33] has the highest recognition accuracy of 0.7614, which is slightly higher than that of the algorithm in this paper. In the AUV type, the recognition accuracy of SA-FPN is 0.93, which is higher than 0.8732 of the algorithms in this paper. In terms of recognition speed, literature [33] only needs 0.1 s, which is significantly higher than that of the algorithm in this paper. But the algorithm in this paper has a higher mAP. The above data analysis shows that although the method in this paper does not perform well in some categories, it has certain advantages in overall recognition accuracy. In addition, the recognition speed of the algorithm in this paper is slightly higher than that of the multiscale target recognition algorithms SiamFPN and SA-FPN. The visualization results of conventional underwater images are shown in Figure 4.

The recognition accuracy and visualization results of underwater blurred images are shown in Figure 5 and Table 3. It can be seen from Table 3 that the mAP of FISHnet is 0.6719, which is slightly higher than that of the algorithm in this paper. In addition, the FISHnet algorithm has the highest accuracy in recognizing torpedo targets. For submarine targets, SiamFPN has the highest recognition accuracy. The accuracy of the algorithm in this paper to recognize the frogman target is 0.7759, which is better than that of other comparison methods. The SA-FPN has the highest AUV target recognition accuracy. It can be seen from the above data that the recognition accuracy of all algorithms has decreased. This paper can improve the recognition accuracy of some targets through spatial semantic features. The mAP of the algorithm in this paper is equivalent to that of FISHnet, but the recognition speed of the algorithm in this paper has a significant lead.

The recognition accuracy and visualization results of underwater distorted images are shown in Figure 6 and Table 4. The mAP of the algorithm in this paper is 0.7081, which is 1.38% higher than that of SiamFPN. For submarine and AUV targets, SiamFPN has the highest recognition accuracy, 0.8933 and 0.9421, respectively. For the torpedo target, the algorithm in this paper has the highest recognition accuracy. For the frogman target, literature [33] has the highest recognition accuracy and the fastest recognition speed. It can be seen from the above data that the algorithm in this paper has the highest mAP when recognizing underwater distorted targets.

7. Conclusion
This paper proposes a dynamic multiscale feature fusion method for underwater target recognition. Firstly, this paper extracts the significant features of the target at multiple scales and maximizes the retention of feature information of the small-scale target. Secondly, the method learns the spatial semantic relations of the target which is used to compensate for the missing features caused by factors such as image distortion. Finally, the proposed algorithm dynamically fuses multiscale significant features and spatial semantic features for target recognition. The experimental results show that the multiscale significant feature extraction module improved the mAP by 1.29%. Spatial semantic features outperformed multiscale significant features, resulting in a 1.98% improvement in mAP. The recognition accuracy of the proposed algorithm is 1.38% higher than that of the existing algorithm when recognizing underwater distorted targets.
Data Availability
The data used to support the findings of this study are available from the corresponding author upon request.
Conflicts of Interest
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Acknowledgments
This paper is supported by the National Key Research and Development Project (2019YFB1311002).