Abstract
As a branch of target recognition, surface target recognition plays an irreplaceable role in both military and civilian applications. However, the large target size variation, low image resolution, and high real-time requirements pose challenges to existing algorithms. To address the issues, we take YOLOv5 as a backbone and adopt coordinate attention and a double-layer cascade structure to enhance both the recognition performance and speed. Specifically, coordinate attention is introduced to guide the corresponding network to focus on discriminative features by capturing channel and location information. Meanwhile, the double-layer cascade structure is designed for finely extracting and aggregating semantic features and spatial features at different scales. We test the model on the COCO dataset, the VOC dataset, and self-built surface target dataset. Experimental results show that proposed coordinate attention module and multiscale module improve the recognition effect of multiscale surface targets and meet the requirement of real time.
1. Introduction
With the growing development of marine technology, the types and numbers of ships are increasing, and the performed tasks are more dangerous and complex. Both the processing of maritime emergencies and the construction of intelligent shipping put forward higher requirements for surface target recognition [1, 2]. At present, surface target identification has been the key technology to environmental sensing for unmanned surface vehicles (USVs) [3]. In the military field, surface target recognition is an important part of marine environment reconnaissance, precision targeting, and other tasks. In the civilian field, accurate recognition plays an irreplaceable role in water personnel rescue, obstacle detection, etc.
Surface target recognition essentially belongs to target recognition. In recent years, deep convolutional neural networks (CNNs) have made great progress in the field of object detection and recognition [4] and have been successfully applied in medical diagnosis, face recognition, etc. Continuously updated network and more sophisticated big data technology perform increasingly well on public datasets. However, it is still difficult to recognize surface targets in complex climatic situations. First, there are many types of surface targets with various postures. Their interclass differences are small and intraclass differences are large [5]. Simply extracting traditional features is no longer sufficient for practical needs. Second, due to the inconsistent size and sampling distance of the surface target, the scale of targets spans a large. The aspect ratio of the bounding box under different angles is variable. In addition, the water surface is a highly reflective surface. The quality of images is more susceptible to the influence of weather conditions and background, resulting in low resolution, blurred edges, and easy confusion. Finally, surface target recognition technology is mainly used in the environment perception tasks that have more stringent requirements on the real time [6, 7]. As shown in Figure 1, the three images indicated by the orange arrows are actually the same target, yet the sizes are completely different.

In fact, the difficulty of surface target recognition exists mainly in the extraction of discriminative features and the recognition of multiscale targets. The key to extracting discriminative features is to focus on the detailed information that is beneficial to classification. The main method is attention mechanisms [8]. Currently, there are two main types of existing methods to solve the multiscale problem: image pyramid and feature pyramid [9]. However, direct use will increase the network overhead and cannot meet the requirement of real-time algorithm.
Therefore, YOLOv5 is used as the basic framework to ensure the speed of this algorithm in this paper. First, coordinated attention is introduced to effectively capture channel and location information. Unlike previous attention methods, it learns correlations between channels without additional computational overhead. Then, a double-layer cascade is used to stitch and enhance the feature maps at different scales through maximum pooling and parameter aggregation. To prove the superiority of this algorithm, extensive experiments are conducted on the COCO dataset, the VOC dataset, and self-built surface target dataset. Experimental results show that our network performs better than other methods on multiscale surface targets and meet the real-time requirements.
The contributions of this paper are summarized as three points: (1) in order to improve the real-time performance of the algorithm, the single-stage detection algorithm YOLOv5 is used as the basic framework. (2) Coordinate attention is introduced to capture the channel and location information of the network. (3) The double-layer cascade structure is used to improve the recognition ability of multiscale targets. The organisation of the proposed work is as follows: the introduction is presented in Section 1. The related works is described in Section 2. The implementation method of this paper is illustrated in Section 3. The experimental results and analysis are described in Section 4. Section 5 describes the conclusion and prospect of this paper.
2. Related Works
As the core and key technology of USV sensing, surface target recognition is a very challenging task. Especially, the study under complex climate conditions has more theoretical and practical significance [10]. The process of surface recognition is shown in Figure 2. It needs to return the classification of the target contained on the image or video and needs the support from big data. The difference is that it also requires the target’s position, whereas target recognition does not. In this section, we give a literature review, including target detection and recognition methods based deep learning, attention mechanisms, and multiscale methods.

2.1. Target Detection and Recognition Methods
Compared with traditional methods, target detection and recognition methods based on deep learning have significantly improved in terms of accuracy and generalizability. According to the number of stages, the deep learning-based target detection and recognition algorithms are divided into the two-stage method and the single-stage method. The former first extracts the borders of possible candidate regions and then inputs them into the region of interest (ROI) pooling layer together with the feature map, the advantage of which is high accuracy. The latter can directly regress the target category by delineating the selected frames according to the feature map, the advantage of which is fast speed.
The study was started with the two-stage approach. R-CNN [11] was the first method to introduce deep learning into the field of object detection and achieve adaptive learning, which was subsequently improved by many researchers. SPP-Net [12] introduces spatial pyramid pooling into R-CNN to reduce the impact of the size on the network. Fast R-CNN [13] uses ROI pooling based on the layer of SPP and achieves end-to-end training, which mainly improves the speed of the model. R-FCN [14] reduces the workload required for each ROI by constructing location-sensitive score maps to achieve speedup.
On the other hand, SSD [15] and YOLO [16–18] are typical one-stage approaches, also known as classification regression-based models. They are designed to directly classify and train predefined anchors without a proposal generation step. SSD draws on the anchor mechanism and regression idea of faster R-CNN in its design, with six (or four) default boxes at each pixel point of the feature map. YOLO divides the feature map into a grid and then regresses the corresponding default boxes directly, so the speed is fast. However, YOLO [16] does not introduce multiscale information. It is difficult to obtain sufficiently rich target localization information when dealing with multitarget recognition. YOLOv2 [17] and YOLOv3 [18] introduce the anchor point mechanism, which improves the detection and recognition accuracy. To ensure the real time of the algorithm, we take YOLOv5 as the main framework.
2.2. Attention Mechanisms
Like human vision mechanisms, attention mechanisms in deep learning tend to focus on key information and ignore irrelevant information. They have been proven to be beneficial to a range of computer vision tasks. SENet [19] and CBAM [20] are typical networks applying attention mechanisms, the structures of which are illustrated in Figure 3. SENet focuses on the channel features of targets. It compresses the feature map and learns the interrelationships between channels. CBAM uses convolution with large size kernels, and combines spatial and channel features. The reason for the popularity of the self-attention networks, including NLnet [21], GCNet [22], and A2Net [23], is that they have the ability to compute in parallel and learn better about distant dependencies. Nonlocal mechanisms are also important to critical information. Later works, such as Genet [24], AA [25], and TA [26], continue to progress by designing different attention modules or fusion of different information.

(a)

(b)
However, SE and CBAM do not learn the importance of positional relationships and correlations between different channels. Self-attention is not applicable to surface target recognition task due to its large computational effort. Therefore, we choose an attention method that learns channel relations and channel dependencies called coordinate attention.
2.3. Multiscale Methods
Multiscale is one of the major differences between surface target recognition tasks and other vision tasks. Large-scale targets are generally easier to detect and recognize due to their large area and enriched feature. Small-scale targets, with fewer features and less resolution, are more difficult to locate and recognize accurately, but they occupy a proportion in images. In a practical application scenario, the scale of the target is measured by the ratio of the target size to the image size.
As a challenging problem in target detection and recognition, the variation of target scales affects the accuracy and speed of the model. There are two main types of methods to deal with the multiscale problem in vision tasks: image pyramid and feature pyramid. In image pyramid, images are scaled at different scales and then directly input to the detector. Based on the image pyramid approach, SNIP [27] selects different proposals for different resolutions to perform gradient propagation in the multiscale training process. SNIPER [28] crops images around the ground truth box on the feature map and selects the context region. However, SNIP and SNIPER still suffer from an inevitable increase in inference time during use.
The idea of feature pyramid is to approximate the image pyramid directly at the feature level. At the beginning, MS-CNN [29] handles objects of different sizes directly on different downsampled layers. Subsequently, TDM [30] and FPN [9] add new top-down branches to supplement the lack of semantic information at the bottom layer, both of which are the continuation of the feature pyramid approach. PANET [31] enhances the feature hierarchy representation with additional bottom-up paths and proposes adaptive feature pooling to aggregate features from various scales.
3. Methodology
To meet the real-time requirements of the algorithm, this paper uses YOLOv5 as the main framework. First, coordinate attention is added to focus on key information. Then, a double-layer cascade is added to solve the problem caused by multiscale and low-resolution targets. As shown in Figure 4, the structure of our model is divided into three parts, the backbone, the double-layer cascade, and the prediction.

3.1. Coordinate Attention
Many attention mechanisms are used in deep CNNs and bring great improvement on the performance of the network, but these mechanisms are significantly lagging when used in small networks. The reason is that the computational overhead of most attention mechanisms is not affordable for small networks [32]. The common attention mechanisms are SE, BAM, and CBAM. SE only considers the internal channel information and ignores the importance of location information. BAM and CBAM try to introduce location information as the basic of SE, but they fail in learning correlation through channels that is critical in identification tasks [33]. Therefore, this paper introduces an efficient and lightweight attention mechanism called coordinate attention which embeds location information into channel attention. Figure 5 illustrates the structure of the coordinate attention.

Generally, the attention module can be considered as a computational unit that is used to enhance the feature representation of the network. In coordinate attention, the channel attention is split into two parallel one-dimensional features encoding processes along the vertical and horizontal directions, respectively, to mitigate the loss of location information caused by global pooling. These two feature maps are embedded with different orientation information and encoded as the attention map. Thus, the location information can be stored in the attention map. We divide the process of coordinate attention encoding into two steps: coordinate information embedding and coordinate attention generation. As shown in Figure 6, we mark the dimensions of the tensor in each step.

3.1.1. Coordinate Information Embedding
To encode channel relations and position correlations, the global pooling as formulated in Equation (1) is divided into two one-dimensional encoding operations. Given the input feature , the vertical and horizontal coordinates are encoded separately by using different kernels and , and the outputs are denoted as
The above two variations (2) and (3) are along two directions, and a pair of feature maps is generated. These two feature maps allow the attention module to learn feature correlations in one spatial direction while retaining position information in the other, which helps to accurately identify and locate surface targets.
3.1.2. Coordinate Information Generation
The and generated in the first step are concatenated and fed into a shared convolutional transformation function , generating
Here denotes the concatenation operation, denotes the nonlinear activation function, and represents the feature map containing two directions of encoded information. We use as the scaling ratio to control the network overhead. Through experiments, is set as 24, which can balance the accuracy and the speed. After that, is decomposed into two independent tensors and . These two are converted to tensors (, ) with the same channel number as by two the convolutional functions and , respectively.
Here, is the sigmoid function. and are expanded and treated as attention weights. Finally, the output of coordinate attention is represented as
In order to have an intuitive understanding of the coordinate attention, we visualize the greyscale maps and heat maps. As can be seen in Figure 7, the greyscale map roughly reflects the overall profile of the surface target. On the other hand, the heat map filters out a large amount of irrelevant information and focuses on useful information. The darker the color on the map, the more significant the role of the area for classification.

3.2. Double-Layer Cascade
Multiscale means sampling the target at different granularities. In general, smaller and denser sampling in granularity allows more details to be seen, while larger and sparser sampling allows the overall contour and shape to be seen. Distance variation and sensor zoom are the main physical reasons for the variable scale properties of the target over the image domain. With the distance of target from near to far, the high-frequency detail texture information gradually declines and the regional scale of the target in the image continues to decrease. First, it is necessary to ensure that the network can extract features at multiple scales, so we improve the spatial pyramid pooling (SPP). Second, a double network is used to aggregate multiscale features.
3.2.1. Improved SPP-Net
SPP-Net is a general CNN framework, which breaks the limitation that the input image must have a fixed size in traditional CNNs. In order to make the model adaptive and capable of handling images of different sizes, the SPP is introduced in this paper and placed after the backbone. SPP has the following significant features: (1) SPP generates fixed-size outputs regardless of the size of the input image, which is convenient for subsequent network processing. (2) Multilevel pooling makes SPP more adaptive to the change of the size. (3) Due to the flexibility of image input size, SPP is more effective in detecting and recognizing multiscale targets. The key point of SPP is that fixed-size feature vectors can be extracted from multiscale features. Therefore, SPP also shows great strength in target detection and recognition.
Compared with the previous YOLO, our proposed model adds an SPP module between convolutional layers. Figure 8. illustrates the structure of SPP block. Unlike the original SPP module, the SPP module in this paper consists of four parallel branches with kernel sizes of , , and maximum pooling and a jump connection. The three different sizes of pooling are used to achieve the extraction of local features at different scales. The jump connection is used to preserve the original global features. Finally, the dimensionality of the tensor is expanded by the process of concating to achieve the fusion of local and global features, which enriches the expressiveness of the feature map and facilitates the case of various scales in water scene images. Compared with the way of using maximum pooling alone, SPP module is more effective to increase the reception range of the main features and significantly separates the contextual features.

3.2.2. Double-Layer Network
Surface target recognition not only needs to local features with small receptive field to get the detail information but also needs to global features with large receptive field to get the global coarse-grained information, such as the shape and contour. As the CNN deepens, the network keeps downsampling. The semantic information becomes richer, and the spatial information is sparser. The last layer may even have a downsampling rate of 16 or 32. This result is that small targets on the original image have less effective information on the feature map. The performance of object recognition decreases sharply. In surface target recognition, small-scale targets have few pixels corresponding to them in the original image, and it is more difficult to find the corresponding information after downsampling.
Improved SPP-Net ensures that the network can receive information at multiscales; the key to the next strep lies in how to extract multiscale features. Experiments show that neurons at higher levels respond strongly to the global features, while other neurons are more susceptible to local textures and contours. That means that networks at shallow levels are more related to detailed information and the networks at higher layers are more related to semantic information. The feature pyramid network (FPN) was created from this starting point, which has greatly promoted the subsequent work of object detection and recognition. FPN mainly consists of four operational processes: bottom-up path, top-down path, lateral connection, and convolutional fusion, through which models obtain strong semantic features. However, focusing only on the semantic features from deep levels is not sufficient in that this approach tends to ignore the detailed information contained in the shallow features. The introduction of path aggregation network (PAN) is aimed at enhancing the detail information in the shallow features from top to down.
Different from the direct use of FPN layers, our network adds two bottom-up feature pyramids (PAN) after the FPN layer, as shown in Figure 9. FPN layer passes high-level semantic features from the top to down. Although it enhances the whole pyramid, it only enhances the semantic information rather than the detail information. In this paper, we address this point by adding PAN, which conveys detailed localization features from the bottom to top. is used to represent the feature maps generated by FPN. is used to represent the newly generated high feature maps by the augmented paths and corresponding . is generated by a higher resolution and laterally connected . The spatial size of is first reduced by convolution. Then, and downsampled feature maps are summed by lateral concatenation. The obtained feature maps repeat the above steps once again until the iterative step is terminated.

The purpose of this module is to transmit the semantic features from the deep layer to the shallow layer through FPN and the localization information from the shallow layer to the deep layer through PAN. The above two are combined to aggregate parameters at different scales to further improve the feature extraction capability of the model for multiscale targets.
4. Experiments
In this section, experiments are conducted on the COCO dataset, the VOC dataset, and our self-built surface target dataset. We describe the setup of our experiments and the way that the dataset is processed in Section 4.1. Then, a series of ablation experiments are performed to demonstrate the contribution of each proposed component to the performance of the model in Section 4.2. Finally, we compare our approach with state-of-the-art approaches on object detection and recognition.
4.1. Experimental Setup and Data Processing
The experiments in this paper are conducted on Ubantu 16.04 LTS operating system and Pytorch 1.2.0 deep learning framework. The model is trained and debugged using GPUs with an NVIDIA RTX graphics card. On the public datasets, such as COCO and VOC, the input image size is uniformly cropped to and the epoch is set to 300. A batch of 16 images are processed per iteration. The initial learning rate is 10-2, adjusted to 10-3 for 100 epochs and 10-4 for 200 epochs until the end of training. The optimizer uses SGD optimizer.
Since the size of the bounding box is completely different from the COCO dataset, this paper uses the -means algorithm to recalculate the anchors for optimization. The size of the input image is , and the epoch is set to 600. A batch of 8 images are processed into each iteration. The initial learning rate is 10-1, adjusted to 10-2 when training 100 epochs, and adjusted to 10-3 until the end of training. The optimizer uses SGD optimizer.
In order to verify the recognition effect of the model on surface targets, a dataset is established by visible light sensor acquisition and manual annotation [5]. As can be seen in Figure 10, the dataset contains different scenes and a total of different classes of surface targets such as fishing boats, yachts, buildings, bridge piers, and water drums. First, the bounding boxes and categories of surface targets are labeled using a software called “Lableimg.” Then, they are transformed into “txt” files like YOLO for processing, which include five types of information: the category of the target, and coordinates, and the width and height of the image.

The self-built surface target dataset consists of a total of 2229 images with 5731 labeled targets, and the size of the images is pixels. Similar to the COCO, we divide targets into small, medium, and large, as seen in Table 1. The training results of neural network are influenced by the richness of the data. In order to enhance the learning ability of the model for various scale targets, a mosaic enhancement method is used in this paper. In order to enrich the sample at different scales and make the images closer to the real scenes, this paper adopts mosaic enhancement method in which mirroring, brightness enhancement, contrast enhancement, and linear blurring are utilized. The data enhancement method used in this paper is shown in Figure 11.

4.2. Ablation Study
To demonstrate the performance of the proposed coordinate attention and the double-layer cascade module, a series of ablation experiments are conducted on the surface target dataset. The corresponding results are all listed in Table 2. We compare the baseline with the model containing SE attention (SE), CBAM attention (CBAM), the horizontal attention (HA), the vertical attention (VA), coordinate attention (CA), the double-layer cascade module (DC), and the combination of multiscale module and coordinate attention (CA+DC). is set to 24 when attention module is used. The average precision of small targets (APS), medium targets (APM), large targets (APL), mean average precision (mAP), and frame per second (FPS) is recorded. We determine whether the prediction is correct by whether the interaction ratio between the predicted box and the real box is bigger than 0.5.
As can be seen in Table 2, the model has significantly improved its recognition effect of small and large targets with the addition of coordinate attention and the double-layer cascade module. The indicator of mAP also performs best in this case. The results show that the addition of SE and CBAM affects the speed of the model and is not suitable for tasks in this paper. In comparison, the addition of coordinate attention and the double-layer cascade module can improve the accuracy of surface target recognition while ensuring the real time of the algorithm as much as possible. We also record the curves of the various parameters during the training process in Figure 12. The results show that the parameters are fitted quickly, and the model has a superior performance. The model basically converges when the epoch reaches 50.

4.3. Comparison with Other Methods
In this section, we evaluate the proposed model on the COCO dataset and the VOC dataset and compare it with other state-of-the-art methods. On the COCO dataset, we test AP, APS, APM, and APL, respectively, with the previous algorithms to verify the recognition performance on various scale targets. The FPS is not compared as the input is different. The results are shown in Table 3.
From the above results, it can be concluded that our model performs best on small-scale targets and only slightly behind Sniper on medium- and large-scale targets. We also show the test results of our model with other advanced methods on the VOC2012 dataset in Table 4, aiming to verify the recognition accuracy on specified targets. We select the target types like aero, boat, and bus. The reason that we select those targets is that they are similar to surface targets in appearance.
Our model achieves the highest recognition precision on three types of objects, aero, boat, bus, and performs worse on two types of objects, bike and car. The mean average precision is the highest among the listed methods. In addition, the model is tested on our self-built dataset and other public images. As can be seen in Figure 13, the classification and location of the target can be precisely returned regardless of the change in background and scale.

5. Summary and Prospect
To address the problems of diverse target types, easy confusion, large-scale span, and high requirements for the real time, this paper proposes a surface target recognition algorithm based on coordinate attention and double-layer cascade. The coordinate attention, which aggregates features along two spatial directions by feature encodings, retains the location information while learning spatial dependencies without additional network overhead. The double-layer cascade module aggregates parameters at different scales to further improve the feature extraction capability of the model for multiscale targets. Experimental results on the COCO dataset, the VOC dataset, and our self-built surface target dataset show that the proposed method is suitable for surface targets and has outstanding performance under evaluation metrics such as AP and FPS.
The surface target recognition algorithm is different from common target detection algorithm and recognition algorithm. It is more like a combination of both, which needs to give both the bounding box of the target and accurately identify the classification of the target. In addition, surface target recognition is usually applied to mobile platforms such as USVs, which puts higher requirements on the real-time nature of the algorithm. This paper still stays in the recognition of images in fixed scenes. In the future, more attention should be paid to the continuous tracking of targets in moving scenes and the collaborative perception of multiplatform and multiperspective under weak observation conditions.
Data Availability
The data used to support the findings of this study are available from the corresponding author upon request.
Conflicts of Interest
We declare that there is no conflict of interest regarding the publication of this paper.
Acknowledgments
This work was supported by the Natural Science Foundation of Hunan Province Youth Fund (Project No. 2020JJ5672).