Abstract
The effectiveness of CycleGAN is demonstrated to outperform recent approaches for semisupervised semantic segmentation on public segmentation benchmarks. In contrast to analog images, however, the acoustic images are unbalanced and often exhibit speckle noise. As a consequence, CycleGAN is prone to mode-collapse and cannot retain target details when applied directly to the sonar image dataset. To address this problem, a spectral normalized CycleGAN network is presented, which applies spectral normalization to both generators and discriminators to stabilize the training of GANs. Without using a pretrained model, the experimental results demonstrate that our simple yet effective method helps to achieve reasonably accurate sonar targets segmentation results.
1. Introduction
Compared with real aperture side-scan sonar, synthetic aperture sonar (SAS) has higher hydrographic surveying and charting speed and can produce higher resolution images [1–3]. The accurate detection and identification of underwater targets in synthetic aperture sonar continue as a significant issue [4–6]. However, according to the principle of imaging by backprojection [2], images of the same underwater target obtained from different views are different and complex in shape and contour [7], which is hard to be labeled by a supervised detection method [8]. By contrast, semantic segmentation labels mark the outline of the target on the image, which excludes the background area. Thus, the accurate result of semantic segmentation is significant for identifying underwater targets in synthetic aperture sonar and estimating their type, location, scale, direction, and so on [9, 10].
In recent years, deep convolutional neural networks (DCNNs) have been widely used in acoustic image processing. Supervised learning is a generally accepted method for acoustic image semantic segmentation [11]. Combined with specific modules based on the characteristics of sonar images, the improved network can achieve better segmentation results than the original structure. For instance, recurrent residual convolutional neural network is combined with the self-guidance module to help discriminate whether the input image is the segmentation result or the ground truth label [12]. FCN is combined with dilated convolution, dense module, and inception [13]. This method can decrease the parameters of the network and speed up segmentation. Receptive field block and attention search function were integrated into the residual convolutional neural network. This model can help enhance the contrast between the underwater target and the background [14].
However, the main reason for limiting its application to sonar image semantic segmentation is that supervised learning requires a large amount of pixel-level labeled data, which is time-consuming. Besides, a large amount of training data are often unavailable as actual experiments are very expensive and often limited in scale. Thus, semisupervised learning is an important area of research to overcome the problem of limited labeled data. To our knowledge, little work has been done to investigate semisupervised learning for sonar image semantic segmentation.
As soon as the generative adversarial network (GAN) is proposed [15], it is widely used in the field of semisupervised learning. For example, SGAN (semisupervised GAN) [16] is used for multiobjective classification, CC-GAN (context-conditional GAN) [17] can generate oil painting images, and BUS-GAN [18] is applied to improve the segmentation quality of breast lesions from ultrasound images. Recently, the CycleGAN model [19] has become the mainstream choice of image style conversion between domains because it reduces the limitation of image pairing in the training process. It can be applied to semisupervised semantic segmentation by learning a bidirectional mapping from unlabeled real images to available ground truth labels. Jiang et al. firstly exploited this capability to transfer CT to MRI for lung cancer segmentation [20]. Mondal et al. leveraged cycle-consistency loss, which preserves critical attributes between the input and the transformed image to add an unsupervised regularization effect that boosts the segmentation performance when labeled data are limited [21]. His experiments were conducted on three different public semantic segmentation benchmarks: PASCAL VOC 2012 [22], Cityscapes [23], and the Automated Cardiac Diagnosis Challenge (ACDC) [24], whose accuracy is proved better than the traditional adversarial learning method.
However, it is found that CycleGAN tends to generate the same type of segmentation results (mode-collapse) and fails to preserve targets’ details for the case of the scarcity and imbalance of sonar image target samples. It is demonstrated by previous research that constraining the Lipschitz constant of the discriminator mapping function can stabilize the training of GANs (L-constraint). The first method to satisfy L-constraint was proposed by WGAN: gradient penalty item was added to the discriminator loss function. The disadvantage of this method is that it can only approximately satisfy the L-constraint only if the number of categories in the training sample is small. The spectral normalization was proposed to limit the Lipschitz constant of the discriminator by limiting the spectral norm of each neural network layer. It can satisfy the L-constraint accurately and does not need the additional hyperparameters tuning. Thus, compared with other normalization techniques, the computation of spectral normalization is relatively tiny. Therefore, it is reasonable to apply spectral normalization to both generators and discriminators.
The main contributions of this paper are as follows:(1)We refine Mondal’s [21] semisupervised model and validate its efficiency for two acoustic image datasets. To our knowledge, this is the first investigation applying semisupervised learning for acoustic image segmentation.(2)The spectral normalization is applied to both generator and discriminator to improve the training stability of the CycleGAN.(3)We make two sonar image datasets, SCTDI and SCTDII. SCTDI contains 300 images of three types of targets (shipwreck, aircraft wreckage, and victims). SCTDII contains 800 images of the tiny targets. All images have a fixed resolution of 320 × 320 and 9.6 bits per pixel.
2. Related Work
In this section, we review work related to semisupervised segmentation concerning three different aspects-The Application of CycleGAN, Techniques to Stabilize Training of GAN, and Recent Work in Semisupervised Semantic Segmentation.
2.1. The Application of CycleGAN
One of the applications of CycleGAN is image synthesis, which is widely used in medical imaging data augmentation.
Image synthesis refers to the mapping between different domains. The image domain in the medical field includes CT images and MRI images. Hiasa et al. [25] proposed a CT to MRI synthesis method using CycleGAN. They extended the CycleGAN approach by adding the gradient consistency loss to improve the accuracy at the boundaries. Huo et al. [26] proposed a novel end-to-end synthesis and segmentation network (EssNet). It can achieve the unpaired MRI to CT image synthesis and CT splenomegaly segmentation simultaneously. Without using manual labels on CT, it can alleviate the manual efforts.
Apart from data augmentation for medical images, CycleGAN can also be employed as a semantic segmentation network and detection framework in remote sensing images.
Dong et al. [27] estimated both segmentation results and monocular depth of three-dimensional (3D) images using CycleGAN, which is meaningful for the study of augmented reality (AR) and autonomous driving applications. Mondal et al. [21] proposed a strategy that enforces cycle consistency to learn a bidirectional mapping between unlabeled real images and real labels. Experiments on three different public segmentation benchmarks (PASCAL VOC 2012, Cityscapes, and ACDC) demonstrate the effectiveness of the proposed method, which outperforms recent approaches based on adversarial learning for semisupervised segmentation.
In the remote sensing images detection field, CycleGAN has been proven generally accepted methods for domain adaptation. For instance, CycleGAN is used to mitigate multisensor differences in a CNN-based unsupervised multiplechange detection approach proposed by Saha et al. [28]. Soto Vega et al. [29] applied the domain adaptation ability of CycleGAN to change detection tasks. This framework can employ previously trained classifiers for new data without a significant drop in classification accuracy. Yang et al. [30] proposed a change detection framework based on selective adversarial adaptation. Adversarial learning further reduces the distribution discrepancy between the target and selected source samples. They prove that not only the positive transfer is enhanced but also the negative transfer is alleviated.
In summary, it has been shown from this review the wide use of CycleGAN’s domain adaption ability, which is applied to semisupervised sonar image semantic segmentation task in our work.
2.2. Techniques to Stabilize Training of GAN
Goodfellow et al. [15] hold the view that if both the generator and discriminator are powerful enough to approximate any real-valued function. However, GANs can be hard to train, and in practice, it is often observed that gradient descent-based GAN optimization leads to divergence and mode-collapse. A possible explanation for this might be that the network does not satisfy L-constraint.
Researchers have tried to address this instability and improve generators through several techniques. Energy-based GAN [31] and Wasserstein GAN [32] attempt to modify the objective function to improve the quality of gradients. Neyshabur et al. [33] proposed stabilizing GAN training with multiple random projections, namely, training a single generator simultaneously against an array of discriminators, which shows only a low-dimensional projection of the data. Salimans et al. [34] proposed virtual batch-normalization and semisupervised learning to provide additional supervision to the generator.
In this paper, we use spectral normalization [34] to stabilize the training of CycleGAN for semisupervised learning.
2.3. Recent Work in Semisupervised Semantic Segmentation
Compared with supervised learning, semisupervised learning can achieve satisfying performance with a small set of labeled data. Quantitative research is generally associated with consistency regularization and has yielded ground breaking results in semisupervised classification problems.
French et al. [35] investigated the conditions that can allow consistency regularization to operate in semisupervised semantic segmentation. Lai et al. [36] presented the context-aware consistency to address the problem that semisupervised models overly rely on the contexts available in the training data. Gurubisic et al. [37] presented a method with one-way consistency for practical real-time applications.
In this paper, we enforce cycle-consistency to achieve satisfying segmentation results.
3. Methodology
In this section, domain mapping is firstly introduced to illustrate the unpaired domain adaptation ability of CycleGAN. The loss function is secondly used to describe the optimization goals of the CycleGAN in the semantic segmentation task. Last, it explains why the spectral normalization method applied to both the generator and discriminator of CycleGAN can stabilize the training.
3.1. Domain Mapping
The domain adaption ability has been explained clearly by [19]. In our work, the source domain refers to sonar images, including real images and generated images. In respect, the target domain refers to labels, including ground truth labels and generated labels. Figure 1 shows examples of real images (first column), ground truth labels (second column), generated images (third column), and generated labels (fourth column) obtained for the three targets used in our experiments. The different palettes are used to distinguish them for convenience.

The types of domain mappings can be divided into unidirectional mappings (Figure 2) and circular mappings (Figure 3). The first unidirectional mappings from sonar images to generated labels refer to sonar image semantic segmentation. The circular mappings enforce cycle consistency as the regularization to enhance the semisupervised semantic segmentation performance.


3.2. Loss Functions
The data of the semisupervised dataset include three types such as labeled images , unlabeled images , and ground truth labels corresponding to labeled images .
In this work, the total loss function follows the definition of
Here, are all constant. Data from [21] suggest that these constants can be set as . The object function that boots our network to achieve reasonably accurate sonar targets segmentation from limited labeled data is as follows:
Expression (1) consists of six loss functions of training defined by Mondal [21], which can be classified as three types, generator loss (orange part), discriminator loss/adversarial loss (green part), and cycle-consistency loss (red region), shown in the following expression and Figure 4:

3.3. Spectral Normalization
Previous research has proved that constraining the Lipschitz constant of the discriminator’s mapping function can stabilize GAN training [3237]. The reason why applying the spectral normalization to the CycleGAN network can satisfy the L-constraint is given as follows.
For one layer of the fully connected neural network, the definition of the L-constraint is as follows:
Here, is the activation function, is the network parameter matrix, is the bias, is a variable about and , and and are the input parameters.
When and are close enough, equation (4) can be approximated by the first-order term as follows:
Because we use ReLU as the activation function, ; equation (5) can be written as follows:
Here, is equal to the spectral norm of the network parameter matrix , and the definition is as follows:
Here, is the maximum singular value of .
According to [38], the output and input of the whole network can be written as follows:
Here, is the activation function of each layer, N is the number of layers of network, x is the input data, and i is any single layer in the total N layers.
Taking the gradient on both sides of equation (8),
Here, is the gradient operator and is the spectral norm of the network parameter matrix. because the activation function used in the CycleGAN is ReLU. Therefore, equation (9) can be written as follows:
Finally, both sides of equation (10) are divided by , namely, spectral normalization, which makes the network satisfy L-constraint:
It means that spectral normalization of each layer helps of network satisfy the L-constraint.
The network of the generator and the discriminator which apply spectral normalization is shown in Figures 5 and 6. The architecture is based on ResNet [39], which has four layers. CSN is Conv spectral normalization, which applies spectral normalization to the conv block, BN is batch-normalization, and ReLU is the activation function. Classifier changes the input features into generated labels or images.


4. Experiments and Analysis
4.1. Sonar Image Datasets
The dataset SCTDI we made is added segmentation labels and dropped too similar images on the basis of the SCTD dataset [8]. It is composed of 300 images, including three categories: aircraft wrecks, shipwrecks, and victims, which are randomly divided into training (270 images) and validation (30 images) subsets.
The dataset SCTDII is acquired from the side-scan sonar Klein series 5000, and the website is “https://www.kleinsonar.com”. It is composed of 800 images of tiny targets, which are randomly divided into training (720 images) and validation (80 images) subsets.
All images have a fixed resolution of 320 × 320 and 9.6 bits per pixel. To reduce the need for memory, the short edges of both datasets fed into the network are shrunken into 200 pixels.
4.2. Evaluation Protocol
The mean intersection over union (mIoU) metric [40] is used to evaluate the segmentation results of all the models (supervised model, AdvSemSeg, MT-CutMix, CycleGAN, and ours), which is defined as follows:where TP, FP, and FN are the true positive, false positive, and false negative pixels, respectively, determined over the whole validation set. The larger the value of mIoU, the better the result of semantic segmentation.
4.3. Results
In this section, the training effects of our method on the sonar image datasets SCTDI and SCTDII are firstly described. We also compare the performance of spectral normalization applied to the ResNet with other network structures. Besides, we show the comparison between spectral normalization and the other stabilization methods.
The supervised training results serve as a benchmark using all the labeled images. Three state-of-the-art semisupervised methods and our model are trained on the same training subsets, which are scratched 10%, 20%, 30%, 40%, and 50% of labeled images. All methods are trained without using the pretrained model to have an unbiased comparison.
Tables 1 and 2 compare the semisupervised semantic segmentation accuracy (mIoU/%) of our model with other state-of-the-art methods on SCTDI and SCTDII dataset. The results would seem to suggest that the proposed model outperforms other methods when training with a reduced set of labeled images in all cases. Furthermore, this difference is particularly significant when pixel-level annotations are scarce (i.e., 10% and 20% of the whole training set), where the proposed model achieves 7%–11% of improvement.
The visual comparison of segmentation results is shown in Figure 7. It shows that the proposed method predicts a segmentation closer to the ground truth than other state-of-the-art methods where labeled images are limited. In addition, our model seems to capture better details like legs of persons, wings of planes, and so on. More segmentation results of different shapes and numbers are shown in Figures 8 and 9. Therefore, our approach seems robust when applied to semisupervised segmentation of acoustic images.


(a)

(b)

(c)

Table 3 shows a comparison of the semisupervised semantic segmentation accuracy (mIoU/%) when different network structures are chosen for both the generator and the discriminator. These results would seem to suggest that the spectral normalization does not rely on the different chosen networks and the ResNet has the best performance.
Table 4 shows comparison of the semisupervised semantic segmentation accuracy (mIoU/%) between the spectral normalization and the other stabilization methods. The results prove that the spectral normalization has the best performance and could be a reasonable approach to tackle the issue that limited labeled data are available for segmentation task.
4.4. Ablation Study
To further analyze the effect of the different components of the proposed model, we conduct an ablation study. The results of our ablation study are summarized in Table 5, and the visual comparisons of different methods are in Figure 10. The previous model is CycleGAN. Method 1 refers to applying transfer learning to the CycleGAN. Method 2 refers to applying spectral normalization only to the discriminator. Method 3 refers to applying spectral normalization to both the discriminator and the generator. Method 4 refers to adding a pretrained model to method 3.

The proposed model uses spectral normalization without transfer learning reaches an MIoU value of 0.6437. If we remove the spectral normalization on the generator, this value is reduced to 0.4517. However, removing the spectral normalization on the CycleGAN leads to an even lower accuracy of 0.4138, suggesting that the spectral normalization on segmentation masks has a more substantial impact on the model. Besides, we observe that adding a pretrained model to the proposed model and CycleGAN only helps improve the accuracy of the segmentation results from 0.6437 to 0.6471 and 0.4138 to 0.4229.
5. Conclusion
This paper presented an improved semisupervised semantic segmentation method for sonar image based on the CycleGAN network combining spectral normalization. The spectral normalization is applied to both generator and discriminator to solve the problem that the generator tends to generate the same segmentation results when labeled data are limited. According to the experimental results, it has been proved that this strategy can improve the performance of semisupervised segmentation, especially when labeled data are scarce. The segmentation results are robust for underwater objects with different shapes and numbers without transfer learning.
Data Availability
The link is “https://github.com/freepoet/SCTD.”
Conflicts of Interest
The authors declare that they have no conflicts of interest.
Acknowledgments
This research was supported by National Natural Science Foundation of China (Grant Nos. 42176187 and 41906162) and China’s National Natural Science Foundation (Grant Nos. 42176187 and 41906162).