Abstract
Thyroid nodule is a clinical disorder with a high incidence rate, with large number of cases being detected every year globally. Early analysis of a benign or malignant thyroid nodule using ultrasound imaging is of great importance in the diagnosis of thyroid cancer. Although the b-mode ultrasound can be used to find the presence of a nodule in the thyroid, there is no existing method for an accurate and automatic diagnosis of the ultrasound image. In this pursuit, the present study envisaged the development of an ultrasound diagnosis method for the accurate and efficient identification of thyroid nodules, based on transfer learning and deep convolutional neural network. Initially, the Total Variation- (TV-) based self-adaptive image restoration method was adopted to preprocess the thyroid ultrasound image and remove the boarder and marks. With data augmentation as a training set, transfer learning with the trained GoogLeNet convolutional neural network was performed to extract image features. Finally, joint training and secondary transfer learning were performed to improve the classification accuracy, based on the thyroid images from open source data sets and the thyroid images collected from local hospitals. The GoogLeNet model was established for the experiments on thyroid ultrasound image data sets. Compared with the network established with LeNet5, VGG16, GoogLeNet, and GoogLeNet (Improved), the results showed that using GoogLeNet (Improved) model enhanced the accuracy for the nodule classification. The joint training of different data sets and the secondary transfer learning further improved its accuracy. The results of experiments on the medical image data sets of various types of diseased and normal thyroids showed that the accuracy rate of classification and diagnosis of this method was 96.04%, with a significant clinical application value.
1. Introduction
In recent years, the incidence of thyroid cancer has continued to rise. As a malignant tumor of the head and neck, it continues to threaten people’s health [1]. It is reported that, in the United States, thyroid carcinoma is expected to be the third disease threat among women, with approximately 37 cases per 100,000 people [2]. The thyroid nodule is a symptom of thyroid-related disease. The nodule may be caused by the growth of thyroid cells or thyroid cyst. The thyroid tissues around the scattered lesion of thyroid nodule can be clearly distinguished through images [3, 4]. If the benign and malignant nodules can be judged earlier, even malignant nodules can be cured. In addition, the accurate distinguishing methods can provide an effective basis for the proper subsequent clinical treatments. Besides, the accurate diagnosis earlier can also reduce the medical risk to be suffered by patients and a large amount of health care costs caused by acupuncture detection.
Currently, there are two major methods for examining the nature of thyroid nodules: ultrasound image analysis and computer tomography imaging analysis. Between them, ultrasound imaging is cheap and common in hospital. This is why the ultrasound image analysis is more common. However, in the ultrasound image, the malignant thyroid nodule with prominent histopathological components and blurred boundaries usually adhere to other tissues, difficult to distinguish the morphology. This requires an efficient image classification method to improve accuracy and reduce the misdiagnosis rate. In the past studies, radiographers have summarized thyroid nodules ultrasonographic features images according to their characteristics, which function as signs of cancer. However, thyroid nodules diagnosis relying on these characteristics is time-consuming and poorly robust. To this end, the accurate computer aided diagnosis system based on ultrasound images is still to be judged by doctors. A fully automatic computer aided diagnosis system consists of image preprocessing, such as denoising, ROI extraction, and classification. Nowadays, most researches mainly focus on image denoising and ROI extraction. At present, it is still difficult to judge by ultrasonic images alone. The low quality and noise pollution of the ultrasound image makes it extremely challenging to classify it. Tsantis et al. [5] proposed an SVM classifier to divide thyroid nodules into high-risk and low-risk malignant tumors. Ma et al. [6] present a noninvasive and automatic approach for differentiating benign and malignant thyroid nodules based on support vector machines (SVM). Acharya et al. [7] proposed a wavelet transform filter to classify it. Shukla et al. [8] utilize artificial neural network dealing with thyroid disease. Prochazka et al. [9] proposed a dual threshold binary decomposition method to classify it.
Rapid progress in the automatic classification of medical image data is also made by this method. Thandiackal et al. [10] identified skin lesions through some pretrained classical classification networks. Convolutional Neural Network (CNN) models are a type of deep learning architecture introduced to achieve the correct classification of breast cancer [11]. It proposed an in-depth model that uses limited chest CT data to distinguish malignant nodules and benign nodules [12–15]. It proposed a classification algorithm for thyroid nodule ultrasound images based on DCNN [16]. Nevertheless, these methods are defective in the following aspects at present:
Needless to say, transfer learning has played an important role in ultrasound imaging diagnosis of thyroid cancer. However, few-shot learning is the challenging problem of making predictions based on a limited number of samples [17–20]. Also, data labelling is a task that requires a lot of manual work [21–23]. Finally, the inappropriate model and imbalanced training data are difficult to get better classification accuracy [24–30].
Therefore, in view of the above problems, this paper discusses and does the following work:
In response to the abovementioned few-shot learning, in this paper, a TV model is introduced for the automated preprocessing of original data collected by various institutions. Some image marks made by doctors also need to be removed. The original image is then expanded by data augmentation for the purpose of supplementing inadequate training samples. Also, in response to what is mentioned above to select suitable learning transfer model, the GoogLeNet model was established for the experiments on thyroid ultrasound image data sets. The results showed that the model enhanced the accuracy for the nodule classification. Finally, in response to the above imbalanced training data problem, this paper puts forward the secondary transfer learning conducted on public thyroid database and the actual data sets collected by hospitals, which improves the classification accuracy.
The structure of this paper is presented as follows. In Section 1, the writing motivation is given, and the relevant literature has been examined. Section 2 provides the traditional CNN structure and describes the Tv-Based Image Restoration. Section 3 describes the network structure of the proposed methods based on GoogLeNet. Section 4 shows the experimental results, including the application of the proposed method for diagnosis of thyroid cancer. The research results are summarized in Section 5.
2. Related Work
2.1. Abbreviations and Acronyms
This kind of network structure is usually called CNN, local connection, weight sharing, and other characteristics of feed-forward neural network [31]. It then inputs the extracted features into the fully connected network; thereby, the parameters are to be optimized. In their research, Moon et al. used ultrasound images for cancer diagnosis. The difference from the previous method is that they use a variety of data sets and combine different CNN algorithms for fusion diagnosis. It was found that the accuracy rates of different data sets were 91.1% and 94.62%, respectively [32]. Kim et al. used deep learning methods for intelligent diagnosis of breast ultrasound images. By calculating different performance standards, the AUC value was 89% [33]. There are also some methods that use a three-dimensional convolutional neural network structure. Through experiments, different performance standards have been found, and the accuracy rate can reach 96.7%.
It is an effective method to extract image features. The image input to the convolutional layer. In this layer, it can perform feature extraction tasks. Each feature can be extracted from each feature map through the convolutional layer. The weight is updated through continuous backward propagation during the training. The computing formula of the convolution layer iswhere is the output neuron cell, is the input signal of each network cell, is the activation function, is the convolution kernel, and is the offset.
After the features are extracted through the convolution layer, the output feature map reaches the pooling layer for feature selection and information screening. The pooling formula is shown in the following equation: where is the weight coefficient and is the sampling function [34–36].
After pooling, the data is input to the fully connected layer that is equivalent to the traditional forward neural propagation. The connected end of convolutional neural only transmits signals to other fully connected layers. The traditional CNN Structure is shown in Figure 1.

In the traditional CNN structure, the forward propagation is adopted to build the network structure, and the backward propagation is adopted to train the network parameters. The loss function, learning rate, and moving average are used for network optimization. Regularization and cross entropy are the loss functions in CNN. The cross entropy formula is shown aswhere is the standard answer and is the predicted value. The exponential decay learning rate is adopted, i.e., the magnitude of each parameter update. The formula of parameter update is given bywhere is the learning rate and is the gradient of the loss function.
2.2. TV-Based Image Restoration
The data sets collected for this experiment were few and needed to be augmented. In the present study, the data set was augmented only by rotation and translation.
The current data contained manual marks, as shown in Figures 2(a) and 2(c). Manual marks mainly refer to the marks made by the professionals on the lesion area in the ultrasound image, which destroy the part of the texture and affect the accuracy and integrity of the image of the area to be analyzed. This also impacts the subsequent training. Therefore, restoring the image was essential. In 2002, Shen et al. [37] extended the TV model to image inpainting and proposed an image inpainting method based on the TV model. The Total Variation- (TV-) based self-adaptive image restoration was adopted for images to estimate the value after pixel restoration:where represents the pixel of the current point O to be restored, represents the pixel of neighboring points of the current point O at four directions, , and is the weight coefficient, which was mainly determined by . It is defined inwhere is the divergence;where , , and are the pixels of the left neighboring point, the upper left neighboring point, and the lower left neighboring point of the current pixel; is the parameter of at point .

Finally, as shown in Figure 2, the image was well restored to an extent that its texture was similar to the surrounding texture. The same method was applied to restore the pixel of Figures 2(a) and 2(c), from which Figures 2(b) and 2(d) were obtained.
3. Proposed Methods
3.1. Proposed Network Structure
The CNN model of GoogLeNet was established to realize the diagnosis of thyroid classification. The process is shown in Figure 3. Initially, the TV-based preprocessing was performed for the thyroid nodule image. Subsequently, the training of CNN model was conducted to extract the features of images of various sizes. Thereafter, the transfer learning was implemented based on the open source database and the database actually collected. The features were integrated, and the dual-softmax assisted forward propagation was conducted. In the end, a softmax classifier was adopted to classify features. The diagnosis of thyroid classification was thus completed.

3.2. GoogLeNet CNN Structure
GoogLeNet adopts the structure of the inception proposed in the Going Deeper with Convolutions [38]. Generally, a CNN structure just simply augments the network, with two disadvantages, namely, overfitting and increase in the computation amount. Generally, the network depth and width can be increased by reducing the parameters, while the reduction of parameters turns the full connection to a sparse connection. For the dense matrix optimization mode, the computation amount does not have a qualitative improvement with this kind of change. The inception structure has a sparse structure and high computing performance. The inception structure is shown in Figure 4.

The use of various scale convolution kernels can get various sizes receptive fields. The final stitching refers to the integration of various scales. Different kernel sizes were set for alignment, such as , . Also, the convolution stride = 1 and the pad = 0, 1, 2, respectively, which was directly stitched together later. However, as the use of 5 × 5 convolution kernel still generated a large amount of computation, hence, the convolution kernel was utilized to reduce the dimension. The specifically improved inception structure is shown in Figure 5.

3.3. Improved GoogLeNet Structure
The GoogLeNet network model is stacked based on the Inception module. Being a network with a relatively large given depth, there is a problem with the backward propagation of effective communication gradient through all layers. For this task, the performance of the shallower network shows that the features generated by the intermediate layer of the network should be very discernible. The discriminative ability of classifiers at low stages can be expected to add complex classifiers. It is considered as a method that overcomes the problem of vanishing gradient. It can adopt the forms of small CNN that are placed above the output of the inception module. These auxiliary networks are discarded in case of inference. The subsequent control experiment results show that the influence of the complex networks is almost the same. One of them is adequate to achieve the same effect.
Dropout determined what percentage of fully connected nodes was shut off for a training cycle. Dropout improved the model generalizability by preventing nodes from overlearning the training data. The average pooling was finally adopted for the network to replace the fully connected layer. Furthermore, in order to prevent the gradient from vanishing, the network was provided with two additional softmax for the forward propagation gradient. The structure of inception is shown in Figure 1. The computation was performed after the number of channels was reduced through the 1 × 1 convolution to aggregate the information, effectively making use of computing power. The integration of multidimensional features by combining the convolution and pooling of different scales also contributed to a better effect in terms of recognition and classification. By changing the computing power from being deep to being wide, it avoided the problem of dispersion of the training gradient. The global average pooling adopted by GoogleNet solved the typical problem of the complicated and weakly generalized parameters of traditional CNN network.
4. Experiments
4.1. Selected Data Sets and Evaluation Indicators
We verified through a lot of experiments that the accuracy for predicting the morphological classification of candidate star clusters depended on the following characteristics of the training sample:(i)Origin of classifications: Primary classification or the mode of two classifiers is shown in Table 1;(ii)Size of images used for training: ;(iii)Using a random selection of 80 percent of the samples described in Table 1 separately, and the remaining 20 percent was reserved for validation.
The thyroid nodule ultrasound image data used was obtained from the hospitals. After the data augmentation, there were 2,763 images of malignant cases and 541 images of benign cases, with a total of 3,304 images. All images were cropped into a size of . The images were extracted from the thyroid ultrasound video sequence by the ultrasonic apparatus, at a frequency of 12 MHz. The TI-RADS score was given by a professional physician after the image diagnosis. 3,123 images of cases were used for the training of improved models. 541 images, as a test data set, were then randomly divided into 5 groups to test the above three models. Each of the benign and malignant samples is divided into a verification set, test set, and training set. The specific classification scheme is shown in Table 1. The overall condition is shown in Table 1 below.
4.2. Comparative Analysis of Experimental Results
The comparison of accuracy is important for different models. Table 2 shows that our mean (Improved Inception) was improved in terms of accuracy than the common GoogLeNet model, and it exhibited the highest accuracy rate in determining whether a thyroid nodule changed pathologically.
The confusion matrix and performance standards obtained in LeNet5, VGG16, GoogLeNet, and GoogLeNet (improved) models are shown in Figure 6.

The LeNet5 architecture correctly predicted 860 out of 1000 images and incorrectly predicted 140. The VGG16 architecture correctly predicted 920 out of 1000 images and incorrectly predicted 80. Although the GoogLeNet architecture correctly predicted 960 of the 1,000 pictures, it incorrectly predicted 40 of them. The most successful class of the GoogLeNet (improved) architecture is the ordinary class. The GoogLeNet (improved) architecture correctly predicted 970 out of 1000 images and incorrectly predicted 30.
True Positive Rate (TPR) is shown on the vertical axis of Figure 6, and False Positive Rate (FPR) is shown on the horizontal axis of Figure 6. The entire graph is also called the ROC curve. Figure 6 shows the result of classification accuracy percentage of our proposed algorithm as 96.65%, 97.81%, 97.32%, 95.97%, and 0.97%, respectively.
The loss values of different CNN models are shown in Table 3, and the GoogLeNet (Improved Inception) model was relatively minimal. The change in the trend in continuous iteration is shown in Figure 7. Table 4 shows the results of the time consumed by the different models to diagnose the same test image.

As shown in Table 5, the LeNet5 model exhibited a shortest time to diagnose the thyroid ultrasound image, and the GoogLeNet model exhibited the second shortest.
4.3. Joint Training and Secondary Transfer Learning
In GoogLeNet, the transfer is from the MINIST data set to the thyroid image. Generally, it is believed that the transfer effect is worse than that of the two similar data sets, when the two data sets have a great difference. MINIST, as a natural image, greatly differs from the medical image. Therefore, the joint data training was conducted herein, based on the public database, and it was provided by the cooperative organization of this paper. Because of the lack of samples, the joint database was deemed as a whole in the training, which further expanded the overall database.
In transfer learning, the database of small samples was used as the aiming field, and a great quantity of marked database was used as the source domain. In the previous experiments, 2,210 images of malignant and 553 images of benign cases, a total of 3,374 images, were collected from hospitals.
4.4. Analysis of Experimental Results
Table 4 shows the difference in the performance between the secondary transfer learning and the primary transfer, the data joint training, and the VGG16-based system. The results showed that, for small medical data sets, the secondary transfer significantly improved the system performance. Figure 8 shows the comparison of system transfer and non-transfer learning based on LeNet5, VGG16, Inception V3.

With α = 0.05, the value of the VGG16 model is less than 0.001, and the t value = −28.71. Because the value is less than α, there is enough evidence to reject these invalid hypotheses. The value of the LeNet model is 0.05 and the t value = −1.66, which shows that there is not enough evidence to reject the null hypothesis α = 0.05. However, between LeNet and VGG16, the average value of VGG16 is higher, and the average value of the other two groups of GoogleNet and GoogleNet (improved) is higher than that of LeNet and VGG16.
In case of data sets with similar category, the data joint training also showed a close agreement to the experimental result of secondary transfer. The data joint training and secondary transfer were effective in further improving the system performance, while introducing the transfer learning. This provided a reference for the classification of small data sets and medical image data sets.
5. Conclusions
In the present study, the thyroid ultrasound image was preprocessed by the TV-based self-adaptive image restoration method. Subsequently, the CNN model was established using the corresponding loss function, learning rate, moving average, and optimization algorithm set for optimization. Three improved models, namely, the LeNet5 model, VGG16 model, and GoogLeNet model, were trained to diagnose the benign and malignant thyroid nodules. Thereafter, the accuracy rate of each model in terms of diagnosing results was obtained through the tests.
Although all of the three trained models completed the recognition, to verify the best CNN model for diagnosing such ultrasound images, we collected a large amount of image data for training and testing. In the comparison studies, it was found that the GoogLeNet (Improved) exhibited the relatively highest accuracy rate in determining whether a thyroid nodule changed pathologically. The average accuracy rate of the GoogLeNet model was up to 96.04%; furthermore, GoogLeNet (Improved) achieves classification accuracy of 97%, with a loss value of 0.3844. It explains that the GoogLeNet model can diagnose whether the patient’s thyroid is in diseased state or is normal. In the end, the data joint training and secondary transfer learning were performed for the open source data sets, and the thyroid ultrasound image data was collected from the hospitals, which further improved the classification accuracy.
In the experiment in this paper, deep learning was applied to the auxiliary medical diagnosis. Our next step is to gradually optimize the model and study the improvement of model, so as to ensure a high accuracy rate of the results. The image classification and diagnosis method based on deep learning will provide a reference to the doctors to diagnose such diseases, help them improve diagnosis efficiency and accuracy, immensely save manpower, and provide new concepts for the ultrasound diagnosis of the thyroid nodules in future.
Data Availability
The data used to support the findings of this study are available from Weibin Chen via e-mail: sun@wzu.edu.cn.
Conflicts of Interest
The authors declare no conflicts of interest.
Authors’ Contributions
WB.CHEN conceptualized the study; WB.CHEN and ZY.GU developed the methodology; WB.CHEN and ZY.GU worked on software; ZM.LIU, YY.FU, and ZP.YE validated the study; XIN.ZHANG carried out formal analysis; L.XIAO investigated the study; L.XIAO helped with the resources; WB.CHEN and L.XIAO wrote, reviewed, and edited the manuscript; XIN.ZHANG visualized the study. All authors have read and agreed to the published version of the manuscript.
Acknowledgments
This work was financially supported by Zhejiang Provincial Natural Science Foundation of China under Grant nos. LY21F020001 and LY19F030006, Wenzhou Science and Technology Bureau of China (Wenzhou major scientific and technological innovation project, under Grant nos. ZG2020026 and ZY2019019), and Science and Technology of Wenzhou (Y20180232).