Abstract
Background and Objective. The use of Chinese herbal medicines (CHMs) for treatment plays an important role in traditional Chinese medicine (TCM). However, some herbs are easily confused with the others because their shapes/textures look similar and they could have totally different utilities. Recently, deep learning has attracted great attention for the application of image recognition and could be useful for TCM herb identification. Methods. For recognizing easily-confused TCM herbs on a smartphone, we propose a CHM recognition system using hierarchical clustering convolutional neural networks (HCNNs) based on the affinity propagation clustering method. Results. We implement our system on the smartphone and show recognition accuracy close to 98%, based on a dataset of 65 kinds of herbs (including 12 easy-confused herbs pairs). We also investigate the effect of different parameters (e.g., selection of clustering algorithms for HCNNs, types of smartphone, and number of layers in the neural network) on the system performance. Conclusions. In this work, we proposed a hierarchical clustering convolutional neural network (HCNN) method to distinguish similar TCM herbs with a high accuracy. We also showed the usefulness of applying the data augmentation techniques when implementing the proposed system for a variety of smartphones.
1. Introduction
Chinese herbal medicines (CHMs) play an important role in TCM. Chinese herbs primarily come from different parts of the plants, including leaves, roots, stems, flowers, and seeds. The core idea of CHM is to restore the balance of the human body to achieve a state of health. While CHM has become an increasingly popular method of treatment globally, for most people, it is difficult to recognize different Chinese herbs and know the properties of each kind of herb. Moreover, some herbs are easily confused with the others because their shapes/textures look similar to the others but they could have totally different utilities. For example, Astragali Mongolici Radix [1] is commonly used in CHMs treatment because of its efficacy in strengthening the immune system. However, some people will sell Hedysari Radix [2] as a replacement for Astragali Mongolici Radix because its flavor is tastier and the price is cheaper than Astragali Mongolici Radix. Hedysari Radix has similar shapes/textures as Astragali Mongolici Radix, but with lower efficacy for boosting the immune system. Other examples of easily-confused herbs are Diocoreae Rhizoma [3] and Manihot Esculenta [4]. The former is commonly used to maintain the function of the lung and kidney while the latter could be poisonous if not properly used. In Table 1, we listed some commonly-used Chinese herbs [5].
Chinese herbs are commonly used for food preparation and play a very vital role in Chinese medicine. For the reasons of safety and efficiency, it is important to have proper recognition of these herbs. However, given that some herbs have similar shapes/textures, most people find it difficult to recognize them without extensive experiences or expert knowledge. Therefore, it might be necessary to develop a system to facilitate people to recognize the herbs and understand the properties of these herbs.
Although there are many illustration handbooks of Chinese herbs around, it is time-consuming and inefficient to use these books to distinguish these easily-confused herbs. On the other hand, given the popularity of the smartphone, it can serve as a convenient vision-based-measurement (VBM) [6] instrument for recognizing the herbs. While a few prior studies attempted to use computer vision techniques for herb recognition [7–10], their results are generally limited in the following aspects: (1) relying on hand-crafted features, (2) based on a small data set (e.g., only 18 herbs in [7, 8]), (3) not targeting on easily-confused herbs, and (4) low recognition accuracy. In this work, we aim to build a system on top of the smartphone based on convolutional neural network (CNN) for recognition of easily confused TCM herbs. More specifically, we proposed a hierarchical CNN method to classify easily confused herbs by first clustering similar herbs into a group (using the affinity propagation algorithm [11]) and building a CNN model for these groups. We then train a CNN model for each group to classify herbs in the same group.
The contributions of this paper are in two folds. First, we set out to develop a system for automatic recognition of easily confused CHMs on the smartphone. Users just need to take pictures of CHMs and the system will show information of the herbs on the phones. The proposed system could potentially be used for the following applications: (1) knowing whether the herb is genuine or not and (2) understanding the properties of the herbs. As far as we know, this is the first TCM herb recognition App implemented on a phone. Second, we proposed a hierarchical CNN (HCNN) method for recognizing 24 easily-confused herbs. Our initial results show classification accuracy close to 98% (a 5% improvement in comparison to the naive CNN). Note that, while our proposed HCNN architecture is not entirely new, we investigate the effect of different parameters (e.g., selection of clustering algorithms for HCNN, types of smartphone, and different CNN models) on the system performance. We believe that these insights could be of interests to readers of this journal.
2. Related Work
2.1. Herb Recognition Based on Its Smell and Taste
Luo et al. [12] developed an electronic nose that simulates biological olfactory organs to achieve the physiological function of the nose through machine learning. Their work can identify 6 types of Pungent CHMs. In addition, they proposed a method using the electronic tongue to identify taste information of five different CHMs [13]. However, these prior works have the same problem. First of all, such instruments are not easy to be built or obtain. In addition, some time-consuming preprocessing needs to be performed first before one can employ such approaches to identify different herbs (e.g., herbs need to be grinded into powders and heated for 30 minutes).
2.2. Traditional Vision Techniques for Herb Recognition
Herb recognition using computer vision techniques is generally more cost-effective than methods based on taste or smell. Some of them are based on hand-crafted features. Tao et al. [7, 8] utilized texture to classify 18 different CHMs. Cai et al. [9] and Liu et al. [10] used color, texture, and shape feature descriptors to identify 3 and 8 different CHMs, respectively. Finally, there are also some prior work on leaves and flowers recognition [14–17], using techniques such as local binary pattern (LBP) [18], histogram of oriented gradients (HOG) [19], and scale-invariant feature transform (SIFT) [20].
2.3. CNN for Herb Recognition
The problem in using traditional hand-crafted feature descriptors is that one needs to know what kinds of features are appropriate for classification. However, it might be difficult to find the representative features to identify the differences between a set of easily confused herbs. Deep learning has recently become increasingly popular, and many studies have shown that it can outperform many traditional machine learning methods for various image recognition tasks [21]. In particular, CNN has attracted strong interest from both academia and industry since the ImageNet dataset became available. Sun and Qian [22] used CNN for CHMs recognition by collecting a total of 5.523 images from 95 categories. The average accuracy rate of their results is about 71%. They did not particularly consider the use of CNN for recognizing easily confused herbs though (e.g., as shown in Table 2, there are only two easily confused herbs pairs in their dataset). In our work, we propose a method based on HCNN to distinguish 12 pairs of easily confused herbs.
2.4. Hierarchical CNN
The concept of hierarchical CNN was introduced in some prior work. Yan et al. [23] implemented HD-CNN (hierarchical deep CNN) that breaks down an image recognition task into two levels. To separate simple classes from each other, an HD-CNN first uses a CNN classifier to classify the image data into K coarse categories. More complicated classes are redirected downstream to fine classifiers with divisions that concentrate on confusing classes. This work showed an improvement of 2.28% on the accuracy rate based on CIFAR100 and ImageNet datasets. They used spectral clustering to cluster their data into K coarse categories.
Mao et al. [24] evaluated their HCNN approach on the German traffic sign recognition benchmark (GTSRB). They proposed a CNN-oriented family clustering (CFC) algorithm to partition the traffic signs into K families. In these studies, the number of clusters (i.e., K) needs to be predefined which are more suitable for the fixed dataset like ImageNet. In our work, we employ affinity propagation (AP) to cluster easily confused herbs. AP does not require the number of clusters to be determined in advance. Given that currently there is no large herb image database available (the image data used in this study are all created by ourselves), AP is more suitable to us since we can then expand our database over time without worrying about changing our algorithm.
The introduction should be succinct, with no subheadings. Limited figures may be included only if they are truly introductory, and contain no new results.
3. System Framework
We first started our experiments using a naive CNN to recognize some easily confused herbs. But then, we realized that some of these herbs look very similar and we were unable to obtain good results for these herbs. Therefore, in this work, we implement a hierarchical CNN method for these easily confused herbs.
Figure 1 shows our hierarchical clustering CNN architecture. In the training phase, we first cluster similar-looking herbs into the same group. Next, we create a two-layer CNN. The first layer is to create a model for cluster classification while the second layer is to classify herbs in the same cluster. In the testing phase, the system will first decide which cluster the input image belongs to, and then use the trained model in the second layer to recognize the herb within the identified cluster.

In this paper, we apply the AP algorithm [11] to cluster similar herbs into a group. AP is based on the concept of message passing between data points so that each data will find the most suitable ones as its exemplars (i.e., cluster center or cluster head) and how much they are suitable as exemplars. Unlike traditional clustering algorithms such as k-means, AP does not require the number of clusters to be determined in advance. More specifically,(1)For our training data, we randomly sample some images from each kind of herb, and then, we extract their features to calculate the similarity matrix [11] as the input of the AP algorithm.(2)After performing the AP algorithm, each data point (i.e., herb) decides its exemplar. It is possible that the same kind of herb might choose different exemplars, so we adopt a majority-vote mechanism to decide the final exemplar for each kind of herb.(3)If the final exemplar of two kinds of the herb are the same, we cluster them into the same group.
In the first layer of our CNN-based clustering model, we utilize an open-source deep learning framework named Caffe [25], and we pretrain our CHMs model over 1 million ImageNet images of 1,000 categories. The architecture of CaffeNet is shown in Figure 2. It consists of eight layers, of which the first five layers are convolutional layers. Three max-pooling layers follow the first, second, and fifth convolutional layers, respectively. The last three layers are fully-connected layers. The number of neurons in the last fully-connected layer of our clustering model is set to the number of herb groups. The function of the first layer of our CNN-based clustering model is to decide which group the input herb belongs to.

The function of the second layer of our HCNN model is to recognize the target herb from an herb group. In the second layer, for each herb group, we train a CNN model similar to the first layer. In other words, there are multiple second-layer models, and each of them is corresponding to an herb group. Each herb group will contain at least one herb based on the clustering results.
4. Results
4.1. Experimental Environments
We evaluated our proposed hierarchical CNN model using CaffeNet [25] based on the AlexNet model [21], and all experiments were run on 64 bit Ubuntu 14.04 with an INTEL i7-4790 CPU, a GEFORCE GTX 1060 GPU, and 16 GB RAM. In particular, BLVC CaffeNet [25] is used for training the CNN model. The trained model is later ported to the phone for the testing phase.
4.2. CHM Dataset
In this work, we select 12 pairs of easily-confused CHMs from a book named “Illustrations of Commonly Misused Chinese Crude Drug Species in Taiwan” [26] for our experiments, as shown in Figure 3. The image data used for our experiments were taken by an iPhone6. One hundred images are taken for each herb so that our dataset contains 2,400 images in total. 1,440 images are used for training and the rest are used for testing. The name of the herbs is shown in Table 3. There are a total of twelve easily-confused herbs pairs in this dataset (e.g., A1/A2).

4.3. The Benefit of Using Transfer Learning on the Accuracy of Our CNN Model
BVLC CaffeNet provides an option called fine-tune which allows one to copy the model parameters from a pretrained CNN model (otherwise, all the parameters in the CNN model are initialized with random values). Given our dataset is small, it is expected to be beneficial from using some pretrained parameters to initialize our model. This is known as the transfer learning method. We utilize the pretrained parameters from the model.
ImageNet work [21] is used to initialize our CHMs model. Figure 4 shows the average accuracy with and without the use of transfer learning. It clearly shows that the model can quickly converge with much higher accuracy when the fine-tune option is enabled.

4.4. Comparison of the Hand-Crafted Method with CNN for Herb Recognition
Some prior studies employed hand-crafted features for the herb recognition. In this study, we compare three different feature extraction methods, including HOG [19], LBP [18], and BOW SIFT [20], with CNN. For HOG and LBP implementation, the cell size is set to 32. Because the number of SIFT feature points in each image is not fixed, we first extract the SIFT feature points of all the training data and run them through K-means clustering (with the center set to 200) so that all images can have the same dimensional vector. Finally, a pretrained SVM [27] model to be used for herb classification.
For the CNN experiment, we first rescale the image size to 256 256 and then randomly crop 224 224 patches from these images to increase the number of training data and reduce overfitting. We enable the fine-tune and set the number of neurons to be 24 in the last layer to match the number of easily confused herbs in our data. Our model is trained using stochastic gradient descent with a batch size of 60 samples (we set the momentum to 0.9, weight decay to 0.0005, and gamma to 0.1). An equal learning rate is used for all layers and the start learning rate is initialized at 0.0001.
Table 4 shows the results using hand-crafted methods and the CNN method. We employ five-fold cross-validation to calculate the accuracy. Among the traditional hand-crafted methods, LBP achieves the highest accuracy at 86.85%. This is not surprising since texture is an important feature for CHMs and LBP is powerful for texture classification. The accuracy of using CNN is 95.69%, which is much better than that of all the hand-crafted feature descriptors methods.
In addition, we compare two different CNN models, CaffeNet [25] and VGG16 [28]. The latter is used by a prior study for CHMs recognition [28]. VGG16 uses a deeper structure than CaffeNet so that it takes more time for training and testing, as shown in Table 5 (the same parameters are used for both the models). The execution time of VGG16 is about three times longer but the accuracy of both models is similar. Therefore, we decide to use a simple model like CaffeNet in this work because it runs faster with acceptable accuracy, as shown in Table 4.
Figure 5 shows the accuracy of detection for all 24 different herbs using a confusion matrix. We find that some herbs are more easily mistaken for another herb, such as B1 and B2 as well as C1 and C2. These recognition errors are reasonable though since they are easily confused even for human eyes. In the next section, we show that these classification errors can be improved with the proposed hierarchical clustering CNN method.

4.5. Performance of Hierarchical Clustering CNN Method
In this work, we employ the use of HCNN to reduce recognition errors. We propose the use of the AP algorithm [11] to automatically cluster similar herbs into a group. In this section, we compare the performance of AP clustering and manual clustering which is based on an illustrated handbook of easily confused herbs [26]. As shown in Figure 6, twenty-four herbs are divided into 14 groups according to the illustrated handbook.

For the implementation of the AP algorithm, we calculate the similarity matrix based on LBP features because it has the best classification accuracy among all hand-crafted features we tried. We randomly sample 20 images for each herb to run AP. The results of AP clustering are shown in Figure 7 for comparison with the result of manual clustering in Figure 6. Figure 8 shows the confusion matrix of the cluster classification using CNN based on the AP algorithm (i.e., if a herb is classified into the correct cluster), and Figure 9 shows the confusion matrix for classifying the herb within a cluster based on the classification results shown in Figure 8.



Furthermore, we compare the performance of the AP algorithm with other clustering algorithms. Table 6 shows the herb classification accuracy from ten experiments using the other automatic clustering algorithm including k-means [29] and spectral clustering [30]. Both of them require the number of clusters to be predetermined before running the algorithm. Here, we let the number of clusters (i.e., K value) be 14 which matches with the manual clustering using the illustrated handbook. Table 6 shows that the AP algorithm has a more stable and higher accuracy. For spectral clustering, some of the results are even worse than naive CNN (i.e., CNN without hierarchical clustering, as shown in Table 4). This is due to that spectral clustering first uses a Laplacian matrix to reduce the dimension, and then employs the k-means algorithm to do the clustering. Therefore, it might lose some information during the dimension reduction. In addition, its results are sensitive to the decision of the initial K value. A bad choice of K might lead it to a local optimal solution.
Table 7 shows a detailed comparison of the recognition accuracy for each herb between the CNN method and the proposed HCNN method. We find a significant improvement for some herbs such as B1 and H2 in addition to a general improvement of average accuracy (about 2%) when the HCNN method is employed. The results from AP clustering are very similar to that of manual clustering based on the illustrated handbook (which is considered as the ground truth for herb clustering in this study).
4.6. The Effect of the Number of CNN Layers
The above results are based on CNN architecture of 8 layers. A recent trend is to perform model compression (e.g., by reducing the number of layers of a deep neural network) for resource-limited devices like smartphones [31]. We next want to explore the use of a smaller number of layers for CNN training. Specifically,(1)Eight layers: 5 convolutional layers and 3 fully-connected layers (the original)(2)Six layers: 5 convolutional layers and 1 fully-connected layer(3)Four layers: 3 convolutional layers and 1 fully-connected layer.
We find that the recognition accuracy drops as we reduce the number of CNN layers. The average accuracy is about 94% for 6-layer CNN and 90% for 4-layer CNN. Nevertheless, these results are still better than traditional methods using hand-crafted features.
4.7. The Effect of Different Brands of Smartphones
The camera parameters (e.g., resolution, image size, and color) of different smartphones can be quite different. Figure 10 shows the herb images taken by 4 different smartphones, including iPhone6, Samsung S7, Xiaomi, and Asus PadPhone. Figure 11 shows the color distributions from these phones. We can see that the images taken by iPhone are more similar to those from Samsung but quite different from images taken by Xiaomi and Asus phones. Therefore, if the training data are taken by iPhone and tested on other brands of phones, the recognition results could be poor, as shown in Table 8 (in this experiment, the training data and testing data were collected from different phones).

(a)

(b)

(c)

(d)

(e)

(f)

(g)

(h)

Data augmentation (DA) is a common way to improve the results of CNN by artificially creating more training data from the original dataset through various transformations of the original images. In this study, we implement four simple different data transformations on the original iPhone dataset, including rotation, resizing, and changes in brightness and histogram equalization, as shown in Figure 12.

(a)

(b)

(c)

(d)

(e)

(f)

(g)

(h)
Tables 9 and 10 show the performance of using the one single data augmentation (DA) method as well as combining multiple DA methods, respectively. By comparing Table 9 with Table 10, we can see that the data augmentation method is generally helpful to improve the recognition accuracy (up to about 9% for Asus phone) if we only have the training data from one single type of phone (i.e., iPhone in our case). However, different types of transformations might have different effects on different phones. As shown in Table 9, providing additional data can improve the results for Xiaomi and Asus phones that does not help much for iPhone and Samsung S7. In particular, data augmentation through histogram equalization might reduce the recognition accuracy for iPhone and Samsung S7.
Finally, we consider adding additional training data from other smartphones. Specifically, 1,440 images were taken from each smartphone (60 for each herb) for training the CNN model. As shown in Table 11, introducing additional training data from all the smartphones obviously can produce a better improvement than the only use of data augmentation. However, given that it might not be feasible to collect the training data from all the smartphones in the world, data augmentation is still a good way to improve the performance of CNN. In addition, we currently are exploring the utilization of generative adversarial networks (GANs) [32] to generate synthetic herb images for data augmentation as an on-going work.
4.8. Visualization
In order to understand what features in the herb images our proposed hierarchical CNN model considers important for recognizing a herb, we employ the layer-wise relevance propagation (LRP) algorithm [33] to visualize which pixels in the input images contribute most strongly to the classification. LRP decomposes the output of the network into the sum of the relevance of the pixels in the input image. Rf is calculated by forward propagation; a pixel-wise relevance scores Ril is computed as follows:where i is a neuron at layer l and Pj runs overall upper-layer neurons.
The results are shown as heatmaps in Figure 13. The pixels in yellow or red mean that they have higher LRP values (and the red pixel has a higher value than the yellow pixel) which are considered to have a strong influence on the classification results. Generally speaking, there are more circular layers (in yellow) in C2 than in C1 when either CNN or hierarchical CNN is used. These circular layers are slightly more observable when the proposed hierarchical CNN is used though.

(a)

(b)

(c)

(d)

(e)

(f)
5. Application
Based on the proposed hierarchical CNN, we implemented a smartphone App (currently only support Chinese) that can automatically recognize CHMs herbs, as shown in Figure 14.

(a)

(b)

(c)

(d)
To further validate the performance of our proposed system, in addition to the 24 easily confused herbs, we also collect another dataset which includes 41 herbs, as shown in Figure 15 and Table 12, and integrate these data into this herb recognition App system. We obtain similar results (around 98%) on the recognition accuracy for these additional herb data (the details are not discussed here due to the space limitation).

6. Discussion
In this work, we propose a system that can recognize easily confused TCM (traditional Chinese medicine) herbs on a smartphone with a high accuracy. As far as we know, this is the first smartphone-based system that considers recognition of easily- confused TCM herbs using deep learning techniques. Generally speaking, we observed that a deeper neural network performs better for herb recognition. In addition, we provide an explainable model to show what features in the herb image contribute most strongly to the final results of classification. We found that the recognition accuracy could be affected by the camera parameters (e.g., color histogram) of different brands of smartphones. Different data augmentation techniques were implemented to improve the system accuracy. Finally, we showed that the use of transfer learning is very beneficial where collecting large amount of herb data for training is difficult.
7. Conclusions and Future Work
In this work, we focus on the recognition of easily confused herbs by proposing a hierarchical clustering CNN method that uses affinity propagation to cluster similar herbs into groups. In each group, CNN is then used to extract representative features to distinguish similar herbs. As compared to CNN, our proposed method can improve the detection accuracy by almost 5%. In addition, we study the impact of different brands of smartphones on CHMs recognition accuracy. When the data augmentation is used with more data from different smartphones, we can improve the recognition accuracy from 86.82% to 95.76%. We are currently enriching our herb database so that our system can recognize more CHMs. In addition, we are exploring the use of generative adversarial networks (GANs) [32] to generate synthetic herb images for the data augmentation. Finally, in the future we plan to study the quality of CHMs by extending the App system we developed.
Abbreviations
CHMs: | Chinese herbal medicines |
TCM: | Traditional Chinese medicine |
HCNNs: | Hierarchical clustering convolutional neural networks |
HCNN: | Hierarchical clustering convolutional neural network |
VBM: | Vision-based-measurement |
CNN: | Convolutional neural network |
LBP: | Local binary pattern |
HOG: | Histogram of oriented gradients |
SIFT: | Scale-invariant feature transform |
HD-CNN: | Hierarchical deep CNN |
GTSRB: | German traffic sign recognition benchmark |
CFC: | CNN-oriented family clustering algorithm |
AP: | Affinity propagation |
DA: | Data augmentation |
GANs: | Generative adversarial networks |
LRP: | Layer-wise relevance propagation algorithm. |
Data Availability
The data used to support in this study are available upon request from the corresponding author.
Conflicts of Interest
The authors declare that there are no conflicts of interest regarding the publication of this paper.
Authors’ Contributions
Kun-chan Lan conceived and designed the experiments; Juei-Chun Weng, Jun-Xiang Zhang, and Tzu-Hao Tsai performed the experiments and analyzed the data; Min-Chun Hu helped on the interpretation of data; the acquisition of the herb data is provided by Yuan-Shiun Chang; both Kun-chan Lan and Tzu-Hao Tsai wrote the final version of this paper. All authors read and approved the final manuscript.
Acknowledgments
This research has received funding from Ministry of Science and Technology (MOST), Taiwan under the grant number 111-2221-E-006-120-.