Abstract
Kuzushiji, as characters written in a cursive style, had been continuously used for both publishing and handwriting in Japan before the end of the 19th century. However, well into the twentieth century, most modern Japanese lost the ability to read Kuzushiji due to changes in writing systems. Therefore, how to develop advanced machine learning algorithms to identify Japanese historical characters has great significance in preserving cultural knowledge and promoting sustainable development in urban living. Unlike traditional digit image recognition, Kuzushiji image recognition is a more challenging task due to the complex script for cursive handwriting, multiple variations between characters, and imbalanced data. To address such issues, a novel deep learning-based character recognition method called residual shrinkage balanced network (RSBN) is developed for classifying Japanese historical characters. A novel residual shrinkage structure with a soft threshold function is constructed to reduce redundant features using specially designed skip-connected subnetworks. Then, multiple residual shrinkage modules are stacked together to learn hierarchical features from raw input. Furthermore, a novel class rebalancing strategy is further designed to improve learning performance from imbalanced data. The experiments are conducted on Japanese classical literature datasets. The results demonstrate that the proposed method obtains more than 89.01% testing accuracy on Kuzushiji-MNIST dataset, which outperforms the existing deep learning-based character image recognition methods. Moreover, without additional data augmentation techniques, the proposed approach also achieves 97.16% balanced accuracy on Kuzushiji-49 dataset, which is the best one compared with existing published results. When considering the computational performance, the number of trainable parameters of the proposed method is competitive with that of the typical ResNet-18.
1. Introduction
Kuzushiji, as characters written in a cursive style, had been continuously utilized in Japan to preserve Japanese cultural knowledge for over a thousand years, beginning in the 8th century. In 1868, the Meiji Restoration led by the Meiji government introduced a modern education system, which leads to drastic changes in the Japanese language, writing, and printing systems. Following the change of the writing system, Kuzushiji has gradually been no longer taught in the curriculum of Japanese compulsory education. Well into the twentieth century, most Japanese natives are unable to read and understand 150-year-old cursive Kuzushiji-style handwritten documents. A scholar of Japanese classical literature, Mitsutoshi Nakano, estimated in his book that even in a humanities faculty, less than ten percent of the faculty members can read Kuzushiji [1]. However, there are nearly 1.7 million Japanese books written with Kuzushiji, and only a small portion have been transcribed into modern Japanese writing. This will enforce the limitations for the scholars and general people, who either used to rely mostly on transcribed or translated works for their research or are interested in Japanese classical writings and culture [2]. Therefore, how to develop advanced machine learning algorithms to help identify and classify Japanese historical characters has attracted great attention from academic fields. The construction of intelligent recognition systems will promote the sustainable development of Japanese culture in urban living.
In recent years, the rapid development of cloud computing, big data, and artificial intelligence has contributed to the explosive growth of the data scale. In the case of enlarging data, a great deal of progress has been made in the field of image classification, speech recognition, and other areas [3, 4]. Deep convolutional neural networks (CNNs) have been widely known for accomplishing state-of-the-art performance in various fields with different model architectures, such as LeNet-5 [5], AlexNet [6], ZFNet [7], VGGNet [8], ResNet [9], and DenseNet [10].
In the digital recognition area, the freely available MNIST database of handwritten digits, modified by the National Institute of Standards and Technology database, has become a standard for fast-testing advanced intelligent algorithms [11]. It contains 70,000 grayscale images, each of pixels. Altogether, there are 10 classes representing numbers from 0 to 9. Deep learning, especially CNNs as a new research direction in the field of machine learning, has been introduced into handwritten digit recognition, which received state-of-the-art results. One of the most typical CNNs LeNet-5, proposed by LeCun et al., has been adopted to identify the traditional MNIST [6], and over 98% testing accuracy can be achieved after model training. Seng et al. [12] observed and compared the performance contributed by different networks. There are five different deep models adopted and discussed in terms of classification accuracy and time, i.e., GoogleNet [13], MobileNet-v2 [14], ResNet-50 [9], RestNeXt-50 [15], and Wide ResNet-50 [16]. In the implementation, PyTorch’s default pretrained models are adopted, which are further trained on MNIST dataset with a one-cycle policy along with 10 epochs for each architecture. The testing accuracy is GoogleNet 99.47%, MobileNet-v2 99.42%, ResNet 50 99.38%, RestNeXt 50 99.42%, and Wide ResNet 50 99.47%. For the computation time, RestNet-50 was the fastest 500 s, GoogleNet 512 s, MobileNet 498 s, RestNeXt-50 594 s, and Wide RetNeXt-50 540 s. This denotes that different structures of deep learning models will lead to different performance and computation times. Therefore, it is very important to make a balance between their computational intricacy and the recognition performance in real applications.
Apart from character recognition with Latin script in the form of English language, there are still other works conducted for handwritten character recognition in various scenarios, such as Bangla, Hindi, Chinese scripts, and many other languages. Prashanth et al. [17] proposed a modified LeNet and AlexNet on the handwritten Devanagari character recognition, which obtained good performance. Chakraborty et al. [18] developed a deep convolutional neural network (DCNN) with an optimized structure for Bangla handwritten character recognition, which showed notable accuracy. Indian et al. [19] employed the typical convolutional neural network (CNN) for the recognition of Hindi script characters, where the constructed offline handwritten character recognition system achieved an acceptable accuracy level. Ahmed et al. [20] developed a deep convolutional neural network model to recognize handwritten characters of Kurdish alphabets, which contained 34 characters and more than 40 thousand images, and the training accuracy and testing results reported a 96% and 83% accuracy rate, respectively. Shi et al. [21] designed a lightweight CNN model for offline handwriting, where an attention mechanism was embedded into the pooling layer and the L2 norm-constrained Softmax classification function. The constructed model reached 96.12% testing accuracy on the ICDAR-2013 competition test set with a model size of 48.34 M. Fateh et al. [22] developed a novel learning system, which can recognize multilingual handwritten scripts, including language and digit recognition. In the system, a language-independent model based on a robust CNN was first constructed. Then, a transfer learning technique equipped with an autoencoder was further adopted to improve the quality of low-resolution images before feeding the recognition stage. Finally, the proposed model was evaluated on six handwritten datasets, which obtained superior recognition accuracy than other single-language models.
Unlike traditional handwritten digit recognition, Kuzushiji image recognition is a more challenging task. On the one hand, Kuzushiji historical characters were written in a cursive style, which can be written in many different ways. On the other hand, a few characters in Kuzushiji look very similar and it is hard to tell what character it is without considering the above characters as context. To handle this issue, some advanced deep learning models have been proposed for Kuzushiji-MNIST digital recognition in recent years. Ghosh et al. [3] proposed a simple CNN to classify the Japanese Kuzushiji-MNIST dataset, which comes up with high accuracy of 96.13%. Shah and Manjula [23] make a comprehensive comparison of different methods on the Kuzushiji-MNIST dataset. The results found that the CNN with 11 layers outperformed the traditional SVM and MLP under the condition of large training data sizes. Saini and Verma [24] proposed a model developed with deep CNN and DropBlock regularization for Japanese Kuzushiji historical character recognition. The modified model achieved the highest accuracy of Kuzushiji MNIST 97.66% and Kuzushiji49 95.67% compared to various models such as 4-nearest neighbor baseline, Naïve-Bayes, AlexNet, simple CNN, transfer learning with CNN, LeNet-5, and MobileNet. Chopra [25] presented an improved concept of a deep neural network (DNN) with the progressive computational network, which has a gradient highway between the input and output layers to reduce the diminishing gradient. The approach reached an improved performance over SpinalNet architecture on the KMNIST dataset. Kabir et al. [26] developed a deep neural network, called SpinalNet, regarded as the classification layers of the VGG-5 network, which achieved state-of-the-art (SOTA) performance on QMNIST, Kuzushiji-MNIST, and EMNIST datasets. Ueki and Kojima [27] provided a survey on Kuzushiji recognition using deep learning, where the typical techniques like CNN, bidirectional gated recurrent unit (GRU), SE-ResNeXt, DenseNet, and Inception-v4 have been introduced. Clanuwat et al. [28] presented the baseline classification results on Kuzushiji-MNIST and Kuzushiji-49. In the experiments, five different models are implemented, including a simple 4-nearest neighbor algorithm, CNN benchmark [29], PreActResNet-18 [9], [30], and . Finally, [31] which incorporates a manifold mixup regularizer achieved the best results on two different datasets.
From the above, it can be observed that many advanced deep neural networks have been constructed for handling the digital recognition tasks of historical characters, which obtained high recognition accuracy. However, there are still some problems to be addressed: (1) it is commonly believed that models with more complex designs and more deep architecture contribute to more strong learning ability and improved outcomes. However, it usually requires heavy computational resources and leads to higher computational costs. In some character image recognition systems, a satisfactory model is usually required to be executed on edge devices due to their limited resources and computational capabilities while maintaining comparable performance. Thus, how to make a tradeoff between recognition accuracy and computational times should be comprehensively considered. (2) In most of the Japanese historical documents, Kuzushiji historical characters were written in cursive script. Unlike traditional digit image recognition, Kuzushiji image recognition is a more challenging task due to the complex script for cursive handwriting, multiple variations between characters, and imbalanced data. It is necessary to develop advanced machine learning algorithms to effectively extract the task-specific features from the complex and noise character images. (3) Currently, most of the deep learning-based digital image recognition approaches usually obtain high and satisfactory performance under the condition of sufficient and balanced data. However, in real applications, a large number of available samples are difficult to meet. Some characters in the dataset may be limited and highly unbalanced, where the frequency distribution in each class of a dataset will be long-tailed. In that case, the recognition performance of existing techniques will drop heavily.
To deal with the above challenges, this paper proposed a novel deep learning-based character recognition method named residual shrinkage balanced network (RSBN) for classifying Japanese historical characters. A residual shrinkage module with a soft thresholding function is constructed to improve the feature learning ability and recognition performance. A class rebalancing strategy is further adopted to reduce the bias induced by the majority and minority classes with the ultimate goal of yielding high identification accuracy. The main contributions are summarized as follows. (1)A robust and effective deep learning model by stacking multiple residual shrinkage modules is developed to make a tradeoff between recognition accuracy and computation cost. Accordingly, a novel soft thresholding function is designed and inserted into the deep architecture of the constructed module which is further adaptively optimized to eliminate the noise-related features to improve the final classification performance(2)A novel class rebalancing strategy is constructed, where a class-balanced reweighting term is introduced into the loss function of the deep neural network to quantify the effective number of samples between the majority classes and the minority classes. As such, the model bias induced by the imbalanced data can be effectively diminished(3)Multiple different imbalanced data recognition tasks based on the Japanese classical literatuds Kuzushiji-MNIST and Kuzushiji-49 are constructed for algorithm verification. The number of model parameters and the balanced accuracy which is the arithmetic mean of sensitivity and specificity are adopted for performance evaluation. The results demonstrated the superiority of the proposed method in comparison with state-of-the-art methods
The remainder of this paper is arranged as follows. Section 2 provides the introduction of the proposed deep learning-based Kuzushiji historical character recognition scheme. Experimental comparisons are given in Section 3. Finally, the conclusions are given in Section 4.
2. The Proposed Residual Shrinkage Balanced Network for Kuzushiji Image Recognition
Deep residual networks (ResNets), as a variant of traditional CNNs, employed identity shortcuts to optimize the parameters, which have been widely used for various image tasks. But it has difficulty learning robust feature representation from the heavy noise and imbalanced dataset. As a novel architecture, the deep residual shrinkage network (DRSN) taking ResNet as the backbone was first proposed by Zhao et al. for improving the feature learning ability of low signal-to-noise ratio (SNR) signals [32]. Then, related techniques have been successfully used in signal noise reduction and infrared image recognition areas [32–34].
Since the DRSN model can effectively mine hidden feature representation without using prior knowledge, inspired by the success, in this paper, it is further introduced to handle the complex handwritten recognition. The proposed residual shrinkage balanced network-driven Kuzushiji historical character recognition on imbalanced data will be introduced in detail. The architecture of the proposed model includes an input layer, several residual shrinkage modules, a BN, a ReLU, a GAP, a fully connected layer, and a Softmax output layer. In addition, a class rebalancing strategy is further embedded into the loss function to handle the imbalance classification tasks. The whole flowchart of the proposed approach can be shown in Figure 1.

2.1. Construction of Residual Shrinkage Balanced Network (RSBN)
2.1.1. Residual Shrinkage Modules with a Soft Threshold Function
ResNet is designed by stacking multiple different residual building units including convolutional layers (Convs), batch normalization layers (BNs), and the rectified linear activation layers (ReLUs). It used an identity shortcut connection between different layers to reduce the gradient diminish achieving superior performance to traditional CNN structures.
Similar to the ResNet, RSBN is composed of multiple residual shrinkage models (RSMs) as shown in Figure 2. The basic components of RSM with channel-wise thresholds include four Convs, two BNs, three ReLUs, a GAP, a fully connected layer (FC), and a Sigmoid activation layer. In the RSM, except for the traditional learning units, a GAP is adopted to obtain the absolute values of the feature map in the previous layer to get a 1D vector, which is further fed into the FC and the Sigmoid layer to obtain the output features. Furthermore, a soft thresholding function is inserted as a nonlinear transformation layer into the RSM to adaptively select specific-task discriminative features since different channels of the feature map will contain different amounts of redundant and noise-related information. In the constructed module, represents the number of channels, widths, and heights in the previous feature map, respectively. denotes the number of convolutional kernels in the convolutional layer; is the number of nodes in the FC layers. Mul indicates the multiplication operation.

In the image denoising field, there is a common assumption that noise in an image is additive white Gaussian noise. Different brightness and exposure of an imaging device will lead to different noise levels. From the aspect of image recognition, additional noise that exists in an image will bring strong uncertainty information, which has a large effect on the final classification performance. Therefore, it is necessary to explore a learning strategy to reduce various noises according to the noise level of the input image. The soft thresholding function has been widely used as a key step in many signal-denoising methods [35]. It can be represented as follows: where and correspond to the input feature and output feature, respectively. is the threshold, which is a positive parameter. It only sets the near-zero features to zeros, which can effectively preserve the negative features.
Accordingly, the partial derivative of the soft thresholding function can be expressed by [35]
Then, the soft thresholding function is adopted and inserted into the residual shrinkage module to remove noise-related features. As shown in the red dashed box in Figure 2, in the constructed RSM, the feature maps in the previous th input layers are further extracted through two convolutional layers to obtain the outputs. Then, an absolute operation and a GAP layer are adopted to reduce the size of the feature map to obtain the 1D output features. Then, like the subnetwork structure of SENet [36], the output features are further fed into a two FC layers to obtain scaling feature vectors . Finally, a Sigmoid layer is adopted as the last layer to obtain the scaled parameters, which are scaled to the range of (0, 1). The output can be expressed as follows: where is the scaled output vector of the final layer.
Accordingly, the threshold used in the RSM can be represented as [35] where is the threshold of the th channel of the input sample ; , , and are the indexes of the width, height, and channels of the feature map, respectively.
Based on the presentation above, it can be seen that the threshold will always be the positive value, which can be adaptively determined and optimized in RSM thereby helping reduce the noise-related features.
2.1.2. Network with Class Rebalancing Strategy
In general, the constructed model with supervision learning from labels usually results in excellent and reliable classification performance under the condition of balanced data. However, in the real world, it is difficult to gather the dataset with the natural equilibrium of the labels. Therefore, a good learning algorithm should also take class distribution into consideration. Currently, two main strategies including resampling (i.e., oversampling and undersampling) and cost-sensitive reweighting have been adopted which have received wide attention to alleviate imbalanced effects in various tasks [37]. In this part, a novel class rebalancing strategy is introduced into the supervised loss function by introducing a weighting factor that is inversely proportional to the effective number of samples [38].
Assume the input samples with the label , where is the number of the classes. Accordingly, is the number of samples in the th classes. Then, the constructed class-balanced term for each class can be expressed as [39]
Then, the class-balanced (CB) loss function is defined as where is the class probabilities of training data, is the loss function of the model, and is the number of samples in class .
In the final classification layer, the Softmax classifier is adopted for outputting the probability of each class in handwritten character recognition of Japanese historical documents. Thus, given the input sample , the predicted output in the last layer can be represented as . Then, the crossentropy loss can be written as
If the sample class has training samples, the class-balanced (CB) Softmax crossentropy loss is [39]
From the class-balanced loss function, it can be seen that the effective number of samples is defined by the weight factor . When , it corresponds to no reweighting and corresponds to reweighing by inverse class frequency. As such, the bias induced by the majority and minority classes can be effectively eliminated by adjusting the class-balanced term between no reweighting and reweighing by inverse class frequency.
2.2. General Procedure of the Kuzushiji Historical Character Recognition
With the given input samples and label information, the deep residual shrinkage network is first constructed to extract hierarchal discriminative features. In the final optimization, the weighted supervised classification loss is further designed to handle the imbalanced data. As shown in Figure 3, the general procedure of the Kuzushiji historical character recognition can be summarized as follows.
Step 1. The Kuzushiji dataset is acquired, and the imbalanced data with the different class ratios are further constructed. Accordingly, the training data and testing data are divided and preprocessed.
Step 2. To balance the computational time and classification accuracy, the typical ResNet-18 is adopted as the backbone of the feature extractor. Furthermore, the deep residual shrinkage network with the base model is constructed by introducing the soft threshold learning units.
Step 3. To handle the learning bias of the model induced by the imbalanced training data, the class-balanced loss function is designed. It is further applied to the Softmax loss function to rebalance the loss, thereby yielding performance gains on imbalanced data.
Step 4. The imbalanced training data are fed into the constructed residual shrinkage balanced network (RSBN) for model training, where the Adam algorithm is employed for the model supervised training with class information.
Step 5. Validate the performance of the constructed model using the testing samples.
Step 6. Report the Kuzushiji historical character recognition results.

3. Result Analysis and Discussions
3.1. Case 1: The Experimental Results on Constructed Kuzushiji-MNIST Dataset
3.1.1. Dataset Descriptions
In this part, the Kuzushiji dataset is adopted for algorithm verification. The Kuzushiji dataset is created by the Center for Open Data in the Humanities (CODH) [28]. It includes characters in both Kanji and Hiragana based on preprocessed images of characters from 35 books from the 18th century. In addition, there are three datasets, i.e., Kuzushiji-MNIST, Kuzishiji-49, and Kuzishiji-Kanji datasets, included. In the experiment, the Kuzushiji-MNIST dataset and Kuzishiji-49 dataset are adopted for experiential evaluation, which will be further introduced.
Kuzushiji-MNIST is written in Japanese cursive writing style. It is similar in its format to the original MNIST but is a harder classification task. Since MNIST restricts to 10 classes, Kuzushiji-MNIST is also created by choosing one character to represent each of the 10 rows of Hiragana. The Kuzushiji-MNIST dataset has 70,000 images, including 60,000 training samples and 10,000 testing images. Each image has a gray scale. Since the training samples are perfectly balanced in the original format, there are only 3000 training samples randomly selected from the original training data to create the imbalanced training data, which will be used for the constructed model training and other algorithm performance evaluation.
One characteristic of classical Japanese is very different from modern ones since classical Japanese contains Hentaigana, which are Hiragana characters having more than one form of writing. As shown in Figure 4, some samples written in cursive handwritten Japanese characters are selected and presented. There are ten Hiragana classes, named o, ki, su, tsu, na, ha, ma, re, re, and wo, which correspond to the label from 0 to 9. Each class shows three characters, which still present different shapes in writing styles. Thus, it can be seen that the Kuzushiji-MNIST is more difficult to be classified than the traditional MNIST handwritten digit dataset.

In a classification problem, one of the biggest challenges is how to tackle unbalanced data. As shown in Figure 5, it can be seen that there are one or more classes in the constructed unbalanced Kuzushiji-MNIST dataset that have an extremely low number of samples than other classes. Specifically, class 0 (o) and class 7 (ya) have larger sample sizes than class 1 (ki) and class 5 (ha). The unbalanced classes in this recognition will make the model have difficulty to learn the potential discriminative features for minor classes.

3.1.2. Evaluation Indexes
When working on problems with imbalanced data, balanced accuracy rather than classification accuracy is usually selected as the satisfactory metric to evaluate the performance of a classification task. Balanced accuracy is the arithmetic mean of sensitivity and specificity, which can be used in both binary and multiclass classifications [40]. The balanced accuracy metric gives half its weight to how many positives you labeled correctly and how many negatives you labeled correctly, which can make a balance between the majority classes and minor classes.
Suppose TP denotes true positive, the correctly predicted positive class of the model. TN is the true negative, the correctly predicted negative class. FP is the false positive, the incorrectly predicted positive class. FN is the false negative, the incorrectly predicted negative class.
Then, the sensitivity, which is known as the true positive rate, is expressed as
Specificity, which is known as the true negative rate, is represented as
Finally, the balanced accuracy is obtained by [40]
3.1.3. Comparison Methods
To verify the effectiveness and superiority of the constructed CDRSN model in addressing the issue of Kuzushiji historical character recognition under imbalanced data, several state-of-the-art approaches are implemented for performance comparisons. Eleven different deep learning models are considered: (1) the proposed method without the class rebalancing strategy (proposed-basic); (2) 18-layer deep residual network (ResNet-18) [9], which is taken as the baseline; (3) 18-layer-wide residual network (Wide ResNet-18) [16], as the variant of DRN is obtained by decreasing depth and increasing the width of residual networks; (4) 18-layer stochastic depth network (stochasticdepth-18) [41]; (5) MobileNet [42], using depth-wise separable convolutions to reduce the number of parameters, (6) MobileNet-v2 [14] introducing a better module with an inverted residual structure for feature extraction; (7) ShuffleNet [43]; and (8) ShuffleNet-v2 [44].
Except for the above typical deep learning models, three modified CNN models, which obtained excellent classification in character image classification tasks, have been taken for comparison. (9) ACNN [17] is the modified version of AlexNet. It has its first layers with 28 kernels that are regularized with the L2 regularization technique. The number of kernels in the remaining four convolutional blocks is 56, 112, 224, and 448 with the window size of , , , , and . Dropout is embedded into the first two fully connected layers with a ratio of 50%. (10) DCNN [18] contains a convolutional layer with 32 filters, and the kernel size is . Then, there is another convolutional layer with 32 filters and a kernel size of . Each convolutional layer is followed by a max pooling layer with a pool size of . Finally, there are three fully connected layers with 1024 nodes in the first two layers and 512 nodes in the 3rd layer. The first two layers have a dropout value of 0.5. (11) The constructed CNN [20]has three convolutional blocks. The first convolutional block has 16 filters, the second one has 32 filters, and the third one has 64 filters. All convolutional filters are , and all max pool layers have a pool size of (2, 2). After that, two fully connected layers with 512 nodes are used. And a dropout layer with a value of 0.5 is added to the fully connected layer.
To make a fair comparison and evaluate the classification performance of various models, several measures are adopted. First, for each approach, the same training data are used for training and the testing data are also kept consistent. Second, all the models use the same learning rate of 0.01, Adam optimization algorithm, and 100 epochs. For the proposed method, the hyperparameter of beta is set as 0.99. Finally, all the deep models are repetitively trained with three trials. The average balanced accuracy is taken for recognition performance comparison.
3.1.4. Comparison Results on the Kuzushiji-MNIST Dataset
In this part, the recognition results including the five repetitive experiments, the average balanced accuracy, and the number of learnable parameters are used for algorithm evaluation. As shown in Figure 6, it can be seen that each model obtains more than 80% testing accuracy in each experiment which demonstrates the effectiveness and strong learning performance of the deep neural network in handling complex tasks. In addition, from the aspect of classification performance, the average balanced accuracy of five repetitive experiments is also calculated for comprehensive evaluation. The balanced accuracy and the standard deviation of twelve different learning approaches are presented in Table 1. It can be observed that the balanced accuracy of the algorithms including proposed, proposed-basic, ResNet-18, Wide Resnet-18, stochasticdepth-18, MobileNet, MobileNet-v2, ShuffleNet, ShuffleNet-v2, ACNN, DCNN, and CNN are 89.01%, 88.00%, 87.33%, 88.34%, 82.81%, 81.30%, 83.91%, 87.14%, 83.70%, 85.73%, 84.63%, and 84.26%, respectively.

(a)

(b)
With the increase of the computational cost (number of trainable parameters), the deep neural networks usually help obtain better recognition accuracy than the other lightweight models, like MobileNet and ShuffleNet due to their advantages in leveraging the learning ability of deep discriminative features. However, this does not always work, as the classification performance of stochasticdepth-18 is poor. The major reason may be that the stochasticdepth-18 model will randomly drop entire residual blocks during training and bypass their transformations through skip connections, which weakens this generalization ability in unbalanced training tasks. The lightweight models including MobileNet, MobileNet-v2, ShuffleNet, and ShuffleNet-v2 have few learnable parameters than the other models, but the best classification accuracy is 87.14% achieved by ShuffleNet, which cannot be compared with the proposed approach. Furthermore, the proposed method without class-balanced loss obtained slightly better results than ResNet-18, which demonstrates the advantages of the constructed deep residual learning model. By further enforcing the class-balanced term, the proposed approach achieved the best results due to the fact that the loss term can reduce the bias induced by the imbalanced classes. Therefore, the obtained testing accuracy is also promising and the standard deviation is also competitive among all the compared methods. This further demonstrates the strong stability and robustness of the proposed method on this challenging task, which highlights its advantages in practical applications.
3.1.5. Confusion Matrix
In this part, the confusion matrix was utilized for the performance evaluations of the proposed approach after the classification. The rows in the confusion matrix represent the predicted classes, while the columns represent the true classes the outcomes should have been. The table value in each box represents testing accuracy from predicted labels to actual labels. Besides accuracies, detailed misclassification information for each class condition is available in the confusion matrix. The results are presented in Figure 7.

(a) The first experiment

(b) The second experiment
It can be seen that the proposed method obtains higher classification accuracies in class 0 (o), 3 (tsu), 6 (ma), 7 (ya), 8 (re), and 9 (wo) than those in other classes. This is attributed to the fact that those classes usually have a larger number of samples involving model training which will implement the majority of classes, while the lowest recognition accuracies of 79.50% and 76.90% are achieved in classes 1 (ki) and 5 (ha). This is consistent with the data distribution where classes 1 (ki) and 5 (ha) have minority samples that will be more difficult to be recognized during the testing stage. In class 1 (ki), the dataset only contains less than 100 samples, and more than 80% prediction performance is still obtained, while in classes 2 (su), 3 (tsu), and 4 (na) with a relative small imbalance ratio, they also obtain over 82%, 95%, and 85 testing accuracy. Through the experiment investigation, it can be observed that the effect of the amount of imbalance data on the proposed approach is not significant. The results further validate the effectiveness of the proposed method on imbalanced data.
3.1.6. Feature Visualizations
To further illustrate what the deep feature representations have learned in the proposed method, -distributed stochastic neighbor embedding (t-SNE) is adopted for visualizing high-dimensional data extracted by the last convolutional layer through mapping each data point in a high-dimensional space into a low-dimensional map space [45]. The 2-D visualization of the raw input data and the extracted high-level features obtained by the t-SNE technique is shown in Figure 8.

(a) Raw data

(b) Extracted hierachical features
It can be seen that there are significant distribution discrepancies exist among different character image classes. The data from the different classes can be projected into different places. For the raw input data, most of the samples from different classes are stacked together heavily demonstrating that the raw features are not well discriminative. It is difficult to directly conduct the character image classification using the raw data. In contrast, for the high-level features extracted by the constructed deep model, the samples from the same classes are compactly gathered together with a relatively large interclass distance. The samples from the different classes can also be well located in different regions. It can be known that the feature discriminative ability after learning with the deep model is enhanced than the raw input data, which further manifests that the proposed method has outstanding performance in deep feature learning and classification tasks.
3.2. Case 2: The Experimental Results on Kuzushiji-49
3.2.1. Dataset Descriptions of the Kuzushiji-49 Dataset
To further demonstrate the advantage of the proposed method in tackling the imbalanced data, the Kuzushiji-49 dataset is further employed. As shown in Figure 9, the Kuzushijij-49 dataset is the extension of the Kuzushiji-MNIST dataset, which has 49 classes, containing 270,912 images. The Kuzushiji-49 dataset covers 232,365 training samples and 38,547 testing samples. Each image also has a gray scale. It is a much larger, but highly imbalanced dataset containing 48 Hiragana characters and one Hiragana iteration mark. Therefore, it is well suitable for algorithm verification under imbalanced data. As shown in Figure 10, the data distribution in each class is different. Some of the classes have more than 6000 samples, e.g., class 0 (a), class 1 (i), and class 2 (u), while some of them like classes 44 (i) and 45 (e) only have 417 and 392 samples. Thus, a large data balance exists in the Kuzushijij-49 dataset, which is a major challenge for the classification task.


3.2.2. Comparison Results on Constructed Kuzushiji-49 Dataset
In this part, the recognition results among different methods are compared. Since the balanced accuracy has been utilized in some research for evaluating the performance of various techniques on the Kuzushiji-MNIST dataset, we have made a comparison of the existing published results. The seven different techniques have been taken including the traditional machine learning method neighbor baseline, the deep learning methods, Keras Simple CNN Benchmark, PreActResNet-18, , and . Both the results of those approaches are taken from reference [28]. In addition, the 18-layer residual network (ResNet-18) and the proposed method without class-balanced loss are also implemented for algorithm comparison.
According to Table 2, it can be seen that 4-nearest neighbor baseline achieved the base recognition accuracy of 86.01% on Kuzushiji-49, which is lower than the deep learning models. ResNet-18 and the proposed method without a class-balanced loss (proposed-basic) achieve similar high recognition accuracies of 96.84% and 96.84%, which are better than the CNN model. This demonstrates the advantages of residual architecture in learning robust features from imbalanced data. In addition, both , Mixup, and the proposed method obtain more than 97% accuracy. The recognition performance of is even slightly better than the proposed method. But it should be noticed that both and should involve the additional data augmentation, where a weighted combination of random input sample pairs from the training data is implemented. The data augmentation technique requires expert knowledge to select suitable parameters, which is time-consuming and hand-laboring work. Without the Mixup data augmentation, the proposed method outperforms other existing deep learning models, which further confirms the feasibility and superiority.
3.3. Advantage Discussion
From the results obtained in those two cases, it can be seen that the proposed method presents obviously better performance in Kuzushiji character image recognition tasks. The main advantages of the proposed approach with respect to other existing works can be summarized as follows: (1)Strong feature learning and model generalization performance. The proposed approach constructed a novel deep learning model by stacking multiple residual shrinkage units. The soft thresholding which can be adaptively updated is inserted into the deep architecture to effectively enforce the unimportant features to be zeros. As such, the learned high-level features can be more discriminative, which is helpful to obtain better learning ability than existing deep learning-based recognition models with traditional convolution structures(2)Promising computational cost. Convolutional neural networks have achieved great success in image recognition. As the performance requirements become higher and higher, the number of network layers continues to increase. Thus, it is necessary to search for a trade-off between classification performance and computational complexity. Compared with the existing deep learning models, the proposed method obtains better recognition accuracy, while it still keeps a comparable number of trainable parameters with the baseline ResNet-18(3)Handling imbalance classification problems. In a real application, data in the vision domain often exhibit highly skewed class distribution, which refers to the class imbalance problem. This issue has received less attention in the field of character image classification. Thus, the proposed method designs a class rebalancing strategy, where a class-balanced crossentropy loss term is developed to adjust the bias induced by the majority and minority classes. Then, the constructed model can well deal with imbalance classification problems, overcoming the limitations of existing deep learning models with a traditional crossentropy loss function
4. Conclusions
Kuzushiji as an old Japanese writing style is disappearing into the river of history with Japan stepping into the twentieth century. Therefore, it is significant to adopt the advanced artificial intelligent technique to bridge the connection between classical Japanese literature and the modern writing system. To promote the sustainable development of Japanese cultural knowledge, a novel approach called residual shrinkage balanced network is presented to effectively recognize the Kuzushiji characters in cursive style. Firstly, a deep residual network is taken as the backbone network to realize the feature extraction from the imbalanced training data. Then, an effective residual shrinkage structure with soft threshold function is embedded into the deep residual network for enhancing the feature representation ability of the network. Finally, a class-balanced loss is further integrated into the network to improve the classified accuracy. Two Kuzushiji character-based imbalanced classification cases are used, and the results confirm the advantage of the proposed approach although the number of samples in different classes has significant changes. With the rapid development of hardware technology, deep learning models have presented great potential and significantly enhanced the effectiveness of various classification tasks which is very helpful for accelerating the real applications of intelligent character recognition systems. The limitations of the proposed method lie in the assumption that training and testing data are subject to the same distributions. However, in the real world, the training data may come from different situations or institutions, which presents large distribution discrepancies due to the variations of environment, known as the domain shift problem. In recent years, transfer learning techniques are aimed at transferring the learned knowledge from one domain to another, providing an effective solution to addressing such issues. How to combine transfer learning and deep learning techniques to improve the learning performance of the constructed system under a complex environment will be investigated in the future. In addition, it should be noted that the independent character image recognition task is implemented in this manuscript. However, in the real application, the character images are attached to the documents; how to segment the content from documents into independent character images is the first and the most challenging work. It belongs to the semantic image segmentation problem dividing an unknown image into different parts and objects, which deserves to be further studied in the future.
Data Availability
Previously reported data were used to support this study and are available at https://arxiv.org/abs/1812.01718. These prior studies (and datasets) are cited at relevant places within the text as reference [28].
Conflicts of Interest
The authors declare no conflict of interest.
Authors’ Contributions
H. Q. and J. D. were responsible for the conceptualization, methodology, validation, writing, and supervision. All authors have read and agreed to the published version of the manuscript.
Acknowledgments
This research was funded in part by the Project of Art Science in Shandong Province (ZD201906444) Study on the Construction of Shandong Image in Marine Culture and Language Landscape in Shandong Province.