Abstract

Digital handwritten recognition is an emerging field in optical character recognition (OCR). A digital writing pad replaces manual writing. In digital writing, the alphabet changes in font and shape. During OCR recognition, covert text file errors occur due to digital pen pressure and digital pen position on the digital pad by the writer. The shape changes in the alphabet lead to an error during the conversion of OCR to text. The above problem arises in Tamil, Chinese, Arabic, and Telugu, where the alphabet consists of bends, curves, and rings. OCR-to-text conversion for the Tamil language has more word errors due to angles and curves in the alphabet, which need to be converted accurately. This paper proposes ResNet two-stage bottleneck architecture (RTSBA) for Tamil language-based text recognition written on a digital writing pad. In the proposed RTSBA, two separate stages of neural networks reduce the complexity of the Tamil alphabet recognition problem. In the initial stage, the number of inputs and variables is reduced. In the final stage, time and computation complexity are reduced. The proposed algorithm has been compared with traditional algorithms such as long short-term memory, Inception-v3, recurrent neural networks, convolutional neural networks, and a two-channel and two-stream transformer. Proposed methods, such as RTSBA applied in the digital writing pad-handwritten and HP lab datasets, achieved an accuracy of 98.7% and 97.1%, respectively.

1. Introduction

Tamil is spoken in different countries, such as southern India and worldwide [1]. Tamil is spoken and written in Tamil Nadu, India, and other countries such as Singapore, Sri Lanka, and Malaysia [2]. Tamil literature is from 300 BC to now and is an ancient language. From 700 to 1,600 AD, the Tamil language was called as Middle Tamil. In the year 1600, the Tamil language was identified as modern Tamil. The old Tamil writings are carved in stones; the middle Tamil was written on palm leaves. Modern Tamil writing is in the textbook, which is written using pen and paper. Recently, Tamil has been written on a digital writing pad, and Tamil documentation is stored in digital formats. A digital writing pad saves paper, and documents are easily accessible and cost-effective.

Handwriting recognition in the Tamil language is an ongoing challenge. Researchers developed algorithms for Tamil handwriting recognition using convolutional neural networks (CNNs). Accuracy in recognition needs to be increased. Tamil is a script-based language that is written in different styles, and recognizing handwritten Tamil text accurately becomes a challenge. Researchers improve handwriting recognition through different methods, such as font normalization, character segmentation, and model ensembling.

Tamil handwriting recognition is a challenging task due to the presence of complex characters and patterns. Studies have shown that deep learning-based methods such as CNNs and recurrent neural networks (RNNs) can achieve high-accuracy outcomes for Tamil handwriting recognition. Moreover, feature extraction and image segmentation extract characters from handwritten text images and improve handwriting recognition accuracy [3]. Other methods, such as bidirectional long short-term memory (BiLSTM) encoding, are used for Tamil handwriting with high recognition accuracy.

Automatic character recognition and conversion of online handwritten text compression with optical character recognition (OCR) for the Tamil alphabet is challenging. Writing on the tablet using the tip of your fingers or a stylus pen has increased due to the growth of internet technology. A novel method is required for Tamil handwriting text segmentation and classification. The major problems with Tamil handwriting text are size variations, dimensional changes, irregular stylus points, discontinuity of structures, superfluous overloops, shape variation, and distinctive curves [4]. Figure 1 shows the Tamil handwritten to digital writing pad for different age groups and the optical digital conversion system. A handwritten document with a digital pen and pad shows the wrong recognition of the word. To overcome the above problem, a ResNet two-stage bottleneck architecture (RTSBA) was proposed. The two-stage bottleneck architecture of ResNet is designed for filtering the noise and enhancing the image for better network performance.

In addition, nonlocal (NL) and attention-gate (AG) blocks enhance performance by increasing the network’s ability to capture long-range dependencies. The bottleneck structure of ResNet reduces the number of parameters in a network and increases the accuracy. NL and AG blocks reduce the parameter’s computational complexities and memory. A two-stage bottleneck architecture has fewer parameters and leads to fast computation. The reduced number of parameters in the bottleneck architecture has reduced overfitting.

Contributions are: (1) To recognize handwritten text in the digital pad and pen text recognition using RTSBA, which is based on segmentation. (2) To recognize the handwritten text using the digital pad and pen at different pen pressures using the proposed RTSBA algorithm, which segments the alphabet into simple curves, and closed simple curves for recognition. (3) To recognize digital writings in the Tamil language based on demography, such as in Madurai, Tirunelveli, Coimbatore, and Thanjavur in Tamilnadu, where the alphabet changes based on the non-simple curve in the Tamil alphabet. (4) To recognize digitized writings in the Tamil language through syllable-based alphabet detection and classification with RTSBA. (5) To classify and recognize the Tamil language alphabet based on the different age groups of writing, such as (i) 15–25, (ii) 26–35, (iii) 36–45, (iv) 46–55, and (v) 56–65, and compare the Accuracy, Precision, Recall, and F1 score of the proposed method with traditional algorithms.

A novel deep learning-based approach for multilingual handwritten numeral recognition has been developed [5]. The process involves a pretrained CNN model and transfer learning. The developed method is tested on eight images collected from eight different languages: Arabic, English, Persian, Hebrew, Urdu, Mongolian, Kalmyk, and Spanish. Moreover, the algorithm provides near-perfect results in specific languages, such as Arabic and Persian. The current study never included any handwriting variability analysis and evaluated the network capacity to recognize handwritten digits of people from different countries or writing styles. A fully convolutional recurrent network (FCRN) [6] recognizes online handwritten Chinese text by incorporating the spatial-semantic context. FCRN is trained from end-to-end with multistage training using deep learning techniques for feature extraction and mapping the data to a higher semantic space. The model is prone to vanishing gradients, limiting performance on more complex tasks. An offline handwritten Chinese text recognition technique based on a fully CNN is suggested [7]. This technique uses CNNs and classifies handwritten Chinese text input in images. The model consists of several convolution layers, residual blocks, and attention mechanisms. The convolution layers are used to extract features from images, while the residual blocks increase the information capacity of the model. Finally, the attention mechanism allows the model to focus on the most critical information in the image and never recognize the sequence of characters in a sentence, recognizing only the individual characters.

BiLSTM with data augmentation, including rotating, shifting, and stretching, improves text recognition accuracy, and the model focuses on a specific type of handwriting [8]. Deep feature learning on wearable sensors improves handwritten character recognition [9]. The model relies on a deep learning architecture and extracts a set of high-level feature representations, which are used to classify handwritten characters. The features are utilized in a supervised learning approach to predict the handwritten character. The model needs to be more balanced, as the timing and sequencing information is complex and maps accurately to handwriting features.

The bottleneck transformer (BNT) [10] is a novel network architecture developed for visual recognition tasks. It uses a simplified transformer-based encoder–decoder network and compresses the image features at multiple scales before feeding them into a cross-scale self-attention module for efficient inference. BNT uses a block sparsity regularization technique and reduces the complexity of the network. Furthermore, the faster deduction is performed. The compression enables BNT for accurate classification in unseen images with a much smaller number of parameters than the standard Transformer-based. It never considers relative positional information, which limits the performance of tasks with highly structured data. A combination of CNNs and RNNs [11] provides the best accuracy in classifying and recognizing handwritten mathematical symbols and expressions, but is limited to a few symbols and expressions and never performed for mathematical operations. An RNN-based deep learning model is used for the accurate classification of text and non-text strokes in online handwritten Devanagari documents [12]. The model is trained using multiple two-dimensional feature vectors extracted from the stroke images and evaluated for average accuracy, obtained by selecting the model’s best-performing parameters. The model classifies the strokes as text or non-text.

A novel approach recognizes handwritten offline Tamil characters using the combination of conditional generative adversarial network (cGAN) and CNN [13]. The preprocessing step involves identifying the characters from the handwritten dataset using edge detection methods. Using cGAN, the dataset is augmented and generates more primitive features used to train the CNN model. Finally, the model is tested on the augmented images and evaluated for performance and accuracy. The method is unable to support complex writing.

To improve the efficiency of online handwriting recognition (OHR), Carbune et al. [14] suggested “Fast Multilanguage LSTM-based Online Handwriting Recognition.” This approach utilizes an optimized version of the LSTM network, which runs on multiple platforms. The OHR system processes the data more rapidly than traditional LSTM networks. The model performs better for datasets with high-quality images, and noise in images affects the performance and accuracy. In [15], a segmentation method based on an attention-embedded lightweight network is developed. The network is a combination of CNNs and attention modules and captures more informative features from images and improves segmentation accuracy, and the accuracy of the segmentation varies based on the types of features in training the model. Table 1 shows the different methods and languages.

3. RTSBA Method for Non-Simple and Straightforward Curve Text Classification

In this paper, the proposed RTSBA model is shown in Figure 2, which is used to recognize Tamil text; text images are collected from different sources, such as offline note-taking and online sources, and text is written on paper and with a pen. RTSBA segmentation consists of two separate stages, reducing Tamil handwriting recognition’s complexity. The two-stage bottleneck architecture segments the text more accurately because the network focuses on the important features and ignores the irrelevant objects. In addition, accurate and detailed segmentation results are compared with single-stage architectures. RTSBA reduces complexity through different types of neural networks at each stage and more accurately predicts curves in the alphabet. RTSBA consists of an encoder and decoder phase with an attention block between the steps. The attention block is integrated with the NL and AG blocks, which overcome the bottleneck problems in CNN.

The encoder phase consists of convolution layers and max pool layers, which acquire the content of the images. Convolution layers capture the image features, followed by the RELU and max pools to dilute feature parameters. Squeeze and excitation blocks are introduced in the encoder phase to overcome feature loss and workload problems. After two convolution layers, the image enters the network; two-channel separation is used. A down-sampling process makes the image the same size as the input image. The model learns the detailed image properties, and the image size increases using the upsampling process.

The feature map is adjusted to a size similar to the decoder phase input image. The decoder phase consists of four upsampling blocks. Each sampling block contains two convolution layers and one RELU layer. The size of input information decreases in the encoder phase and increases in the decoder phase. Some attributes are transmitted if the input is compressed and a bottleneck occurs. To overcome the bottleneck problems, a two-stage block is developed to minimize loss of input.

In Figure 3, the NL block feature map is represented; and depict multiplication and addition, respectively. For each row, the softmax operation is performed block by block as described in Equation (1).

Wz represents the initial weight values; ai is for residual information; bi is for similar size information; and ci is for block value. The bottleneck problem is solved using the NL and AG blocks.

The AG block has 1×1×1 convolution. The convolution is resampled with the RELU activation function through the sigmoid function. NL blocks help the networks by capturing the long-range dependencies in their data. Long-range dependencies refer to relationships between data points that are far apart. These relationships play a vital role in image classification. This relationship is never used in traditional convolutional networks. NL blocks capture dependencies by utilizing self-attention.

NL blocks improve computational efficiency. Since self-attention modules never rely on convolutional kernels or local processing, they perform better with fewer operations. NL blocks are more powerful and accurate than traditional convolutional layers. The self-attention module allows for global context-aware learning and will enable networks to capture relationships between different locations in the image. This led to more accurate predictions and better performance. The AG block helps focus on the input’s relevant features and removes irrelevant distractions, improving the model’s performance. AGs reduce the computational cost through faster convergence and improve the model’s performance. The AG block provides a visual representation of features and makes it easier to interpret the model to debug potential problems.

4. Experiment and Result

The digital writing pad-handwritten (DWP-H) dataset is created using the Tablet Model Number (Wacom CTL-672/K0-CX), a graphic tablet for online and offline. Text datasets were acquired during ambient lighting conditions and pressure-based stylus writing on a digital pad. A total of 251 writers’ samples were obtained in .tiff image format. Different age groups based on text images are collected, such as children aged between 10 and 18, adults aged between 19 and 59, and old males and females aged between 60 and 75. Samples per class were 5, few contributed as many as ten, and the number of samples per class was 550 with a character size of 92 × 133. The images were resized; the longer side length is about 50 × 50 pixels. These images normalize by transforming each grayscale pixel value from the [0, 1] range to the [−1, 1] range. The experiments used a learning rate of 0.0001 and an Adam optimizer. The network has trained for 64 batch sizes and 50 epochs. The proposed and RTSBA are analyzed using “Accuracy,” “Precision,” “Recall,” and “F1 score” for Tamil handwritten words. The Tamil language is composed of more similar characters as well as strokes such as right curve, left curve, circle, up the curve, down the curve, dot, question mark, slanting line, standing line, sleeping and standing line, springs, down curve and circle, standing line, sleeping line. Among the above strokes, the recognition of curves is complex due to rings in the alphabet. The CNN approach is used for Tamil handwritten word recognition. Table 2 represents the proposed RTSBA results compared with modified multi-scale segmentation network (MMU-SNet) and CNN. The first-word second character is incorrectly predicted as “மு” instead of being “ழு,” and the word’s meaning also changed. In the second word, the second character is mispredicted as “ந” instead of “ற.” The first character’s incorrect prediction for the third word is “ஆ” instead of “சூ.”

The performance measures are summarized for open-curve and closed-curve alphabets in the Tamil language.

5. Evaluation of the Alphabet in the Tamil Language

The Accuracy (A), Precision (P), Recall (R), and F1 score (F1) are four statistical measures to assess the performance of the RTSBA classifier. In Equations (25), the metrics are mathematically calculated. The performance evaluation of classifiers is shown in Table 3.

The model typically performs across all classes according to the Accuracy metric. Calculating accuracy involves dividing the number of correct predictions by the total number of predictions.

P is obtained by dividing the total number of correctly categorized positive samples by the total number of positive samples.

Calculating the “R” is through dividing the total number of positive samples by the percentage of positive samples correctly classified as positive. The ability of a model to locate positive samples is assessed using the “R.” Higher “R” values discover the positive samples.

The weighted average of “P” and “R” is the F1. This F1 score considers both false positives and false negatives.

Statistical measurements are used to assess the performance of the classifiers shown in Table 3.

Figures 47 show the statistical measures of the proposed RTSBA methods compared with the traditional classifiers, such as the CNN, the self-adaptive lion algorithm (CNN + SALA), Vgg19Net, and AlexNet. The CNN + SALA models are computationally expensive and require more training time and resources. The performance of the CNN + SALA model is affected due to unbalanced or noisy data. Alexnet has a large number of parameters and performs less for smaller datasets. Alexnet never suits open-curve alphabet segment feature extraction. VGG19Net leads to overfitting for the closed-curve alphabet. VGG19Net requires a large dataset to train.

5.1. Different Dialect Person Handwritten Tamil Alphabet Recognition

Tamil dialects are from different parts of Tamil Nadu, India, and the world. Dialects of Tamil are Madras Bashai, Kongu, Kannada, and Malayalam dialects, and writing changes from one person to another based on dialect. Each dialect has a unique vocabulary, grammar, and writing style. The Tamil diaspora speaks several Tamil dialects in other parts of the world, such as Singapore, Malaysia, and South Africa. Tamil dialects are from Madurai, Coimbatore, and Thanjavur in the Tamil Nadu state of India. The Tamil dialects vary in phonemic modifications and sound effects from Old Tami to modern Tamil. For example, the word here anku (அங்கு) is in the dialect of Coimbatore, anga (அங்க) is in the dialect of Thanjavur, and old Tamil’s ankaṇa (அங்கன)—anganakula (அங்கனகுள்ள)—is a dialect of Tirunelveli, and old Tamil ankittu—the source of ankittu (அங்கிட்டு)—is a dialect of Madurai. Table 4 shows the performance analysis of the classifier based on the statistical measures of the dialect-based alphabet.

Figures 811 show the statistical measures of different person-written alphabet dialects, such as Madurai Tamil, Tirunelveli Tamil, Coimbatore Tamil, and Thanjavur Tamil.

5.2. Syllable-Based Alphabet Recognition

Tamil syllables are the basic units of Tamil pronunciation. They are composed of a consonant and a vowel sound, or sometimes just a vowel sound. Tamil syllables are written using a combination of twelve consonants and eighteen vowels. Tamil uses various vowel and consonant combinations to create unique characters and individual syllables. Tamil syllables are classified as one, two, and three. Table 5 shows the Performance evaluation of classifiers with Accuracy, Precision, Recall, and F1 score based on a syllable.

Figures 1215 show statistical measures of the proposed RTSBA method with classifiers such as CNN + SALA, AlexNet, and VGG19Net. The proposed RTSBA has high Accuracy, Precision, Recall, and F1 score based on syllables.

The performance evaluation of different classifiers is shown in Table 6, along with the accuracy.

Figure 16 shows the accuracy of the proposed RTSBA with DWP-H and HP labs datasets with different classifiers. The Tamil handwritten dataset from HP laboratories has 1,000 images in 169 folders, of which 550 samples are taken into account for each class, which amounts to about 156 classes. Two-stage bottleneck architectures outperform traditional architectures on tasks such as image classification and object detection. The multiple layers of the architecture provide better generalization and allow the model to perform better. The two-stage bottleneck architecture reduces the total number of parameters using smaller filters in each layer, minimizes overfitting, and makes the model more efficient. The two-stage bottleneck architecture reduces the complexity of the architecture.

5.3. Different Age-Based Handwritten Tamil Alphabet Recognition

The age group can be classified into five groups, as mentioned earlier in the section, each with different educational qualifications. Based on handwritten alphabet recognition, the performance of the classifier is assessed statistically with various age groups, as shown in Table 7.

Figures 1720 show the statistical measures of the proposed RTSBA with different age groups based on handwritten Tamil alphabet recognition. The proposed RTSBA has achieved high Accuracy, Precision, Recall, and F1 score based on other age groups.

The different OCRs convert handwritten Tamil text into digital text, as shown in Table 8; these include Google Docs, i2 OCR, Easy Screen OCR, Unicode Tamil OCR, and SUBASA Tamil OCR. Figure 21 depicts OCR accuracy of 95%, 45%, 95%, 35%, and 40% for the handwritten Tamil words. The OCR, like Google Docs and Easy Screen OCR, can predict simple and complex curves, similar shapes, and discontinued curves. The OCRs such as i2 OCR, Unicode Tamil OCR, and SUBASA Tamil OCR indicate simple curves, which makes it challenging to predict complex curves, similar shapes, and discontinued curves.

6. Discussion

As described, the proposed RTSBA achieves a good result in predicting curves. The existing networks, such as LSTMs, are very computationally intensive and require a lot of memory to store the data in long-term memory; LSTMs are more complex than other networks and harder to tune and optimize, which has vanishing and exploding gradient problems due to the long-term dependencies of multiple update gates [24]. Inception v3 [25] is a deep learning architecture with multiple layers added to Inception v2. It is computationally expensive to train the model. Inception v3 is a black box model, making interpreting the individual layers’ internal parameters impossible. The gradients vanish or explode in a long-term RNN [26]. While the gradient descent flow works for small RNNs, storing a gradient in larger RNNs is difficult, resulting in the vanishing or exploding gradient problem. This is a significant problem in RNNs. Since RNNs have many parameters, they can be prone to overfitting, affecting the network’s performance and predicting new data points. For instance, context varies significantly over a period of time. Thus, RNNs never remember the exact past words or contexts, which can degrade performance. RNNs exhibit unpredictable behavior due to their internally complicated connections, and hence it is challenging to figure out the cause behind this behavior. This unpredictability results in performance issues and transient errors being propagated across the time steps, which leads to network instability and causes crashes in the network. RNNs use data from previous time steps; they can fail to use multi-threaded processors, which never support parallelism. Then, the computation time for such networks is higher. CNN is prone to overfitting on the training set. This can lead to poor generalization and inaccurate results on testing data [27]. Compared to a more straightforward signature recognition system, the two-channel and two-stream transformer (2C2S)-based framework [28] requires several additional layers for processing. This increases the amount of processing power for accurate classification. Large amounts of data must be processed, and create an accurate identification system. Data must be kept for a long time when dealing with signatures to ensure accuracy. This leads to high maintenance costs and data storage.

7. Conclusion and Future Work

The RTSBA is proposed for digital writing pad-based handwritten alphabet detection, classification, and word recognition. Handwritten character changes in Tamil words due to pen pressure, pen position, variations in writing styles, spaces between characters of different sizes, unnecessary curves in characters, simple curves, non-simple curves, straight lines, open curves, and closed curves are detected through the proposed RTSBA method. The RTSBA method is compared with different neural network architectures such as LSTM, Inception-v3, RNN, CNN, 2C2S, and MMU-SNet compared to traditional algorithms; the proposed RTSBA methods have a text prediction accuracy of about 98.7%. This model can be applied to other Indian languages, such as Malayalam and Telugu, for text recognition from a digital writing pad.

Data Availability

Data will be made available after a reasonable request from the author.

Conflicts of Interest

The authors declare that they have no conflicts of interest.