Abstract

In the area of machine learning, different techniques are used to train machines and perform different tasks like computer vision, data analysis, natural language processing, and speech recognition. Computer vision is one of the main branches where machine learning and deep learning techniques are being applied. Optical character recognition (OCR) is the ability of a machine to recognize the character of a language. Pashto is one of the most ancient and historical languages of the world, spoken in Afghanistan and Pakistan. OCR application has been developed for various cursive languages like Urdu, Chinese, and Japanese, but very little work is done for the recognition of the Pashto language. When it comes to handwritten character recognition, it becomes more difficult for OCR to recognize the characters as every handwritten character’s shape is influenced by the writer’s hand motion dynamics. The reason for the lack of research in Pashto handwritten character data as compared to other languages is because there is no benchmark dataset available for experimental purposes. This study focuses on the creation of such a dataset, and then for the evaluation purpose, a machine is trained to correctly recognize unseen Pashto handwritten characters. To achieve this objective, a dataset of 43000 images was created. Three Feed Forward Neural Network models with backpropagation algorithm using different Rectified Linear Unit (ReLU) layer configurations (Model 1 with 1-ReLU Layer, Model 2 with 2-ReLU layers, and Model 3 with 3-ReLU Layers) were trained and tested with this dataset. The simulation shows that Model 1 achieved accuracy up to 87.6% on unseen data while Model 2 achieved an accuracy of 81.60% and 3% accuracy, respectively. Similarly, loss (cross-entropy) was the lowest for Model 1 with 0.15 and 3.17 for training and testing, followed by Model 2 with 0.7 and 4.2 for training and testing, while Model 3 was the last with loss values of 6.4 and 3.69. The precision, recall, and f-measure values of Model 1 were better than those of both Model 2 and Model 3. Based on results, Model 1 (with 1 ReLU activation layer) is found to be the most efficient as compared to the other two models in terms of accuracy to recognize Pashto handwritten characters.

1. Introduction

A human can read and understand text documents if they have learned the basic knowledge of reading and semantics of a specific language/script. The optical character recognition (OCR) systems were developed to do the same by converting images of the characters, handwritten or printed text, into machine-coded text [1]. The main idea of OCR was first developed in 1933 and was patented in Germany later in the USA [1]. OCR is a major actor in the digitization of text documents. The disadvantages of digitization are storage reduction, searching ability in the content, and compatibility with other digital applications. OCR lies in the domain of Document Image Analysis (DIA), which is one of the major branches of Pattern Recognition. The DIA system has different phases like (i) image acquisition, (ii) image classification, (iii) data preprocessing, (iv) text recognition, and (v) postprocessing. In the past decade, there has been significant research in the field of OCR [2].

Various studies in different languages have been done such as in English [3], Japanese [4], German [5], Chinese [6], French [7], and Indic-based script [8]. But cursive scripts like Persian [9], Arabic [10], Sindhi [11], Urdu [12], and Pashtu [13] still require further research. In this modern digital era, the demands for understanding different regional languages and translating them have increased. Pashto has a rich literary heritage and is not only spoken but also used in writing medium among 50 million people across the world [14]. However, not much work has been done to recognize and digitize Pashto text yet. Pashto has its own specificities; it has not only linguistic composition but also character shapes just like Arabic. Such specificities need dedicated research to find the conceptual foundation for the challenges posed by the Pashto language in the field of OCR.

Although a very limited amount of research is done in the Pashto OCR, none is done for handwritten character recognition, and due to this reason, there is no Pashto handwritten dataset available. The author in [13] has produced a “printed sentence dataset” for their study, but no handwritten dataset was produced. Printed characters are easy to train and recognize because their shapes do not change much once trained, but the handwritten data changes constantly with different handwritings, which is very easy for the humans to understand while it is very difficult for the machine to understand if multiple shapes look different for the same character. One of the challenges in working with the Pashto language is the lack of a sample dataset [13]. Thus, a proper dataset is required to train any machine for a language which is one of the motivations behind this research. A proper dataset is necessary as a benchmark for the research of Pashto language in the area of optical character recognition (OCR). Similarly, this study proposes a deep learning (DP) based neural network (NN) trained with backpropagation (BP) with ReLU activation function for Pashto handwritten character recognition.

The main contribution of this paper is the following:(1)To develop a benchmark Pashto handwritten character dataset(2)To build a deep neural network (DNN) based backpropagation algorithm with ReLU activation function for the classification of Pashto handwritten characters(3)To check the performance of the proposed DNN with similar and different variants in terms of using the ReLU function at various layers

The remaining paper is categorized as follows. Section 2 highlights previous work-related. Section 3 explains the proposed methodology used in this study. Section 4 is used to carry out the simulation result. The final conclusion of the paper is discussed in Section 5.

In this section, we will review the work done and techniques applied by the research community in the field of OCR targeting different cursive languages in the form of printed and handwritten text. Handwritten recognition in cursive languages like Pashto, Urdu [15], Arabic, Latin, Chinese, and Persian is a big challenge these days [10]. As far as Pashto is related, the current research is quite limited regarding the handwritten Pashtu character recognition system. Only two reputed works in this area have been done [13].

OCR system was implemented for BBN Byblos (Bolt, Beranek, and Newman technologies) using Hidden Markov Models (HMMs) and trained the machine to perform recognition of Pashto documents containing printed data. From their simulation result, this study showed a character error rate of 1.6% on synthetic images; on a scanned page, it was 2.1%, and on the faxed page, the error rate was 3.1% [16].

Similarly, Scale Invariant Feature Transformation (SIFT) and Principle Component Analysis were used to uncover the challenges that are faced when working on a cursive text like Arabic, Urdu, and Pashto. The new approaches like long short-term memory and recurrent networks were effective and showed 89% to 94% of accuracy in recognizing text [17].

Furthermore, another studywas conducted based on real-world Pashto/Urdu text, which cannot be always a straight line but can be rotated in any direction. Different machine learning techniques were used to analyze the results, the techniques included Scale Invariant Feature Transformation (SIFT), Long Short-Term Memory Models (LSTMs), and Hidden Markov Model (HMM) to detect rotated text. The LSTM showed 98.9%, while HMM-based methods showed 89.9% and SIFT gave 94.3% accuracy in this study [18].

Though a similar study used recurrent neural networks for cursive and noncursive scripts by employing bidirectional long short-term memory (BLSTM) networks, a variant of the recurrent neural network is a special layer called connectionist temporal classification (CTC) which was used in a previous study. The results showed BLSTM with 98.75% accuracy in recognizing the scripts [19].

Another study contributed by the developmental dataset having 1000 unique Pashto ligatures [20], which were extended in 2018 for the creation of OCR systems [13]. Furthermore, another study proposed an Urdu OCR system that aims at the ligature-level classification of Urdu text. The proposed algorithm tried to overcome the character level segmentation problems associated with cursive language scripts by using four machine learning techniques (decision tree, linear discriminant analysis, naïve Bayes, and k-nearest neighbor). The accuracy of the decision tree, discriminant analysis, naïve Bayes, and K-NN was claimed to be 62%, 61%, 73%, and 100%, respectively [21].

Similarly, with the aim to create an optical character recognition system that will use Urdu, Pashto, and Arabic, so the similar properties of one language could be used in another language by testing them on the printed text. The investigation was madeto observe the effects on the “recognition accuracy” when different languages were combined using publicly available synthetic datasets for Arabic and Pashto languages. This study also provided statistical analysis as clues for transfer learning concerning OCR systems for Arabic, Urdu, and Pashto languages. KPTI dataset (Pashto)showed 25% and 38% accuracy on the Urdu UPTI dataset after the training of MDLSTM on Arabic dataset KHAT and tested on unseen data [22].

An optical character recognition system for printed/scanned Pashto continuous text which used Feed Forward Neural Networks (FFNNs) consisting of 315 neurons in input layers, using 21 × 15 pixel symbols with 2000 neurons in the hidden layer, giving results in a 6-node output layer to recognize joinable printed Pashto characters on different locations in text achieving 78% accuracy was developed [23].

An end-to-end OCR system for a “printed” Pashto cursive script was produced at the University of Kaiserslautern, Germany, in 2018, which was also used for recognition of the printed Pashtu text from different Pashtu books using different deep learning techniques [13]. The results with the MDLSTM model achieved a 9.22% character error rate, while the BLSTM Model achieved a 16.16% character error rate.

Another study used the zoning technique with KNN and then ANN with their custom dataset of 4488 images with 102 images of each character and 44 classes, achieving an accuracy of 70.07% and 72% on KNN and ANN, respectively [24].

The literature review shows most of the progress in OCR for the various languages such as English, Latin, Persian, Arabic, Urdu, and Pashto, but most of the work is done purely on printed scripts and very little work is done on handwritten text because of the complexity that occurs as the handwritten sample may always be different due to different handwritings. However, the printed text has fewer changes in dynamics and is less complex. There is no dataset available, nor proper work found in the Pashto handwritten text [13, 24]; therefore, this research will focus on the Pashto handwritten character recognition. That will be trainedusing deep neural network techniques.

3. Materials and Methods

This section will discuss the proposed research process. The overall research methodology process is shown in Figure 1. The most important step of this research is the data collection. The next section elaboratesthat how the datawas collected in the study.

3.1. Data Collection

The first objective of this researchwas to prepare a Pashto handwritten dataset, whichwas not available before. The data collection process that is followed can be seen in the flowchart in Figure 2.

3.2. Crowd Sourcing

Due to the absence of proper Pashto handwritten dataset, a new dataset was created. The procedure for creating a new datasetwas by providing the 350 university students (aged 19–24) with Pashto as their native language and who had studied the Pashto language in their primary schools. They were with A4 sized paper and were instructed to write the forty-three Pashto characters from the chart (skipping Pashto numbers) shown in Figure 3, while Figure 4 shows an original sample image taken from one of the students. The data was collected on the A4 sized paper with different handwritings and scanned one by one using the high-speed scannerwith 200 dpi setting. Image splitting utility was used to split each form into grids and separated each image into sub-images of 43 characters of Pashto language using “GIMP 2,” a free GNU Image Manipulation program. After splitting 350 images into 43 smaller images, a total of 15050 images were formed. Figure 5 illustrates how the images and neural networks relate.

3.3. Literature Study

From the literature review, it was clear that most of the work done in the past was based on printed text, very few on handwritten text, and no work was done till now on Pashto handwritten data to the best ofour knowledge, and because of this, there was no proper Pashto handwritten dataset available. Therefore, we developed our own dataset for Pashto handwritten character recognition. The Pashto character set is given in Figures 3 and 4.

3.4. Data Preprocessing

After the collection of data, the next step was to preprocess the data. Data preprocessing is a technique, which is used to filter the data and remove the noise from the data and select the best sample. MATLAB and Python with OpenCV and Pillow libraries were used for the purpose of data preprocessing. The data points (images) are preprocessed in the following ways.

3.4.1. Image Selection

Among 15050 images, most of the pictures were not qualified for the machine learning algorithms, and those images are discarded. Figure 6 shows some images that were discarded due to ambiguous structure. Using such images could make the machine learning procedure unstable, and it becomes harder for the machine to recognize unseen data after training.

3.4.2. Dimension Reduction

Each of the images in the samples had variable sizes. The images are required to have some minimum size to the extent that they do not lose their details and those all have to be of a fixed size. The more the image size increases, the bigger the input vector for the neural network will be, and this will result in complex architecture and processing overhead, which can be minimized by dimensionality reduction. In this study, the MNIST dataset standard was followed, by having each image containing 28 × 28 rows and columns, making a standard 784-pixel input.

3.4.3. Channel Reduction

The scanned images were all scanned containing three-channel RGB. The channels were reduced to a single channel by using the MATLAB function. Each image was originally composed of three channels, which combine together to create a color image. The 28 × 28 pixel images in RGB are actually (28 × 28 × 3) size images as shown in Figure 7, while the conversion can be seen in Figure 8.

3.4.4. Binary Conversion

The greyscale images were converted to binary images. A greyscale image has shades between white and black including shades of grey, while the binary image has just black and white pixels. The conversion of the image into binary causes all the pixels above a certain threshold value to become black, while others to white. MATLAB imbinarize () method was used to convert the images from greyscale to binary images. Figure 9 shows each conversion stage. A threshold value of 120 was set to achieve the binary image after trying 90, 100, 120, and 140 values, as 120 values had the best results.

3.4.5. Integer to Double

The image matrix values that are between 0 and 255 were converted to double-precision values as neural networks perform best when the input values are between 0 and 1. The actual image is a matrix of numbers as shown in Figure 10.

This was achieved using the MATLAB function im2double which converts each image into a double-precision number.

3.4.6. Normalization

The images were normalized using the min-max normalization method. After achieving the minimum and maximum values of the pixels from the images, each image is looped pixel by pixel and each pixel is normalized using the formula given in where Min is the Minimum Pixel Value and Max is the Maximum Pixel Value. The pixels are not disturbed, but their values were retained in the range of 0–1.

3.4.7. Image Augmentation

A handwritten character written by a human can be written at any angle, unlike computer written characters, which have standard shape size and orientation. Another reason for applying data augmentation is to achieve new image samples with more variations, making the dataset larger, which prevents overtraining [25]. OpenCV and Pillow libraries were used to achieve the data augmentation, and the final dataset was of 43000 handwritten characters images with 1000 images in each class by rotating random images randomly to 10° in random directions.

3.4.8. Vector Conversion

Each image was converted into a single row of a matrix, while all the rows of the matrix were combined to represent the complete dataset as a two-dimensional matrix dataset. Each picture is converted to a row and input to the neural network as shown in Figure 11.

3.4.9. Vector Creation

After converting the images to rows in a dataset X, another (43000, 43) size matrix Ywas created populating with labels that represented class for each matrix.

3.4.10. Selecting Variable(s)

Dataset of 4300 images in a two-dimensional matrix X of size 4300 × 784 along with Y of size 4300 × 43 as labels of X was produced. This completed research objective 1, i.e., to produce a Pashto handwritten dataset, which did not exist before this study.

3.4.11. Data Partitioning

The dataset as a whole cannot be used in a training. The data must be split into training (seen data) and testing (unseen data), and then, the model is trained on the training data, and then the prediction is evaluated on the test data (unseen data). In this study, 90% of the random rows were taken as training data, while the rest of the 10% were stored as testing data.

3.4.12. IDX File Format

Finally, the training and testing data were converted into IDX file format, which represented the information as numbers, which are a standard format followed for MNIST data.

3.5. DNN Architecture Design for Experiments

A Four-Layer Model is proposed for this study. Details can be seen in Table 1. The architecture of the proposed DNN models is given in Figures 1214.

Each model was fed with 784 inputs and random weights at the start of the training. The model was created in tensor flow, which is a deep learning API in python. The objective of the research was to create a model for testing using the backpropagation and ReLU activation function. Finally, to check the performance of the proposed models, the models were trained on training datasets and then evaluated on the testing data.The accuracy was recorded after every 1000 epochs for both training and testing data.

3.6. Performance Parameter(s) for Experiments

Different performance parameters were used to evaluate the performance of the proposed models in this research. The accuracy performance of the proposed model was checked using the accuracy parameter, given in equation (2). Accuracy refers to the closeness of a predicted value to the known or actual value.where TP is True Positive, FP is False Positive, TN is True Negative, and FN is False Negative.

4. Results and Discussion

This section further tests and authenticates the performance and accuracy of the proposed algorithm. The performance of the suggested algorithm was analyzed by implementing the ReLU function in each layer of our model using the same dataset. The models were evaluated using accuracy on the testing and training data. This study focuses on different performance parameters such as accuracy, cross-entropy, recall, and f-measure evaluation matrices for the proposed models. The benchmark dataset was our custom made dataset divided into 90% training data and a 10% testing dataset to check the accuracy of the proposed model to recognize unseen data. The workstation used for conducting the experiments was with configuration given in Table 2 using tensorflow for the training and testing of the Pashto character classification/recognition. The algorithms were investigated based on a five-layer deep ANN Model, with one input layer, three hidden layers, and one output layer.

The learning rate in all the experiments was set constant to 0.0001 which was found to be most excellent during the pre-experimental phase. All algorithms were tested and trained by starting with random weights and biases. Each of the experiments had 10000 epochs, where after every 1000 epochs, the progress was tracked. The final layers in each of the models had Softmax activation function for multiclass classification, as given in Table 3.

Different experiments were performed on three models. The simulation results after every 1000 epochs were recorded for training and testing which are shown in Table 4 and Figures 1517, respectively.

From Table 4, it can be seen that in 10000 epochs the accuracy of Model 1 was increased to 99.3% while at the start of the training on training data the accuracy was 2%. On the other hand, the accuracy of testing data is also 2% at the start of the training and reached 87% at the end of the training which can be seen in Figure 15.

Similarly, Model 2 at the start of the training on training data was 25% which increased to 94.3%. Similarly, the accuracy of the testing data gradually improved from 24% to 81.6% as shown in Figure 16. Furthermore, the accuracy of the Model 3 at the start of the training on training data was 2% which increased to 3% on testing data as shown in Figure 17. The model improvement was almost negligible as compared to the training improvements of Model 1 and Model 2.

The performance comparison of the cross-entropy for the used models is given in Table 4. From the given Table 4, it clarifies that for the training dataset the cross-entropy of Model 1 starts from 628.3774 at the first epoch and decreases to 0.151699 on 10000 epochs. For the testing dataset, the cross-entropy starts from 257.8005 at the first epoch to 3.7494256. Similarly, the cross-entropy of Model 2 starts from 257.8005 and goes down to 0.7376458, and for testing data of Model 2, it starts from 394.31918 and goes down to 4.209053. Moreover, the cross-entropy on training data of Model 3 starts from 620.0325 and reached 6.4359164 while on testing data of Model 3 it also starts from 258.61313 and decreases to 3.6924555, which falls behind Model 1 and Model 2. From the overall discussion, it clearly shows that Model 1 gives low cross-entropy than Model 2 and Model 3 for both training and testing datasets.

The performance comparison of the precision for the used models is given in Table 4. From the given Table 4, it clear that for the training dataset the precision of Model 1 starts from 0.499291122 at the first epoch and decreases to 0.506046524 on 10000 epochs. Similarly, for the testing dataset, the Model 1 precision starts from 0.50019 at the first epoch and reaches up to 0.505583 on 10000 epochs. The precision of Model 2 starts from 0.499862418 and goes down to 0.505972948, and on testing data of Model 2, it starts from 0.500557414 and goes down to 0.500664364. For Model 3, it starts from 0.500080295 and reached 0.500664364. Similarly, on testing data of Model 3, it starts from 0.50066353 and decreases to 0.501312649. From the overall discussion, it clarifies that Model 1 gives low precision than Model 2 and Model 3 for both training and testing datasets.

It clear that for the training dataset the performance comparison of the recall of Model 1 starts from 0.491390 at the first epoch and decreases to 0.5060465. Similarly, for the testing dataset, the Model 1 recall starts from 0.4935145 at the first epoch and increases up to 0.50594737 on 10000 epochs. Similarly, the recall of Model 2 starts from 0.50381340 and goes down to 0.50594584 while on the testing dataset of Model 2 it starts from 0.50204994 and goes up to 0.50486486. For Model 3, it starts from 0.49739111 and reaches 0.49931754,whileon testing data of Model 3, it also starts from 0.49658968 and reaches up to 0.49934625. The overall discussion clearly shows that Model 1 gives low recall than Model 2 and Model 3 for both training and testing datasets.

Finally, the performance comparison of the f-measure of Model 1 for the training starts from 0.49530948 at the first epoch and increases up to 0.50604652. Similarly, for the testing dataset, the Model 1 f-measure starts from 0.496829682 at the first epoch and increases up to 0.505765073 on 10000 epochs. Moreover, the f-measure of Model 2 starts from 0.501830136 and goes up to 0.505959396. On the other hand, for testing data, the f-measure starts from 0.501302568 and goes up to 0.50507721. Similarly, for Model 3, it starts from 0.498618287 and reaches up to 0.500327517 on 10000 epochs, while on testing data of Model 3it starts from 0.498732082 and reaches 0.499990048. The overall discussion clearly shows that Model 1 gives low precision than Model 2 and Model 3 for both training and testing datasets.

From the above-detailed discussion, it is clear that the more the ReLU activation layers are, the lower the accuracy is achieved. The experiments show that having the ReLU layer on the first hidden layer gives better results as far as accuracy is concerned. Having two hidden layers had the second-best results, while the model with three ReLU layers was a disaster. This clearly shows that the increase in ReLU activation layers will reduce the overall accuracy of the model. The above Table 4 shows the overall training and testing average accuracy comparison.

Looking at the loss function for training and testing data of three models given in Figures 18 and 19, respectively, it was seen that Model 1 ranking 1 followed by Model 2 and then Model 3 at the last position. Similarly, looking at the precision information given in Figures 20 and 21, Model 1 outperforms Model 2 and Model 3 for both training and testing data, respectively.

5. Conclusions

This study proposed various variants of DNN trained with BP algorithm for Pashto handwritten character recognition. For the proposed DNN architecture leaving one input layer, three hidden layers, and one output layer, the ReLU activation was applied to Hidden Layer 1 in Model 1. In Model 2, the same function is applied to Hidden Layer 1 and Hidden Layer 2. Similarly, in Model 3, the ReLU activation function was applied to Hidden Layer 1, Hidden Layer 2, and Hidden Layer 3. Accuracy was analyzed to see the best model for the classification of the Pashto handwritten character classification/recognition problem. Datasets were created as per objective 1 of this research, as there was no publicly available dataset for the Pashto handwritten character dataset. Models were designed and data was trained and tested to achieve objective 2 of the study. The evaluation based on accuracy was performed to find the best model which satisfied objective 3 of the research. The research models were evaluated using three DNN models with a different configuration of activation functions for hidden layers. The model with only one ReLU layer (Model 1) achieved the best results on testing data in terms of accuracy of 87.67%. Model 2 had 81.60% accuracy while Model 3 had 3.02% accuracy. As the layers increased, the accuracy lowered to 81.60%, and with three ReLU layers, it dropped to 3%. Similarly, loss (cross-entropy) was the lowest for Model 1, that is, 0.15 and 3.17 for training and testing. Similarly, Model 2 reached 0.7 and 4.2 for training and testing, while Model 3 was the last with loss values of 6.4 and 3.69. The precision, recall, and f-measure values of Model 1 were better than those of both Model 2 and Model 3. The research showed that only one ReLU activation layer can give good results, and then increasing the number of layers, the accuracy declined while the error rate inclined. Both Model 1 and Model 2 achieved higher results than the similar kind of study on a different network performed in [24], which had 72% and 70% accuracy In the future, convolution neural networks can be used to classify the dataset created in this study, which can improve the results.

Data Availability

The Pashto_handwritten_dataset used to support the findings of this study have been deposited in the GitHub repository and publicly available at (https://github.com/imrandin1976/pashto_handwritten_dataset).

Conflicts of Interest

The authors declare that they have no conflicts of interest to report regarding the present study.

Acknowledgments

The authors would like to thank the University of Agriculture, Peshawar, Pakistan, and Universiti Sains Malaysia for this research work. The research work was partially sponsored and supported by Universiti Sains Malaysia under the Research University Grant (1001.PELECT.8014057).