Abstract

Handwritten text recognition is considered as the most challenging task for the research community due to slight change in different characters’ shape in handwritten documents. The unavailability of a standard dataset makes it vaguer in nature for the researchers to work on. To address these problems, this paper presents an optical character recognition system for the recognition of offline Pashto characters. The problem of the unavailability of a standard handwritten Pashto characters database is addressed by developing a medium-sized database of offline Pashto characters. This database consists of 11352 character images (258 samples for each 44 characters in a Pashto script). Enriched feature extraction techniques of histogram of oriented gradients and zoning-based density features are used for feature extraction of carved Pashto characters. K-nearest neighbors is considered as a classification tool for the proposed algorithm based on the proposed feature sets. A resultant accuracy of 80.34% is calculated for the histogram of oriented gradients, while for zoning-based density features, 76.42% is achieved using 10-fold cross validation.

1. Introduction

In this modern digital age of ever-growing computer technology, the machine learning algorithms play a key role in all fields of life, especially in the areas of text recognition [1], network security [2, 3], privacy [4], traffic flow predictions [5], object detection [6], and may others. One of the major applications of machine learning algorithm is Optical Character Recognition (OCR) system development. The OCR system reads the text from an image and converts it into a computer-readable form. Several research works have been addressed on the automatic recognition of multiple languages such as Arabic, English, Persian, Chinese, and Urdu [7, 8]. The main problems associated with these languages are the cursive writing styles, writer’s handwriting habits, and secondary components (diacritics). The Pashto language has incorporated most of the Arabic, Urdu, and Persian letters with some minor modifications. Due to this reason of incorporation of letters, the Pashto language is cursive in nature. The Pashto language consists of a large character set (44 characters) greater than Urdu (38 characters), Arabic (28 characters), and Persian (32 characters). This large character script and minor change in character shape make the recognition process more complex for Pashto script.

Pashto is the maternal language of a large community of residents in Northern areas of Pakistan and official language of Afghanistan. Ahmad et al. [9] used k-nearest neighbors (k-NNs) as a classification tool for printed Pashto character recognition by using high-level feature extraction techniques. Boulid et al. [7] suggested the use of a neural network with spatial distribution of pixels (SDPs) and local binary patterns (LBPs) for the recognition of handwritten Arabic characters. Boufenar et al. [10, 11] presented an artificial immune recognition (AIR) system based on both statistical and structural features for handwritten Arabic letters recognition. Askari et al. [12] introduced the derivative projection profile (DPP) as a feature extractor technique and neural network as a classification tool for isolated Arabic character recognition. El-Sawy presented an Arabic OCR system by using convolutional neural networks [13].

Boufenar and Batouche proposed a deep learning convolutional neural network (DCNN) for Arabic letters recognition [14]. The performance of the suggested system is dependent on hyperparameter tuning and the size of the dataset in use. Naz et al. [8] suggested an approach of the convolution neural network and recursive neural network for Urdu Nastali’q text recognition. They tested the system on the famous Urdu printed text-line image (UPTI) dataset. Sarvaramini et al. [15] suggested the use of the convolutional neural network (CNN) for offline Persian character recognition.

This paper presents an OCR system for offline Pashto characters. A medium-sized database of handwritten Pashto characters is developed for the proposed research work. Two enriched feature extraction techniques of Histogram of Oriented Gradients (HOGs) and zoning-based density features are used for feature set generation. A 10-fold cross validation using k-nearest neighbors (k-NNs) is considered as a classification tool based on the feature set calculated using the proposed feature extraction algorithms to evaluate the proposed system performance. Objectives (significance) of the proposed work are listed below:(i)To develop a medium-sized database of 11352 (258 samples for each character) for Pashto handwritten characters(ii)To provide a benchmark for the identification of handwritten Pashto characters using histogram of oriented gradients and zoning features and the k-NN classifier(iii)To evaluate system performance and provide results that would help open the gate towards handwritten Pashto character recognition

This paper is organized in sections as follows: Section 2 gives a brief detail of the related research work. Section 3 explains the proposed methodology for image extraction, feature set generation, and classification. The results are discussed in Section 4 followed by conclusions in Section 5.

2. Literature Review

Recent research shows prominent improvements in OCRs for many languages, especially which are cursive in nature. These include Arabic, Urdu, Persian, and others that possess same cursive nature script. Optical character recognition (OCR) systems convert images of text into a computer-readable form. The recognition rate of cursive scripts is low as compared to noncursive scripts such as English. This is due to the ambiguity in writing styles. Recent work on OCR for cursive-text-based languages achieved significant results, which are discussed below.

Khan et al. [16] provided a baseline study for handwritten Pashto character recognition using zoning features and three different classifier models. The proposed model showed an accuracy of 56% for support vector machine, 78% for an artificial neural network, and 80.7% for a convolution neural network. A dataset of 4488 characters was used for simulation purposes in their research work. Bhuiyan and Alsaade [17] suggested a hybrid neural network model for Arabic character recognition. They used a hybrid neural network by combining a bidirectional associate memory (BAM) and a multilayer perceptron (MLP). Tavoli et al. [18] proposed a new feature extractor for the recognition of Arabic and Persian words, namely, the statistical geometric components of straight lines (SGCSLs) technique. Oujaoura et al. [19] suggested a method for offline Arabic letters identification using three feature extraction techniques including Zernike moments in conjunction with neural networks. Zernike moments surpass rest of the two in recognition rate.

Boufenar et al. [10] have conducted a study for handwritten Arabic character recognition on the famous Offline Isolated Handwritten Arabic Character (OIHACDB) and Arabic Handwritten Character Database (AHCD) datasets using Deep Convolutional Neural Networks (DCNN) and showed state-of-the-art accuracy using this method. Younis presented a DCNN for handwritten Arabic character recognition [20]. He also performed batch normalization to prevent overfitting. The model was tested on AIA9K and AHCD datasets.

Jebril et al. [21] used histogram of oriented gradients (HOGs) as features and support vector machines as classifier on a self-made database. Althobaiti and Lu suggested a novel approach to feature extraction using an encoded freeman chain code and change of tangent for isolated handwritten Arabic character recognition [22, 23]. Jehangir et al. [24] proposed Zernike moments for feature extraction purposes and linear discriminant analysis for the automatic recognition of the handwritten Pashto text.

Naz et al. [25] presented the use of 2-dimensional long short term memory (2DMLSTM) networks for Urdu script recognition based on zoning features. The referenced model is tested on the Urdu Printed Text line Images (UPTI) dataset. Ahmed et al. [26] presented an algorithm for Urdu character recognition using bidirectional long short-term memory (BLSTM) on the Urdu nasta’liq handwritten dataset (UNHD). Jameel and Kumar proposed basis spline (B-spline) curves for Urdu character recognition [27]. Nawaz et al. [28] compared siamese and triplet networks and showed performance improvement when combined with a CNN for handwritten Urdu character recognition.

This work is based on offline Pashto character recognition using k-nearest neighbors (k-NNs) as a classification tool. The histogram of oriented gradients (HOGs) and zoning-based density-based feature extraction techniques is followed as a features extraction tools in the proposed research work. This work proposes a character database of 11352 character images (258 samples for each 44 characters in the Pashto language). This work also uses both HoGs and the zoning technique for feature extraction purposes. The performance capabilities of the proposed OCR system are tested using 10-fold cross validation.

3. Proposed Methodology

The proposed handwritten Pashto character recognition system consists of 4 main phases as depicted in Figure 1. The data collection and accumulation phase, the data processing and character database development phase, the feature extraction and feature map development phase, and at last, the recognition and identification phase. The data collection and accumulation phase is completed by collecting handwritten Pashto samples from different people, while the preprocessing steps include scanning and correction steps. This phase aims to prepare data for the feature extraction purposes as proper characters results in achieving high and accurate feature values that ultimately results in high recognition rates of the handwritten characters. For the feature extraction purposes, we have proposed HoGs and zoning techniques. These techniques grab the astute numerical values of the characters. The classification and recognition phase is completed using a k-NN classifier based on the accumulated feature map using HoG and zoning techniques.

3.1. Data Collection

Since there is no standard database available for handwritten Pashto characters, a medium-sized database is developed by collecting data from different people. Most of the samples are collected from the Department of Pashto, University of Swabi, KP (Khyber Pakhtunkhwa), Pakistan, from multiple students and teachers varying in age, gender, and educational backgrounds.

Individual characters are extracted from the scanned images so as to make a database. The database, thus, formed contained 258 samples for each of 44 characters and a total of 11,352 characters (25844 = 11352 characters). The final database contained character images with nonuniform dimensions and decentralized characters (appearing either at the top, bottom, right, or left), as shown in Figure 2. These sliced images were preprocessed to form normalized and centralized character images.

3.2. Preprocessing

Preprocessing is a preliminary step necessary to achieve better classification accuracy. Preprocessing steps are applied here to form normalized and centralized character images. Preprocessing greatly improves OCR accuracy. We applied the following preprocessing steps.

3.2.1. Size Normalization

To achieve best classification results, it is necessary that the sliced images are normalized and centralized. By normalizing, image size is scaled to a fixed size. All the images here are normalized to the size of 64 × 64 and are converted to the color map of grayscale. Figure 3 shows the size normalization to the dimensions of 64 × 64.

3.2.2. Centralization

Some of the images contained characters that occurred at different positions (top, bottom, right, and left). Firstly, the centroid of the character and image are calculated separately to fix all the characters at the central point so as to calculate accurate features of each handwritten Pashto character. In our case, the character is of dimensional size, so the central point of the character is in our case. Then, the character centroid is shifted to the centroid of the image to produce a centralized image. Figure 4 shows the centralization of characters “alif” and “twe.”

3.3. Feature Extraction

Feature extraction is a pivotal stage of an OCR. Features are used to describe the image in terms of numerical values. There are two types; statistical features are calculated via mathematical computations whereas structural features are derived from the structure of the image. A good feature extractor should have the ability to discriminate while retaining similarity for similar character images. We applied two techniques, namely, histogram of oriented gradients (HOGs) and zoning-based density features, to our database and compared their results.

3.3.1. Zoning Features

Zoning-based features are efficient for reading and extracting accurate image patterns. Due to its high feature extraction capabilities, this technique is frequently used in many text recognition problems. This technique divides the image into 8 × 8 zones and then calculates the image pixel densities in each zones that forms the feature vector. It gives a feature vector of 64 features. Figure 5 shows the character “bhe” divided into 64 zones.

3.3.2. Histograms of Oriented Gradients

The histogram of oriented gradients (HOGs) was firstly introduced by Dalal and Triggs [29]. The primary purpose was human detection. Nowadays, this technique is used for character recognition [21, 30, 31], pedestrian detection [32], face recognition [33], and many other problems of interest. We generated HOG features using cell size 16 × 16 pixels, block size 2 × 2 cells, and 9 bins. HOG visualization over the Pashto character “ye” is shown in Figures 6(a) and 6(b).

3.4. Classification

Classification acts as a kernel in any recognition problem. Classification is the stage to classify unpredicted input data into given classes. This paper presents the use of k-nearest neighbor (k-NN) as a classification model for offline Pashto character recognition. K-NN is a supervised learning model. K-NN works on the nearest-neighbor rule which classifies data by measuring the distance between the input instance and training data and chooses the class for unpredicted instance based on nearest instance in training data.

There are four-distance functions that can be used in k-NN, Euclidean, Manhattan, Chebyshev, and Minkowski distances. Here, we used the Euclidean distance function and evaluated its performance with respect to different values of k. The distance function is shown in the following equation:

4. Results

Results are calculated for the proposed system based on a zoning-based density feature set and Histogram of Oriented Gradients (HOGs) feature set. For each of the feature extractor technique, the results are drawn based on the k-nearest neighbors (k-NN) classification model using 10-fold cross validation.

Ten-fold cross validation using the 1-NN classifier and Histograms of Oriented Gradients (HOGs) achieved an accuracy of 80.34%, while for zoning features, a relatively lower accuracy of 76.42% is achieved. The results are compared, and a graph is generated using these data as shown in the graph in Figure 7. Accuracy tends to increase as training data increase because the classifier learns more accurately and produces better results.

The k-NN parameter k value was tuned, and the best score for k = 1 was calculated using the Euclidean distance as depicted by a graph in Figure 8.

Figure 8 shows the k value vs. accuracy.

The accuracy vs. k value table is generated as shown in Table 1. Different accuracies were calculated for different values of k.

Figures 9 and 10 show the plot of training data vs. time vs. accuracy for HOGs and zoning-based density features. From Figures 9 and 10, it is evident that when the training set increases, the overall recognition rate of the classifier increases, but ultimately, the time consumption also increases.

Applicability of the system is also validated by using other performance metrics such as precision, false-positive rate, false-negative rates, true-positive rates, true-negative rates, f1 score, and accuracy based on both HoG-based and zoning-based feature maps. Experimental results based on these performance metrics are depicted in Figure 11.

5. Conclusions

Handwritten text recognition is followed as the most daunting step in the research work. During the last two decades, cursive text recognition gained a significant interest in the research community to explore. However, the unavailability of a standard database makes it more challenging. To address these problems, an OCR system for handwritten Pashto character recognition is presented in this paper. A medium-sized database containing 11352 character samples (44 characters x 258 samples) was developed for the analysis and experimental work. Histogram of oriented gradients and zoning techniques are used for the feature accumulation purposes. This feature map is used for the identification and recognition of the handwritten Pashto characters using the k-NN classification tool. Based on the calculated feature map, histograms of oriented gradients gives an accuracy rate of 80.34% while zoning-based density features give an accuracy of 76.42%. Ten-fold cross validation was applied for evaluating system results.

Data Availability

No data were used in this work.

Conflicts of Interest

The authors declare that they have no conflicts of interest.