Abstract
The ongoing growth in the vast amount of digital documents and other data in the Arabic language available online has increased the need for classification methods that can deal with the complex nature of such data. The classification of Arabic plays a large and important role in many modern applications and interferes with other sciences, which start from search engines and do not end with the Internet of Things. However, addressing the Arab classification errors with high performance is largely insufficient to deal with the huge quantities to reveal the classification of Arab documents; while some work was tackled out on the classification of the Arabic text, most of the research has focused on English text. The methods proposed for English are not suitable for Arabic as the morphology of the two languages differs substantially. Moreover, morphologically, the preprocessing of Arabic text is a particularly challenging task. In this study, three commonly used classification algorithms, namely, the K-nearest neighbor, Naïve Bayes, and decision tree, were implemented for Arabic text in order to assess their effectiveness with and without the use of a light stemmer in the preprocessing phase. In the experiment, a dataset from Agency France Persse (AFP) Arabic Newswire 2001 consisting of four categories and 800 files was classified using the three classifiers. The result showed that the decision tree with light stemmer had the best accuracy rate for classification algorithm with 93%.
1. Introduction
Machine learning (ML) is a branch of Artificial Intelligence (AI) research [1], which aims to develop practically relevant multipurpose algorithms based on a little amount of data. Difference between ML approaches and general AI lies on the discovered patterns in data and the way in which data is used. There are different examples of the application of ML, such as fraud discovery, weather forecasting, and patients’ diagnosis. The two major forms of ML are supervised and unsupervised learning. Here we consider the former, which involves the generation of a mapping from labeled training data into an output of predictions or classes. This process can be described as classification and is the core aspect of supervised ML.
Classification involves the determination of output values known as classes or labels using input objects. This mapping is known as a model or classifier. The entered objects are related to the categorized objects also recognized as examples, instances, or tuples. According to [2], ML classification technique involves combining several instances together with their known labels by manually tagging a group of instances. The group of labeled instances is recognized as a training set. The labeled instances (i.e., training set) are used by classifier to generate the model that maps the instance to its label. As a result, then the training model can be used to label or classify new, unknown instances. In the current study, which focuses on the classification of Arabic text, the instances are carefully chosen from a prelabeled pool of instances by employing enhanced Arabic classifiers.
There are many situations in which unlabeled documents are both plentiful and cheap. However, labeling them is regarded as costly and time-consuming. For example, it is handy to get a huge amount of documents basically with no price; in contrast a lot of money is paid for human comment hosts to classify these documents with their subject classification whether they are in Arabic or non-Arabic. Also, data for videos is easy to collect, but it is very difficult to get good semantic content labels from that data. Likewise, it is easy to get a wide range of compounds that may be useful for treating a disease, but it is very expensive to run expensive biochemical tests to see which one really works. These three examples are essentially classification problems.
Several algorithms have been implemented to solve the text classification (TC) problem. More than one work in this field has focused on English text. In contrast, little research has been done on the Arabic script. The English text differs from the Arabic text in terms of its morphological structure, which makes the preprocessing of Arabic text more challenging for a number of reasons. The aim of the study is to evaluate the performance of the Arabic text classification system using three distinct categorization methods, namely, the decision tree (DT), Naïve Bayes (parametric-based), and K-nearest neighbor (KNN) (example-based) classifiers. In order to get the best integration of weighting scheme and technique, various weighting schemes were adopted in the first two methods.
In the following, Section 2 discusses text classification. Then we present the motivation and objective of this work in Section 3. An overview of the related works and the three classifiers considered in this study are provided in Section 4. Section 5 introduces the framework of the proposed Arabic text classifier. Section 6 describes the experiment and Section 7 presents the document representation. Section 8 presents results and Section 9 contains the conclusion and details of future work.
2. Text Classification
Text classification is a machine learning supervised task requiring prelabeled documents in need of learning. Furthermore, it aims to detect new documents based on certain learned criteria [3]. Applications of text-based knowledge and the TC feature are particularly important in natural language processing (NLP), at least because of the recent increase in the volume of available text data. One example of an area in which TC and NLP are needed is filtering [4], which is a process that attempts to filter a user’s inbound documents to identify those that are unwanted or unsolicited. Another is sentiment analysis [5], which looks to identify the general feelings cleared up in a document in order to measure, for example, customer satisfaction.
It is possible to apply the supervised learning algorithms of the classification training model to a set of respective problem states to overcome the problems encountered in the TC. These models can then be used to identify the unlabeled document class [2, 6–10].
There are two phases in the TC approach: training and testing. The training phase involves the building of a classifier using a group of the collected documents (called the training set) and by allocating a subset of the training set to each category before processing them via several NLP techniques. The aim of this processing is to extract the set of features from the training set which will be used as the representative for each category. The remainder of the collected documents is the so-called test set, which is used in the testing stage to evaluate the performance of the classifier in terms of its ability to classify the documents that it has not seen before into the correct categories, where performance is assessed by comparing the categories selected by the classifier with those of the predefined documents [3].
A TC system generally consists of these parts:(i)Text preprocessing, which converts the text into a group of dimensions that can be processed by classifiers.(ii)Reducing dimensionality, which decreases features number to enhance the efficiency of classification algorithms. This can be done using methods such as feature selection and dimension reduction [8, 9, 11, 12].(iii)Classifier training, which is the process of building an autonomous classifier using supervised learning frameworks [2].(iv)Prediction, which is the process of using a trained classifier to generate labels for new documents [2].
It has been indicated in [13] that texts can be symbolically represented as a set of characteristics by employing two representation methods, namely, the n-gram and the bag of words (BOW). The former involves the use of some words or sentences as characteristics while the latter employs the order of the words or characters of n length. Past studies [14, 15] have pointed out that the creation of an accurate TC system requires the effective handling of a high number of characteristics or features (which may be number in their tens of thousands). Hence some information retrieval (IR) techniques such as stemming and elimination of stop-words have been used to decrease the feature space dimensionality.
3. Motivation and Objectives
The importance of using technologies for classification has increased due to the need to have the ability to automatically classify the huge amounts of diverse text-based information that can be found on the Internet and in electronic/digital format in many languages, including Arabic. Hence, several studies initially focused on addressing the challenges associated with standard Arabic document classifiers [6, 7, 9, 16], which then encouraged more studies that concentrated on enhancing the performance of Arabic document classifiers. This research continues because most Arab classifiers are characterized by their inability to deal accurately with the vast quantities of documents that have been identified as Arabic documents. As such, this is considered the major problem in the classification of Arabic texts.
One of the main obstacles facing researchers working in the field of text classification for documents in Arabic is the failure of the available classifiers to deal with stemming, which is a factor that might affect other processes in a document classification system. To address this issue, an algorithm is employed to define the stemming rule, and this rule depends on the processing of grammatical components of an utterance to solve the complexity of morphological and syntax.
The major TC problem is related with the enormous features extracted from the text (can reach hundreds or thousands). Therefore, the time required to substitute a term with its possible concepts may increase and the high dimensionality of the feature space may reduce classifier performance. The number of features or feature size can be reduced by extracting the essential semantics from texts [17, 18].
Therefore, in order to reduce the feature size of Arabic text, this study evaluates three classifiers without and with stemming [19]. It is hoped that the outcome of this research will contribute to the improved tracking and detection of new documents and their categorization into the relevant categories and consequently, the improved performance of Arabic classifiers. In sum, this study attempts to answer the following research question: What is the effect of classification techniques on Arabic documents without or with the use of stemmer?
4. Related Works
Text classification refers to assigning predefined categories of text depending on the content of the documents. For natural language processing and other applications of textual knowledge, text classification is important. The importance of text classification is due to the recent increase in the volume of available text data. It is possible to overcome the problems of text classification by applying supervised learning algorithms to train the models of classification with a group of abovementioned examples of the problem in question that clarify correct classification (labels). These models can then be used to predict the labels of unlabeled documents [12, 20–23]. A text classification system may be built from the following components.
It is supposed that the structure of categories is known in advance in the case of supervised algorithms, and these algorithms require a group of tagged documents to map the documents to some prespecified classes. However, as abovementioned, in case of huge dataset it is difficult to remark the true label and class of the document in training set. Hence, the focus and review in this section will be on the most commonly used classification based on algorithms, namely, KNN, NB, and DT.
4.1. K-Nearest Neighbor Algorithm (KNN) Classifier
KNN is a popular example-based classifier. There are two basic steps, the KNN was developed as a popular instance-based learning technique which has been efficient in several text categorization tasks. The flow of the algorithm is boiled down as follows: first, the k-nearest neighbors are found within the given training documents [24]. Second, the test document category is found using the category labels of these neighbors. The conventional approach usually assigns the test document with the commonest label of category among the established k-nearest neighbors.
The conventional KNN is the basis of the extended weighted kNN in which the contribution of each neighbor is weighted with respect to its proximity to the test document. Next, the similarity of the adjacent documents in each class is collected to obtain the document class score; i.e., the class score for x document is illustrated as follows:where the training document is = , group of x nearest k training document is = , = the cosine similarity between x and di, and function with a value of 1 if is relevant to class , and 0 else. The class with the highest score allocates x test document.
4.2. Naïve Bayes (NB) Classifier
The NB classifier is a simple probabilistic-based classifier, which is based on Bayes’ theorem which estimates the likelihood of the classes assigned to a test document using the joint probabilities of terms and classes of such document. The naïve aspect of the classifier originated from its assumption of the conditional independence of all terms of each category from the other category. Based on this assumption of independence, the parameters of each term can be separately learned, as such, making the computation operations easier compared to the non-NB classifiers. An NB proper classifier can merely assume that there is no relation between the presence or the nonappearance of a particular category trait with any other feature. We can express this presumption as follows:where P(Ci |d) refers to the previous probability of class Ci in the presence of a new instance d and P(Ci) symbolizes toe probability of class Ci, which can be figured bywhere the proper samples that are associated with class Ci = Ni, N is the number of classes, the likelihood of a sample d being assigned to a class Ci = P(d|Ci), and the likelihood of sample d = P(d).
4.3. Decision Tree (DT) Classifier
The DT is a commonly used inductive learning method that is characterized by its ability to resist noisy data and its ability to learn detailed expressions, which makes it suitable for document classification [25]. This algorithm employs a “divide and conquer” approach, where it divides complex decisions into several simpler ones.
It divides complex decisions into several simpler ones. In the learning stage of the DT, it is contained from a group of tagged training examples manifested in a record of features values and a label class due to big areas of decision tree learning and search are top-down, repeated process and greedy start with an empty tree and the entire training data. A feature has more information about content and has a best partition chosen as the splitting feature for the training data and for the root and then the training data is divided into disjoint subgroups satisfying the values of the incision features. In respect of every subgroup, the algorithm occurs before repeatedly until each subgroup’s classes maintain the same class [3].
5. Framework of Arabic Text Classifier
When answering the user’s demand, the TC system requests to get the following: classify the intended document, classify it swiftly, meet user requirements, and obtain optimum classification efficacy [26, 27]. Thus, the objective of the Arabic TC (ATC) structure presented in this study is to raise the ATC system efficiency, if the system takes into account the semantic relationship and the complexity of the Arabic terms.
The ATC framework depends on the following stages: preprocessing, extraction, representation, application of classifiers, and evaluation. The ATC framework takes into account these important issues (Figure 1).

The ATC system’s first step is the preprocessing phase, which is an important step for document presentation. It involves the initial processing of the text to choose the appropriate terms to be indexed. Through the preprocessing phase, many operations are performed like stemming, stop-words eliminations, tokenization, and normalization.
In this study, the main contribution is to build an automatically Arabic text classifier to classify documents based on morphological knowledge representation by utilizing a light stemmer. The general procedures performed in this method are as follows (Figure 2).

Figure 3 shows the different stages of the ATC framework which will be discussed in detail in Section 6, “Experiment.”

6. Experiment
Arabic-language classification is a supervised learning-dependent process; three ML processes and supervised algorithms were used in this experiment, the KNN, NB, and DT classifiers [28]. In order to enhance the accuracy of the Arabic classifier, the Arabic Light10 stemmer was employed and tested. In this section, the steps shown earlier in the Arabic text classifier framework were presented and tested.
6.1. Dataset
We used a dataset that consisted of 800 documents that were classified into four classes. These documents were extracted from the relevant documents for four queries (i.e., each query represents class) from an Arabic Newswire dataset that were used recently in TREC experiments [29]. Figure 4 shows a sample document from the dataset.

6.2. Preprocessing
The aim of the preprocessing phase is to filter out nonsignificant data, such as tags (i.e., <DOC>, <DOCNO >, <DOCTYPE>, <DATE_TIME>, <BODY>, <TEXT>, <END_TIME>) from a document. In carrying out the step of preprocessing, the document must be converted into a format suitable for the representation process so that learning algorithms are applied. Following this, removal of the unnecessary words used as the characters such as punctuation and special markers takes place. Thus, in carrying out this step, three commonly identified tasks, tokenization and normalization, stop-word removal (in order to reduce the dimension of the feature space), and mainly stemming and lemmatization, need to be done. Based on the review of these tasks in previous studies, the following section provides a brief description of these three tasks.
6.3. Tokenization and Normalization of Data
According to [31], text documents are usually converted in a way that is appropriate for their analysis by employing a machine learning algorithm. The text is divided into separate units by using either spaces or special symbols. As such, every word in a text is represented as a single unit. This procedure is called tokenization. For instance, (خير جليس في الزمان كتاب) it can be tokenized using white space to list of tokens (words) as (خير، جليس، في، الزمان، كتاب). Accordingly, the other task known as normalization is useful because this is done before the task stemming particularly for the Arabic script. This is for the reason that the text normalization in the Arabic language helps in the downgrading the various shapes of characters to produce a uniformed shape representing these shapes. This is illustrated by the following example:(i)Substitute ﺁ, إ as well as أ by ا(ii)Substitute the last ة by ﻩ(iii)Substitute the last ى by ي
6.4. Elimination of Stop-Words
Stop-words are those words that occur frequently in the document. These words give no hint to the document content in which they appear. Stop-words removal is mandatory prior to submitting text to be processed by an ATC system in order to reduce both time and cost. Hence a list of stop-words is created, which is then applied to the indexed terms to be eliminated. However, for an ATC system there is no prominent stop-words list that could be used in such systems. Consequently, for the experiment, the same stop-words list used in [32] was used here. Table 1 provides some examples of Arabic stop-words.
6.5. Stemming Text
The text stemming process helps in reducing the various inflectional derivational words forms to a uniform called the stem [32]. For instance, the terms, “work,” “works,” “working,” “worked,” and “worker” are derived from the “work” stem. Table 2 shows an example of different Arabic words derived from the same root. The word root is gained by eliminating some or all the word suffixes attached to it. In the ATC system, terms are grouped together that share the same stem or root, which effectively raises the number of matched documents to the user query. Furthermore, there is an overall improvement in the ATC performance due to the reduction in the dictionary size as a result of the stemming process [33].
In this paper, for stemming purpose we followed the same stemming steps in [33] using Light10 stemmer, as follows:(1)Remove “و” (“and”) for Light2, Light3, Light8, and Light10 if the remainder of the word is three or more characters long(2)Eliminate the definite articles that leave the remaining word with more than or equal to two letters(3)Keep words with a length of two or more letters after suffixes removal which appears in the list; remove one at a time in order from right to left
Table 3 shows the list of strings that should be removed. Note that the conjunction and definite articles are the prefixes shown in the table. No elimination is done for the strings that deemed an actual Arabic prefix in Light10 stemmer.
Table 4 shows an example of affixes in Arabic word.
7. Document Representation
Each document in the study dataset was represented by a vector ti with the term as the attribute and the attribute value as its TFIDF weight [34], which is a statistical way of determining the relevance of a word to a document in a corpus. The most commonly used method to weight a term is the (TF.IDF) weighting, because it considers the attribute. With this weighting scheme, setting the weight of the term I in the document d is proportional to the number of times the term appears in the document, the Term Frequency (TF), and inversely related to the total number of documents in which the term appeared from the corpus, the Inverse Document Frequency (IDF).
The TFIDF weighting method assigns a weight to the number of term occurrences in a document by disregarding its relevance in case it appeared in most of the documents, especially when the term is assumed to possess little discriminating power:
7.1. Construction of the Three Classifiers
In this experiment, the Arabic dataset documents were categorized using the following classifiers: KNN, NB, and DT in two forms, the full word (without stemming) and the stem word (full word stemmed by light10 stemmer).
7.2. Evaluation and Comparison of Classification Quality
Two measures are mainly used to evaluate the quality of the output of a classifier, namely, the f-measurement and accuracy [35]. In classification problems, the evaluation is generally represented in the form of a confusion matrix. The matrix contains the number of instances that are correctly and wrongly classified for each class.
In practice, the most widely used evaluation metric is the accuracy (ACC) rate. It represents the classifier efficiency based on the proportion of the number of correctly predicted instances the classifier made. The classifier accuracy is calculated as
8. Results
A comparison of the three classifiers was conducted in respect of accuracy and the number of features selected with and without the use of stemming in the preprocessing phase. Tables 5 and 6 show the results for the three classifiers with and without stemmer, respectively.
The tables show that, without a stemmer, DT outperformed KNN and NB achieving 90% accuracy as compared to 33.83% and 26.11%, respectively. When a stemmer was included in the preprocessing phase, all three classifiers improved their performance, and again, DT produced the best result with 93% as compared to NB with 35% and KNN with 26.36%. Thus, the use of a stemmer improved the accuracy of all three classifiers. Furthermore, the tables show that the use of a stemmer also reduced the number of features around 50% by the classifiers. Figure 5 provides a graphical illustration of the results, by which we can conclude that the number of features has effect on the NB and KNN performance. KNN when using all features got accuracy of 26.12, while when using stemmer the performance was not satisfying, with accuracy of 26.36%. On the other classifier of NB the stemmer enhances around 1.8% but comparing with DT the performance was better. We can conclude that the DT can be used for huge features better than NB and KNN.

The result shows that the decision tree with light stemmer was the best accuracy rate for classification algorithm with 93%.
9. Conclusion and Future Work
In this paper, prior to developing our proposed method, we reviewed several previous studies that contributed to improving our understanding of the study problem, namely, the classification of Arabic text, and potential solutions. Given the vast amount of information in Arabic that is available online, and which continues to grow, the main aim of this study was to save the effort and cost of both users and developers in searching for and using such data. In this work, we address the weakness of classifiers used for TC before as KNN, NB, and DT. The main weakness of the classifier algorithms is being poor when holding a huge number of features. Based on our experimental outcomes, we find that DT with stemmer can improve efficiency and outperform other classifiers compared to this work. However, the dimensionality of the terms without light stemming is the primary weakness in preprocessing phase, where there is a need for feature selection to fill the gap in the number of huge terms as a future work. We offer future work to improve text classifier with deep reinforcement Q-learning combined with our proposals. We also recommend the use of other classification criteria not used here in this work.
Data Availability
The data are available at https://catalog.ldc.upenn.edu/LDC2001T55 and are not free to access.
Conflicts of Interest
The authors declare that there are no conflicts of interest regarding the publication of this study.