Abstract

High dimensionality of the feature space is one of the problems in the field of text classification. Identification of optimal subset of features can optimize text classification process in terms of processing time and performance. In this paper, we propose a novel Relevant-Based Feature Ranking (RBFR) algorithm which identifies and selects smaller subsets of more relevant features in the feature space. We compared the performance of the RBFR against other existing feature selection methods such as balanced accuracy measure, information gain, Gini index, and odds ratio on 3 datasets, namely, 20 newsgroup, Reuters, and WAP datasets. We have used 5 machine learning models (SVM, NB, kNN, RF, and LR) to test and evaluate the proposed feature selection method. We found that the performance of the proposed feature selection method is 25.4305% times more effective than the existing feature selection methods in terms of accuracy.

1. Introduction

Massive amount of information is generated and pushed into the digital world every second through various sources such as web pages, blog contents, eBooks, social media contents, and review documents. As the content is increasing day by day, it becomes difficult to convert the content into an organized form which causes many problems such as difficult in searching and lack of summarization. Automatic text classification is one of the way to efficiently organize the documents. Supervised machine learning models such as support vector machines (SVM) [1], Naïve Bayes (NB) [2], k nearest neighbor (kNN) [3], random forest (RF) [4], and logistic regression (LR) [5] are very efficient in organizing content into one or more topics (or classes). There are wide applications of machine learning in the field of text classification such as spam detection [6], sentimental analysis [7], and topic classification [8].

There are three stages in text classification known as preprocessing, feature selection, and final classification. The preprocessing stage is responsible for formatting and removing useless words. Stop word removal, stemming, and text representations are few task performed in the preprocessing stage. Stop word removal eliminates useless symbols such as “is,” “was,” “that,” and punctuation marks. Stemming is responsible for converting all the derived words into its root form (e.g., “running” is converted to “run,” and “walked” is converted to “walk”). Word representing formats the document into usable text. Features are identified in this stage. There are many text representations such as Bag-of-Words (BoW) [9] and n-gram [10].

A feature is the indivisible atomic unit in a text document. A text corpus may contain many documents . Each document contains number of unique features, and the entire text corpus contains number of unique features such as . As the number of documents increases, the corresponding feature size also increases which increases the classification complexity, increases time, and decreases the accuracy. Hence, an optimal subset of should be found to represent the document much better and increase the classification performance. The total number of subset possibility is (excluding the null set), so it is not practically possible to brute force all the combinations; thus, there are various feature selection algorithms which are aimed at finding out the optimal combinations in much easier way.

There are three types of feature selection methods known as filter based, wrapper based, and embedded based [11]. Filter-based methods are model independent which picks the features based on statistical methods like correlation and chi-square. Filter-based methods are faster than the other two types but it cannot identify the dependency between the features. Wrapper-based methods are model dependent that means for each model, separate sets of features are selected. Wrapper-based methods use an evaluation strategy to pick the optimal subset. The embedded-based method combines both the filter based and wrapper based. Wrapper-based methods inherit both the positives and negatives of filter and wrapper based.

In this paper, we propose a filter-based feature selection method called as Relevant-Based Feature Ranking (RBFR) algorithm which identifies the most important features and removes irrelevant features from the feature space. The proposed method first ranks all the features according to two metrics known as true positive rate (TPR) and false positive rate (FPR). Then, the features from top TPR are picked; within the chosen list, the features with high FPR are removed. The list is appended by the common features selected by odds ratio (OR), information gain (IG), and chi-square feature selection methods. We have compared the proposed method with well-known standard feature selection methods such as balanced accuracy, OR, IG, and Pearson correlation. The main contributions are listed as follows: (i)To develop a filter-based feature selection method which is able to pick the most important features that could describe the target class better(ii)To identify and eliminate overlapping or weak features that poorly represent the target class(iii)To utilize the merits of other filter-based methods to pick correct features

The above-mentioned contributions are aimed at picking the high rich features that could represent the target class better than the other features; additionally, the error in the selected features should be identified and removed to increase the performance. Moreover, the high features selected by other filter-based methods are also utilized in the feature selection process.

The rest of the paper is organized as follows. Section 2 briefs the literature related to feature selection. Section 3 contains the working of the proposed algorithms. Section 4 presents the experimental results and the comparison with existing machine learning models and with other existing works. Finally, the conclusion is present in Section 5.

In this section, we brief the recent works in the field of feature selection in text classification and list out the comparison, merits, and limitations.

A research work done by [12] proposes a feature selection method that uses correlation between each feature to the class. They have strengthened the positive features and weakened the negative features. A margin-based feature selection is implemented to increase the performance of the classification. They have evaluated their proposed filter-based method in thirteen datasets and showed the superiority over existing feature selection methods.

Feature selection can also be done in many stages. A work by [13] proposes a three-stage feature selection. In the first stage, they have incorporated particle swarm optimization to search for optimal features in the feature space. The second stage, the redundant features are found and removed from the selected features. The last stage is used to measure each feature for their significance; if the measure is too low, they are deleted from the feature space. Thus, one stage for selecting the features and two stages for removing irrelevant features are used.

The feature selection proposed by [14] focuses on selecting features in two decision levels. In the first level, they have used learners to find the relevant features. The filtration of learners is done to find the high confident learners. The elected learners are allowed to vote in the second level to pick the most relevant features among the feature space.

Clustering is used for grouping features and picking the relevant features in a work proposed by [15]. The redundancy and relevancy problems are solved by the clustering algorithm. A sorting algorithm is used which arranges all the features in the clustering space. Correlation is the main metric used in the sorting algorithm to rank all the features.

An embedded based feature selection was proposed by [16] for classification on Twitter review. As it combines both filter and wrapper methods, it eliminates the semantic problem. Transfer learning is used along with filter-based methods such as information gain, Pearson’s correlation, and wrapper-based methods such as expectation maximization. A weight-based deep learning model is implemented to test the performance of the proposed method.

The irrelevant and redundant features present in the text corpus create a negative impact in text classification. A hybrid filter-based feature selection introduced by [17] combines principal component analysis and information gain. In their experiment, they found that their proposed feature selection method reduces the dimension of data significantly by picking the correct feature subset thus reducing the training time.

A comparison of feature selection was done by [18]; they have used seven filter-based methods, two wrapped-based methods, and one embedded-based method to test the significance of the classification. Three models artificial neural network, support vector machine, and random forest were used in their experiment. Several combinations of feature selection and classifiers are made, and the most appropriate subset is found based on the training performance.

Instance selection is the method of selecting/removing instance. Reducing the number of instances is also one of the methods to increase the performance of the classification. Ensemble methods are also popular in feature selection such as in [19] where the authors have used both feature selection and instance selection. Three-feature selection algorithms along with instance selection are used in their experiment. Two ensemble-based techniques are used in the experiment.

Redundancy and dependency identification is generally good in filter-based methods [20]; a work [21] shows that mutual information feature selection is effective in finding correlation between the features and the target class. When it comes to the fuzzy-based environment, the mutual information like other filter-based methods is weak in calculating correlation and dependencies. They adopted a fuzzy independent classification on a fuzzy-based data space; then, based on the proportion of classification error, they adjust the fuzzy-based feature selection.

Feature selection is optimized by using genetic programming as mentioned in [22]. A hybrid feature selection is done by merging multiple filter-based feature selection methods. A feature construction algorithm is utilized to optimize the selected features. Nine datasets were used in their experiment, and the comparison shows that the feature construction algorithm is effective (Table 1).

From the above-mentioned literature, the feature selection needs lots of improvement, especially when considering the relevancy. Thus, we propose a feature selection which is able to extract the relevant features which improves the efficiency of the text classification.

2.1. Few Existing Feature Selection Methods

This section presents an overview of three popular feature selection filter-based methods.

2.1.1. Information Gain

Information gain [28] is a supervised feature selection methods which is used to rank the feature according to the word’s contribution based on its presence or absence in a particular set of text inputs [29]. IG is calculated as where represents the total number of target classes. If binary classification is used then value is 2. denotes the probability of class . is the probability of the word when is present in the document, and similarly, represents the probability of the word when is absent in the document. and are the conditional probabilities.

2.1.2. Chi-Square

Chi-square [30] is the test of independence of a feature with the target class. It is used to measure how much a term is diverged from its dependent class [31]. CHI is calculated using the formula shown as follows:

The symbols and represent the probability of the positive class and the negative class, respectively.

2.1.3. Pearson Correlation

Pearson correlation is one of the good statistical measures to test the dependence of a feature towards the target class [32]. It is unaffected by overfitting [33]. It is calculated by the formula as described as follows:

The existing feature selections have lots of problems such as lack of representation of class unique features, problems in removing the unless and common features, and unable to perform negativity test.

2.2. Overall Drawbacks in Existing Feature Selection Methods

Feature selection is done to reduce the dimensionality of features in the dataset. Good features need to be identified to separate the classes. As the number of features increases, the complexity of the classifier is also increased; this creates a need for better feature selection methods [34].

Most existing feature selection methods use a weighted method such as frequency and distribution; these feature selection methods fail to pick the class unique features; that is, when one feature is very specific to one class or few classes, that feature is very important for a classifier to determine the class as the classifier feels very easy to identify the class.

Another problem in the feature selection is many methods rely on positive test; that is, if a feature is present, then an appropriate class can be identified; however, negativity test is also one of the powerful methods to eliminate weak candidates in the classification. There are only limited methods for the negativity test.

Combining two or more feature selection methods lets the classifier enjoys the advantages of multiple feature selection methods. The existing methods are least focused on ensembling. Hence, by the use of ensemble technique, the performance of feature selection can be improved.

3. Relevant-Based Feature Ranking Algorithms

Feature selection is one of the important steps in text classification. The existing problem in ranking features is lack of identification of dependence. A good feature is identified by the following characteristics: (i)A feature present in only one class is uniqueness, and it helps to identify the class correctly(ii)A feature present in all the classes is not a good sign to identify a class(iii)A feature is absent in one or more classes is also uniqueness, and it helps in negativity test

Consider a sample dataset as described in Table 2. There are two classes; one class is representing the topic astronomy, and other class is representing the topic society. Let us take the feature “planet” which is a unique feature in the topic 1; similarly, the feature “marriage” is a unique feature for the topic 2. The words “people” and “life” are present in both the topics. The ACC2 ratings are displayed in the last column; it is noted that for the unique feature “planet” and the nonsignificant feature “life” have the same rating, which is not a good sign for the classification. Hence, the rating methodology should be optimized to select the rich features.

The proposed feature selection algorithm takes this ranking problem in consideration and is aimed at assigning a rank based on its relevance towards the target class. If the feature represents the class fully, then high weight is given; similarly, when the feature is present in almost all the classes, then it is less likely that the proposed algorithm will pick this particular feature. The RBFR algorithm works in the following steps: (1)Rank the features based on TPR-FPR(2)Within the list, remove the features with low FPR(3)Merge three filter-based FS algorithm selected features(4)Rank the features based on class unique weights

The feature ranks are given based on four metrics known as true positive (TP), true negative (TN), false positive (FP), and false negative (FN), which are defined as follows: (i)TP: if a feature is present in the positive class(ii)TN: if a feature is absent in the positive class(iii)FP: if the feature is present in the negative class(iv)FN: if the feature is absent in the negative class

Input: F= set of features in the text corpus
Output: S – top N rich features
Begin:
1 For each f in F
2  TPR score =
3  FPR score =
4 L = {top k1 features with high TPR-FPR score}
5 For each f in L
6  If FPR(f)<TH then
7   Remove f from L
8  F1 = top N features from IG =
9  F2 = top N features from CHI =
10  F3 = top N features from Pearson Correlation
11  Common Features =
Return

The rich features for each class are determined by the ACC2 (TPR-FPR) [35], but there are high chances that the negative features are also selected alone with the rich features. Hence, a second level filtration on the basis of FPR could remove the weakly represented features.

3.1. Feature Selection Methods

To increase the rate of representation, three popular feature selection methods, namely, information gain, chi-square, and Pearson correlation, are used to extract features. If a feature is selected by at least two of the feature selection methods, then that feature is also selected as per equation (4) for classification.

, , and in equation (4) represent the features selected by information gain, chi-square, and Pearson correlation, respectively. The details of the feature selection algorithms are briefed in the following subsections.

3.2. Class Unique Features

A feature is important based on how it represents the class. If a feature is present in only one class, then the feature is very important because it is very unique to a class. Similarly, if a feature is present across many classes, then it is very less important. After the second level of filtrations, a unique weight is calculated for each feature. This weight is based on the occurrence of a feature across various classes. Consider Table 3 which displays feature wise and class wise frequency, where represents the frequency of feature in the class . The first step is to remove the less class wise frequent term as per the condition in

The average of all frequency count is calculated, and the first step is to remove all the entries which have the frequency less than the average frequency. Then, an inverse class frequency is calculated to find out whether a feature is common or rare. A term which is very important is then filtered using a threshold value as described in equation (6), where is the total number of classes in the classification. represents the number of classes the feature represent.

3.3. Machine Learning Models

The proposed feature selection algorithm is tested using five machine learning models which are briefed in the following subsections.

3.3.1. k Nearest Neighbor

kNN is the machine learning models that finds distances between each instance. When a new sample or instance needs to be classified, the kNN finds the closest neighbors from the instance, and the target class is found by majority voting. Some statistical methods are used to fix the value of before starting the classification. It is better to fix the value of as odd number. kNN is called as lazy classifier because it does nothing in the training phase; the distance calculation and the majority voting are done only in the classification phase.

3.3.2. Naïve Bayes

One of the most used classifiers in the field of text classification is Naïve Bayes. This model works with the probability concept of Bayes theorem. NB groups the instances based on similarity and determines the class of the new sample based on how much it is related with each class.

3.3.3. Support Vector Machines

Support vector machines are the most used classifier in the text classification domain. SVM can classify both linear as well as nonlinear data. A support vector is an end point in each class. The SVM model fixes a linearly separatable margin between the class; this margin is used to classify the instances.

3.3.4. Random Forest

RF is an ensemble-based classifier. The RF uses multiple decision tree. The number of DT is fixed before the start of classification. Each decision tree receives unique set of input and trained separately. Then, the output of each DT is used in majority voting to determine the final class.

3.3.5. Logistic Regression

LR is a special type of classifier that is used to classify linear data. LR constructs a margin which separates the classes. The new instances are assigned a class based on the position where it resides with respect to the margin.

4. Results and Discussion

We have used three benchmark datasets for evaluating our proposed feature selection algorithm. Table 3 contains the descriptions of all datasets.

4.1. Dataset Description

The three datasets contain different instances, number of classes, and number of features as shown in Table 4. We have taken random 2500 features from each dataset for our experiment.

4.2. Performance Evaluation

In order to test the performance of our proposed feature selection algorithm, we have used four standard metrics: accuracy, precision, recall, and F1-score. The formulas for calculating all the metrics are shown as follows:

All the documents are preprocessed; stemming and stop word removal are done before the classification; also, a second level of filtering is done by identifying the frequent words. The frequent words are features that are present in almost every document. The datasets are divided as per 10-fold validation.

The characteristics of the features that are selected by a feature selection algorithm can be analyzed to test the effectiveness of the feature selection algorithm. If unique features are selected and high rank is given to those features, then it is more likely that the performance of the classification will be good. Similarly, if irrelevant features are assigned higher ranks, then that will cause very poor performance in classification. The proposed feature selection method removes the high false rates thus provides a way to rank good feature. This is one of the reasons for the good performance of each classifier. Along with the ranking, the RBFR also considers top selected features from three well-known filter-based methods, and the common features present in them were selected. The precision, recall, and F1 comparison are shown in the Tables 57 for the datasets Reuters, WAP, and 20 newsgroups, respectively.

From the performance comparison tables, it is clear that the RBFR method identifies the rich features present in the corpus and ranks them higher than the irrelevant features. Precision is one of the good measures to judge a classification. It indicates the quality of positive predictions. The RBFR has higher precision in majority cases while compared with other feature selection methods.

The ensemble of three filter-based feature selection increases the chance of selecting high rich features. As the selected features contain high level features, the classification using RBFR method is much higher than the classification done by other feature selection algorithms. Figures 13 display the accuracy of kNN in the three datasets. We have compared our proposed feature selection algorithm with other two works, and Table 8 shows the comparison.

The participation of multiple number of features in the process of classification is one of the important stages as it is not only responsible for increasing the efficiency of classification but also reduces the presence of simultaneous information redundancy. To solve the problems which affect the classification performance, the number of features should be selected optimally. If the feature size is very high, it increases the time of training rapidly; also if the size is too small, the accuracy becomes very low. Hence, the optimal number of features is determined by linearly increasing the number of features and stop when the performance degradation is observed.

In our experiment, we noticed that the optimal feature size is 600; after that, the accuracy of the classifiers seems to reduce. Among the classifiers, random forest seems to have increased accuracy even after 600; this is because the random forest can reduce the dimensionality by branching over the data. Up to 1400, the random forest classifier produces acceptable accuracy.

From Figure 4, it can be seen that among the existing feature selection methods, our proposed method outputs better performance in terms of accuracy, and SVM classifier produces the best accuracy when the number of features is 600. From the analysis, it can be found that as the number of features increases, there is a positive fluctuation in the classification performance. This is because, more sufficient knowledge can be derived in the training stage to improve the accuracy of the classification. Information duplication may arise when the number of features is increased too much; hence, an optimal count is preferred.

The number of neighbors plays a critical role in classification. From Figure 5, it can be observed that as the number of neighbor’s increases, the performance also increases, but after 75, the classifier stabilizes. The proposed feature selection produces better results than the other feature selection methods because the removal of noise and redundant features.

5. Conclusions

Feature selection is one of the important stages in improving the performance of text classification. The existing feature selection methods can identify rich features present in the text corpus, but still lots of irrelevant features are also selected which degrades the performance of the text classification. In this work, we propose a ranking-based feature selection model which can identify and eliminate the irrelevant features from the selection set. We have implemented the proposed feature selection model in three datasets and compared with five existing filter-based feature selection methods, namely, ACC2, NDM, CHI, GI, and IG. The machine learning models used for classification were kNN, SVM, NB, LR, and RF. The experiment result shows that NB outperforms the classification task with 93.96% accuracy. In future work, we aim to rank the features based on its semantics and implement deep learning-based classification.

Data Availability

The data used to support the findings of this study are included within the article.

Conflicts of Interest

The authors declare that there is no conflict of interest regarding the publication of this article.