Abstract
The number of scientific publications is growing exponentially. Research articles cite other work for various reasons and, therefore, have been studied extensively to associate documents. It is argued that not all references carry the same level of importance. It is essential to understand the reason for citation, called citation intent or function. Text information can contribute well if new natural language processing techniques are applied to capture the context of text data. In this paper, we have used contextualized word embedding to find the numerical representation of text features. We further investigated the performance of various machine-learning techniques on the numerical representation of text. The performance of each of the classifiers was evaluated on two state-of-the-art datasets containing the text features. In the case of the unbalanced dataset, we observed that the linear Support Vector Machine (SVM) achieved 86% accuracy for the “background” class, where the training was extensive. For the rest of the classes, including “motivation,” “extension,” and “future,” the machine was trained on less than 100 records; therefore, the accuracy was only 57 to 64%. In the case of a balanced dataset, each of the classes has the same accuracy as trained on the same size of training data. Overall, SVM performed best on both of the datasets, followed by the stochastic gradient descent classifier; therefore, SVM can produce good results as text classification on top of contextual word embedding.
1. Introduction
The growth of scientific article publication has made finding important, relevant research difficult for researchers. Citations have long been studied for the identification of influential studies [1]. However, not all the citations within a research article play the same role. There may be different reasons for citing a research article, and therefore, the intensity of relatedness may vary. Moravcsik and Murugesan [2] argue that most of the references within articles are to understand the work and provide background knowledge about the research problem. Teufel et al. [3] have categorized the citations into three classes with a positive, weak, or neutral relationship with the citing paper. Jurgens et al. [1] have claimed that the citation maybe for six different reasons, and the strength of relevancy of these categories is different from each other.
Various attempts have been made in order to understand the reason and intent of a citation. The most recent techniques have used deep networks for reading the citation context of a citation [4–7]. They have set a window for extracting the citation context. The window boundaries typically contain the paragraph in which citation has been made. It may also include the sentences before and after it. An example of the citation context is given in Figure 1, being cited for comparison of the proposed methodology cited work.

Other approaches have utilized the bibliographic information of the research articles, which creates a network of citation nodes having edges of their mutual linking by citing [9]. These approaches reasonably find the relationship among the citation papers but usually fail to provide the reason for a citation as they give the same weight to each of the references. Metainformation has been used extensively for citation intent extraction. The study based on text features is limited to the statistical similarity of the articles and normally does not study the internal context of those features [10]. New advances in natural language processing, especially word embedding, have made it possible to understand the text context and label them with a class of intent [11].
This paper has evaluated a number of classification methods after converting the text information to their numerical representation. We have used Association for Computational Linguistics-Anthology Reference Corpus (ACL-ARC) and Science Citation (SciCite) datasets, discussed in the next section, to extract text features related to citation records. The ultimate goal of classification was to find the citation intent based on our selected text features list. The experiments show that the linear support vector machine (Linear-SVM) classifier has performed well on both datasets. We also evaluated the classifiers for the prediction of individual citation intent class. The results show that the algorithms performed well, particularly for those class prediction where the training set was immense; for example, in the case of Linear-SVM, the “background” class has an F1 score of 86% while the other classes, including “future” and “extension,” have 65% and 61%, respectively. The overall objectives of this study include the following:(1)Understanding the impact of text features for citation intent classification while using contextual encoding(2)Evaluating the results and comparing the classification models for citation intent labelling(3)Understanding the impact of training set size classifiers’ biasness towards the individual citation classes(4)To exploit the authorship and titles for citation intent classification
The rest of the paper is organized as follows: in Section 2, we introduce existing citation intent classification methods and the number of labelled classes. Section 3 discusses the proposed study framework. The details of each of the steps are further discussed in the subsections of Section 3. Section 4 evaluates the classification models and compares the results. Finally, we conclude our study in Section 5.
2. Related Work
The citation intent, also called the reason for a citation or citation function, has long been studied to analyze the research article relationship. As each article has, on average, 40 references and with time, the number of referenced articles within a research paper is growing [12], it is essential to understand why a paper has been cited. This section discusses various attempts made to identify the citation reason.
Roman et al. [4] used contextual embedding for capturing the context of citation context. They used an automated method for annotating the unannotated dataset for citation intent and achieved good precision, recall, and F1 score. They also developed a vast dataset containing one million labelled citation context, named C2D-I. The author claimed the dataset as new a state-of-the-art dataset to design new citation intent approaches. C2D-I annotated the intent in three classes: background, method, and result. Although they could successfully develop a vast labelled dataset required for deep learning, they have not developed any recommender system to identify the citation reason. Their method was merely for the dataset annotation and not for the citation reason identification.
Hassan et al. [13] proposed a deep-learning-based approach for classifying the importance of a citation from a list of referenced papers. They argued that not all references have the same measure of relevancy. They used a Short-Term Long Memory- (LSTM-) based [14] deep-learning model to distinguish between important and unimportant citations. They also presented a classification model based on machine learning to select best-performing features using a Random Forest (RF) classifier [15]. The authors have listed 14 features of a citation context describing the reason for citation, apart from being an important or unimportant citation.
Cohan et al. [16] criticized predefined hand-engineered features such as linguistic patterns extracted from paper content and borrowed the idea of scaffolding from Swayamdipta et al. [17]. They assumed that better representations could be obtained directly from the data. They proposed a multitasking framework to incorporate knowledge from a paper structure. Their designed framework incorporates two tasks as structural scaffolding: (1) prediction of the section title and (2) predicting whether a citation is needed. Their scaffolding also predicts the citation intent of a citation as background, method, or result class. They also created a SciCite dataset out of 6,627 papers having 11,020 by crowdsourcing. The authors compared their model with the previous state-of-the-art Jurgens et al.’s [1] method for citation intent classification and achieved better results in terms of precision, recall, and F1 measures. The authors used pattern-based features including sequence of phases, parts of speech, lexical categories depicting the positive or negative sentiments, and specific categories such as words “we extended” and “compared with the previous state-of-the-art method.” They borrowed the list of patterns from Simone Teufel [18] and extended it with newly identified patterns and categories. They further exposed topic-based features by arguing that a topic thematic framing can point out the citation function. For example, a citation context describing the methodology is more likely related to “uses” function, whereas a citation context providing some definition is from the “background” class.
They also explored the prototypical argument features and investigated a list of arguments that reflect a class of citations. For prototypical argument featuring, they identified frequently occurring arguments in syntactic positions. For example, the words “follow,” “unfold,” and “extend” frequently occur for “extend” class of citations. A vector representing the occurrence of an argument is created. The average of those occurrences decides the similarity of a citation towards a citation class. This study has used natural language processing features in detail to measure the citation reason and importance and has proved to be state-of-the-art research in this area. This study demonstrated that authors are sensitive to discourse structure and publication venue when citing a research paper.
Table 1 provides the list of citation Internet classes. The table also lists the dataset in which each of the classes is used. Some citation context examples are taken from these available datasets, which belong to those citation intent classes.
3. Proposed Study Framework
In this section, we discuss various steps of the proposed study, as depicted in Figure 2. The flow of the proposed study starts with the data processing and cleaning step, followed by converting text data to numeric representation. After converting the text data to numeric data, we apply different classification algorithms by feeding this data to the input layer to the classifiers. Finally, we gather the results and compare various evaluation measures for comparing the effects of classification algorithms. In the next step, we discuss the data preparation and preprocessing step in detail.

3.1. Data Preparation
The data preparation step starts with the extraction of data for our study. We used two state-of-the-art datasets ACL-ARC and SciCite. These datasets are publicly available and widely used for citation intent classification. ACL-Anthology Reference Corpus (ACL-ARC) is an Artificial Neural Network- (ANN-) based citation intent classification dataset [1,19]. The dataset has around 2,000 records. It has a number of features, including the citation context where in-text citation has been placed, citing and cited paper_id, which can be used to access the paper details using a web service, publication years, paper titles, author ids, extended context including more information on the in-text citation context, section number, section title, citation marker offsets, the sentence before the citation context, and finally, the most crucial feature of citation intent specifying the reason of a reference. The citation intent in the ARL-ARC dataset has six citation intent classes described in Table 2.
The second dataset that we have used is the SciCite dataset [3]. This dataset has achieved a 13 percent increase in the F1 score in comparison to the ACL-ARC. The dataset includes, along with some other unimportant features, the name of the section in which in-text citation is placed, citing and cited paper id, citation context, citation intent class, and the confidence level of the annotated citation intent class. The features included in the dataset are minimal, and only few match the features listed in ACL-ARC. The second state-of-the-art dataset contains the citation intent annotation in only three classes: background, method, and result. This dataset is five times larger than the ACL-ARC dataset, with over 9,159 instances with citation intent distribution listed in Table 2.
In order to keep the datasets persistent and for comparing and evaluating the results on both of these datasets, we made a balanced version of SciCite, which includes the missing required features for our study. From the name, it is clear that the balanced version of SciCite is a balanced one with an equal number of instances in each class. We used the Semantic Scholar API (https://api.semanticscholar.org/) by passing the citing and cited paper ID to extract the missing feature information.
3.2. Preparation of Textual Information
This study is based on the features selected from both of the datasets discussed in the previous section. Table 3 provides the list of features selected from both of the datasets for our study. The table also provides the reason for choosing these particular features as input for machine-learning classifiers.
The features contain information in text form and, therefore, need natural language processing preprocessing steps for making them ready to be taken as inputs. The following operations are performed as data preparation steps.
3.2.1. Tokenization
This task is used for breaking the paragraph or sentences into words by using whitespace or a special character as a token separator.
3.2.2. Stop Word Removal
Stop words include the words that frequently occur in text having no significant impact on the topic under discussion. They normally include parts of speech. Natural Language Toolkit (NLTK) [27] has defined a massive list of stop words in sixteen different languages.
3.2.3. Removing Punctuation and White Spaces
We extended the NLTK stop word list in Python by adding numbers and special characters to it while removing the stop words.
3.2.4. Case Conversion
Regardless of the position of the words in a sentence, we have changed the case of text to small so that the case of a text does not impact the meaning of a text.
3.2.5. Stemming
Kantrowitz et al. [28] have studied the effects of stemming on word embedding using TFIDF and have proved that it has remarkable results. It is a language-specific task and converts words from the derived form to their root form. We have used the NLTK package for stemming the terms of our text data.
Once the text data are in a cleaner form, we need to convert the nlp_input to some numerical form as machine-learning algorithms required numerical representation of information for processing, discussed in the next section.
3.3. Numerical Representation of Text Data
The raw data in text format is converted into numerical representation such that similar words are closer to each other on the vector size. We used word embedding for numeric representation. Table 4 discusses various types of word embeddings along with their strengths and weaknesses. We have selected BERT word embedding as BERT is good in capturing the contextual information from a text and has been used by Roman et al. [4] for a similar task. BERT uses the transformers model [35, 36] for encoding the vector representation, using encoding-decoding architecture. We used Transformer libraries [37] for BERT implementation using Python language on the Kaggle platform (https://www.kaggle.com/).
3.4. Classification Models
Once the data has been converted to numeric representation, similar words are closed on the vector space. We are ready to feed this information to the citation classification model and evaluate the results to determine the best classification algorithm for citation intent class prediction. The classification methods assign predefined classes to the feature data. To define our problem, we consider our training dataset,of records. Each record is assigned a citation class from
The task is to find the best classification method , wherecan assign an accurate citation intent to the new instance . To study the accuracy of the classifiers, a number of classification algorithms have proved best for natural language processing tasks, listed in Table 5. The steps performed in this stage are listed below and depicted in Figure 3.(1)The classification models were provided with the input parameters, listed in Table 5, from ACL-ARC and SciCite datasets. 80% of the records were provided as training data.(2)We trained a model based on the input parameters, adjusting the input weights for the target class of citation intent.(3)The trained model was then used for predicting 20% of the remaining records.(4)The predicted citation class was checked with the actual class of the inputs.(5)To guard against jumping to a conclusion without enough evidence, we calculated the average accuracy by repeating the experiments multiple times.

After setting the general guidelines and executing the steps discussed above, we performed an experiment and compared the selected machine-learning algorithms discussed in the next section.
4. Result Analysis and Comparisons
After training the models listed in Table 5, we performed experiments on the testing part of the datasets. In this section, we discuss the results of each model using precision, recall, and F1 measures. Precision counts positive predicted values and is the number of classes correctly identified. Recall is the fraction of actual classes identified. Increasing one decreases typically the other, and therefore, a harmonic mean of these two values is calculated given by the F1 measure. By evaluating the results against these measures, we want to see which model has performed well compared to the other models.
A multiclass confusion matrix is created using sklearn [44], NumPy, and seaborn libraries shown in Figures 4 and 5 for ACL-ARC and SciCite datasets. The confusion matrix clearly describes the number of true positive, false positive, false positive, and false negative predictions for each of the classes in the respective datasets. The calculation of precision is based on true positive and false negative parameters. The true positive is divided by the sum of true positive and false negative.

(a)

(b)

(c)

(d)

(e)

(f)

(a)

(b)

(c)

(d)

(e)

(f)
A multiclass confusion matrix is given in Table 6 for the linear regression classifier on the ACL-ARC dataset. We used this table to present a sample calculation of precision, recall, and F1 score. The precision of a model is the average of the precisions of each of its classes. Thus, the precision of the linear regression classifier is calculated as follows:
Similarly,
Thus, the average precision of the linear regression classifier is 73%. Similarly, the rest of the precisions are calculated for each classifier, given in Table 7 and 8 for ACL-ARC and SciCite datasets, respectively. The second measure of evaluation is recall. Recall finds the proportion of actual positive correctly identified. To calculate recall, true positive is divided by the sum of true positive and false negative.
The recall of linear regression on the ACL-ARC dataset is calculated as follows:
Similarly,
The average recall of linear regression using ACL-ARC is, thus, 66%.
Precision and recall measures are always in tension, and increasing one results in decreasing the other. Therefore, a third measure called F1 score is used, which is a weighted average of the two previously calculated measures given by
A sample calculation of the F1 for linear regression on the SciCite dataset is as follows:
The average F1 score of linear regression is only 63% using the ACL-ARC dataset. Although the F1 score of some of the citation intent classes is very high, in the case of background class, it is 83%, yet the overall F1 score of this classifier is significantly less. This is because of the unbalanced nature of the ACL-ARC dataset, as some of the other classes have minimal records in the dataset, and their training has not been performed very well.
Tables 7 and 8 provide a complete list of precision, recall, and F1 scores for each classifier. The overall accuracy of the classifiers is shown in Figures 6 and 7. Linear-SVM has the highest accuracy on both of the datasets, having 78.49% and 77.8%. Background class measures in the ACL-ARC dataset are much higher than the rest of the classes as the ACL-ARC is not a balanced dataset and, therefore, is biased towards the classes having a higher number of training records. Motivation, extension, and future classes have the least F1 score due to their small training data size, having less than 100 records in each of these cases. To further validate our conclusion, we observed that, in our balanced SciCite dataset, the F1 score is very closed for each of the classes, while the result class has the highest F1 score. The SGD classifier has the second-highest accuracy with little difference with the linear regression classifier.


5. Conclusions
Understanding the reason for a citation in a research article is crucial to investigate the essential related documents. Machine learning can perform well in classifying numeric metadata. Advances in natural language processing have made it possible to convert text data into a vector representation. The vectors can then be passed to classification algorithms to annotate the records in a scientific dataset. We have used BERT, a contextualized word representation, for converting text data to vectors. The classifiers were then evaluated, and two state-of-the-art datasets, ACL-ARC and SciCite, were used. The trained models performed well, especially in the case of our balanced version of SciCite. Linear SVM achieved an 86% F1 score on the “background” class where the training records were above 1000. In the case of citation intent classes where the number of training records were less than 100, SVM achieved only 57 to 64% F1 score. In the case of a balanced dataset, SVM and other algorithms did not have that much difference in the accuracy of the classifiers. This study has utilized only the text features from the dataset. In the future study, the meta- and NLP feature, consisting of text information, can both be combined to classify citation intent class.
Data Availability
The data are available upon request from the corresponding author.
Conflicts of Interest
The authors declare no conflicts of interest in this research study.
Acknowledgments
This work was supported in part by the Counterpart Service for the Construction of Xiangyang Science and Technology Innovation China Innovative Pilot City.