Abstract
Context-aware citation recommendation aims to automatically predict suitable citations for a given citation context, which is essentially helpful for researchers when writing scientific papers. In existing neural network-based approaches, overcorrelation in the weight matrix influences semantic similarity, which is a difficult problem to solve. In this paper, we propose a novel context-aware citation recommendation approach that can essentially improve the orthogonality of the weight matrix and explore more accurate citation patterns. We quantitatively show that the various reference patterns in the paper have interactional features that can significantly affect link prediction. We conduct experiments on the CiteSeer datasets. The results show that our model is superior to baseline models in all metrics.
1. Introduction
Citation recommendation for researchers to quickly find the appropriate relevant literature is a rapidly developing research area [1]. Among this area, context-aware citation recommendation is a particular type for predicting citations for a citation context [2]. The citation context is usually a few sentences before and after the place holder, such as “[]”. The key problem for context-aware citation recommendation is how to measure the similarity between the citation context and a specific scientific paper.
Similar to other NLP tasks (e.g., information retrieval (IR) and text mining), the simplest solution for context-aware citation recommendation calculates the relevant score between a citation context and candidate papers via Euclidean distance [3] and then selects the salient citations. However, simple text similarity is obviously too coarse to be a good measurement. In recent years, neural network models have been widely used to recommend documents due to their efficiency and effectiveness [4–7]. Neural network models can be regarded as better solutions than traditional machine learning methods for simplifying feature engineering tasks and having the ability to deal with large-scale data. However, the weight vectors in existing neural network-based models are usually strongly correlated. In fact, a critical assumption of using similarity measurements, such as Euclidean distance or cosine distance, is that the entries in the feature vectors should be possibly independent [8]. When the weight vectors are overcorrelated, some entries of the descriptor will dominate the measurement and cause poor ranking results. The above problems seriously affect the performance of citation recommendation because citing activity appears to have strong orthogonality. Assume there are three types of citations in a paper, including “field-reference” (red color), “method-reference” (purple color), and “math-reference” (blue color). “Field-reference” usually appears in the introduction and cites scientific articles that use the same techniques in other research fields. “Method-reference” usually appears in related work and cites scientific articles solving the same task. “Math-reference” usually appears in the main part of the paper describing the researcher’s method in detail, and its citations will be more related to mathematical theorem. It is obvious that these three types of citations have strong orthogonality. In the neural network model, these three citation types are usually mapped into a matrix and can be seen as base vectors for inputs. As shown in Figure 1, vectors in the mapping matrix learned by traditional neural network models are not orthogonal. When a sample is mapped by , , and , apparently and will dominate the output and consequently create low discriminative ability. A more satisfactory (yellow color) imposes orthogonality.

To address the aforementioned problems, we propose a neural network model with orthogonal regularization for context-aware citation recommendation. Our model uses CNN to extract the semantic features for citation context and candidate papers. We then add the orthogonal constraint based on SVD in our model to weaken the correlation of weight vectors in the FC layer, which can learn good interpretable features for citation context and papers. To the best of our knowledge, this is the first work that addresses the context-aware citation recommendation with the CNN and orthogonal constraint framework. Experimental results show that our model significantly outperforms other baseline methods.
2. Related Work
2.1. Citation Recommendation
A variety of citation recommendation approaches have been proposed in the literature, including text similarity-based [9, 10], topic model-based [11, 12], probabilistic model-based [13], translation model-based [7], and collaborative filtering-based [14]. Sun et al. [15] proposed a method for recommending appropriate papers for academic reviewers by using the similarity-based algorithm. Their method builds preference vectors for reviewers based on published history information and calculates the similarity between the preference vector and candidate document vector. The literature with high similarity is recommended to corresponding reviewers. Shaparenko and Joachims [16] considered the relevance of citation context and the paper content and applied a language model to the recommendation task. Strohman et al. [17] showed that using text similarity alone was not ideal for recommending citations, because scholars tend to construct new words to describe their own achievements, while two scholars who study the same topic may use different expressions for the same concept and method. To address this problem, Strohman et al. [17] regarded the document as a node in a directed graph to perform citation recommendations. They believe that the similarity measurement with reference information can reflect the reference situation of a node more authentically. Livne et al. [18] proposed a citation recommendation method by coupling the enriched citation context of the literature and adopted various techniques, including machine learning when making recommendations. Some works addressed the language gap between cited papers and citation contexts and attempted to use translation models or distributed semantic representations. Lu et al. [19] assumed that the languages used in the citation contexts and in the cited papers were different and used a translation model to solve this problem. He et al. [3] combined a language model, topic model, and feature model to find the appropriate citation context. Huang et al. [20] assumed that the appearance of cited papers was a particular language and represented the cited papers in unique IDs regarded as new “words.” The probability of citing a paper given a citation context is directly estimated by using a translation model. Tang et al. [21] proposed a joint embedding model to learn a low-dimensional embedding space for both contexts and citations.
In recent years, neural networks have shown better performance in many fields. Some researchers have attempted to recommend citations by using neural networks. Huang et al. [4] learned a distributed word representation for citation context and associated document embedding via a feedforward neural network and then estimated the probability of citing a paper by a given citation context. Tan et al. [5] proposed a neural network method based on LSTM to solve quote recommended tasks. They focused on the characteristics of quotes and trained neural networks to bridge the language gap. A neural network model learned the semantic representations of arbitrary length texts from a large corpus.
2.2. Orthogonal Constraint in Deep Learning
One of the greatest advantages of orthogonal matrices is that the norm of the matrix is changed when it is multiplied by a matrix. This property is useful in gradient backpropagation, especially to deal with gradient explosion and gradient dissipation problems. Orthogonal regularization is widely used in many fields. Brock et al. [22] used orthogonal regularization to improve the generalization performance of image generation editor tasks by using generative adversarial networks (GANs) [23]. They further expanded their work into BigGAN [24]. The results in their work showed that by applying orthogonal regularization, the generator allows fine-tuning the tradeoff between fidelity and diversity of samples by truncating hidden spaces, which can make the model achieve the best performance in the image synthesis of class conditions. Another advantage of orthogonal matrices is that they benefit from deep representation learning. If the weight vectors of the full connection layer in the convolutional neural network are highly correlated, the individuals in each full-join description will also be highly correlated, which will highly reduce retrieval performance. Sun et al. [25] proposed SVD-Net to show that guaranteeing the feature weight of the FC layer can increase the orthogonal constraint of the network and improve the accuracy. Zheng et al. [26] reported that regularization was an efficient method for improving the generalization ability of deep CNN because it makes it possible to train more complex models while maintaining lower overfitting. Zheng et al. [26] proposed a method for optimizing the feature boundary of a deep CNN through a two-stage training step to reduce the overfitting problem. However, the mixed features learned from CNN potentially reduce the robustness of network models for identification or classification. To address this problem, Wang et al. [27] decomposed deep face features into two orthogonal components to represent age-related and identity-related features to learn the age-invariant deep face features. In the above model, age-invariant deep features can be effectively obtained to improve AIFR performance. Chen et al. [28] proposed a group orthogonal convolutional neural network (GoCNN) model based on the idea of learning different groups of convolutional functions that are “orthogonal” to those in other groups, i.e., with no significant correlation among the produced features. Optimizing orthogonality among convolutional functions reduces the redundancy and increases the diversity within the architecture. Moreover, it can also obtain a single CNN model with sufficient inherent diversity, such that the model learns more diverse representations and has stronger generalization ability than vanilla CNNs.
3. Proposed Method
3.1. Problem Formulation
The context-aware citation recommendation is defined as the matching task between citation context and candidate papers. The main architecture of our model is shown in Figure 2. Our model is actually a convolutional neural network with two inputs and orthogonal constraints. Our model consists of the following main steps:(1)We adopt word2vec to obtain the raw input vectors and then use CNNs to extract multiple granularity semantic features(2)The multiple granularity semantic feature is then imposed orthogonally by an SVD-FC layer(3)We use fully connected layers to obtain the final vector representation. The logistic function or SVM is used to obtain the recommendation result

3.2. Network Structure
3.2.1. Input Layer
Word2vec [29] is used to embed the input of our model. Each word is represented as a dimensional precomputed vector, where . As a result, each sentence is represented as a feature matrix with dimension . Through this layer, we can obtain the raw representation of citation context and candidate document .
We also calculate the weight of common words according to the inputs. Then, we can obtain the basic input features for our model, which is the product of and to reflect how important a word in citation context is for a candidate document in the corpus [30]. is a word in citation context . These two variables are calculated as follows:where is the number of words that appear in document . is the occurrence number of the word that appears most frequently in this candidate document . is the number of documents containing the word in all candidate citations . is the total number of candidate citations.
3.2.2. Convolution Layer
The inputs of the convolution layer are the feature matrix of citation context and document . The process of this layer is demonstrated in Figure 3. We first pad the two inputs to have the same length by zero vectors. For every input, let be the words in a sentence. We define , , as the concatenation of . Then, this layer generates the feature for the phrases as follows:where is a convolution kernel, and is the bias.

3.2.3. Average Pooling Layer
The pooling layer is usually used for feature compression. In our model, we choose average pooling. The reason is that whole sentences or paragraphs can express more meaningful semantics. As shown in Figure 4, we design two pooling layers. The first one is “-ap,” which is the column average for the window of continuous columns. After the convolution layer, an column feature map is converted into a new column feature map. By using “-ap,” the new feature map is recovered into the column. This architecture facilitates the extraction of more useful abstract features.

The second one is “all-ap,” which normalizes all columns. As shown in Figure 5, “all-ap” generates a representation vector for each feature map. The generated feature combines the information of the whole citation context or cited document.

Now, we can obtain the features of citation context and independent features of the cited document. The next step is to obtain the semantic relationships between the citation context and the candidate paper. We use cosine similarity to measure the semantic relations:where and are the distributed representation of citation context and candidate document after the “all-ap” layer, respectively. A total of ten “all-ap” layers are carried out in our model. Therefore, belongs to . The benefit is that we can obtain the semantic relation between the citation context and the cited document with multiple granularities. As shown in Figure 6, the final output feature consists of all and basic features. Then, it is fed into the SVD-FC layer.

In most cases, we find that if we use all outputs of pool layers as the input of the SVD-FC layer, the performance will be improved. The reason is that features from different layers represent the different levels of semantics. Neglecting any layers will obviously cause information loss problems.
Next, we use the SVD-FC layer to learn the nonlinear combination features of citation relationships. This layer can force vectors in the feature map independent and orthogonal to each other. The added SVD-FC layer can also reduce the negative impact of excessive parameters.
3.2.4. SVD-FC Layer
In this layer, we use SVD to factorize the weight matrix and replace it with . Our experimental results show that replacing operations can reduce the negative impact on the sample space.
The Euclidean distance between samples can be used to measure whether their feature expression changes in a sample space. Denoting and as the feature maps of two different samples, we can obtain two different outputs of the full connection operation by using the weight matrix or as follows:
As seen in the above equations, is orthogonalized output, while is unorthogonalized. Then, we can obtain the following theorem.
Theorem 1. and in equations (4) and (5) will generate the same Euclidean distance for samples and .
Proof. The Euclidean distance between and is calculated as follows:Since is an orthogonal matrix, equation (6) is equivalent toIt can be seen that .
It should be noted that there are no negative impacts and no changes in discrimination ability for the entire sample space when replacing the weight. As shown in Figure 7, we use SVD of weight matrix to map the feature map to an orthogonal linear space.

3.2.5. Output Layer
The citation recommendation problem is regarded as a classification task in our model. In this layer, logistics and SVM can deal with binary classification tasks and predict the final citation relationship.
3.3. Training Details
3.3.1. Embeddings
In our model, words are initialized by 300-dimensional word2vec embeddings and will not change during training. A single randomly initialized embedding is created for all unknown words by uniform sampling from. We employ AdaGrad [31] and L2 regularization. We introduce adversarial training [32] for embeddings to make the model more robust. The process is achieved by replacing the word vector after word2vec embeddings using word vector with disturbing :where is the worst case of perturbation on the word vector. Goodfellow et al. [33] approximated this value by linearizing the loss function around , where is a constant set to the current parameters of our model, and it only participates in the calculation process of without a backpropagation algorithm. With the linear approximation and norm constraint, the adversarial perturbation is
This perturbation can be easily computed by using backpropagation in neural networks.
3.3.2. Layerwise Training
In our training steps, we define conv-pooling block , which consists of a convolution layer and a pooling layer. Our network model is then assembled by the initialization block that initializes using word2vec and conv-pooling blocks.
First, we train the conv-pooling block after is trained. On this basis, the next conv-pooling block is created by keeping the previous block fixed. We repeat this procedure until all conv-pooling blocks are trained.
Second, the following semiorthogonal training procedure is used to train the whole network.
Semiorthogonal training (SOT): it is crucial to train SVD-CNN, which consists of the following three steps: Step 1. Decompose the weight matrix by SVD, i.e., . is the weight matrix of the linear layer. is the left-unitary matrix. is the singular value matrix. is the right-unitary matrix. After that, we replace with . Next, we take all eigenvectors of as weight vectors. Step 2. The backbone model is fine-tuned by fixing the SVD-FC layer. Step 3. The model keeps fine-tuning with the unfixed SVD-FC layer.
Step 1 can generate orthogonal weights, but the performance of prediction cannot be guaranteed. The reason is that over orthogonality will excessively punish synonymous sentences, which is apparently inappropriate. Therefore, we introduce Steps 2 and 3 to solve the above problem.
The inputs of SVD-FC are defined as . The outputs are defined as . The weight matrix is defined as . The expected outputs are defined as . The error function is defined aswhere . Then, with respect to is derived, and the outcome is
We utilize the gradient descent strategy to find the gradient of the error with respect to weights. The iterative update of weights is as follows:
We define an error signal . equation (12) is equivalent to
According to equation (11), is equivalent to
We use the sigmoid as the nonlinear function, so equation (13) is equivalent to
In Step 1, the weight matrix is decomposed by SVD and replaced with . , and . Since is given, we define that . As a result, equation (15) is equivalent to
are in the left-unitary matrix , so the model operation is not affected by the nonorthogonal eigenvectors . This is the reason for excessively punishing synonymous sentences in Step 1. However, orthogonality has a positive effect on in Step 2.
The purpose of SVD is to maintain the orthogonality of each weight vector in geometric space. When weight vectors are conditioned by orthogonal regularization, the relevancy between weight vectors decreases. We use the following methods in Step 3 to measure relevance:where is a weight matrix that contains weight vectors: . is the dot product of and . Let us define as the correlation measurement of all column vectors in :
When is an orthogonal matrix, the value of is 1. When , obtains the minimum value . Therefore, we can see that the value of falls into . As a result, when is close to or 0, the weight matrix will have high relevance.
3.4. Complexity Analysis
Assume that the training sample size is , the average number of words in each citation context is , is the number of kernels in the -th layer, and is the size of the sliding window. For one convolution layer, the training complexity is . The training complexity of one -ap layer is . The training complexity of one all-ap layer is , which was improved by C. F. Van Loan [12], computing the eigenvalue for SVD matrix decomposition with size takes on the way of JACOBI. Assume that the size of the weight matrix in the SVD-FC layer is, and the channel of the input matrix is . The computational cost for the SVD-FC layer is .
4. Experiment
4.1. Dataset
We use the CiteSeer dataset [34] to evaluate the performance of our model. The dataset was published by Huang et al. [4]. In this dataset, citation relationships are extracted by a pair of citation contexts and the abstracts of cited papers. A citation context includes the sentence where the citation placeholder appears and the sentences before and after the citation placeholder. Within each paper in the corpus, the 50 words before and 50 words after each citation reference are treated as the corresponding citation context (a discussion on the number of words can be found in [7]). Before word embedding, we also remove stop words from the contexts. To preserve the time-sensitive past/present/future tenses of verbs and the singular/plural styles of named entities, no stemming is done, but all words are transferred to lower-case. The training set contains 3,989,547 pairs of reference contexts and citations, and the test set contains 1,021,685 citation relations.
Following common practice in information retrieval (IR), we employ the following four evaluation metrics to evaluate recommendation results: recall, mean reciprocal rank (MRR), mean average precision (MAP), and normalized discounted cumulative gain (nDCG).
4.2. Evaluation Metric
For each query in the test set, we use the original set of references as the ground truth . Assume that the set of recommended citations is , and the correct recommendations are . Recall is defined as
In our experiments, the number of recommended citations ranges from 1 to 10. Recall evaluation does not reveal the order of recommended references. To address this problem, we select the following two additional metrics.
For a query , let be the rank of the first correct recommendation within the list. MRR [35] is defined aswhere is the testing set. MRR reveals the average ranking of the first correct recommendation.
For each citation placeholder, we search the papers that may be referenced at this citation placeholder. Each retrieval model returns a ranked list of papers. Since there may be one or more references for one citation context, we use mean average precision (MAP) as the evaluation metric:where is a binary function indicating whether document is relevant or not. For our problem, the papers cited at the citation placeholder are considered relevant documents.
We use normalized discounted cumulative gain (NDCG) to measure the ranked recommendation list. The NDCG value of a ranking list at position is calculated aswhere is the 4-scale relevance of document in the ranked list. We use the average cocited probability [2] of to weigh the citation relevance score of to (an original citation of the query). We report the average NDCG score over all testing documents.
4.3. Baseline Comparison
We choose the following methods for comparison.
Cite-PLSA-LDA (CP-LDA) [36]: we use the original implementation provided by the author. The number of topics is set to 60.(i)Restricted Boltzmann Machine (RBM-CS) [37]. We train two layers of RBM-CS according to the suggestion of the author. We set the hidden layer size to 600.(ii)Word2vec Model (W2V) [29]. We use the word2vec model to learn words and document representations. The cited document is treated as a “word” (a document uses a unique marker when it is cited by different papers). The dimensions of the word and document vectors are set to .(iii)Neural Probabilistic Model (NPM) [4]. We follow the original implementation. The dimensions of the word and document representation vector are set to . For negative sampling, we set the number of negative samples , where k is the number of noise words in the citation context. For noise contrast estimation, we set the number of noise samples .(iv)Neural Citation Network (NCN) [7]. In NCN, the gradient clipping is 5, the dropout probability is 0.2, and the recurrent layers are 2. The region sizes for the encoder are set to 4, 4, and 5, and the region sizes for the author network are set to 1 and 2.
Figures 8 and 9 show the performance of each method on the CiteSeer dataset. It is obvious that the SVD-FC model leads the performance in most cases. More detailed analyses are given as follows.


First, we perform a comparison among CP-LDA, RBM, W2V, and SVD-CNN. Our SVD-CNN completely and significantly exceeds other models in all metrics. The success of our model is ascribed to the content and correlation of our network. Due to the lack of citation context information, we find that W2V is obviously worse than other methods in terms of all metrics. CP-LDA works much better than W2V, which indicates that link information is very important for finding relevant papers. RBM-CS shows a clear performance gain over W2V because RBM-CS automatically discovers topical aspects of each paper based on citation context. However, the vector representations of citation context in RBM-CS are extracted by traditional word vector representations, which fully neglect semantic relations between the citation document and citation context and thus may be limited by vocabulary.
Second, we compare the performance among NPM, NCN, and SVD-CNN. It is not surprising that NPM and NCN achieve worse performance than SVD-CNN since their distributed representation of words and documents relies solely on deep learning without restraint. NPM recommends citations based on trained distributed representations. NCN further enhances the performance by considering author information and using a more sophisticated neural network architecture. However, the CNN in NCN does not have orthogonal constraints, which makes it difficult to capture different types of citing activities. In addition, NCN only utilizes the title of the cited paper for a decoder, which is apparently not sufficient for learning good embedding.
4.4. The Influence on the Link Prediction of Reference Pattern Interactional Features
According to the chapter positions of citation context in the article, we divide the training set into three parts: the introduction part contains 1,307,885 pairs of reference contexts and citations, the related word part contains 1,599,897 pairs of citations, and the main part contains 1,024,783 pairs. Furthermore, these datasets form three mixed datasets. In this part of the experiment, we use the CNN model without SVD as the baseline. These datasets are tested in a ratio of 3 : 1. In Tables 1 and 2, we show the results on the abovementioned datasets.
From the results, we obtain the following observations:
First, both CNN and SVD-CNN outperform unmixed datasets over mixed datasets across the different evaluation metrics, which shows that the diversity of reference patterns increases the difficulty of citation recommendation tasks.
Second, in Tables 1 and 2, we observe that our model is particularly good at resolving the difficulties in mixed datasets, which come from the diversity of reference patterns.
To better explore why mixed datasets are more complex than unmixed datasets, in Figure 10, we show the change in during the training process of SVD-CNN among various datasets.

As shown in Figure 10, the increase in on the mixed datasets indicates that SVD-CNN is good at decorrelation. We can also see in Tables 1 and 2 that the CNN model has pretty performance on unmixed datasets while achieving poor performance on mixed datasets. However, SVD-CNN achieves almost the same performance on the two types of datasets. This proves that the correlation from various reference patterns can significantly affect the link prediction.
The reason why the change in is not large on the unmixed datasets is that reference patterns of unmixed datasets have similar features, which belong to the same category. As a result, the orthogonality of the weight matrix is hard to improve on unmixed datasets. However, a citation recommendation algorithm has pretty performance on the unmixed datasets because there are low complexities.
Although mixed datasets are more complicated than unmixed datasets, SVD-CNN still performs well in mixed datasets. This indicates that SVD-CNN reduces the negative impact of the correlation of reference patterns, and our approach is more suitable for complex scenarios.
4.5. Comparison with Other Types of Decorrelation
In addition to SVD, there are still some other methods for decorrelating the feature matrix. However, these methods cannot maintain the discriminating ability of the CNN model. To illustrate this, we compare SVD with several varieties as follows:(1)Using the originally learned (2)Replacing with (3)Replacing with (4)Replacing with (5)Replacing with , where is the diagonal matrix extracted from the upper triangle matrix in Q-R decomposition(6)Replacing with , where is the diagonal matrix extracted from the weight matrix after the processing of dimension reduction by PCA
After convergence of training, different orthogonal matrices are used to replace the weight matrix . We define T-cost as the time cost of replacing the weight, which is equivalent to the proportion of the added time to the original time. As shown in Table 3, other types of decorrelation degrade the performance, in addition to and . However, the time cost of is more than that of .
4.6. Ablation Study
In our method, there are two essential parameters, a term , which means the number of SOT iterations, and a biased parameter . In this section, we conduct an ablation study of these parameters.
We first evaluate the effectiveness of by empirically fixing . Since defines the loop time of orthogonal constraint training, it should be set as a nonnegative value. Figure 11 illustrates the MRR with from 0 to 10 on the CiteSeer dataset. We can see that the performance improves as the value of increases. When , the model has no decorrelation and achieves the worst performance. In this situation, the weight matrix in the FC layer is highly correlated, and has the lowest value. The recommendation performance then increases while adding , which indicates that reducing the correlative degree of the weight matrix in the FC layer is critical for improving performance. When , our model achieves the best performance.

In our model, is the dimension of citation context and cited document representations. Figure 12 shows how the performance of SVD-CNN varies with on the same . When is small, the information content of the citation context is very small and produces worse performance. The recommendation performance increases to a maximum point until reaches 300. It should be noted that although the larger is better, the larger will significantly increase the training time. Therefore, we choose .

5. Conclusion and Future Works
We propose a convolutional neural network model with orthogonal regularization to solve the context-aware citation recommendation task. In our model, orthogonal regularization is achieved by using SVD to factorize the weight of the FC layer, which can essentially make each vector in the feature map more independent. The orthogonal regularization also enhances the feature extraction ability of CNN. The experimental results show that SVD-CNN outperforms the other compared methods on CiteSeer. Our model only takes the abstract as the content of the cited paper. In the future, we will explore the performance of our model by using the full text of papers.
Data Availability
Previously reported CiteSeer data were used to support this study and are available at [https://psu.app.box.com/v/refseer]. These prior datasets are cited at relevant places within the text as references [4].
Conflicts of Interest
The authors declare that there are no conflicts of interest regarding the publication of this paper.
Acknowledgments
This work was partially supported by the National Natural Science Foundation of China (project no. 61373046) and the National Key Research and Development Programs of China (project nos. 2018AAA0101100 and 2019YFB2102500).