Abstract

The sentiments among social individuals are complexity and diversity, and the relationships between them include being friendly and hostile. The positive (“friendly” ,“like” or “trust”) or negative (“hostile”, “dislike” or “distrust”) sentiments in the relations can be modeled as signed connections or links. The missing relations or sentiments between individuals are always worthy of speculation. The sign predication on links has been significant applications in a variety of online settings, such as online recommendation system and abnormal user detections. A novel sign prediction method called the model is measured by the values of the two indexes, one is similarity; the other is preference-reputation (PR). The similarity of a pair nodes is defined by the statistical properties of local structures. The definition of similarity agrees with the theory of social balance because existing connections reflect the tendency of the new links emergence between individuals. And PR value is to measure the positive or negative tendency of edges without sign. The experiments on real big social data proved the feasibility and efficiency of the model: Comparing with some popular predication methods, the model in this issue shows lower complexity and higher accuracy. Experimental results also prove that the model provide insight and foresight of the mechanism driving the sign formation of links.

1. Introduction

In social networks, relations among members not only exhibit friendship and cooperation, but also hostility and competition. Positive and negative links were used to describe cooperative (friendly/trustful) and competitive (hostile/distrustful) relationships respectively. Assigning signs to links were a significant way of including additional information to networks than traditional binary or weighted approaches [13]. One of the challenges in signed networks is inferring the signs of unknown relations that is often referred to as sign prediction [4], which reveals the underlying relationships between social members. Therefore, it can be widely used in many applications such as recommendation systems and abnormal user detections etc. [5].

Sign prediction is the problem of inferring those hidden signs using the information provided by the rest of the network. It is similar to link prediction, which is a well-studied problem in traditional unsigned social network analysis [6]. However, compared with link prediction, sign prediction is still in its beginning stage due to the following difficulties. One the hand, the effects of negative and positive signs are unbalanced or unwieldy in signed social networks [7, 8]. Positive signs can be propagated between members of social networks while negative signs cannot. For example, A trusts B and B trusts C, A will trust C to some extent, while A distrusts B and B distrusts C, it is hard to judge the relationships between A and C directly [9]. Thereby, in the propagation model of reference [10], the distrust relationship only propagates once among the trust relationships. On the other hand, the formation mechanism of the negative links is different from the positive links. In the field of signed network research, less negative signs datasets are available for study [11] because members of social networks rarely express their antipathy to others for fear of being retaliated [12]. So the negative sign prediction became a difficult problem in the field of sign prediction. Therefore, in-depth study and mining of the formation mechanism of social network is the key to improve the accuracy of prediction.

Sign prediction was first introduced and investigated by Guha et al. [10], and later developed in matrix calculation, machine learning, and collaborative filtering. Guha et al. [10] used power matrix to calculate the propagation of trust and distrust. By the matrix, a variety of technical on predications were discussed. The leading eigenvectors with fitness functions to fine-tune clusters were presented [13]. The random walk according to the similarity between nodal pairs realized in researching the inconsistency of distrust in propagation [14]. Minimizing the rank of the adjacent matrix could approximately make the balanced structure to the greatest extent [15]. To quickly obtain the maximal balanced matrix, Cai et al. [16] propose a singular value projection algorithm, in which the product of the top-k singular vectors and singular values is taken to approximately replace the original matrix. Agrawal et al. [17] and Hsieh et al. [18] approximate the original matrix by a matrix decomposition method, in which the original matrix is decomposed into the product of two matrices, and the element values of the product matrix are used as the predicted values. To date, the methods used in machine learning include logistic regression [4, 9, 19, 20], support vector machine [21], decision tree [22], naive Bayes [23] etc.; the features used for learning include nodal degrees [4, 9], types [23], similarity [9, 20], trustworthiness [24], preference [25, 26], triangle structures [4], quadrilateral structures [19], user reviews [22, 27] etc. Collaborative filtering focuses on similarity, similar individuals are more likely to make similar behaviors, which is the basic idea of sign prediction by collaborative filtering. Javari and Jalili [28] believe that computing the similarity between nodes is affected by the sparsity of the social networks. Therefore, they cluster the network and calculate the similarity between clusters to replace the similarity between individuals. Individual behaviors in signed network was believed hidden in “group intelligence” which is embodied by the community structure [5]. The community structure embedded in the social network is untractable even in complete networks [29].

Enlightened by the references and their methods, a new sign prediction method is presented by two indexes in this paper, one is similarity; the other is the preference-reputation (PR) value, called model for short. The statistics of local structures are analyzed to explore the constitution mechanism of signed social networks by which the similarity of a pair nodes are defined. The meaning of similarity agrees with the theory of social balance, because the existing connections reflect the tendency of new links emerging between individuals. And the PR value, coinciding with the preferential attachment mechanism [2], is to measure the positive or negative tendency of edges without sign. The experiments on real data proved the feasibility and efficiency of the model. Compared with the popular predication methods, the model in this issue shows lower complexity and higher accuracy. Experimental results also prove that the model provide insight and foresight of the mechanism driving the sign formation of links.

The arrangement of this paper is follows. The introduction and motivation is illustrated in Section 1; In Section 2, the similarity, and the PR value are defined. Thereafter, the predictive method, namely the model, is presented based on the indexes. In Section 3, the experimental results and comparisons on three real social signed networks, Epinions, Slashdot, and Wikipedia, are shown. Finally, the discussion and conclusion of this work are presented in Section 4.

2. The Method and Model

A signed graph is denoted by , where and are the node set and the link set of respectively, and is a weight set on such that the link is set , −1 or 0 if the node shows positive, negative, or none attitude to the node . Irrespective of positive or negative, the sentiments are clear and distinct. While, for the none attitude, it is ambiguous and unsetting, people wonder to determine the precise attitude. Then a natural question is to predict the sign of link based on the information of and their signs [4]. The sign prediction problem is also interpreted to “what extent the evolution of a network can be predicted using its structural information” [26].

In this section, indexes such as similarity, dissimilarity, preference and reputation are presented, and the sign of link predication model is constructed.

2.1. Similarity and Dissimilarity

In order to predict the edge sign from node to node , , it is necessary to make targeted analysis on the prediction task. Consider the following local structure, as shown in Figure 1: in panel (a), since is the node into and , then the higher common attribute between and , the more probability of ; in panel (b), since is the node out of , then the higher common attribute between and , the more probability of . There by predicting can via the common attributes between and and the common attributes between and . Analyzing Figure 1, since are the source nodes and are the target nodes in the quadrilateral structure, the common attributes between and are equal to the common attributes between and . Thus, it can yield twice the results with half the effort. Generally, the more common neighbors (polarity is also consistent) two nodes have, the higher their common attributes will be. Then the similarity between and can be defined aswhere and are the neighborhoods getting out the node with positive and negative links, respectively, is the neighborhoods getting in the node irrespective of the signs of links. Further, is refined by the signs of the node and its neighbors. Thenwhere and are the cases of and for Equation (1) respectively, and and are the neighborhoods getting in the node with positive and negative links respectively. and are called the positive similarity and negative similarity, respectively.

Figure 2 shows all the cases of : where panels (a)–(d) are the case of and panels (e)–(h) are the negative similarity ; Hence, panels (a)–(d) show positive similarity , whereas panels (e)–(h) describe the negative similarity . By Equation (1), panels (a) and (b) confirm with , while panels (c) and (d) against it; Panels (e) and (f) confirm to while (g), (h) are against it respectively. For the opposite property of the similarity, the dissimilarity is also introduced.

In Figure 2, the more structures of (a) and (b), the larger the value of , and the more structures of (c) and (d), the smaller the value of . The more structures of (e) and (f), the larger value of , and the more structures of (g) and (h), the smaller value of .

As the definition of similarity of nodes and , the dissimilarity between nodes and is defined

where and are the cases of and for Equation (3) respectively, and are positive dissimilarity and negative dissimilarity, respectively.

By Equations (1)–(4), it is found that the following two facts hold if ,

otherwise, when , the other two facts hold,

Normally, represents the degree of consistency between nodes and , while is the degree of inconsistency between nodes and . In real social networks, positive similar nodes tend to have positive relationships, while nodes with large differences between them may have negative relationships.

2.2. Preference and Reputation

In social networks, the preference and reputation of individuals are influential in decision-making to form a connection [25]. The preference, known as optimism or bias in previous studies [26], is for edge generating nodes. Some nodes might be more optimistic than others, meaning their attitude are more likely to be positive. The preference of node is defined as

measures the general attitude of node toward other nodes in Equation (7), and also means the probability of positive edges among all edges generated by the node . The greater is, the higher the probability of node regenerating another positive edge is.

Reputation, also known as prestige or deserve in previous studies [26], is for edge receiving nodes. Reputation reflects the popularity of a node in the network. A node with a high reputation tends to receive more positive edges. The reputation of node is defined as

In Equation (8), measures the general attitude of other nodes toward node , and it is also the probability of positive edges among all edges received by node . The greater is, the higher the probability of node receiving another positive edge is.

Combing both and would enhance the prediction effect on the pair of nodes and . Therefore, we calculate the weighted sum of and as

The sum of the coefficients of and in Equation (9) is 1, which means the equation not only takes into full consideration the preference of node and the reputation of node , but also the priority connection mechanism [2].

2.3. The Prediction: SPR-Model

This section predicts signs using similarity-dissimilarity (denotes as ) and value. is a local environmental feature which reflects the interaction structure the target edge actually participated, while value is the nodal own feature which reflects the empirical estimates according to the past performances. Here, the prediction method takes both as the decisive factor and value as the auxiliary factor.

The model is taken as follows:

Denote as the positive index and as the negative index. Let be any given positive real number to measure the difference between and , a threshold measuring the difference between and , and . reflects the positive tendency between nodes, while is the negative tendency between nodes. When the gap between and is large enough, the tendency is looked as obvious. Therefore, two cases of and are assumed as the positive and negative signs, respectively. Hence, the sign of the link of nodes and is assigned by the two cases:

Case 1. If . In this case, the sign tendency on is easy to understand, so the values of is competent for the prediction. Therefore, the sign of the link is assigned asWhen , the sign tendency of is obvious so that the feature is competent for the prediction task. Yet,

Case 2. . This case means that the sentiment’s tendency is ambiguous. Hence, the feature of loses its efficacy for predictions. In this case, the values of is considered for prediction. Denote the proportion of positive links in the network by . Then the sign of the link is assigned asIn fact, means a probability of the preference and the reputation is greater than the proportion of positive tendency, so is easy to admit. Otherwise, . When means the links generated by nodes and are all positive; otherwise, the links generated by nodes and received are all negative when .

2.4. The Pseudo-Code for Computing the SPR-Model

The pseudo-code for calculating the -model is shown in Table 1.

The computational complexity including time and spatial complexity of the -model algorithm in Table 1 are analyzed. Step 1 computes the nodal neighbor’s set by traversing all edges once time, the computational time complexity is , where is the size of edge set ; In Step 2, for each edge , match the neighbors and of and respectively, the time complexity of Step 2 is , where is the average degree of nodes. In Step 3, computing the similarity and dissimilarity of each pair of nodes takes . In Step 4, it takes for computing value of each pair of nodes. And finally in Step 5, it also takes for predicting the sign of each edge. Therefore, the total computational time complexity of predicting the signs of edge in is .

In the experimental analysis, the input real social networked data is the adjacent matrix with rows times 3 columns. Each row is an edge, the first and the second columns are the source and the target nodes, respectively, the third column is the observed sign from a source to a target node. When we calculate the -model, a dimensions matrix is defined. As described above, the first three columns are still network link data. The 4th column to the 11th are the number of eight special quadrangles of each edge contained in respectively. The 12th to 15th column store the values of , , and of the edge respectively. The 16th to 18th columns are the values of , and of each edge respectively. The 19th and 20th columns are the values of and of each edge respectively. The 21st column is the predicted value for each edge. Hence, the spatial complexity is . In addition, the spatial complexity of calculating the neighbor set of each node is , where is the size of the node set of the network. Summarizing the above analysis, the total spatial complexity is .

3. Experiments

In order to verify the efficiency and reasonability of the sign of link predication model, experiments on real data are taken. Experiments are included for three real social signed networks, Epinions, Slashdot, and Wikipedia [4]. Epinions is a consumer review site. Users can read or comment on a variety of goods and services, and they can also rate them. Users also can be allowed to evaluate the comments made by other users, that is, evaluate other users as trustworthy or distrusted objects. Epinions dataset consists of 131828 nodes and 841372 edges, 86.0% of which are positive edges. Slashdot is a blog site that allows users to say they like or dislike other users’ comments. Slashdot data consists of 82144 nodes and 549202 edges, 77.4% of which are positive edges. Wikipedia is an online voting network where users can vote for or against a candidate administrator. Wikipedia dataset consists of 7118 nodes and 104359 edges, 78.4% of which are positive edges. The details of these three networks are shown in Table 2.

3.1. Evaluating Metrics

Experimental results are presented by three metrics: accuracy, average accuracy and -score. The accuracy (acc) is defined as:

where TP, TN, FP and FN are defined as shown in Table 3. TPR is the true positive rate, TNR is the true negative rate, P is the number of positive edges, and N is the number of negative edges. Equation (12) shows that the role of negative edge prediction is almost ignored and the result is completely determined by positive edge when (). Therefore, the average accuracy () is defined as:

Thus, predictors with higher can predict higher rates of either sign in even skewed datasets disregarding bias [30]. In addition, since sign prediction is a binary classification task, -score is used to measure the predictive precision and recall rate and it is calculated as:

where and . Obviously, the -score is the harmonic mean of and and can be a trade-off between them.

3.2. Generalization across Datasets

To test the performance of the predictive model, experiments are made on different datasets Epinions, Slashdot and Wikipedia. In Table 2, of Epinions, of Slashdot and of Wikipedia are extracted for testing. Table 4 shows the three sub-datasets whose edges are contained in at least one panel of Figure 2.

The performances of the predictive model is displayed in Figures 3(a), 3(c) and 3(e) which demonstrates that: (1) when predicting only based on value, accuracies on three datasets are 85.51%, 78.66% and 75.34%, respectively, while when predicting only based on , results are 97.57%, 95.31% and 90.20%, improved by 12.06%, 16.65%, and 14.86% respectively. (2) when using as decisive and PR value as auxiliary to predict, accuracies on the three datasets are all improved, which demonstrate the scientific of the predictive model.

Since is computed by the number of quadrilaterals as Figure 2 displayed, each dataset is classified into four sub-datasets according to the number of quadrilaterals to test the performance of , As shown in Figures 3(b), 3(d) and 3(f). For Epinions, the predictive effect does not differ significantly over the four sub-datasets, moreover, the predictive accuracy always be high. This proves that has high robustness. For Slashdot and Wikipedia, when the number of quadrilateral is , the predictive accuracy is obviously lower than that when the number of quadrilateral exceeds . This demonstrates that these two networks have less data to extract features, which is the main reason why the accuracy under these two datasets is not as well as the data of Epinions. Therefore, the conclusions are threefold. First, the network of Epinions is more mature than that of Slashdot and Wikipedia. Second, that the predictive accuracy of Slashdot and Wikipedia increasing with the increased available network data; And the third is scientific to predict with .

3.3. Comparison of Results

To further test the performance of prediction of model, it is compared with the existing approaches, such as the logistic regression (LR) proposed by Leskovec et al. [4], the logistic regression based on three attributes (LR-3A) proposed by Yuan et al. [9], the supervised learning based on higher order cycles (HOC) proposed by Chiang et al. [19], the logistic regression based on Bayesian node properties (LR-BNP) proposed by Song et al. [23], the troll-trust model based on ranking proposed by Wu et al. [24], the logistic regression based on reputation and optimism (LR-RO) proposed by Shahriari et al. [26], the measures of imbalance (MOI) and the matrix factorization (MF) studied by Chiang et al. [15], the collaborative filtering (CF) introduced by Javari and Jalili [28] and the closed triple micro structure (CTMS) proposed by Khodadadi and Jalili [30]. The comparison results are shown in Table 5. In order to compare the approaches fairly, the experimental data of Table 5 are quoted from the previous studies. Note that in the predictive model .

Table 5 shows that the values of SPR-model on Epinions, Slashdot and Wikipedia are all larger than that of other 10 approaches. This proves the feasibility and validity of ’s predicting mechanism for calculating the nodal features. By comparing the of the 10 approaches, the following conclusions can be drawn: (1) Social balance theory cannot fully explain the mechanism of the formation of signed social networks, although MOI-10 measures the balance of cycles with lengths , its prediction results are still inferior to those of other algorithms. In addition, the low of CF also illustrates that the prediction of edge signs should take full account of other features of the network, rather than relying solely on structural balance. (2) Local structure is more signed than macro structure. In other words, nodes generate the signed edges usually based on their local connections, i.e., HOC-5 learns the features of cycles with length of , its predictive results are still inferior to those of other machine learning algorithms. (3) Machine learning can not effectively capture the key signed structural features when there are too many features to learn, i.e., for the nine scalars of the three algorithms (LR, HOC-5 and LR-BNP) there are eight scalars inferior to that of LR-RO. The main reason is that LR-RO only learns two features (reputation and optimism) while the other three algorithms have learnt many features. (4) The main factor affecting the sign of an edge is the features of its two endpoints, followed by its local features, and finally its global features. For these 11 algorithms, there are only Troll-Trust and LR-RO can be comparable to in terms of accuracy and robustness. What these three algorithms have in common is that they are based on the features of two endpoints to predict the sign of edge. The above comparative analysis demonstrates that successfully avoids the shortcomings of other algorithms and captures the key signed structural features.

As for the skewness feature of actual datasets, is basically determined by the positive edges. Therefore, the of the model is compared with the exiting algorithms, shown in Table 6. In order to compare the approaches fairly, the experimental data of Table 6 are quoted from previous studies. Since some previous studies did not show the results of these experiments, the kinds of comparison algorithms in Table 6 are less than that in Table 5, and the -model significantly outperforms than others showing the scientific and validity of ’s predictive mechanism. Compared with the five algorithms in Table 6, LR-RO is still the most competitive, which is consistent with the conclusion in Table 5. However, the of other algorithms has been greatly reduced. This shows that most of the algorithms have defects in predicting negative edges. In addition, ’s -score is also compared with LR-3A and Troll-Trust algorithms, as shown in Table 7, of which the experimental data prove that the predictive model has high predictive precision and recall rate. By comparing with the state of the art methods, it is fully demonstrated that outperforms others in predicting both positive and negative edges.

3.4. Analysis of Results

Figure 4 shows the experimental results, plotted as a function of . With the change in , the trend of and -score is basically synchronized, which also shows that the two evaluation metrics are mainly determined by the positive edges, moreover, when is very small (), they can reach the optimum. However, the trend of is quite different. With the change of , shows a clear trend of increasing first and then decreasing, and the optimal value is obviously lagging behind that of or -score. This is because: when is very small, the edge signs are mainly determined by the feature; with the increase of , a considerable part of the edges are determined by the value, by this token, value is superior to in predicting negative edges. Yet, due to the overwhelming advantage of the positive edges in quantity, all the three evaluation metrics will be reduced when is too large.

4. Discussion and Conclusion

In this paper, the model is proposed to predict the edge signs in large online social networks where interactions can be both positive and negative. The model is easy to understand because of the only two indexes to measure the interactions between nodes and their local environments.

shows similarity and dissimilarity between nodes and , which can be refined into positive and negative similarity-dissimilarity. Experimental results on Epinions, Slashdot, and Wikipedia proved the scientific and validity of in predicting edge signs. The main advantages of the index of precisely predicting edge’s signs are as follows. The first advantage is the index of measuring the common attributes of nodal pairs. Hence is calculated from a highly symmetrical quadrilateral. Since the signs of the bi-directed edges are basically coincident which powerful supported by the evidences in Table 8. The natural conjecture of the directions of the links in the network should be symmetrical. In fact, the proportions of bidirectional links in Epinions, Slashdot and Wikipedia are , , and , respectively. The reason why Wikipedia has a worse prediction effect might be the bi-direction of links. The second advantage is that the values of keep both social balance and status theory hold on, or at least it skillfully avoids the conflicts between them. For example, in Figure 2(a), the quadrilateral is structurally balanced when is 1, and should be the same as when nodes and have similar status.

The third might no the last advantage is that the prediction model makes the best possible of the existing data to predict the missing signs of links. Previous methods are mostly based on triangle structure, and there are fewer triangle data in actual data. As shown in Table 2, in Epinions, Slashdot, and Wikipedia there are 11.5%, 39%, and 5.8% fewer triangle data compared with the data the model based.

displays the tendency of , and is a weighted sum of the preference of and the reputation of . Nodal preference and reputation are derived from the preferential attachment mechanism, which can be described in signed social networks as: nodes with larger positive/negative outdegree (or indegree) generate a positive/negative edge with larger probability; nodes with smaller positive/negative outdegree (or indegree) generate a positive/negative edge with smaller probability, shown in Figure 5. Experimental results demonstrate that negative edges have obvious features when they are generated. Therefore, it may be more effective to predict edge signs by distinguishing the features of nodal pairs.

In this paper, the underlying mechanism that determine the signs of links in large social networks is explored and a conclusion is obtained that edge signs are mainly determined by their own or local features, not the global one. Through experimental analysis, the scientificity and validity of the predictive model are verified. In addition, because the features measured by the model are extracted from the nodal own or local structures, the model is very advantageous for large-scale datasets.

Data Availability

The three .txt files, Epinions.txt, Slashdot.txt, and Wikipedia.txt are datasets used to support the findings of this study have been deposited in the Stanford web site repository at https://snap.stanford.edu/data/#signnets. The datasets are in the form of adjacency list, include three arrays: the first is the source node, the second is the target node, and the third is the edge weights or the signs. The data of Epinions is the consumers’ review site, includes 131828 nodes and 841372 links. Users can read or comment on a variety of goods and services, and also be allowed to evaluate the comments made by others users, that is, to evaluate other users as trustworthy or distrusted objects. The data of Slashdot is a blog site that allows users to say what did they like or dislike other users’ comments, and it contains 82144 nodes and 549202 links. The data of Wikipedia is an online voting network where users vote or against a candidate administrator, and there is 7118 nodes and 104359 links.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

We would like to thank the anonymous reviewers for the constructive comments and suggestions, which undoubtedly improved the presentation of this paper. We show our great appreciation to all the authors who collected and shared the data, such as Epinions, Slashdot and Wikipedia to be benchmark networks. Finally, we would like to thank the National Science Foundation of China (No. 71471106) that supported this research.