Abstract
As the rapid development of mobile Internet and smart devices, more and more online content providers begin to collect the preferences of their customers through various apps on mobile devices. These preferences could be largely reflected by the ratings on the online items with explicit scores. Both of positive and negative ratings are helpful for recommender systems to provide relevant items to a target user. Based on the empirical analysis of three real-world movie-rating data sets, we observe that users’ rating criterions change over time, and past positive and negative ratings have different influences on users’ future preferences. Given this, we propose a recommendation model on a session-based temporal graph, considering the difference of long- and short-term preferences, and the different temporal effect of positive and negative ratings. The extensive experiment results validate the significant accuracy improvement of our proposed model compared with the state-of-the-art methods.
1. Introduction
Nowadays, a huge ecosystem of independent content providers (such as Facebook, Netflix, Google Maps, and Snapchat) and consumers (web users) is emerging on the mobile Internet. Confronted with the problem of finding a needle in the haystack, many web users usually resort to information filtering technology to find more relevant contents. Nowadays, recommender systems have been deployed on the websites of many industries [1], to make the web services more suitable and engaged to their users and promote the scale and profitability of such businesses [2]. In the recent decades, recommender systems have received considerable research attention in the literature, and many effective recommendation approaches have been proposed, such as social network-based recommendation models [3], graph-based recommendation models [4, 5], and context aware recommendation models [6, 7]; a recent and up-to-date review can be found in the works of Lu et al. [8].
Many of these works are focused on movie recommendation or based on movie-rating data sets [9, 10]. Typically, in online video-watching websites with recommender systems, users are asked to rate movies with discrete scores to express their individual opinions, where a high score usually indicates user preference on this movie. Take https://www.netflix.com as an example, users are suggested to rate movies and TV shows (items in general) in a rating scale from 1 star to 5 stars, where one star means “Hate It,” and five stars mean “Love It.” This kind of explicit feedbacks can largely reflect user preferences. Even if a user dislikes a movie after watching it, he might be attracted by its title, cast, director, genres, or others; otherwise he would never watch it. Hence, the negative ratings indicate many useful information and thus should not be neglected or simply considered to be negative. Many works have shown that both of positive and negative opinions are effective to make effective recommendations.
First, given a rating scale where the highest score denotes the most positive opinion and the lowest score indicates the most negative opinion, users’ rating scores do not distribute evenly along the whole rating scale [11]. Second, different users may have different rating criterions, some good-tempered users are willing to give high scores whereas other critical people seldom give full marks to any items they have watched [12]. Last but not least, the negative ratings indicate dislike and simultaneously relevance, and they may play an either negative or positive role depending on the sparsity of training set and the popularity of the corresponding items [13].
As the mobile platforms become more and more user friendly, computationally powerful, and readily available, online content providers have begun to develop mobile apps to offer more personalized contents. People can watch their favorite movies and TV shows wherever and whenever they have a break. This mobile feature poses a new challenge to recommendation systems. Most of previous works do not consider the temporal difference in the rating criterions of users. According to the memory effect of movie watching behavior [14] and the anchoring bias phenomena of movie-rating behavior [15], the current rating of one user will be influenced by his previous watching and rating history. Therefore, individual rating criterions may vary in different periods, depending on the previous items he has watched. Besides, a user’s negative ratings may have a different temporal influence from his positive ratings on his future preference.
In this paper, we empirically analyze three typical data sets created by popular online video services (MovieLens, Netflix, and MovieTweetings) with focus on the temporal effects of the rating behavior of each individual user. We concentrate on the time-varying rating criterion and different temporal effects of positive and negative ratings on future behavior. We propose a session-based recommendation model taking into account these temporal characteristics of user ratings. Compared with five state-of-the-art methods on the aforementioned movie-rating date sets, our proposed model is validated to give more accurate prediction of user preference.
2. Empirical Analysis
In this section, we analyze empirically the temporal difference in users’ rating criterions and the temporal effects of positive and negative ratings, with the hope of understanding the temporal characteristics of users’ rating behaviors and verifying the following two assertions.
Assertion I. The rating criterion of a user varies over time.
Assertion II. The positive and negative ratings of a user have different temporal influences on his future preference.
For the convenience of readers, we list all the notations used in this paper in “Notations.”
2.1. Data Sets
Three real-world data sets, MovieLens, Netflix, and MovieTweetings [18], are employed in this paper. The MovieLens data were collected by the GroupLens Research Project at University of Minnesota, through the MovieLens website (https://movielens.umn.edu) during the seven-month period from September 19, 1997, to April 22, 1998. It consists of 943 users and 1682 movies, and discrete rating records from 1 star to 5 stars. The Netflix data is random sampling from the original Netflix data set provided by http://Netflixprize.com. It is composed of 97367 rating scores of 3000 users on 3000 movies from January 2005 to May 2005. The MovieTweetings is a data set consisting of ratings on movies contained in well-structured tweets on Twitter. By removing users who rate less than 10 movies from the snapshots with 100 K records, the MovieTweetings data set consists of 67040 ratings of 2583 users on 9111 movies from February 28, 2013, to September 2, 2013. The basic statistics of these data sets are presented in Table 1.
2.2. The Rating Criterion
In this paper, we investigate users’ rating criterions in two aspects: average rating score and rating scale. Specifically, we consider the monthly average rating score and rating scale of each user as two independent random variables and estimate their standard deviations across months. To obtain a reliable estimation, we consider only the users who are active in more than 2 months of the whole period. Figures 1(a), 1(b), and 1(c) show the distributions of standard deviation of average rating scores for MovieLens, Netflix, and MovieTweetings data sets, respectively. The mean values of the deviations for three data sets (0.36427, 0.56429, and 0.79849) are all significantly greater than 0 (, obtained by one-sided -test). Similarly, Figures 1(d), 1(e), and 1(f) show the distributions of standard deviation of users’ rating scales. The mean values for MovieLens, Netflix, and MovieTweetings are 0.93737, 0.93495, and 1.05155, respectively (, obtained by one-sided -test). These observations indicate that every user has a significantly changing rating criterion over time, empirical evidence of Assertion I.

(a)

(b)

(c)

(d)

(e)

(f)
2.3. The Positive and Negative Ratings
Note that the rating criterion on items varies from person to person; we take the median score of each individual user , instead of the median score of systematic rating scale, to distinguish his own positive rating (rating score no less than ) and negative rating (rating score less than ).
We use session to represent a continuous period of user activity; thus the records of user can be divided into several sequential sessions . In this paper, the sessions are divided by month; that is to say, two ratings of the same user are in the same session if and only if they occur in the same month. For a user , the items rated with positive ratings by user in session constitute his positive item set , and the negative item set is similarly defined. For a target user , we take his latest positive item set as future preference, and all the previous positive and negative item sets and are treated as previous interests, .
The correlation between two item sets and is defined as the averaged cosine similarity on all item pairs, where one item is from and the other is from . Figure 2 plots the correlation (the black line) and (the red line) against the time gap , of course averaged over all users, for MovieLens, Netflix, and MovieTweetings data sets. We can see that the future preference of a user is clearly more influenced by his positive ratings than the negative ratings in the past. From the temporal point of view, the bigger the time gap is, the less influenced the future preference is by the previous positive/negative ratings. However, the decay rates of influences of positive and negative opinions vary for different data sets. For MovieLens, the influence of positive ratings on future preference is more stable than negative ratings (), while, for Netflix, the decay rates of influences of positive and negative ratings are very similar to each other ().

(a) MovieLens

(b) Netflix

(c) MovieTweetings
Since the first and the last sessions contain data of only 1 day and 2 days for MovieTweetings data set, we ignore the last points of curves with time gap of 6 months. Different from above observations, we find that the influence of negative ratings is more stable than that of positive ratings in MovieTweetings (). Therefore, users’ positive and negative ratings have different temporal influences on his future preferences, empirical evidence of Assertion II.
3. Recommendation Model
Based on the session-based temporal graph (STG) introduced by Xiang et al. [4], we propose a session-based recommendation model with the temporal effect of user preferences (STeuP), which is an enhanced version of the Injected Preference Fusion (IPF) model associated with STG. Users and items are represented by user nodes and movie nodes , respectively. To represent users’ ratings at different periods, we associate a session node to the movies rated by user in this session. These three types of nodes are connected by weighted directed edges, namely, , , , and . The edges affiliated to session nodes reflect the short-term rating criterions of users, while the edges affiliated to user nodes reflect users’ long-term preferences. Figure 3 gives an example of session-based temporal graph.

To eliminate the effect of different rating criterions of different individuals, the rating score of a user is normalized according to his own rating scale: which reflects users’ long-term rating criterions. In this way, the rating scores of all users can be strictly regulated to , where the maximum rating score of each user is set to 1 and the minimum rating score is fixed on 0.
Since the short-term rating criterion of a user varies at different periods, we normalize his rating score in a particular session by
Recall that our recommendation task is to recommend movies for a target user to watch in the future. Of course, the rating whose occurrence time is closer to the target time is more useful to the recommendation task. Since the temporal influences of positive and negative ratings may be different, following previous works [19, 20], we use two exponential functions to model the relevance of positive and negative ratings at time with user’s preference at the target time . Hence, edge weights in and are defined as Similarly in a specified session, the rating whose occurrence time is closer to the target time is more important in this session. We use the same exponential functions to model the temporal influences of positive and negative ratings in a session, and the median rating value in this session is taken to distinguish positive and negative ratings. Thus, the edge weights of and are calculated by
After setting the initial edge weights of STG, we normalize these edge weights as follows: A larger indicates that users’ long-term preferences play a more important role in preference propagation.
Given a target user , the basic idea of the preference propagation is to first inject initial preference on both the user node and his latest session node and then propagate the preference to candidate movie nodes through various paths in the graph. As defined in [4], the preference propagated by each path is the production of the initial preference assigned to the target user node (or the latest session node ) and the weights of all edges on the path: depends on the node type: where means no preference is injected into the user node, while means no preference is injected into the session node. Similar to the previous work [4], we consider only the shortest paths (distance = 3) from source node to unknown movie nodes, which can be obtained effectively by Bread-First-Search. Consequently, we use to represent the set of shortest paths from source nodes to an unknown movie node for user , and the estimated preference of user on movie is then measured as where is the weight of path defined as (6). The top-ranked movies sorted by preference value are then recommended.
4. Experiment Results
4.1. Evaluation Metrics
In order to predict users’ future preferences based on the past interaction records, all the records are listed in ascending order of rating time. We take the records occurred in the latest 30 days as the probe set and the remaining records as the training set for all data sets. The training set is treated as known information, while no information from the probe set is allowed to be used for recommendation. Moreover, we denote the latest time among the training set as the target time . In this paper, four typical metrics are employed to evaluate the accuracy, diversity, novelty, and coverage of recommendation results.
4.1.1. Accuracy
Accuracy is one of the most important evaluation metrics of a recommendation system. Both Precision and Recall could be used to measure the accuracy of the recommendation. Precision is the fraction of recommended items that are relevant, while Recall is defined as the ratio of the number of relevant items in the recommendation list to the number of preferred items in the probe set. However, Precision and Recall seem to be two sides of the seesaw; that is to say, given a fixed length of recommendation list, when one end rises, the other end falls. The measure is proposed to find a suitable trade-off between Precision and Recall, which is defined as where and , where is the number of users, is the number of relevant items in the recommendation list of , and is the number of all preferred items in the probe set of user . Generally speaking, for a given length of recommendation list, the method with higher value is the better one.
4.1.2. Diversity
Diversity is used to measure the difference between recommendation lists of different users. An excellent algorithm should recommend as widely distributed items as possible, because people are glad to get personalized suggestions. We use the Hamming distance to measure the diversity of recommendation lists, where is the number of common items in the recommendation lists of user and user . , if and get identical recommendation list consisting of items. Diversity is defined as the mean value of Hamming distance:
4.1.3. Novelty
Novelty quantifies the capacity of a method to generate novel and unexpected recommendations, which may be greatly contributed by less popular items (i.e., items of low degree) that are unlikely to be known previously. It can be simply measured as the average degree of the recommended items. Specifically, for a target user whose top- recommendation list is denoted by , his novelty is defined as [21] Averaging over the novelty of all users, we obtain the novelty of the system.
4.1.4. Coverage
Coverage measures the percentage of items that an algorithm is able to recommend to users. It can be calculated as the ratio of the number of distinct items in the users’ lists to the total number of items in the system, which reads where is the number of items in set , only if item is recommended to at least one user (i.e., is in at least one user’s list), and otherwise . Undoubtedly, recommending more popular items will result in lower coverage.
4.2. Parameter Adjustment
Before comparing the proposed model with the baseline methods, we investigate the impacts of the parameters , , , and on the performance of the STeuP model. As we see in Section 2.3, the temporal effect of positive and negative opinions may be different in different online websites. Thus, we first examine the effect of parameters and , which govern the decay rates of temporal influence of positive and negative opinions on users’ future preference. The bigger the parameters and are, the less affected the future behaviors are by users’ past positive and negative opinions. Without loss of generality, we set and when tuning and .
Figures 4(a), 4(b), and 4(c) plot the heat map of against parameters and for MovieLens, Netflix, and MovieTweetings, respectively. The log- is along the -axis while the -axis is for log-. The different values along the -axis are indicated by different colors. Firstly, we observe that the is more sensitive to than ; that is, given a fixed value of , the changing range of is much bigger when traversing the parameter . Secondly, the results on all data sets show that there is an obvious “ridge” along the -axis, where we can get the optimal value. Hence, we can firstly fix to a small value (~) and tune the parameter to find the local optimal value, and then fix and adjust to find the globally optimal accuracy.

(a) MovieLens

(b) Netflix

(c) MovieTweetings
By setting the values of and to 0, we can get the recommendation results without temporal influence, which are presented in Table 2. We can see that consideration of temporal influence by weighting users’ positive and negative opinions with different temporal decay rates leads to performance improvements. From the values of parameters and when we get the optimal , we can find that the decay rate of temporal effect of positive opinions is much smaller than that of negative opinions on MovieLens, but it is bigger than that of negative opinions for MovieTweetings. For Netflix, the decay rates of positive and negative opinions are almost the same. This validates again our inference on the different temporal influences of positive and negative opinions on three data sets in Section 2.3.
In our STeuP model, parameter controls the ratio of injected preferences into user nodes against session nodes. If equals 0, no preference is injected into the user node; if equals 1, no preference is injected into the session node. Thus, parameter is used to balance the effect of long-term and short-term interests in the initial phase, where the larger is, the stronger the influence of long-term preferences is. The results of how accuracy changes against for three data sets are shown in Figure 5. Firstly, the results show that ignoring long-term preferences () cannot generate good results. Secondly, in a sparser data set, the value for the optimal is bigger. Generally speaking, optimal results can be obtained by combing long-term and short-term interests together. In the next discussion, we fix to 0.5, 0.9, and 1.0 for MovieLens, Netflix, and MovieTweetings data sets, respectively.

(a) MovieLens

(b) Netflix

(c) MovieTweetings
Parameter is used to balance the influence of long-term and short-term preferences in the process of preference propagation. means item nodes are only connected to users nodes and item-item similarity depends only on users’ long-term preference and vice versa. Figures 6(a), 6(b), and 6(c) plot the change of against parameter on three data sets. As the -axis is for the logarithmic value of , we can see that, for MovieLens and Netflix data sets, a parameter close to 1 is corresponding to the optimal , while the optimal value of for MovieTweetings is close to . This observation verifies that both users’ long-term and short-term opinions are important to measure item similarity. Furthermore, users’ long-term opinions are more important than short-term opinions on sparse data sets.

(a) MovieLens

(b) Netflix

(c) MovieTweetings
4.3. Comparison of Methods
We compare our proposed model with other five models listed in Table 3. The mark in the table indicates whether a model distinguishes user ratings as positive and negative opinions and/or considers temporal influence. UCF is an important collaborative filtering method which calculates the similarity between users based on the rating information. NBI is a network-based inference algorithm which can provide more efficient and effective recommendation than collaborative filtering. Both of the methods do not distinguish user ratings as positive and negative opinions or consider temporal influence. UOS is an improved user-based collaborative filtering model which estimates the similarity between users based on the spreading process of users’ positive and negative opinions on user-item bipartite networks. SNBI is the enhanced version of NBI which assigns two different weights to the positive and negative opinions. UOS and SNBI distinguish users’ ratings as positive and negative opinions, but do not utilize temporal influence. IPF is the recommendation model proposed together with session-based temporal graph, which is based on binary data and distinguishes users’ long-term and short-term preferences. STeuP is our proposed model based on the two assertions in Section 2, which takes into account both of the temporal differences of users’ rating criterions and the different temporal effects of users’ positive and negative ratings.
What is more, we also check the performance of a famous matrix factorization recommendation algorithm [22], which is very successful in rating prediction with the help of time information. However, the value of this method on the aforementioned three data sets is all less than 0.001, perhaps because it does not apply to the situation of binary preference prediction. Hence, we do not present the results of the matrix factorization methods for comparison in this paper.
Given the recommendation length , the recommendation performance of these six methods for MovieLens, Netflix, and MovieTweetings is reported in Tables 4, 5, and 6, respectively.(i)Collaborative filtering methods (UCF and UOS) can provide higher diversity, novelty, and coverage, but lower accuracy than graph-based methods (NBI and SNBI).(ii)With the help of distinguishing users’ ratings into positive and negative opinions, both memory-based CF method and graph-based method can provide more accurate recommendation. The values of UOS and SNBI are both better than UCF and NBI, especially for the graph-based method. The diversity, novelty, and coverage of SNBI are all improved compared with NBI. However, for sparse data sets (MovieTweetings), UOS and SNBI cannot work well as expected. The results of SNBI are almost the same with that of NBI, while of UOS is even worse than UCF.(iii)The IPF model, which considers users’ long-term and short-term temporal influence, can provide quite good recommendation results for all data sets, except for the accuracy for Netflix.(iv)The proposed method STeuP model performs the best with regard to accuracy for all data sets. Besides, it can provide better diversity, novelty, and coverage than graph-based methods, especially for the sparse data sets (Netflix and MovieTweetings).
5. Conclusions
In this paper, we analyzed users’ dynamic rating criterions and we discovered the different temporal effects of user’s positive and negative opinions on future preference. We propose a session-based movie recommendation model using this dynamic rating information, which simultaneously considers the temporal change of users’ rating criterions and different temporal effects of positive and negative ratings. The extensive experiments on three real-world movie data sets show that the accuracy of our proposed model is significantly improved compared to the state-of-the-art methods.
Notations
| : | A user, user set |
| : | A movie, movie set |
| : | The session set of user |
| : | The -th session of user |
| : | The rating of user on movie |
| : | The highest and lowest rating scores made by user |
| : | The middle value of user ’s ratings |
| : | The positive item set in |
| : | The negative item set in |
| : | The highest and lowest scores in |
| : | The edge sets from node set to node set |
| : | The number of relevant items (namely, the items collected by the user in the probe set) in the recommendation list of |
| : | The number of selected items in user ’s probe set |
| : | The time stamp when user rates movie |
| : | The decay factors controlling the extent of temporal influences of positive and negative ratings |
| The neighbor set, neighbor session set, and neighbor user set of item | |
| : | The parameter used to adjust the preference propagation of an item node to its user neighbors or session neighbors |
| : | A node and a path on the session-based graph |
| : | The weight of edge in path |
| : | The value of injected preferences on the source node |
| : | The parameter used to tune the ratio of injected preferences on the user node against the session node. |
Conflicts of Interest
The authors declare that there are no conflicts of interest regarding the publication of this paper.
Acknowledgments
This work is supported by National Natural Science Foundation of China (Grant no. 61673085) and Science & Technology Department of Sichuan Province (Grant no. 2016GZ0081).