Abstract
In recent years, streaming music platforms have become very popular mainly due to the huge number of songs these systems make available to users. This enormous availability means that recommendation mechanisms that help users to select the music they like need to be incorporated. However, developing reliable recommender systems in the music field involves dealing with many problems, some of which are generic and widely studied in the literature while others are specific to this application domain and are therefore less well-known. This work is focused on two important issues that have not received much attention: managing gray-sheep users and obtaining implicit ratings. The first one is usually addressed by resorting to content information that is often difficult to obtain. The other drawback is related to the sparsity problem that arises when there are obstacles to gather explicit ratings. In this work, the referred shortcomings are addressed by means of a recommendation approach based on the users’ streaming sessions. The method is aimed at managing the well-known power-law probability distribution representing the listening behavior of users. This proposal improves the recommendation reliability of collaborative filtering methods while reducing the complexity of the procedures used so far to deal with the gray-sheep problem.
1. Introduction
In the digital era, where e-commerce and digital content distribution are so extended, recommender systems have become indispensable tools to help users to find the information, products, or services they are interested in. These systems are especially useful in the area of music streaming services, given the large volume of content they make available to listeners. Most streaming platforms have advanced filtering mechanisms and even music recommender systems. However, user satisfaction data indicate that their reliability is not very high [1, 2]. This may be due to numerous problems with the recommendation methods that occur irrespective of the application domain, as well as those specific to the field of music.
Collaborative filtering (CF) is the most extended recommendation approach and one of the most reliable. Its main characteristic is the use of ratings given by users to items to be recommended. The ratings are stored in a U × I matrix, where U is the number of users and I the number of items in the system. The GroupLens research system for Usenet news [3] was the first recommender system using CF, and Ringo [4] was one of the first and most popular music recommender systems based on CF.
There are two categories of CF methods: user-based and item-based. In the first, the active user receives the recommendations of items that have been positively rated by other users with similar tastes to him, that is, his/her nearest neighbors. These users have rated items in common with the active user in a similar way. Neighborhood can be computed by means of different similarity metrics. The most widely used ones are the Pearson correlation coefficient and cosine similarity [5]. Since the nearest neighbors are searched at recommendation time, user-based CF methods are also called memory-based. One of their main problems is scalability, which causes an exponential increment of the user response time as the number of users and the number of products in the system increase. In order to avoid this problem, item-based CF was proposed [4]. In this approach, rating-based similarities between items are computed before recommendation time, and then the active user receives suggestions of items similar to those he/she previously rated positively. This can be done since it is expected that new ratings given to items in large databases do not significantly change the similarity between them, especially for much rated items. This type of methods are also called as model-based methods since they make use of a model induced before the active user accesses the system. However, recommendations provided by item-based methods usually have less quality than those provided by user-based approaches. Therefore, they are indicated to be applied in large-scale systems where scalability is a serious problem. That is the case of Amazon, a very popular system where item-based models have been used [6].
The need for the explicit expression of the user’s personal preferences for items in the form of ratings is the cause of the other major drawback of CF: the sparsity problem, which arises when the rating matrix contains a large number of null elements. This means that the number of ratings obtained from the users is fewer than the number of ratings needed for prediction [7]. Matrix factorization approaches can be used to deal with this problem, but these methods have some disadvantages, such as the cost of building the models and the loss of information resulting from the dimensionality reduction, and these are not always compensated with a significant improvement of results [8]. Thus, in many cases, it is more effective to resort to implicit ratings that can be obtained from the time that users spend examining the items or from other data stored in log files; although in this case, it must be assumed that preferences derived from this information are usually not as reliable as the explicit ones.
Content-based methods are alternative approaches to CF that base recommendations on the similarity between items, as item-based techniques. Nevertheless, they do not need rating data since they make use of other features of the items for computing the similarity. These methods can be applied to address two well-known shortcomings of both user-based and item-based CF: early-rater (first-rater) and cold-start. The first drawback is observed when new products are introduced into the system. These items have never been rated; therefore, they cannot be recommended. The cold-start problem affects new users, who cannot receive recommendations because they have no or few evaluations about products. In these circumstances, item content is used to make recommendations of items similar to those that the user likes. Content-based methods are also used to address the gray-sheep problem suffered by users with unusual tastes, for whom it is very difficult to find neighbors [9].
Other proposals to deal with the abovementioned shortcomings of recommendation methods seek to build on the strengths of every category and avoid their weaknesses by means of hybrid approaches. This class of methods, which is currently the most extended, involves the combination of either different types of CF or CF with content-based techniques, among others [10].
Music recommender systems have other additional limitations that are specific to this application domain. On the one hand, explicit ratings are usually not available in the streaming platforms, so ratings are obtained from implicit feedback. This is the main difference between music and other domains: while items such as books and other products can be evaluated from their purchase records, musical items in the streaming platforms cannot be evaluated in that way because they are not purchased individually. Another difference is the way these items are consumed. While a book or a movie is generally read or watched by a given user once, a song is usually listened to many times. This quality can be used to derive ratings from the number of times that users play a given song or artist, but this is not a trivial task, given its characteristic frequency distribution. The frequency of plays of musical items (artists or songs) adopts a power-law distribution since high frequencies of plays are concentrated in very few items, while the remaining ones are part of the long tail of the curve [11]. Simple frequency functions usually used to transform plays into ratings are not suitable in this case.
The complexity of recommender systems has been increasing in recent years as they have evolved from initial CF or content-based systems to current systems that mostly use hybrid methods. In the latter, the level of complexity of the procedures and the information to be processed is much higher. While the basic collaborative filtering methods use simple algorithms such as k-nearest neighbors or matrix factorization, the hybrid methods combine these techniques with more complex machine learning algorithms. In addition, current systems do not only use explicit preference information but are able to infer that knowledge from user behavior and manage other users an item attributes. This usually requires, unlike traditional systems, collecting and processing not only static but also dynamic information, which entails greater difficulty.
The work presented in this paper deals with that complexity while addressing two important drawbacks of recommender systems: sparsity and the gray-sheep problem. Both are considered based on an analysis of the power-law distribution. Data sparsity is avoided by means of inducing implicit ratings, which is affected by that type of distribution. The gray-sheep problem, which has received little attention in the area of music recommendation, is closely related to the power-law distribution since gray-sheep users are those that listen to music that is mainly placed in the tail of the play frequency curve.
Taking into account the stated objectives, the main contributions of the work are the following:(i)A procedure for inferring user-song ratings from implicit feedback in which user sessions are considered(ii)A method to tackle the gray-sheep problem that involves the characterization of each user according to the play frequency of the songs he/she listen to
An important advantage of our proposal is the fact that it requires only information about the plays of songs by users, without the need for content information. This in turn leads to a reduction in the complexity of the methods used so far to make recommendations to users with unusual tastes.
The rest of the paper is organized as follows. Section 2 contains a brief description of related works. The approach proposed to improve CF-based recommendations is detailed in Section 3. Results and discussion are given in Section 4, and the last section of the paper is devoted to the conclusions.
2. Related Work
The drawbacks of the recommendation methods have been the focus of many works in literature [12]. Gray-sheep, cold-start, and early-rater problems have been addressed mainly by making use of content-based approaches. The content information about musical items can be extracted from their metadata, such as title, artist, year, genre, or lyrics for songs and style, country, and other demographic data for artists. Recently, social tags given to items by users are also taken as content attributes of the items, and even biographies of the artists have been used to obtain content data [13]. Besides those high-level features, low-level audio features are also exploited by means of content-based methods in many works. Spectrum, rhythm, and harmony-conforming chord structure are used in [14] to determine music similarity. In [15], music is classified in melody styles as a preliminary step to learn user music preferences by mining the melody patterns from the music access behavior of the users. Pitch, tempo, loudness, and entropy features are taken in [16] to classify musical items. Metadata, including title, artist, genre, and the lyrics of a musical piece, are used as content information by the recommendation process. A clustering technique is proposed in [17] to group similar songs from audio features. The aim is to provide users with recommendations from the appropriate clusters according to their listening behavior. The cold-start problems are addressed in [13] by means of deep network architectures used to combine user feedback data with artist and track embeddings. These are learned from biographies and audio signals, respectively. In [18], tempo, timbre, and rhythm features, jointly with tags provided by users, are used in a method for recommending appropriate music for videos. Each video or music is represented as a linear combination of latent factors of their associated features, and this model is used to calculate similarities on new feature spaces. Low-level description of the music is also used in [19] for emotion recognition and genre classification. These two features are learned by means of a recurrent neural network and later used as input of a support vector machine (SVM) in order to improve its results against the use of the music original features as input.
Since content-based methods usually produce worse results than CF, hybrid approaches have been proposed for improving recommendation reliability while addressing some of the problems mentioned above [20–22]. They have also been used in the music application domain for the same purpose. In [23], unobservable user preferences are represented as a set of latent variables associated with ratings and content data, which are statistically estimated and introduced in a Bayesian network called a three-way aspect model. The hybrid proposal presented in [18] combines a content-based model for recommending unrated music, a collaboration algorithm for recommendations based on other users’ suggestions, and an emotion-based recommendation procedure that determines interesting music for users by computing the differences between user interests and musical emotions. A weighting system based on user listening behavior is used to combine the three methods. A questionnaire that users must fill out is necessary to discover their interests, which can be a drawback since users are not always willing to do so.
In the last years, social information is being incorporated into recommender models, either as additional attributes in CF or in hybrid recommender systems [24, 25]. In [26], topics associated with songs are induced from social tagging. Social tags assigned to songs are used in [27] to establish the similarity between them as well as to capture user preferences. Another more innovative use of social tagging is the inference of user expertise in order to find more trusted neighbors for CF [28]. Friendship relations between users of streaming platforms is a different type of social information that can be treated jointly with user preferences in order to improve music recommendations [29].
There are fewer works in the literature specifically focused on improving recommendations for the gray-sheep users, in spite of the fact that the rest of the users being affected by this problem, since it has been proven that the existence of a large number of individuals with unusual preferences might have an important impact in the recommendation quality of the entire community [30]. Content-based and hybrid methods, as previously described, can produce some improvement but are not usually very significant. Moreover, they require additional information that may not be available. Semantic web mining can also be used to solve gray-sheep and other typical problems of recommender systems. Semantic information is added to the existing data in order to formalize and classify product and user features. In this way, content-based models at different abstraction levels can be generated to provide recommendations based on those taxonomies. They can be combined with other approaches in order to improve recommendations [12, 31]. In [32], the authors make use of domain ontologies to classify users and items in a multilayered community of interests prior to the similarity computation. The main drawback of this type of method is the fact that they are not easily extendible since every application domain would involve the time-consuming task of defining a specific ontology. A different approach in this line is presented in [33], where a framework for semantic-aware recommendations is proposed. In this work, concepts are automatically extracted from heterogeneous information sources, and relations between concepts are established on the basis of temporal-spatial information. The procedures involved in the framework are complex and are defined for a specific application scenario. In general terms, the methods reported in the literature to address the gray-sheep problem are very complex.
Clustering is an alternative and simpler procedure to treat users with few neighbors [34]. In [30], an extensive review of recommender systems based on diverse clustering techniques is reported. The work also includes a new proposal involving the application of the k-means algorithm to generate clusters in order to detect the gray-sheep users and a recommendation procedure for them based on their profiles. In addition, a clustering-based collaborative filtering algorithm is used to give recommendations to the remaining users. They also analyze the effect of different distance metrics in the quality of the recommendations. In some works, the clustering technique is used to address the sparsity and gray-sheep problems at the same time since some authors consider that both problems are related. In [20, 35], fuzzy class association rules are induced from previously clustered data in order to assign more than one cluster to each user with different degrees of belonging. A simulated scenario for gray-sheep users proved the effectiveness of the method. The process, implemented in a tourist system, is not simple and requires user and items features. The Last.fm dataset is used in [36] to validate a hierarchical agglomerative clustering method for recommending resources in folksonomies, which considers the users’ current navigation context in cluster selection. As far as we know, most of the methods proposed for dealing with the gray-sheep problem make use of user and/or item attributes.
The sparsity problem, caused by the insufficient number of ratings, has been widely studied in the literature. Apart from content-based methods, there are two main approaches to deal with this drawback: matrix factorization and the use of implicit user feedback to derive ratings. Matrix factorization methods have the peculiarity that can be applied with both explicit and implicit ratings. They are procedures for dimensionality reduction that generate latent factors for each user and each item. The most extended technique in the area of recommender systems for factorizing the rating matrix is singular value decomposition (SVD) [37]. In some application domains, SVD yields more reliable recommendations than standard CF algorithms [38]. However, it has a high computational cost in large-scale systems; thus, less expensive SVD-based approaches, such as incremental SVD have been proposed [39]. Sparsity has also been addressed by means of other SVD-based techniques for dimensionality reduction as latent semantic indexing (LSI) [40] and principal and component analysis (PCA) [41].
In the music recommender area, there are some works in which matrix factorization-based procedures have been proposed. The proposal of [42] using weighted matrix factorization (WMF) with implicit ratings in recommender systems has been taken in [43] as a basis of their method for song recommendation where latent factors for a given song are predicted from its audio signal. In [44], WMF is also used with the same purpose but using the number of song plays as implicit feedback.
A way to address the sparsity problem when using implicit feedback is presented by Yu et al. [45]. A model that combines the Poisson factor model and the Bayesian personalized ranking is proposed to learn user preferences and item characteristics from the frequency of interactions between users and items. Implicit ratings are also usually obtained from purchase records. In [46], log files of a mobile web application are used to identify actions, such as purchases, prelistening, and clicks, in user sessions. This information regarding the purchasing behavior of users is aimed at obtaining implicit ratings. As stated previously, those data are not available when consuming music through streaming services; thus, the usual way of obtaining implicit feedback in that context is making use of the frequency of plays. This information is provided in the Last.fm database and used in some research works where different functions for transforming it into ratings are proposed [47, 48]. However, other kinds of information can be used, such as the access history of users, which is taken in [16] to obtain user interests in a music recommendation system based on music and user groupings. In [49], a session-based collaborative filtering recommendation method is proposed, which can be used to recommend the next song the user should listen to, even when no previous user rating data are available. This method uses the items selected in the active user session to find the most similar sessions and generate the recommendation from them.
3. Improving CF Approaches for Song Recommendation
The main advantage of the proposal for recommending songs presented here is the fact that only data about the plays of the songs by each user are required. Since this information is collected by the streaming systems in an easy and regular way, some drawbacks regarding the need to acquire additional data, as explicit ratings, music metadata, or audio features, are avoided. The work is the continuation of a previous proposal for artist recommendation [50] and another preliminary study [51], which has been extended and adapted for recommending songs. The improvement of results compared to the main CF methods is achieved by focusing on two major aspects: a new way of obtaining implicit ratings from user sessions and the characterization of users according to the place of the songs played by them in the power-law distribution of play frequency. These approaches are ways of dealing with sparsity and gray-sheep problems, respectively. The first objective is achieved by significantly increasing the number of ratings about songs since every song played by the user will have an associated implicit evaluation. The second is addressed by characterizing each user according to their gray-sheep level.
The procedure for computing implicit ratings differs from other approaches based on frequency functions since not only the count of plays is used but also the position of the song in the user sessions. Concerning the gray-sheep treatment, there is no need for content information or the creation of clusters for different types of users, as most of the proposals in the literature do. The recommendation method is applied in the same way to all users in the system, taking into account an additional attribute that characterizes them according to the degree to which their tastes are unusual.
3.1. Computing Implicit Ratings from User Sessions
Obtaining both the implicit and explicit ratings required by collaborative filtering methods always requires some types of user interaction. In the case of explicit ratings, users assign a value to items that indicate the degree to which he or she has liked that product, while implicit ratings are usually obtained from other kinds of interactions with items, such as the purchase of a product and the time spent viewing information about the item. Therefore, in both cases, the only available information on the preferences is about those items that have been the object of the user interaction. The aim of the recommendations is to help users discover products or services that they do not know and that they might like. Thus, only items that the user has not previously interacted with are recommended.
Traditional ways of obtaining implicit ratings for items from purchase records, clicks, or timestamp information are not possible in the context of our study since the interaction mode of users with songs in music streaming platforms is quite different from interaction with other items in other kinds of systems. Usually, binary values or simple frequency functions of plays are used to derive preferences from user implicit feedback. However, in this work, we propose a more complex model to infer users’ interests from their behavior in a more reliable way.
This approach takes into account the sessions in which users play songs through the streaming services as well as a play frequency percentile function in the calculation of the ratings. Although all the songs played by the user have been chosen by him/her, the method is based on the fact that the first song in a user session is important since it has a higher probability of being a direct choice of the user at this time than the songs in other positions.
A user session is considered a period in which the user is listening to songs without interruption. It consists of songs that are played in a particular order; thus, it can be characterized as a Markov chain where initial probabilities are proportional to the number of times a given state was visited. In our case, the problem is simplified since only the start and nonstart of a session is considered for the songs belonging to a session. Therefore, we use the number of times for each user that each song was at the start of the session and the number of times that each song was not at the start of a session to induce the ratings.
Let us consider a set of users and a set of songs where and represent a user and a song, respectively. In this way, in our method, the frequency function for a user i and a song j is computed as follows:where is the number of times the song was the start of the session for the user and is the number of times it was played in other positions of the sessions. α parameter is used to adjust the importance of each term of the equation.
Once the session-based frequency is computed, Pacula’s procedure [52] is applied to obtain the ratings. This method has proven to be more suitable in the context of artist recommendation when play frequencies have a clear power-law distribution since there are few highly played artists, and most of them have few plays. The same distribution is presented when songs are the target of the recommendations, so it is also indicated in this case.
The method is based on the assumption that a user likes more a song that he/she listens to more times than one that he/she listens to less times. Therefore, rating values are given in comparative terms for each user. The implicit rating for the user and the song is calculated from as follows.
Let us consider that songs played by the user are ordered by their frequency values for this user, and denotes the frequency of a song with rank , being k’ = 1 for the song having the highest frequency:
Then, the rating for a song with rank k is computed as a linear function of the frequency percentile:
The values of the ratings are real numbers in the interval (0, 4]. Unlike other item interaction-based approaches, where binary ratings are obtained based on whether or not interaction has occurred, this approach more closely resembles explicit ratings that are usually within a range of values, which can be integer or real.
The ratings calculated in that way are used in the collaborative filtering approach proposed in this work, but they also can be used in any CF method following the same procedure used when ratings are explicit.
3.2. User Characterization-Based CF Approach
In order to deal with the gray-sheep problem suffered by users with uncommon tastes, we propose a procedure for characterizing users according to the play frequency of the songs they listen to. As indicated, the frequency of plays of the songs follows a power-law distribution, also called “long tail” in the context of music recommender systems. Then, gray-sheep users are those that listen to very few played songs, which are placed at the end of the long tail. However, in our proposal, it is not necessary to identify those special users, but all users in the system are associated with a gray-sheep degree along the power-law distribution curve depending on the position on the curve where the songs they play are located.
The first step of the procedure for user characterization is to determine a coefficient for the songs that reflects their popularity. This is the listening coefficient, which is computed for each song from both the number of users who play it and the number of plays it has. It is important to take into account both aspects since this coefficient will be used to characterize users, and gray-sheep ones are distinguished by having few neighbors.
For the set of users and the set of songs , the number of times that user plays a song is denoted as . This information for all users and songs is represented by the matrix of plays where :
The listening coefficient for a song gj is computed as indicated in the following equation:where is the number of users who play the song gj, is the average number of users per song, and is the average number of plays per song of user i.
The coefficient captures the playing behavior of the users with respect to each song: first, in the form of proportion of users who have listened to the song and second, in the form of number of plays of the song by a given user with respect to the average number of plays of this user. A normalized listening coefficient can be obtained by means of the following equation:
In the next step, the user playing coefficient (UPC) that characterizes users is computed from the listening coefficients of the songs they listen to:where a parameter that takes the value 1 if the song has been played by the user and the value 0 otherwise. is the total number of songs played by user . Users with high values of UPC have preferences in common with many others, while those with low values would be gray-sheep users.
Both user-playing coefficients and session-based implicit rating are needed for the next step that involves the CF method proposed in this work and described in the next subsection. Algorithm 1 describes the complete sequence of steps required for their calculation.
| 
 | 
3.3. Incorporating UPC to User-Based CF
In user-based collaborative filtering, active users receive recommendations of items liked by their nearest neighbors. Two users are defined as neighbors if they have some items in common that they have rated with close scores. In the context of our work, users who like the same songs would have similar ratings and would, therefore, be neighbors.
For the set U of m users and the set G of n songs, there is a list of ratings for each user ui that user has given to a subset of songs Gui, where Gui ⊆ G. Ratings are stored in a matrix called the rating matrix, where each element is the rating that a user ui gives to a song :
When explicit ratings are used, this matrix usually has many null elements because users have rated a small subset of songs, in a way that the fewer the number of rated items, the sparser the matrix. As stated, this is an important problem inherent to CF methods that can be minimized by making use of implicit feedback. In this work, the rating matrix contains the session-based implicit ratings computed by means of equations (1)–(3), as described in Algorithm 1. Our proposal also requires the user playing coefficients for every user , whose computing procedure is included in the same algorithm.
In order to make recommendations to the active user ua, it is necessary to find user neighbors. Among the metrics that can be applied to computer user similarity, the Pearson correlation coefficient and cosine similarity are the most frequently used in the field of recommender systems.
The Pearson correlation coefficient evaluates the linear relationship between two variables and is obtained from its covariance. This coefficient for the active user ua and another user ui is computed as follows:where raj and rij are the ratings of user ua and user ui for song , respectively, and and are the average ratings of user ua and user ui, respectively. The Pearson coefficient can represent inverse and direct correlation with its values in the interval [−1, 1], where the value 0 corresponds to the absence of correlation.
Another commonly used similarity metric is cosine, which is given by the dot product of the vectors representing the preferences of two given users, ua and ui in the Euclidean space. The cosine similarity (CS) between those users is computed according to equation (10), where and are the vectors containing the implicit ratings for songs corresponding to users ua and ui, respectively:
This is the metric used in our approach since it can be used to compute similarity from other user attributes in addition to ratings. The additional attribute incorporated at this point is the user playing coefficient, , which influences the search result of the k-nearest neighbors. To do this, an attribute-aware weighted user-based K-NN approach was applied [53]. Specifically, the implementation is provided by the recommender extension of RapidMiner [54, 55]. The resulting weighted similarity () between a user and the active user, along with their ratings, is used to predict the rating that a given user would assign to a song that he/she has not played yet, by means of equation (11) [56]. Only the k-nearest neighbors, that is, those with the highest similarity values, will be taken into account to make the predictions:
The results of applying this proposal have been compared to those provided by other CF methods. They are analyzed in the next section.
4. Comparative Evaluation of the Proposal versus Other CF Methods
In order to validate the recommendation approach, a comparative study was conducted in which this proposal and other widely used CF methods were applied to a dataset containing real data collected by Oscar Celma (https://www.upf.edu/web/mtg/lastfm360k) from the Last.fm streaming platform. Only information concerning the play of songs by users was used in the study. Specifically, 420,209 records corresponding to 86,000 songs played by 53 users over two years were processed. Each of them consisted of the user ID, the song, and timestamp when the song was played.
The first step of the preprocessing process was to establish user sessions and place the songs played in each session in the order in which they were listened to. We considered a user inactivity period longer than 15 minutes as the mark of the end of a session. Then, the first play after that period was the indication of the start of a new session. After determining the sessions, the second step was to compute implicit ratings from the count of plays in each session and the number of times that each song was the start of the session for each user.
In order to compare our proposal based on sessions to the classical implicit ratings calculation based on the frequency of plays, we applied both methods. For the first, we reported the results obtained for three values of the alpha parameter. In the second, Pacula's method was used to calculate the ratings from the simple play count, without considering sessions. Basic and matrix factorization CF methods were tested to check whether the new rating computation procedure succeeded in increasing the reliability of the recommendations. One of these methods was K-nearest neighbor (K-NN) which is extensively used in the implementation of recommender systems. We tested user-based K-NN using both cosine and Pearson similarity measures for determining the neighborhood of K users who have preferences most similar to those of the active user. The number of K neighbors was set to 5 since it provided the best results in the experiments. Although it may seem insufficient to make the predictions, this number of neighbors has been successfully used in other work in the same field of application [8]. In addition, two matrix factorization methods were applied, the basic technique and a variant called biased matrix factorization that incorporates user and item regularization parameters. Ten-fold cross-validation was performed to evaluate the results, and the metrics used were RMSE (root-mean-square error), MAE (mean absolute error), and NMAE (normalized mean absolute error).
Figure 1 shows the error rates of the K-NN output for the cosine distance and Pearson coefficient. The results of matrix factorization methods are shown in Figure 2. We can see that the error rates decrease for all the methods when the session-based approach is used, especially with . This difference is significantly greater in the case of matrix factorization methods, which yielded worse results than k-NN. However, the NMAE reduction achieved with session-based ratings () versus play count-based rating is 17.12% and 16.08% for K-NN with cosine similarity and Pearson coefficient, respectively; the reduction versus matrix factorization (MF) and biased matrix factorization (BMF) is 37.56% and 24.43%, respectively.


Once the method for obtaining the ratings was validated, we checked the impact of introducing user characterization in CF. User playing coefficients, which characterize the degree in which a specific user is a gray-sheep, were computed according to the procedure described in Section 3.2. These coefficients were discretized before applying the proposed CF method aimed at improving efficiency. The results for different numbers of bins were analyzed to obtain the optimal partition. As expected, the errors decreased as the number of intervals increased, tending to stabilize at 300 bins. Thus, this was the number of bins chosen to conduct the experiments. Discretized coefficients obtained for every user ui were used in the user-based K-NN algorithm as an additional attribute to compute user similarity, making use of the cosine measure, as described in Section 3.3. This approach that we call “user attribute K-NN UPC” was also tested with ratings based on play count and ratings based on sessions for three different values of α equations (1)–(3) and its results compared to those provided by the CF methods tested above. Table 1 shows the detailed results of all these methods.
One of the main conclusions obtained from the table is the confirmation that the session-based rating with α = 0.7 provides the best results. Another observation regarding the metric values in the table is the more significant error reduction for MAE than for RMSE. It is known that MAE is a linear score that is not as sensitive to outliers as RMSE, which further penalizes large errors. Therefore, the smaller decreasing of RMSE values may be due to the fact that there could be some predictions where the deviation from the actual value is significantly higher than the majority, both when using our proposal and the other methods. This can mask the improvement of the rest of the predictions.
From the analysis of the table, we can also derive that the lowest error rates occur with the new UPC-based method, regardless of the type of rating used. Figures 3–5 representing RMSE, MAE, and NMAE, respectively, allow us to visualize jointly both facts. It can be seen that the line representing the user attribute K-NN UPC method is in the lowest position, and the lowest points of all the lines representing the methods are those corresponding to the session-based ratings with α = 0.7.



In order to get an idea of the improvement achieved with the proposed approaches, we can compare the NMAE result of user attribute K-NN UPC with session-based ratings (α = 0.7) to both the best and the worst NMAE result of the other CF methods using classical play count-based rating. NMAE was reduced by 20.28% with respect to the best, user K-NN with Pearson coefficient, and by 42,64% with respect to the worst, matrix factorization. Differences are also important, comparing the UPC-based method to the rest of the methods when using session-based ratings (α = 0.7) in all of them. Figure 6 shows these differences. In this case, the improvements provided by user attribute K-NN UPC vary between 5.00% and 14.29%.

5. Conclusions
Sparsity and gray-sheep problems are two of the main reasons CF methods do not provide the reliability required in some recommendation systems. Both have been addressed in many works in the literature, although in the field of music, they have been less studied, especially the second. This is a major drawback because some of the proposed solutions are difficult to implement in this application domain. On the one hand, the way of obtaining implicit feedback in streaming services is totally different from other web applications due, among other reasons, to the fact that there are no individualized purchase records of songs, and the mode of consuming music is different from the consumption of other items. On the other hand, most of the proposals to deal with the problems of sparsity and gray-sheep, particularly with the latter, make use of content information that is difficult to obtain and that, in many cases, does not lead to the expected results.
In this work, an approach to improve recommendation reliability in the context of music streaming services is presented. Its main value is to address implicit rating computation and user characterization only from the play timestamp of the songs, information that is regularly collected by streaming platforms. The procedure proposed in this work for obtaining the ratings differs from most of the methods, that generally use play counts, because this procedure is based on user sessions. Furthermore, a new way of managing gray-sheep users based on the long tail distribution is presented. The results show a significant improvement of recommendation reliability over traditional CF and matrix factorization methods.
Data Availability
The dataset used in the study is publicly available for noncommercial use at https://www.upf.edu/web/mtg/lastfm360k.
Conflicts of Interest
The authors declare that there are no conflicts of interest regarding the publication of this paper.
Acknowledgments
This research has been supported by the Department of Education of the Junta de Castilla y León, Spain (ORDEN EDU/667/2019 ‐ Grant ID: SA064G19).