Abstract
To address the diversity of user preferences and dynamic changes of interests in the personalized recommendation scenario, a personalized recommendation model based on the improved gated recurrent unit (GRU) network in a big data environment is proposed. First, in order to deal with outliers in sequence recommendation, context awareness sequence recommendation is introduced, and the dynamic changes of users’ interests are modeled by redefining the update gate and the reset gate of the GRU. Then, the duration information about how long users browse each item is processed and transformed to obtain the duration attention factor of each recommended item. And the duration attention factors and the item information are together used as the input of the proposed model for training and prediction. Finally, the auxiliary loss function is introduced to make up for the shortcomings of the traditional negative logarithmic likelihood function, and a super-parameter is applied to combine the auxiliary loss function with the negative logarithmic likelihood function so as to enhance the relationship between the interest representation and the accuracy of recommendation. Experiments show that the root mean square error (RMSE) of the proposed method in the Criteo dataset and MovieLens-1M dataset is 0.7257 and 0.7869, respectively, and the mean absolute error (MAE) is 0.5147 and 0.5893, respectively, which are better than those of the comparison methods. Therefore, the proposed method significantly outperforms the comparison methods in improving the accuracy of personalized recommendation in the system.
1. Introduction
The recommendation system is a kind of information filtering tools that deals with the problem of information overload, and it can provide users with content that may be of interest to them in a personalized way [1, 2]. In the field of recommendation system, the products or services recommended to users are collectively referred to as items. Recommendation systems predict how interested a user is in a particular target item based on basic characteristics of users, behavioral feedback, and information derived from the attributes of the target item [3]. Unlike search engines that require users to enter keywords, recommendation systems infer users’ interests from their implicit feedback and then tailor personalized recommendation services to improve user satisfaction with online service websites [4, 5]. With the rapid development of information technology, the application of recommendation systems in industry has also gained much importance [6]. Nowadays, the study of recommendation systems has been extended to several aspects, including what to recommend (based on the characteristics of users/items), when to recommend (based on time-aware recommendations), where to recommend (based on geolocation recommendations), who to recommend (based on social network recommendations), and why to recommend (explainable recommendations) [7, 8]. Showing why an item is recommended not only helps users understand the rationale for the recommendation, but also helps improve the efficiency, transparency, and credibility of the system [9, 10].
Deep learning is a powerful learning framework that can efficiently process time sequence and high-dimensional sparse data, providing great impetus to recommendation strategies [11–13]. In the traditional research on recommendation algorithms, interpretability and personalization are often opposed to each other [14, 15]. A simple and easy-to-understand recommendation strategy would be more interpretable, while at the expense of the accuracy of predictions [16, 17]. By contrast, complex models can improve the accuracy of predictions, but the recommendation system cannot provide a reasonable explanation to the user [18]. The application of techniques related to deep learning offers a new solution to this problem. Based on deep learning, recommendation algorithms can be explored with both efficiency and interpretability [19, 20].
To address these problems, a personalized recommendation method based on improved GRU network in a big data environment is proposed. The main contributions of the proposed method are as follows:(1)The context is classified into four categories: input context, relevance context, static interest context, and transfer context. And the dynamic changes in user interest are modeled by redefining the update gate and the reset gate of the gated recurrent unit (GRU).(2)The duration information about how long each item is browsed by users is analyzed and transformed to obtain the duration attention factor of each recommended item. Meanwhile, the duration attention factor and item information are jointly used as the input for the improved GRU model, which improves the accuracy of prediction.(3)The relationship between interest representation and the accuracy of recommendations is enhanced by combining an auxiliary loss function with a negative logarithmic likelihood function using a hyper-parameter.
2. Related Studies
Aiming at the problems of large scale, fast update, and noise in YouTube, Covington et al. (2016) proposed to build a deep learning network with the embedding layer to learn the characteristics of videos and users, which are finally used for personalized recommendations [21]. To deal with sparsity in recommendation algorithms and overgeneralization in deep learning, Cheng et al. (2016) proposed a recommendation algorithm that combined the generalized linear model and the deep learning model [22]. In Ref. [23], a tag recommendation method based on a convolutional neural network was investigated, which extracted and fused the features of images and user interaction information through the convolutional neural network to obtain personalized tag recommendations in order. Guo et al. (2017) proposed a model based on deep matrix decomposition to learn complex and high-dimensional interaction information of users, so as to realize the prediction of click-through rates [24]. He et al. (2017) studied a collaborative filtering recommendation algorithm based on a multilayer perceptron [25]. The algorithm successfully migrated deep learning to collaborative filtering recommendation systems and designed a general model for neural collaborative filtering algorithms. Feng et al. (2019) were dedicated to capturing the dynamic interests embedded behind users’ behavior and proposed a personalized recommendation method based on the deep interest network to formulate the evolution of users’ interests [26]. Liu et al. (2016) argued that the effective use of temporal and spatial contextual information can help predict the trend in access preferences of users’ interest points and can effectively improve the capability of predicting the next interest point [27]. And by extending the recurrent neural network, a novel recommendation model called spatial-temporal recurrent neural networks (ST-RNN) was proposed. To address the issue of mining the actual preferences of users in personalized recommendation, Yin et al. (2019) proposed an attention-based deep learning POI recommendation (ADPR) framework [28]. However, the feature extraction of this recommendation method takes too long and increases the parameters to be trained in each round, so it takes a long time on a large dataset. Zhang et al. (2020) investigated a deep neural model based on transfer learning that incorporated cross-domain knowledge to achieve more accurate personalized recommendations [29]. The above methods can only mine shallow information during the feature extraction, while a large amount of deeper information about features is hard to be extracted, resulting in low accuracy.
3. Personalized Recommendation Model Based on Improved GRU
3.1. Definition of Attentional GRU (aGRU)
With increasing interactions between the user and items, user’s interests are constantly changing, and user’s preferences are influenced by various kinds of factors such as sequential information, time, and locations in addition to the user-item interaction matrix. It is often assumed that there is a relationship between changes in user’s interests and his behavior: the early actions of the user have a low impact on the interest preferences, while the recent actions are more indicative of changes in interests. Therefore, when formulating the user’s preferences, the recent behavior is given a higher weight, and items with high similarity to his recent interactive items are recommended for him in preference.
Additionally, items that are highly popular do not necessarily represent a user’s real interests in recommendation systems. It is inappropriate to assign a high weight to the behavior just because the item with which the user has recently interacted is one of the popular items. In the history list of behaviors, those that are not very distinguishable from the user's personalized interests or not very predictive of the recommendation list are referred to as outliers. The influence of these outliers should be reduced in sequence recommendations.
3.1.1. Traditional GRU
Recurrent neural networks (RNNs) can capture dynamic information in sequential data by periodically connecting nodes in the hidden layer, and they can store, learn, and represent relevant information in contextual windows with random length. Given the current input and the state , the probability distribution of the next element in the output sequence of RNN can be written as
The context of the aGRU includes the input context , the transfer context , the relevance context , and the static interest context . The impact of these contexts on the recommendations needs to be considered simultaneously in the transfer process of states in the RNN. Thus, the RNN defined in equation (1) is not sufficient to describe the problem proposed in this article. aGRU is a more complex RNN unit whose new hidden state is a linear interpolation between the previous hidden state and the current candidate knowledge.
The candidate knowledge can be formulated as
The vectors of the update gate and the reset gate determine which information can be the output of the GRU. The update gate and the reset gate can be defined respectively as
3.1.2. aGRU
The GRU described by equations (2)–(5) is not fully suitable for the model formulated in this article, thus the aGRU is proposed and the structure of it is shown in Figure 1.

(a)

(b)

(c)
The components of aGRU are defined as follows:
(1) Candidate Knowledge. The structure of the candidate knowledge is shown in Figure 1(a). And the state determined by the current input and the previous hidden state is called candidate knowledge , which can be expressed aswhere is a dimensional embedding vector associated with the item ; is the various input contexts at the moment ; is the embedding matrix associated with ; is the hidden state output at the previous moment; is the various transfer contexts at the moment ; is the embedding matrix associated with ; and is the nonlinear activation function. In the proposed aGRU, the candidate knowledge can be calculated by equation (6), and the update gate determines the proportion of candidate knowledge that can be output as the hidden state at the moment .
(2) Update Gate, Reset Gate, and Hidden State. The output structures of the update gate and the hidden state in aGRU are shown in Figure 1(b) and Figure 1(c). The update gate of aGRU is determined by the embedding expression of the item , the hidden state at the previous moment, and the relevance context , and it can be defined aswhere , , and are the corresponding weight matrices. is the relevance context calculated by the attention mechanism at the moment .
Define the reset gate as: .
Then the hidden vector of the user at the moment can be defined aswhere represents the element-wise multiplication of the vector. Unlike the common definition of the GRU, the update gate of aGRU depends not only on the current input and the hidden state at the previous moment, but also on the relevance context vector . When the relevance between the candidate knowledge and the user's static interest context is low, will mainly originate from . Otherwise, more candidate knowledge will be retained in .
3.2. Duration Attention Factor
In this article, the sigmoid function is used to normalize the standardized z-score values. The advantages of using the sigmoid function for normalization are as follows:(1)The sigmoid function can transform the input in the range of into the output between , meeting the basic requirements of normalization.(2)The sigmoid function has a larger slope at the position closer to 0, so it is more sensitive to the data closer to the average, while it is relatively insensitive to the data far from the average level. This characteristic has obvious advantages in calculating the duration attention factor.
Thus, after sigmoid normalization, the final expression of the duration attention factor of the user for the th item in the browsing sequence can be written aswhere represents the new browsing duration for the th item in the browsing sequence of the user after the z-score transformation.
3.3. The Proposed Recommendation Model
In this section, the duration attention factor is combined with the aGRU to improve the network structure. The structure of the proposed network is shown in Figure 2.

The input data at each moment of the network are the browsing behavior of a certain user at a certain moment. The item information and the browsing duration information about how long a user browses the item can be obtained from the input data at each moment, and the item information is encoded using one-hot encoding method. As the number of items in the dataset in this article is within the computing capacity of the GPU, and the computational efficiency can meet the requirements of personalized recommendation systems, one-hot coding is used. If there are too many items in the application, embedding operations can be performed in advance to reduce the computation.
After encoding the item information, the browsing duration information in the input data at that moment is processed to obtain the duration attention factor corresponding to the user’s behavior, and then the duration attention factor is added to each input in the deep learning model as shown in Figure 2. The duration attention factor corresponding to the input at the moment is . To maintain consistency with the parameters in the original GRU network, the duration attention factor corresponding to the input data of the network at the moment is all represented by in the expressions of the update gate , the reset gate , and the candidate hidden state . In the practical training process of the model in this article, bias is introduced for better performance.
With the addition of the duration attention factor, the update gate , the reset gate , and the candidate hidden state of the network at the moment can be written aswhere is the input data of the network at the moment . , , and are the dimensional weight matrices of the model. , , and are the weight matrices of the dimension. And the values of each matrix will be updated gradually during the training process, where denotes the number of neural units in the neural network and is the number of types of items. By adding the duration attention factor to the expressions of the update gate function, reset gate function, and the candidate hidden state, the duration attention factor is involved in each step of the training process for the neural network, giving full play to the duration information it represents.
The hidden state of the network at the moment can be calculated from the hidden state of the previous moment, the current candidate hidden state, and the current update gate function. Thus, can be calculated as
The of the network is calculated in the same way as the of the GRU network, and it controls the extent to which the information from is brought into the hidden state at the current moment through the update gate . The output of the network at each moment can be calculated from the hidden state and the corresponding weight matrix . In the training process, the expected output of the network at the moment is the vector of items in the training set that the user actually browses at the moment . The actual output of the network at the moment is a dimensional vector, and the th value represents the prediction score of the th item, where a larger score indicates a higher probability of the item appearing in the user’s browsing data at the next moment. The expression for iswhere is the weight matrix used to obtain the predicted based on the hidden state in the network, and the value of is continuously updated during the training process. The goal of training the network is to make the error between the predicted output and the expected output of the neural network as small as possible, and the model is often optimized by the gradient descent algorithm in neural networks. The duration information about how long each item is browsed by users is analyzed and transformed to obtain the duration attention factor of each recommended item. Meanwhile, the duration attention factor and item information are jointly used as the input for the improved GRU model, which improves the accuracy of prediction.
3.4. Objective Function and Optimization
Using GRU can capture the hidden interest in the user sequence, but it is not known whether the obtained hidden state is an accurate representation of the user’s interest. Although whether a user is interested in a new item is triggered by the final interest, that is the loss function of the deep interest network monitors the final prediction, the historical hidden state should also give feedback to the model during the training.
The auxiliary loss function is proposed to compensate for the shortcomings of the traditional negative logarithmic likelihood function. It takes the positive sample which is the user’s real behavior at the next moment and the negative sample instance of the user from negative sampling as the input to the auxiliary network to obtain the auxiliary prediction results, and then a logarithmic loss function is applied to obtain the final auxiliary loss. Assuming that there is a user embedding sequence with the length , where represents the user's true click sequence and represents the user's negative sample sequence from negative sampling; is the length of the user's interaction sequence; is the size of the embedding layer; denotes the th item embedding vector clicked by the user ; and denotes the output of the th hidden layer of the user in the GRU network, the auxiliary loss function can be formulated as
In order to combine the auxiliary loss function with the general negative logarithmic likelihood function, a new hyper-parameter is added to the training to balance the interest representation with the accuracy of the recommendations.
With the auxiliary loss function, the output of the hidden layer in each GRU network can have an impact on the final prediction result, which can more accurately represent the interest state of the user after taking the corresponding behavior, and all interest points jointly form the interest evolution sequence of the first layer of the GRU network. Meanwhile, the introduction of the auxiliary loss function reduces the difficulty of the reverse transmission of the GRU and provides more semantic information for the learning of feature embeddings. The structure of the auxiliary loss function is shown in Figure 3. In addition to choosing a suitable objective function, the dropout method is adopted here to avoid overfitting in the model. In this method, a certain number of neural network units are randomly hidden during the training of the model and only the remaining units are activated for training, which effectively improves the robustness of the model.

4. Experiments and Analysis
4.1. Experimental Environment
Data analysis is performed using tool libraries such as Pandas, Numpy, and Matplotlib, and the deep learning model is constructed using Tensorflow, keras, and scikit-learn. The hardware configuration and software environment are shown in Table 1. In this article, the training of the model and the analysis of data are mainly implemented in Python. Python is an easy-to-read, cross-platform and open-source software with rich library resources that can be called directly, which is widely used in deep learning, crawlers, data analysis, and other fields, and that is the reason why it is chosen to implement the model.
4.2. Dataset and Evaluation Metric
Criteo is a benchmark dataset for CTR prediction, shared by Criteo Labs. It includes users’ access data in the advertising display system in 1 week. Criteo provides both a training set and a test set, where the training set consists of partial traffic logs from the advertising site, with each row corresponding to one displayed advertisement. The test set has the same distribution as the training set, but it only contains data from the day after the date of the training set. There are click records of 45 million users in the advertising display system. It contains 26 classification feature fields and 13 numeric feature fields. Label denotes whether the target advertisement item has been clicked or not and it is represented by 0 or 1. L1–L13 are all integers and represent counting features. C1–C26 are classification features and the values of these features are hashed to 32 bits for anonymization.
MovieLens-1M is the rating data of movies from users in movie websites, including basic information about users, attributes of movies, and ratings. As the feature fields of this dataset have semantic meaning, it is used in the experiment for interpretability analysis. The specific numerical statistics of the rating dataset are shown in Table 2.
There are various kinds of evaluation metrics to measure the performance of the recommendation algorithms, including accuracy, metrics based on ranking and weighting, coverage, diversity, and novelty. Among them, the prediction accuracy metrics represented by RMSE and MAE are widely used in recommendation algorithms, while the famous Netflix competition and Ali in China also often use them to evaluate the algorithms’ accuracy.
The RMSE is calculated as
The MAE is calculated aswhere is the size of the test set, is the rating from users, and is the predicted rating.
4.3. Parameter Selection
To obtain the best results for the model in the dataset, parameters are selected using the Criteo dataset and the MovieLens-1M dataset. In order to determine the appropriate dimension of the hidden state in the aGRU of the model, experiments about the number of iterations and the selection of the learning rate are performed on the model under the same conditions, and the results are shown in Figures 4 and 5. In the training of the deep learning model, the number of iterations per batch and the learning rate will directly affect the degree of convergence of the model. In order to optimize the model, a series of tuning on parameters are performed in the model, and RMSE changes with the number of iterations, which is illustrated in Figure 4. It can be found that the model converges when the number of iterations reaches 240.


In Figure 5, different learning rates are used to verify the effect of the number of iterations on the RMSE. As can be seen from Figure 5, for both Criteo and MovieLens-1M datasets, convergence is obtained when the learning rate is 0.0001, and the value of error is minimized. Thus, the model reaches the optimal condition.
4.4. Comparison with Other Methods
To demonstrate the advantages of the proposed personalized recommendation method, it is compared with the methods in Ref. [28] and Ref. [29] under the same experimental conditions. RMSE and MAE are used to measure the effectiveness in recommendation of each model, and the results are shown in Tables 3 and 4. As shown in Table 3, the RMSE of the proposed method is 0.7257 and the MAE is 0.5147 under the Criteo dataset, both of which are lower than those of the comparison methods. For the MovieLens-1M dataset, which has a sparse data volume, the proposed method also achieves better recommendation performance. As depicted in Table 4, the RMSE of the proposed method in the MovieLens-1M dataset is 0.7869 and the MAE is 0.5893, both of which are lower than those of the compared personalized recommendation methods. This is because the proposed method analyses the duration information about how long users browse each item, transforms it to obtain the duration attention factor corresponding to each recommended item, and uses the duration attention factor and the item information as the input of the proposed model, so as to improve the accuracy of prediction. Although the comparison methods take into account the influence of both contextual and auxiliary information, they do not consider the effect of the duration information of each item, and therefore the recommendation performance is less effective.
Also, the time consumed to run each model is analyzed in this article, as shown in Figure 6. For the Criteo dataset, the feature extraction of methods proposed in Ref. [28] and Ref. [29] takes much time. They increase the number of parameters that need to be trained in each round, so they take a longer time in large datasets. The proposed model can extract the data features faster and takes less time. Even in the MovieLens-1M dataset which is with less data, the proposed model still takes less time than the methods proposed in Ref. [28] and Ref. [29]. This proves that the proposed method improves the efficiency of personalized recommendation in a big data environment.

(a)

(b)
5. Conclusion
A personalized recommendation model based on an improved GRU network is proposed to address the problem of diverse user preferences and dynamic changes of interests in the personalized recommendation scenario in the big data environment. The duration information about how long users browse each item is analyzed and transformed to obtain the duration attention factor corresponding to each item, which is used together with the item information as the input of the proposed model for training and prediction of the model. A hyper-parameter is introduced to combine the auxiliary loss function with the negative logarithmic likelihood function to enhance the relationship between interest representation and recommendation accuracy. The experimental results show that the proposed method improves the capability of the system in personalized recommendation.
In the future, the research will consider how to integrate heterogeneous information sources into interpretable recommender systems. In addition, the research in this article is based on existing static datasets, whereas recommendation systems are engineered with dynamic user data, and how to capture these features will be the focus of the future research.
Data Availability
The data used to support the findings of this study are included within the article.
Conflicts of Interest
The authors declare that there are no conflicts of interest regarding the publication of this article.
Acknowledgments
This work was supported by the fund project of Shanxi Provincial Education Department (no. J2020392).