Abstract

Sequential recommendation algorithm can predict the next action of a user by modeling the user’s interaction sequence with an item. However, most sequential recommendation models only consider the absolute positions of items in the sequence, ignoring the time interval information between items, and cannot effectively mine user preference changes. In addition, existing models perform poorly on sparse data sets, which make a poor prediction effect for short sequences. To address the above problems, an improved sequential recommendation algorithm based on short-sequence enhancement and temporal self-attention mechanism is proposed in this paper. In the proposed algorithm, a backward prediction model is trained first, to predict the prior items in the user sequence. Then, the reverse prediction model is used to generate a batch of pseudo-historical items before the initial items of the short sequence, to achieve the goal of enhancing the short sequence. Finally, the absolute position information and time interval information of the user sequence are modeled, and a time-aware self-attention model is adopted to predict the user’s next action and generate a recommendation list. Various experiments are conducted on two public data sets. The experimental results show that the method proposed in this paper has excellent performance on both dense and sparse data sets, and its effect is better than that of the state of the art.

1. Introduction

With the development of Internet technology, recommender systems have become one of the indispensable tools in people’s daily life [14]. Compared with traditional methods, the sequential recommendation model performs well on the Top-N recommendation problem [5]. In recent years, with the development of deep learning technology, sequential recommendation models based on deep learning have been widely used, such as e-commerce shopping platforms, medical and health services [6, 7], and audiovisual platforms [8]. The user’s interaction behavior with items in such application platforms can be regarded as a sequence of behaviors in chronological order. Based on this, researchers have proposed various sequential recommendation models to mine and analyze user-item interaction information. The purpose of these models is to provide users with a personalized recommendation list containing items to help users filter out valuable information.

The recommendation model based on the Markov chain (MC) [9] method is one of the early methods of sequential recommendation, which assumes that the user’s next action is determined by his historical behavior and transforms the recommendation problem into a sequence prediction problem. In recent years, with the continuous breakthroughs of the deep neural networks (DNN) in the field of artificial intelligence [1012], researchers have tried to introduce a series of deep neural network models into the field of recommendation and have achieved a series of results [1315]. For example, Huang et al. [16] combined the traditional MC method and the recurrent neural network (RNN) to optimize the recommendation model and improve the recommendation accuracy. Based on long-short-term memory (LSTM) network, Xu et al. [17] combined self-attention network to capture users’ complex and dynamic behavioral preferences. Inspired by the semantic understanding model, Sun et al. [18] applied the bidirectional attention model to sequential recommendation, combining user context information and making recommendations. The existing sequential recommendation models tend to perform poorly when there are a large number of short-sequence users in the data set [19]. In addition, most of the existing sequential recommendation models only consider the absolute position information of the user sequence and assume that each item has the same time interval, ignoring the impact of the time interval between items on the recommendation results, which cannot capture user preferences effectively [20].

To address these issues introduced above, some improved sequential recommendation models are proposed. For example, Zhao et al. [21] employed a deep bidirectional long-short-term memory network and attention mechanism to capture the changes in user preferences. Liu et al. [22] first used the method of reverse training short sequences to expand short sequences and fine-tuned the model through the enhanced short sequences, which can achieve certain results on sparse data sets. Ahmadian et al. [23] adopted a deep learning based trust- and tag-aware recommender system, to extract potential features through sparse automatic encoder, which can effectively solve the problem of data sparsity. Li et al. [24] adopted a time-aware self-attention mechanism to explore the effect of different time intervals on the prediction results. These methods lay a good foundation for the research of sequential recommendation models, but there are still some problems that are not well solved. For example, the pseudo-historical items generated by direct reverse training are not accurate enough, and the time interval information is not sufficiently mined to capture user preferences well.

Based on previous research, we propose a sequential recommendation model based on short-sequence enhancement and improved time-aware self-attentive mechanism to address the above-mentioned problems. In the proposed model, the data set is first preprocessed to divide users into long-sequence sets and short-sequence sets. Then, by reverse training the long-sequence set, a reverse prediction model is generated. Finally, the model is transferred for the short sequence, and a batch of pseudo-historical items is generated before the initial item of the short sequence, to enhance the short sequence and solve the problem of data sparsity. At the same time, the model adopts an improved time interval self-attention mechanism, which not only considers the influence of absolute location information on the recommendation effect but also considers the influence of the time interval information between any two items on the recommendation result.

The proposed model in this paper can fully reflect the changes of user preferences over time and improve the accuracy of the recommender system. In summary, the main contributions of this paper are as follows: (1) pretrain a reverse prediction model, use the transfer learning method to reverse predict short sequences, and generate a batch of pseudo-historical items before the initial items of the short sequence, so as to achieve the purpose of enhancing short sequences. (2) Combined with the absolute position information and time interval information of the item, an improved time-aware self-attention mechanism is used to give the absolute position weight and time interval weight of different items, fully exploit the change of user behavior preferences, predict the user’s next action, and generate a list of recommendations. (3) Extensive experiments on two real data sets are conducted. The results demonstrate the effectiveness of the proposed model, which can outperform existing methods on two different metrics. In addition, the influence of each key component in the proposed model on the recommendation results is discussed through multiple experiments.

This paper is organized as follows. Section 2 introduces the related works. Section 3 gives out the details of the proposed model. Section 4 provides experiments and analysis of results. Section 5 discusses the parameters and important components of the proposed algorithm. Section 6 provides the conclusions.

2.1. Sequential Recommendation Model

The earliest sequential recommendation models are mainly based on the MC method [25]. These MC-based models have a significant improvement over other types of recommendation algorithms in terms of short-term prediction. However, this type of model cannot capture the long-term behavioral features in the user sequence and has low accuracy and high computational complexity in long-term prediction.

As deep learning technology shines in the fields of machine vision and natural language processing [26, 27], the introduction of deep learning technology into recommender systems has also become the focus of researchers. For example, Zhang et al. [28] designed a new session-based recommendation method based on recurrent neural network, which fuses user’s general preference information and dynamic preference information. Sun et al. [29] proposed a method based on temporal context awareness and RNN, which can effectively capture the correlation between items. In addition, long-short-term memory (LSTM) and gated recurrent unit (GRU) (two popular variants of RNNs) have also achieved results in the field of recommendation. For example, Yuan et al. [30] computed the global state transitions of user sequences to model user interest preference changes, based on an improved GRU model. Zhao et al. [31] proposed a content-aware movie recommendation model based on LSTM, which effectively utilizes the long-term and short-term information of the sequence for content perception and movie recommendation. However, most models assume that user behavior sequences are simple time-sequential sequences, without considering the time interval information between items. At the same time, existing models perform poorly on sparse data sets and short-sequence users.

2.2. Transformer-Based Model

Attention mechanism has achieved great results in a large number of works, such as image processing [32, 33] and natural language processing [34]. The essence of the attention mechanism can be understood as selecting some important information from a large amount of information and giving them weights, where the size of the weights represents the importance of the information. In recent years, transformer, a neural network architecture based on pure attention mechanism, has achieved excellent performance and effects in the field of machine translation [35]. Inspired by this, researchers introduced the transformer model into the recommender system [36] and achieved good results. The transformer-based model uses scaled dot-product attention, which is presented as follows:where , , are three matrices representing queries, keys, and values, respectively; is the scaling factor, which is used to avoid the inner product value being too large; and is the normalized function [33].

The attention function can be described as mapping a query and a set of key-value pairs to an output, where the queries, keys, values, and output are all vectors. The output is computed as a weighted sum of the values. The transformer model adopts the multihead attention mechanism, which executes the attention function in parallel, and connects each output value with item linearly again to obtain the final result. The multihead attention mechanism enables the model to focus on the information of different subspaces from different locations at the same time. The architecture of the transformer-based model is shown in Figure 1.

3. Proposed Sequential Recommendation Model

In this paper, a sequential recommendation model (SeTsRec) based on short-sequence enhancement and temporal self-attention mechanism is proposed, which is shown in Figure 2. First, the original data are preprocessed, where users are divided into long-sequence user sets and short-sequence user sets; and then the long-sequence sets are reversely input into the transformer network to train a reverse prediction model. Subsequently, inspired by the transfer learning method [37], this reverse prediction model is transferred to short-sequence users to generate a batch of pseudo-historical items before the initial item of the short-sequence user behavior list. By combining pseudo-historical items and short-sequence user behavior lists, an augmented sequence of short sequences is generated to enhance short sequences. Finally, the long sequences and enhanced short sequences are used as input to train a time-aware self-attention recommendation model and predict the user’s next action. The model proposed in this paper will be described in detail as follows.

3.1. Problem Description

In the sequential recommendation problem, we assume that is the user set of the system, where is the number of the users in the data set, and is the item set of the system, where is the number of the items in the data set. For a certain user , and are the user behavior sequence and time series, respectively, indicating that the length of the behavior sequence of the user is . Each item in the behavior sequence is arranged in the chronological order of the user’s interaction with it, and each element in the time sequence represents the actual interaction time between the user and the item . At a certain moment, given the user’s behavior sequence and time series , the goal of the model is to predict the next item that the user is most likely to interact with, which is expressed aswhere is the output probability of a certain item and is the nonlinear function that needs to be learned.

Recommendation systems usually provide users with multiple recommendation results and finally generate a recommendation list containing items. Set as the output possibility of all the candidates, according to the output probability of the candidates, select the previous items for recommendation, which is the famous Top-N recommendation problem in the recommendation system.

3.2. Short-Sequence Enhancement

The sequential recommendation algorithm is a recommendation method that predicts the user’s next action by mining the information contained in the user’s behavior sequence. Therefore, the validity of user behavior sequence information is crucial. Existing sequential recommendation methods have achieved good results. However, most of the existing methods do not solve the short-sequence prediction problem well and often perform poorly on sparse data sets. To deal with the limitation problem of the sequential recommendation model on sparse data sets, the proposed method in this paper utilizes the transfer learning to enhance short sequences on the basis of existing research, which will be introduced in detail as follows.

3.2.1. Reverse Prediction Model

Ideally, in the field of machine learning, it is always expected that the data sets used for model training are dense and efficient. However, in the actual research process, the data sets often have a large amount of data sparse phenomenon. In sparse data sets, there are often a large number of missing or zero data, which makes the data availability very poor, and brings many difficulties to establish the recommendation models.

In this paper, the user set is first divided into a long-sequence user set and a short-sequence user set according to the length of the user sequence. The long-sequence user set is a dense datas et, and the short-sequence user set is a sparse data set. For the long-sequence user set , the behavior sequence can be obtained. In this paper, the long sequence is first reversed to obtain the reverse sequence , and then the reverse sequence is input into the transformer layer for training, to obtain a reverse prediction model. For the user , the purpose of this reverse prediction model is to predict the previous item of the sequence :where represents the previous item of the item in the sequence . Although the model is reverse-trained, the transformer network is also able to mine interitem correlations, which has been demonstrated in previous work [22].

3.2.2. Pseudo-Historical Item Generation

The existing methods often regard the data set as a whole for recommendation tasks, which ignore the different data quality of different users in the same data set. Specifically, some users have interacted more with the item, and the data of these users is relatively rich and reliable; while some users have little interaction data with the item, so the data of these users are sparse and can be poor usability. In order to solve this problem, this paper uses the transfer learning method to transfer the reverse prediction model of long-sequence users obtained above to short-sequence users [38]. The long-sequence reverse prediction task is taken as the source task, and the short-sequence reverse prediction task is taken as the target task. By fully mining the rich data information provided by the long-sequence users, the data sparsity problem of the short-sequence users is alleviated, which can improve the overall recommendation quality. Taking the short-sequence set as input, the reverse prediction model is used to generate pseudo-history items of short-sequence users, namely:where represents the behavior sequence of the short-sequence user set and represents the previous item of the item in the sequence .

For a data set, we define the length to represent the threshold for the short-sequence user set. Namely, if the length of a sequence (denoted by ) is less than , this sequence is regarded as a short sequence; otherwise, the sequence is regarded as a long sequence.

In this paper, we denote the generated set of pseudo-historical items as and place this set before the initial items of the original short sequence to form an augmented short sequence, where is the total number of pseudo-historical items generated by short sequences. Figure 3 shows the enhanced short-sequence set, in which the yellow part represents the generated pseudo-historical item, and the green part represents the original short sequence. In Figure 3, it is assumed that . The generated enhanced short sequence is denoted by

3.3. Time-Aware Self-Attention Model

The existing sequential recommendation model simply regards the user’s behavior list as a sequential sequence according to the interaction time between the user and the item. In addition, the items are regarded as having the same time interval. Specifically, as shown in Figure 4(a), if the user A and the user B have been exposed to the same item, the traditional method will regard the time interval between the items in the two sequences as a fixed value of days, which will lead to the same result for the two different users. However, such a result is unreasonable because the actual time that the user A and the user B have access to these items is different.

In the actual application scenarios, the time interval between items will be different even if the user’s behavior list is exactly the same due to the different actual interaction time between users and items. As shown in Figure 4(b), although the user A and the user B have been exposed to the same items, the time interval between items is different. In this case, if the model can combine the different time interval information between the items, it is possible to make more accurate recommendation results. To solve the above problems, this paper adopts an improved time-aware self-attention model. The overall framework of the proposed model is shown in Figure 5, which will be introduced in detail as follows.

3.3.1. Time Interval Matrix

After getting the augmented sequence of user behavior , it is used as model input together with the long-sequence . First, the two different types of behavior sequences are converted into a fixed-length sequence :where represents the maximum sequence length of the input model. If the length of the sequence or is greater than , only the latest items are considered; otherwise, padding items are added to the left of the sequence until its length reaches .

Similarly, for the time series and , they can be converted to a fixed sequence :

If the length of the sequence or is greater than , only the latest items are considered; otherwise, the time corresponding to the first item is used on the left side of the sequence , and padding it until its length reach . In this study, for the time of the pseudo-historical items generated in Section 3.2.2, the average time interval is used to define them in turn, which are calculated as follows:

After obtaining the user’s fixed time series , define the time interval between any items as . Due to the different frequency of interaction between different users and items, this paper adopts the relative length of the time interval between items, which is defined as follows [24]:

Finally, the time interval matrix of the user can be obtained:

3.3.2. Time-Aware Self-Attention Module

(1) Time-Aware Self-Attention Layer. For each input sequence, an embedding layer is applied to convert the user behavior sequence into item embedding matrix , absolute position information into position embedding matrix , and time interval information into time interval embedding matrix ( is the latent dimension):

Then a new sequence is calculated:where is obtained by the input item embedding, absolute position embedding, and time interval embedding, namelywhere , , , respectively, represent the input item matrix of the value, query and key; is the scale factor, which is used to prevent the inner product from being too large; and is the bias term.

(2) Point-Wise Feed-Forward Network. The self-attention layer of the model is mainly based on linear combination to realize the combination of absolute position information and relative time interval information of items. In order to make the model have nonlinear characteristics and consider the interaction between different latent dimensions, we apply a point-wise feed-forward network to each output of the self-attention layer:where is an activation function, which is ELU in this paper. The main reason of using ELU function is that it can solve the Dead Relu problem, while reducing the influence of the bias term offset, and the learning rate is faster. represents the weight matrix and represents the bias term.

(3) Stacking Self-Attention Blocks. With the continuous stacking of self-attention layers and point-wise feed-forward networks, problems such as overfitting and long training time will occur. In order to solve these problems, this paper adopts the residual connection, dropout, and layer normalization processing methods [36]:where is the element-level product; represents the mean and variance of ; and represents the learned scale factors and bias terms.

The specific workflow is as follows: for each self-attention block, layer normalization is first applied to each input , which is beneficial to stabilize and speed up the training process of the neural network. Then, the output of the self-attention layer is applied to the point-wise feed-forward network, to give the model nonlinearization features. Finally, a dropout regularization technique is applied to the output of the position feed-forward network, to alleviate the overfitting problem that occurs in deep neural networks. The main reason for using the dropout regularization technology is that it can control overfitting by artificially destroying data, which has been proven to be effective in various neural network architectures [39, 40].

3.3.3. Prediction Layer

After the prediction layer obtaining the final representation of the absolute position of the item and the time interval, in order to predict the possible next action of the users, we use a Softmax function to calculate the user’s interaction probability with the candidate item , namely:where represents the embedding vector of items and represents the first given sequence containing items and their time interval between the th item.

3.3.4. Model Inference

This paper uses user implicit interaction data. In implicit feedback, the interaction between the user and the item can be regarded as a binary classification problem, where 1 means that the user likes an item and 0 means that the user does not like or has not touched the item. Therefore, the items in the user behavior sequence can be regarded as positive samples. At the same time, all the items that the user has not touched are regarded as negative feedback, which is sampled as negative samples. In this paper, the sampling is carried out according to the ratio of . When negative sampling is performed for each user, the principle is to select those items with higher popularity, which are more representative. The loss function is as follows [30]:where represents the predicted candidate items; represents the negative sample; is the set, which represents the embedding matrix; and is the regularization parameter, which is used to prevent the model from overfitting. In the training process, the Adam optimizer is used to optimize the model, which is a variant of the stochastic gradient descent (SGD) algorithm [41]. As an adaptive learning rate optimization algorithm, the Adam optimizer is usually used for tasks in sparse data scenarios, and its convergence speed is fast [42].

In summary, the process of the proposed algorithm is shown in Algorithm 1.

Input: The behavior sequence of user
Output: The recommendation list result of user , denoted as
(1)for in do
(2)ifthen
(3)  
(4)else
(5)   %Date preprocessing
(6)end if
(7)end for
(8)for in do
(9) %Reverse prediction model training
(10)end for
(11)for in do
(12) %Short sequence enhancement
(13)end for
(14)for in do
(15) Generate time interval matrix;
(16) Calculate time-aware self-attention model;
(17) Apply the point-wise feed-forward network and further processed;
(18) Calculate prediction and loss;
(19)end for
(20)return ;

4. Experiments

4.1. Setting of Experiments
4.1.1. Data set

In order to verify the effectiveness of the proposed algorithm in this paper, experiments were carried out on two public data sets, namely Movielens-1M data set (denoted by ML-1M, see https://files.grouplens.org/datasets/movielens/ml-1m.zip) and Amazon Beauty data set (denoted by AM-BE, see https://snap.stanford.edu/data/amazon/productGraph/categoryFiles/reviews_Beauty_5.json.gz). Among them, the ML-1M data set is a dense data set, and the AM-BE data set is a sparse data set. The dense data set ML-1M in this paper is used to evaluate the effectiveness of the time self-attention improvement in the proposed model, while the sparse data set AM-BE is used to evaluate the effectiveness of the improved short-sequence enhancement method of the proposed model. The statistics of the two data sets are listed in Table 1, which contains the information such as users, items, and timestamps.

Before the experiments, the two data sets are preprocessed [17]. For all data sets, we treat the rating behaviors as implicit feedback, where “1” means that there is an interaction between the user and the item, on the contrary, “0” means that there is no interaction between the user and the item. Then, the behaviors are sorted according to the chronological order of the actual interaction between users and items to generate the historical behavior sequence of users.

In this paper, the leave-one-out method is used to train and test the model [43]. Namely, the user’s last behavior-producing item is taken as the true value, which is used as the test set. The last second-behavior-producing item is taken as the validation set, and all other remaining items are used as the training set. The advantage of the leave-one-out method is that it is not affected by the random sample division method and can use as large a sample as possible for training. It is suitable for sparse data sets.

4.1.2. Evaluation Metrics

This paper adopts two commonly used metrics in Top-N recommendation problem, namely hit rate (HR) and normalized discounted cumulative gain (NDCG) [29] to evaluate the recommendation performance of the model.

Hit rate (HR) is a common indicator for measuring recall rate, which can intuitively measure whether the predicted item exists in the first items of the real list. The larger the hit rate (HR), the more accurate the recommendation. The calculation of HR is as follows:where represents all items in the test set, represents in the user’s recommendation list, and the number of the top items belonging to the test set.

NDCG is often used to evaluate the accuracy of ranking of recommendation results [44]. NDCG introduces a location influence factor to discount lower ranked recommendations. The calculation of NDCG is as follows:where is the normalization factor, which is used to make the value of NDCG between 0 and 1 and represents the predicted correlation of the item at position in the sequence. If the item is in the test set, otherwise, .

The above two indicators can well reflect the performance of the recommendation list. This paper intercepts the top 10 of the recommendation list, namely , and uses HR@10 and NDCG@10 to evaluate the performance of the recommendation model.

Remark 1. The proposed method in this paper is based on the deep neural network and transfer learning technology, which needs more time in the process of the model training. However, the proposed model is trained offline and the computational time of the prediction is very fast. Thus, the computational complexity is not used as the evaluation metric in this study, which is a common way in the literature about the recommendation problem [19, 44].

4.1.3. Comparison Methods

To show the efficiency of the proposed methods (denoted by SeTsRec), various methods are used for comparison in this paper, including the recommendation method without considering the order, the classic order recommendation method, and the latest order recommendation method. In the experiments, the settings of these comparison methods are made by their optimal parameters according to the respective paper declarations. The comparison methods are listed as follows:(a)POP [28]: POP is a simple baseline method that generates recommendation lists based on item popularity rankings, namely more popular items rank higher.(b)BPR [45]: Bayesian personalized ranking method, which is a classic nonsequential recommendation method using matrix factorization.(c)FPMC [46]: A sequential recommendation method that combines matrix factorization and Markov chains method.(d)GRU4Rec+ [47]: An RNN-based deep sequential recommendation model for user sessions.(e)Caser [48]: A CNN-based sequential recommendation method that captures higher order Markov chains by applying a convolution operation to the embedding matrix of the nearest term.(f)SASRec [36]: One of the state-of-the-art sequential recommendation methods, which is the first method using a self-attention-based sequential recommendation model.(g)TiSASRec [24]: A state-of-the-art sequential recommendation model that applies a multiorder attention mechanism to capture personal and item relatedness.

4.1.4. Other Settings

The experiments are conducted on a computer with Windows 10 system and the programming language used in the experiment is Python3.6. In this study, the model uses two self-attention blocks. Because different data sets have different sparsity, some parameters are different, such as the maximum sequence length and the latent dimension . The setting of these parameters will be discussed in Section 5 for details. The parameters of the proposed deep network and the experimental environment are listed in Table 2.

4.2. Experimental Results and Analysis

Table 3 shows the experimental results of the proposed algorithm and all baseline methods on two different data sets with different indicators. The results in Table 3 show that the best recommendation effect is achieved by the proposed method in this paper, which prove the superiority of the model in this paper. The results are analyzed in details as follows:(1)In most cases, the sequential recommendation methods FPMC, GRU4Rec+, and Caser outperform the nonsequential recommendation methods POP and BPR. This indicates the necessity of considering the order of user behavior lists in recommender systems. The user’s behavior sequence order information can effectively characterize the user’s preference change to a certain extent and can effectively improve the performance of the recommendation system.(2)Compared with the three classical sequential recommendation methods, the latest attention-based SASRec and TiSASRec methods outperform all other baseline methods on the two different types of data sets, which indicates that the attention mechanism can effectively improve the performance of recommender systems.(3)The algorithm SeTsRec proposed in this paper is improved on the basis of the existing algorithm. Through short-sequence enhancement and the use of an improved time-aware self-attention mechanism, it not only works well on dense data sets but also has the best results on sparse data sets. On the dense data set ML-1M, the HR@10 and NDCG@10 of the proposed method are improved by 1.1 % and 1.7 %, respectively, compared with the second best method TiSASRec. On the sparse data set AM-BE, the performance of the proposed method is improved by 9.4 % and 7.7 %, respectively, compared with TiSASRec. Compared with the baseline method, our model adopts an improved time-aware self-attention mechanism, which can adaptively adjust the item absolute position information and time-interval information to assign different weights in two different types of data sets.

5. Discussions

The results of the experiments in Section 4 show that the proposed model has better performance than that of the state of the art. The influence of the key parameters is discussed in this section. In addition, the ablation experiments are conducted in this section to further discuss the effectiveness of the improvement in the proposed model.

5.1. About the Latent Dimension

First, the influence of latent dimension on the performance of the recommendation results of our model is discussed, and some experiments are conducted, where other hyperparameters are kept unchanged while the latent dimension is changed within the range . The experimental results are shown in Figures 6 and 7.

It can be observed from Figure 6 (on the dense data set ML-1M) that the overall performance of the model improves with increasing potential dimensionality and tends to converge gradually, as the latent dimension increases. However, on the sparse data set AM-BE, the larger latent dimensions do not lead to better performance. The reason is that too many latent dimensions will lead to overfitting and thus degrade the model performance in a sparse data set. On the ML-1M data set, the algorithm in this paper tends to converge when . Considering the performance and time cost of the model, this paper sets the potential dimension on the ML-1M data set and sets on the AM-BE data set.

5.2. About the Maximum Sequence Length

Another important parameter of the proposed model is the maximum sequence length . To discuss the influence of the maximum sequence length of the input model on the performance of recommendation results, some experiments are conducted, where the maximum sequence length is changed in the range [10,80], while keeping other hyperparameters unchanged. The experimental results are shown in Figures 8 and 9.

It can be observed from Figure 8 (on the dense data set ML-1M) that the model achieves satisfactory performance when the sequence length is . Therefore, under the consideration of balancing model performance and time cost, we set the maximum sequence length on the ML-1M data set. On the sparse data set AM-BE, it can be observed that the model performance does not change much when changes, this is because the average sequence length of the AM-BE data set is 4.4 (see Table 1), even after a certain degree of short-sequence enhancement, the longer sequence input does not provide more useful information, but will increase the time cost of the model. Therefore, the maximum sequence length is set as on the data set AM-BE.

5.3. Ablation Experiments

This section discusses the impact of two major improvements in the proposed model, namely the short-sequence enhancement and time-aware self-attention mechanism. In these ablation experiments, the method based only on short-sequence enhancement is defined as SeTsRec-Se, and the method based only on time-aware self-attention is defined as SeTsRec-Ts, and they are compared with the existing SASRec method and our proposed algorithm SeTsRec. The experimental results are shown in Table 4, Figures 10 and 11. The results are analyzed in details as follows.(1)On the dense data set ML-1M, the SeTsRec-Ts method outperforms the SeTsRec-Se method and the SASRec method. The experimental results show that the improvement of the model by the short-sequence enhancement method is limited. In this case, the SeTsRec-Ts method can achieve good results, and its performance is improved by 1.4 %and 3.3 %, respectively, on the HR@10 and NDCG@10, compared with SASRec, which is close to the improved algorithm SeTsRec in this paper.(2)On the sparse data set AM-BE, the SeTsRec-Se method is better than the SeTsRec-Ts method and the SASRec method, and its performance is improved by 3.6 % and 3.2 %, respectively, on the HR@10 and NDCG@10, compared with SeTsRec-Ts. The experimental results show that it is necessary to use the method of enhancing short sequences on sparse data sets. In addition, the model effect would not improve much if only the time-aware self-attentive mechanism approach is used. This is due to the large proportion of short-sequence users in the sparse data set, which limits the overall recommendation effect of the model.(3)In summary, the proposed algorithm SeTsRec in this paper not only considers short-sequence enhancement to alleviate the problem of data sparsity but also combines the time-aware self-attention mechanism to fully consider the change of user preferences over time. Thus, it outperforms existing methods on both dense and sparse data sets.

6. Conclusions

This paper proposes a sequential recommendation algorithm based on improved short-sequence enhancement and temporal self-attention mechanism. The proposed algorithm first trains a reverse prediction model through the long-sequence users in the data set, to predict the reverse recommendation in the user sequence. Then, the model is transferred to short-sequence users, and pseudo-historical items of short-sequence users are generated to enhance short sequence. After enhancing short sequences, an improved time-aware self-attention model is adopted, which adaptively assigns different weights by combining the time interval information and absolute position information between items. It can deeply mine the changes of user preferences over time. Experimental results show that our method outperforms the existing sequential recommendation methods on different data sets. In the future, it can be considered to generate more accurate pseudo-historical items by improving the reverse prediction model to improve the recommendation effect further.

Data Availability

Publicly available data sets were analyzed in this study. These data can be found at https://files.grouplens.org/datasets/movielens/ml-1m.zip and https://snap.stanford.edu/data/amazon/productGraph/categoryFiles/reviews_Beauty_5.json.gz.

Conflicts of Interest

The authors declared that they have no conflicts of interest to this work.

Acknowledgments

This work was supported by the National Natural Science Foundation of China (61873086) and the Science and Technology Support Program of Changzhou (CE20215022).