Research Article

Word Sequential Using Deep LSTM and Matrix Factorization to Handle Rating Sparse Data for E-Commerce Recommender System

Table 3

Preprocessing the text document.

MethodStep description

Set the maximum wordsThe product review contains long sentences. However, in this experiment, this section limits the number of words in a sentence to a maximum of 300 words. In movie reviews, most of them consist of less than 300 words. Based on the above considerations, the number of words is limited to a maximum of 300 words. Following the previous works, this scenario is sufficient to generate information on the user’s expression representation
Remove stop wordsThere are many categories of words that can be selected, such as stop words, to achieve the goal. In the case of search engine applications, there are several existing words, concise purpose words, etc.; for example, there are, in, where, also, and on, and, especially in labels like “the on,” “the also,” “there are,” or “in where.” In another method, a search engine erases some of the most famous words, for instance, lexical words, such as “need” in a query that aims to increase achievement
Remove frequently occurring wordsThis section removes the data corpus for special stop words for documents that occur frequently (more than 0.5). This process is essential to avoid words that appear too often so that they dominate emergence
Remove non-English vocabularyThis section aims to remove all non-English vocabulary words from a catalog document. As an output, the average number of words per document is 97.09 and 92.05 on the MovieLens dataset 1 million (ML-1M) and Amazon instant video (AIV), respectively. In this section, items without a description document in every dataset catalog, and specifically in the Amazon dataset table, are removed. Besides, users without ratings below three are also removed. As an output, every data table demonstrates three datasets with distinct specifications. Even though many users were removed in preprocessing, the Amazon dataset remained quite lacking over the other data
Remove frequently occurring wordsThis section removes the data corpus for specifically stop words for documents that frequently occur with more than 0.5. This rule is essential to avoid the word from appearing frequently