| Method | Step description |
| Set the maximum words | The product review contains long sentences. However, in this experiment, this section limits the number of words in a sentence to a maximum of 300 words. In movie reviews, most of them consist of less than 300 words. Based on the above considerations, the number of words is limited to a maximum of 300 words. Following the previous works, this scenario is sufficient to generate information on the user’s expression representation | Remove stop words | There are many categories of words that can be selected, such as stop words, to achieve the goal. In the case of search engine applications, there are several existing words, concise purpose words, etc.; for example, there are, in, where, also, and on, and, especially in labels like “the on,” “the also,” “there are,” or “in where.” In another method, a search engine erases some of the most famous words, for instance, lexical words, such as “need” in a query that aims to increase achievement | Remove frequently occurring words | This section removes the data corpus for special stop words for documents that occur frequently (more than 0.5). This process is essential to avoid words that appear too often so that they dominate emergence | Remove non-English vocabulary | This section aims to remove all non-English vocabulary words from a catalog document. As an output, the average number of words per document is 97.09 and 92.05 on the MovieLens dataset 1 million (ML-1M) and Amazon instant video (AIV), respectively. In this section, items without a description document in every dataset catalog, and specifically in the Amazon dataset table, are removed. Besides, users without ratings below three are also removed. As an output, every data table demonstrates three datasets with distinct specifications. Even though many users were removed in preprocessing, the Amazon dataset remained quite lacking over the other data | Remove frequently occurring words | This section removes the data corpus for specifically stop words for documents that frequently occur with more than 0.5. This rule is essential to avoid the word from appearing frequently |
|
|