Abstract
This paper firstly describes the research status of online review text mining and finds out the problems existing in the mining and application of tourism texts. Aiming at these problems, this paper proposes a text mining method for tourism online reviews based on natural language processing and text classification technology. The first step is to analyze the validity of the online review text; the purpose is to remove the invalid text and improve the mining efficiency of the online review text. The second step is to conduct a comprehensive evaluation of scenic spots and hotels based on text classification technology and sentiment analysis. The comprehensive evaluation indicators are established for the five core service contents. High-quality scenic spots and hotels are selected according to the ranking of comprehensive evaluation. The third step is to propose a mining method of tourism hot words based on natural language processing for the selected high-quality tourist locations. The obtained hot words can intuitively show the impression of tourists on the scenic spot. The fourth step is to use mutual information combined with the left and right entropy to discover new words and to mine service characteristics of high-quality scenic spots and hotel from the new words. Finally, the proposed new methods are tested on the crawled tourism online review texts. The experimental results show that the novel comprehensive evaluation method proposed in this paper can truly and objectively select high-quality scenic spots and hotels and provide an important basis for the decision-making of tourism management. On this basis, hot words and new words can be effectively excavated from relevant online review texts, and travel impressions can be fed back from various aspects and angles.
1. Introduction
With the booming development of tourism, people pay more and more attention to practical experience when traveling. Relevant government departments and tourism enterprises are also focusing on improving the service quality of the tourism industry. Major travel websites, such as Trip, TravelGo, and eLong, all provide a wealth of tourist comment functions. Tourists can comment on scenic spots or hotels in tourist destinations from various aspects. These online review texts generally reflect the visitor’s experience. These review texts can not only provide reference for other tourists but also provide suggestions for improving and enhancing tourism services for the operation department of the scenic spot, the local cultural tourism management department, and tourism enterprises [1].
Online review texts are often created by thousands or even tens of thousands of tourists and are a type of user generated content (UGC). In recent years, with the help of modern computers and artificial intelligence, especially natural language processing technology, these text data can be automatically processed and analyzed to obtain indicators reflecting the impression of tourists, thus providing a decision-making reference for the development of tourism [2]. At present, many scholars have carried out research on improving tourism services by analyzing online review texts.
This article uses the blog post about Zhujiajiao in Sina blog to divide the tourism perception image of Zhujiajiao from the direction of text mining [3]; Cai et al. [4] studied the audience perception of urban tourism image in Guiyang City based on ROST text mining software. This article used text mining method to conduct word frequency, sentiment, and semantic network analysis of online travel notes about Gansu tourist spots based on four typical travel websites such as Trip, Mafengwo, Lvmama, and Tuniu [5]. Li et al. [6] used text mining to conduct image perception research on the comments of tourists on typical urban tourism communities in Beijing, such as Baidu Travel, https://Trip.com/. This article conducted a comparative study of tourists’ opinions and suggestions based on text mining in Guangxi Qinbeifang. This paper also formulated tourism policies for the government and related tourism management departments and provided an important direction for the development of tourism in Guangxi Qinbeifang and has important practical significance [7].
After analyzing the current research status, it is found that the above research is not systematic in the application of online review texts, lacks a comprehensive evaluation mechanism for scenic spots and hotels, and cannot reflect tourists’ travel impressions from multiple angles and levels, so it cannot effectively feedback the services and characteristics of hotels and scenic spots. This paper proposes the analysis and processing of online review text based on natural language processing and text classification technology. The first is the validity analysis of online review text; the purpose is to remove invalid text. The second step is to use the text classification technology combined with the sentiment analysis method to carry out the comprehensive evaluation of the scenic spots and hotels, to comprehensively score the five services that the tourists are concerned about, and to select the high-quality tourist spots and hotels. The third step is to use the named entity recognition method in natural language processing to mine hot words for the top-ranked scenic spots and hotels. The obtained hot words can effectively feedback tourists’ intuitive impression of scenic spots and hotels. The fourth step is to analyze the characteristic services of scenic spots and hotels based on new word discovery. The service characteristics of high-quality scenic spots and hotels are excavated in the new words. Therefore, these methods can reflect tourists’ travel impressions from multiple perspectives.
2. Related Technical Analysis
2.1. Text Classification Techniques
2.1.1. Bayesian Methods and Naive Bayes
Bayesian methods and theories were first proposed by British mathematician Thomas Bayes. In recent years, with the development of artificial intelligence, especially the rise of machine learning, data mining, and other technologies, Bayesian theory has a broader development and application space.
The Bayesian classifier is a general term for a class of classification algorithms, which are all based on Bayes’ theorem. Their applications in text mining are mainly focused on plain Bayesian classifiers and Bayesian network classifiers. The Naive Bayes is the simplest and most common type of Bayesian classifier. The algorithm assumes that the probability of all words appearing in the text is considered to be relatively independent [8].
Assuming that the set is the set of text categories, determining whether a text belongs to a category can be done by calculating the probability of , i.e., given a text , calculate what is the probability that it belongs to the text category . The discriminant rule of plain Bayes is to categorize into the category that makes reach the maximum probability, i.e., to solve .
The Naive Bayes is the simplest and the most common of the Bayesian classifiers. The algorithm assumes that the likelihood of each attribute taking values is considered to be independent and uncorrelated with the values of other attributes [9]. The core idea is that for a given item to be classified, the probability of occurrence of each category under the conditions of this item is solved, and whichever is the largest is considered to be the category to which this item to be classified belongs.
Let be an unlabeled data sample, and let be some assumption that the data sample belongs to a particular class . In the classification problem, we want to obtain , i.e., given the predicted data sample , the probability that the assumption holds, and the classification is completed by comparing the maximum probability [10]. Its Bayes’ theorem is formulated as follows:
where is the posterior probability, which denotes the probability of occurrence of given that condition is found. is the prior probability, which denotes the prior probability of hypothesis . is the prior probability of condition . is independent of .
The process of Naive Bayes is as follows.
Each data sample is represented by an -dimensional feature vector , which describe the attributes samples with metrics. Assume classes . Given an unknown data sample , the classification method will predict that belongs to the class with the highest posterior probability. That is, the unknown data sample is assigned to class .
The class whose is the largest is called the maximum a posterior hypothesis and according to Bayes’ theorem.
Since is constant for all classes, it is only necessary that is maximal. After calculating for each class, the sample is assigned to the class whose is the largest.
For the calculation of the probability estimate, to improve its accuracy, the Laplace smoothing estimate can be used with the following equation.
where is the training text set and , indicating whether the training text belongs to the text of class , which “1” means belonging, and “0” means not belonging.
2.1.2. Classifier of Linear Support Vector Machine
Support vector machine (SVM) algorithm is considered to be one of the more effective methods in text classification, which is a machine learning method based on statistical learning theory [11]. This technique solves the previous problem of requiring an infinite number of samples by simply abstracting a certain amount of text into vectorized training text data through computation, which improves the accuracy of the classification and can be widely used in statistical classification and regression analysis. It maps the vectors into a higher dimensional space in which a maximum interval hyperplane is established [12]. Two hyperplanes parallel to each other are built on either side of the hyperplane separating the data, and the separation hyperplane maximizes the distance between the two parallel hyperplanes. It is assumed that the greater the distance or gap between the parallel hyperplanes, the smaller the total error of the classifier.
Definition 1. For a given data set that is linearly divisible, this equivalent method of using interval maximization or solving the corresponding convex quadratic programming problem. The separation hyperplane is learned as , and the corresponding classification decision function is . It is called linearly separable SVM.
Definition 2 (geometric interval). For a given training sample data set and hyperplane , we call the function interval of the hyperplane about the sample point as . The minimum value of the geometric interval between the hyperplane and all sample points in the data set is .
SVM has few parameters, and the most important one is the kernel function. When there are relatively many text features, the linear kernel function is enough. The SVM using linear kernel function is called linear SVM; it is also the kernel function used in this paper.
SVM (support vector machine) is a statistical theory classification algorithm. The algorithm implements sample features to find a balance between model complexity and learning ability. It is more effective in solving small samples, nonlinearity, and high dimensionality.
The optimal classification plane exists for a sample . The original parameter space is transformed to a high-dimensional space using a nonlinear function, and then, the above hyperplane is established; denotes the plane normal vector, and denotes the intercept set to separate the plane function:
The discriminant function ; will be normalized, so that the samples meet ; the operation transformation can be obtained after the simplification of the formula , so that the classification interval is the maximum is equivalent to make minimum.
To satisfy the above conditions, the following conditions are needed to make the classification models work correctly for all samples [13].
For each inequality constraint introduce a Lagrange multiplier (Lagrange multiplier) ; construct the Lagrange function:
By eliminating and , the original constrained optimization problem can be equated to the minimax dual problem.
When linearly inseparable, it needs to be transformed to a higher dimensional space to make it linearly separable. In this case, the relaxation variable method is needed to solve this kind of problem, and a relaxation variable is introduced for each sample so that the constraint becomes:
The objective function then becomes:
The larger is the penalty factor, the smaller the number of error points, but it should not be too large to avoid excessive clutching. To avoid dimensional catastrophe, it is necessary to introduce the kernel function, which is a mapping from low-dimensional space to high-dimensional space.
2.2. Mutual Information and Left-Right Entropy Theory
2.2.1. Information Entropy
Information entropy was originally proposed by Shannon and can be used to describe the uncertainty of an event. Usually, a text with a lower probability of occurrence of an event contains more information and thus has a higher information entropy [14]. For example, “Bill Gates goes bankrupt” has a much lower probability of occurring than “Bill Gates becomes the richest man,” so the former has a higher information entropy [15]. The formula of information entropy can be expressed as follows:
2.2.2. Information Gain
Information gain (IG) (Mitchell1 1997) represents the average information of a document class when a text contains a certain feature, defined as the difference in information entropy before and after a feature appears in the text [16]. Assume that is the set of text classes, is the text class variable, is the text, and is the feature. For feature , its information gain is denoted as . The frequency of documents with and without t occurring in is examined. It is used to measure the information gain of word for category . The calculation formula is shown as follows:
2.2.3. Mutual Information
Mutual information refers to the amount of information contained in one random variable with another random variable and also a measure of the association between two variables. The formula of mutual information can be expressed as follows [17].
Mutual information (MI) is a commonly used information metric in information theory and is widely used in statistical language models. It measures the importance of a word to a category based on the occurrence of the word . The calculation formula is shown as follows:
where denotes the conditional probability that feature appears in category , denotes the probability that the text with category appears in the document collection, and is used to denote the probability that features appears in the text.
From the formula, it can be seen that feature selection using mutual information prefers to select low-frequency feature words. That is, if two feature words have the same conditional probability , the feature word with fewer occurrences will receive a higher mutual information value than the feature word with more occurrences.
3. Validity Analysis of Online Review Texts
Online reviews sometimes have irrelevant content, simple copying and modification, and irrelevant content, which prevents tourists from obtaining valuable information from online reviews and also brings challenges to the operation of various online platforms [18, 19]. Therefore, it is very important to analyze the validity of online review texts. Validity analysis can be regarded as a two-classification problem of text, and it is divided into two categories: valid and invalid. That is, the subjectivity, randomness, irrelevant content, and suspicious copy and paste text in the sentence are classified into the “invalid” category, and the rest of the comment text belongs to the “valid” category [20]. The sample tested in this paper uses the web crawler technology Python 3.6 to crawl the review texts of scenic spots and hotels in China’s major travel websites, especially choosing the online travel giant Trip as the main source of sample data.
The main steps about the validity analysis of online review texts are described below.
3.1. Data Preprocessing
To analyze the validity of the review content, it is necessary to preprocess the review content of scenic spots and hotels. The following two steps are included:
Step 1. Filter punctuation marks and stop words. Because there are many useless symbols in text information, many words have no practical meaning, such as some words. Therefore, the review data must be filtered first. The processing steps include filtering punctuation marks, special symbols, removing stop words, useless adverbs, etc. Here, we use a custom stop word dictionary to remove stop words more accurately and efficiently and to filter out a large number of words such as adverbs that have no practical meaning by word filtering.
Step 2. Word segmentation processing. Chinese word segmentation is not as simple as English word segmentation. There is no obvious distinguishing mark between words, and semantic and logical relationships are often taken into account. The effect of word segmentation directly affects information analysis and experimental results. At present, several common word segmentation tools are Jieba word segmentation, SnowNLP word segmentation tool, and HanLP word segmentation tool.
Jieba word segmentation is the most widely used word segmentation technology in China. The Jieba word segmentation has the following features: precise mode can segment sentences most accurately and is suitable for text analysis; the full mode can scan all the words that can be turned into words in the sentence, and the speed is very fast. SnowNLP word segmentation is a Python-based library. Its functions are relatively simple, and it is relatively easy to use. Jieba word segmentation is more suitable for text analysis than SnowNLP. Therefore, this paper uses Jieba word segmentation for Chinese word segmentation and word annotation.
3.2. Manual Annotation Validity Information
To build a machine learning model to analyze the effectiveness of reviews, the model must first be pretrained. So we made 1000 valid manual annotations in the reviews of scenic spots and hotels in advance. If there is useful information in the reviews of hotels and scenic spots, we regard it as a valid mark of “1.” For some irrelevant or useless information, we regard it as an invalid mark of “0.” The models are trained and tested accordingly through these manually annotated data.
Here, there are nearly 60,000 reviews of scenic spots and more than 25,000 reviews of hotels collected through crawler technology. Take the scenic spot as an example; we extract 1,000 items from them for manual annotation, that is, to add a column to the original data file. The column name is set to “valid,” which is set to 1 when valid and 0 when invalid.
3.3. Extract Text Features
For the comment content of the segmented words, the numerical calculation of text feature is performed. In this paper, document frequency is used to calculate the value of text feature. Document frequency (DF) is an efficient feature selection algorithm that counts how many texts contain the word in the entire dataset. Its document frequency is counted for each feature in the training text set, and those features with a particularly low and particularly high document frequency are removed according to a preset threshold. DF is the simplest feature selection method, and this method has low computational complexity and can perform large-scale classification tasks. It is a common method for feature dimension reduction.
3.4. Construction of a Classification Model for the Validity of Online Review Texts Based on Naive Bayes Classifier
The classification method based on Naive Bayes classification algorithm is divided into two stages. The first stage is the training stage, building a classifier with a known set of instances. The known instance set used to build the classifier is called the training instance set, and each instance in the training instance set is used as a training instance. Since the class labels of the training instances are known, the construction of the classifier is a tutored learning process.
The second phase is the test phase, using constructed classifiers to classify unknown instances. The classifier generally needs to be evaluated before it can be used to predict. Only classifiers with the required classification accuracy can be used to classify test instances.
The classifier used here is the Naive Bayes classifier, and its characteristics mainly include incremental learning. Prior knowledge can determine the final probability of the hypothesis together with the observed instances and allow assumptions to make uncertain predictions. The classification of new instances can be predicted by multiple assumptions together weighted by their probabilities. Based on the above classification characteristics, this paper constructs a classification model for the validity of online review texts based on the Naive Bayes classifier. The specific process is using the Naive Bayesian method for supervised learning and training based on the 1,000 manually annotated scenic spot reviews. These trained models are then used to annotate all the online review texts. Then, for the annotated review file table of hotels or scenic spots, a column is added to the right: “validity,” and it is marked as “valid” or “invalid” according to the output of the trained model. Finally, the validity analysis table of tourism online review texts is obtained, as shown in Table 1.
4. Comprehensive Evaluation of Scenic Spots and Hotels Based on Text Classification Technology and Sentiment Analysis
After removing the invalid text, the effective online review texts of scenic spots and hotels are obtained here. We use text classification techniques to classify online review texts into appropriate categories. According to the five aspects of scenic spots and hotels that tourists focus on, service, location, facilities, hygiene, and cost, a comprehensive evaluation is carried out, and an evaluation model is constructed according to mean squared error (MSE) combined with sentiment analysis. Mean squared error is a measure that reflects the degree of difference between the estimator and the estimated. Let be an estimator of the population parameter θ determined from the subsample. The mathematical expectation of is called the mean squared error of the estimator . , where and are the variance and bias of , respectively.
Therefore, this paper uses the mean squared error to build the evaluation model according to the above data analysis. The specific steps are as follows. (1)This paper uses web crawler technology to crawl corpus in five aspects of service, location, facility, hygiene, and cost from major tourism websites and performs part of manual annotation and model training
First, we look for the text classification corpus for training, where the text files containing about 5,000 hotel online reviews are collected. According to the five aspects of service, location, facilities, hygiene, and cost, manual annotation is carried out one by one, and more than 1,500 online comment texts are marked and classified into the abovementioned five categories. (2)The paper splits the review texts of hotels or scenic spots into single sentences and then classifies the texts according to service, location, facilities, hygiene, and cost
According to the data characteristics of online review texts in the tourism field, two suitable text classification methods are selected for comparative testing. The best text classification method is selected according to the processing speed and classification evaluation indicators.
Two text classification methods are used here: one is Naive Bayesian method, and the other is linear SVM. The sample data are tested separately, and the experimental results show that the linear SVM method is faster in terms of data processing speed, as shown in Figure 1.

In this paper, the indicator test of the text classification model is carried out. The model evaluation indicators of the classification algorithm are often measured by the confusion matrix, as shown in Table 2. The four data obtained from the confusion matrix are extended by calculation to obtain four secondary metrics: accuracy, precision, recall, and F1-score, which are the core metrics for evaluating the classification model [21].
Accuracy of a classification model (accuracy) represents the ratio of samples correctly predicted by the model to all samples, and in general, the higher the accuracy, the better the classifier is accordingly. This is shown in the following equation.
The precision of the classification model (precision) is defined as the percentage of samples with true positive class among all samples predicted to be positive class, and the formula is as follows:
The recall (recall) of a classification model is defined as the percentage of samples with true positive classes that are correctly predicted, and the formula is as follows:
F1-score is the summed mean of precision and recall, see equation (17), which combines the results of precision and recall and is closer to the smaller of the two, so when precision and recall are close, F1 is the largest. A higher value of F1 indicates a better model prediction.
Here, the test is carried out on the sample data. First, the Naive Bayes method is used. According to the above four classification indicators, the test results in five aspects of service, location, facilities, hygiene, and cost are shown in Figure 2.

In order to compare the classification effect, this paper uses the linear support vector machine to test the five aspects (service, location, facilities, hygiene, cost) on four classification indicators. The test results are shown in Figure 3. The experimental results show that for the online review texts tested in this paper, the linear support vector machine has better classification effect than Naive Bayes method on the four classification indicators by synthesizing the five aspects of tourists’ concerns. Finally, linear support vector machine is selected as the classification model of online review text. (3)A single evaluation is made on the 5 aspects of each hotel or scenic spot (service, location, facilities, hygiene, cost); combined with text sentiment analysis, this paper adopts a 5-point system for comprehensive scoring

On the basis of text classification, all online review texts of hotels can be classified into the above 5 categories (service, location, facilities, hygiene, cost). After the classification is done using the classifier, a confusion matrix with specific numerical values can be shown by Figure 4.

For example, in the “service” category, a certain online review under this category is scored according to objective criteria. Sentiment analysis in natural language processing is used here. Sentiment analysis is a classification technique based on natural language processing. The main purpose of sentiment analysis methods is to determine whether a review is positive or negative. Therefore, this paper uses sentiment score indicators to quantify online review texts. Sentiment analysis generally sets the sentiment of the text to a value between (0,1). 0.5 represents neutrality, a score closer to 0 represents a negative emotion, and a score closer to 1 represents a positive emotion. Since the score is based on a 5-point system, the result of sentiment analysis is multiplied by 5 to get a score between (0,5). After all the online review texts of the hotel under the category of “service” are scored on the above 5-point system and averaged, the “service” score of the hotel can be obtained.
In the same way, the other four aspects of the hotel can be scored. In addition, the same method can be used to score the five aspects of the scenic spot. (4)The evaluations of all hotels or scenic spots are normalized and comprehensively evaluated, and a score between 1 and 5 is obtained. The experimental results are shown in Figure 5

The experimental results in Figure 5 show that Hotel01’s scores in five aspects are 4.5, 4.6, 4.3, 4.5, and 4.7, and the comprehensive score is 4.5; Hotel03’s scores in five aspects are 4.3, 4.6, 4.1, 4.5, and 4.4, and the comprehensive score is 4.4; Hotel05’s scores in five aspects are 4.4, 4.6, 3.9, 4.5, and 4.5, and the comprehensive score is 4.4. Both Hotel02 and Hotel04 have the same comprehensive score of 4.2. Therefore, this paper uses the sentiment analysis method in natural language processing to comprehensively score the five services of the hotel, and the hotel with the highest score can be selected as Hotel01, which can provide decision-making basis for hotels and scenic spots to improve service quality.
5. Mining and Analysis of Tourism Hot Words Based on Natural Language Processing
Hot words are the most direct form to reflect tourists’ impression of scenic spots and hotels. Generally speaking, whether it is positive or negative, tourists will always have some representative words in their comments on scenic spots and hotels. These words are often also high-frequency words, which are obtained by statistical methods after excluding irrelevant stop words. The hot words need to be obtained through natural language processing. This paper obtains the hot words in the online review text based on the named entity recognition method.
Named entity recognition is a common basic task in natural language processing, and it is also an important component of tasks such as information extraction, question answering, syntactic analysis, and machine translation. Common types of entities are person names, place names, institution names, times, dates, money, and so on. Different tools have some subtle differences in the distinction between entity types. This article is concerned with a few types of place names, such as hotels or scenic spots [22].
In this paper, the idea of mining hot words from online review texts is as follows. Firstly, according to the above method, the scenic spots and hotels are comprehensively evaluated and scored, and the thresholds are set into three levels: high, medium, and low, and the top 10 of each level are taken as the online comment text for mining hot words. These online comment texts are then named entity recognition, and the obtained named entities (noun) are stored in the word segmentation dictionary. Then, the comment text is segmented based on the named entity as a word segmentation dictionary, and after filtering the part of speech, high-frequency words are counted, and finally, hot words are obtained. Based on this idea, a flow chart of hot word acquisition is formed, as shown in Figure 6. (1)Corpus Preprocessing. Here, the Python module of pandas is used to read the corpus according to the scenic spots, and the Python module of the natural language processing tool SnowNLP is used to segment superlong text into short sentences. Since the NLP tool (HanLP 2.x) used in the following steps cannot handle superlong text, during preprocessing, these superlong sentences need to be segmented into short sentences to lay the foundation for subsequent processing(2)Named Entity Recognition. Using the NER function in the NLP tool, the words with named entities such as scenic spots and facilities are extracted from each sentence and written to the word segmentation dictionary

After comprehensive testing and comparison, the NER function provided in HanLP 2.x version is adopted here. The reason is that this version uses a large-scale Chinese corpus pretrained deep learning model and provides three mature methods: ner/msra, ner/pku, and ner/ontonotes. The effect of the test here is relatively good, and all the scenic spots can be found. Here, the named entities obtained by the NER methods are merged and deduplicated as the actual NER results. (3)Word Segmentation. In the case of using a word segmentation dictionary of NER results, the word segmentation tool is Jieba. This paper uses this tool to perform word segmentation on all the review corpus of a scenic spot or hotel. In order to ensure the effect, only the noun is retained in the result after word segmentation(4)Statistics of Word Frequency. For the list of word segmentation results, the word frequency of all words is counted. The 20 words with the highest frequency are found and written to the file in the required format
The hot words generated by 16 scenic spots such as A01-A16 are listed in Table 3. From the experimental results in Table 3, it can be seen that hot words can effectively display the intuitive impression of scenic spots to tourists.
6. Analysis Characteristic Services of Scenic Spots and Hotels Based on New Word Discovery
In order to attract tourists and enhance their competitive advantage, scenic spots need to find their own characteristics. This paper excavates their respective characteristics and highlights from the online review texts of scenic spots and hotels. Combined with the comprehensive evaluation results of scenic spots and hotels, after the appropriate threshold value is determined, the scenic spots and hotels are divided into three levels: high, medium, and low. And for the top three of each level, this paper applies the new word discovery method to find the characteristics of scenic spots and hotels.
New word discovery is also called “new word recognition” or “unregistered word recognition.” The new words here are words that are newly generated with the development of the times, for example, “cloud era,” “big data,” letter words “yyds,” and “xdm.” Alternatively, words that do not exist in the dictionary may also be called new words or unregistered words [23].
New word can be judged as “new” from three aspects: (1)Degree of solidification refers to the degree of closeness between words in a field. For example, words such as “research” and “water cup” have a relatively high degree of solidification, while words such as “Haier” and “Gree” have a relatively low degree of solidification(2)Degree of freedom refers to the degree to which a field can be used freely. For example, “Jujue” has the same degree of solidification as “Jejue,” but the freedom degree of “Jejue” is far less than that of “Jejue”(3)The IDF (inverse document frequency) of new words is called the inverse document frequency of new words. If a word appears a high number of times in an article, it is most likely a new word. But if the word also appears a high number of times in the entire text corpus, then it may be a common word, not a new word
There are generally two approaches to new word discovery: rule-based new word discovery and statistics-based new word discovery [24]. New word discovery based on rule can be performed by constructing a template for new word matching, and the obtained results have a relatively high accuracy rate. New word discovery based on statistics is to identify new words by counting the word frequencies in the corpus. This method is more portable and flexible and requires a certain model for training. In this paper, an algorithm based on mutual information and left-right entropy is used to discover new words. The specific steps are as follows: (1)Stop word processing
The text that needs to be processed often contains many meaningless words. This paper replaces these texts or removes stop words to get cleaner texts. (2)Use three thresholds to judge new words(i)Minimum Mutual Information. The greater the mutual information, the higher the correlation. This paper uses the n-gram software for word segmentation and then calculates mutual information for these words. If it is below the threshold, it means that it cannot be a word(ii)Minimum Entropy. The larger the entropy, the more abundant the neighboring words. This paper calculates the minimum value of left entropy and right entropy. If the minimum value is lower than the threshold value, it means that the word cannot be formed(iii)Minimum Number of Occurrences. If the number of occurrences of a word is less than the set minimum number of occurrences, it is filtered out(3)Mining of characteristic in scenic spots and hotels
In this paper, the algorithm based on mutual information and left-right entropy is used to discover new words. At the same time, this algorithm also integrates the Python module of SmoothNLP. The implementation of SmoothNLP is different from the work of mutual information and left-right entropy. For the same text, the new words discovered by the two methods are not consistent. Therefore, we calculate the two methods together and obtain the common vocabulary of the two methods. In addition, natural language processing also provides an option “whether to extract only words that are not in the dictionary.” Thereby, the search scope can be expanded to find more unregistered new words. Therefore, this pattern is also incorporated into the method of new word discovery. Here, these three methods are combined to discover new words in the online review texts by adjusting the parameters. This new word discovery method considers both the breadth and novelty of new words and the acceptability of new words, which can reflect the characteristics of scenic spots or hotels. Finally, the test is carried out on the online review text with high comprehensive evaluation, and the experimental results show: Table 4 shows the characteristics of 13 scenic spots obtained by the new word discovery algorithm. Table 5 shows the characteristics of the 13 hotels obtained by the new word discovery method.
7. Conclusion
At present, it is a research hotspot in the field of tourism management to provide directions for the development of tourism by the mining of online review texts. This paper first analyzes the research status and finds that the current application of online review texts is not systematic, and there is a lack of a comprehensive scoring mechanism for the service quality of scenic spots and hotels. Then, this paper proposes a novel mining and application of tourism online review text based on natural language processing and text classification technology. A series of new methods are proposed here.
The first is to remove the invalid online review text and keep the valid text. The research focus of this paper is to propose a comprehensive evaluation of scenic spots and hotels based on text classification technology and sentiment analysis methods. This evaluation system can establish evaluation indicators for comprehensive scoring and select top-ranked scenic spots and hotels. Then, for the online review texts of high-quality scenic spots, this paper proposes to use natural language processing to mine hot words. The obtained hot words can intuitively reflect the impression of the scenic spot to tourists. Then, new words are discovered based on mutual information and left-right entropy methods. The service characteristics of high-quality scenic spots and hotels are excavated from new words. Finally, the above series of methods are tested on the massive tourism online review texts. The experimental results show that the new comprehensive evaluation method proposed in this paper can effectively select high-quality scenic spots and hotels. Hot words and new words can be efficiently mined from relevant online review texts. These methods can feedback tourism impressions from various aspects and levels and provide important basis for the development of tourism. Due to the limitations of the selected online review texts, the comprehensive evaluation method in this paper has certain regional characteristics and is not suitable for all tourist locations. There are two aspects of future research work. The first is to expand the crawling scope of online review texts as much as possible and establish more comprehensive evaluation indicators. The second is to consider how to use the discovered hot words and new words to provide personalized intelligent services for tourists and tourism practitioners.
Data Availability
No data were used to support this study.
Conflicts of Interest
The authors declare that there are no conflicts of interest regarding the publication of this article.
Acknowledgments
This work was supported by the National Natural Science Funds of China (61272015) and the 2022 Henan Province Science and Technology Research Project: “Construction and Application of Intelligent Ontology in Internet of Things Based on Semantic Concept Model” (222102210316).