Abstract
Based on text analysis, public big data management is studied. The public data management of Mount Wutai tourism network travel notes is discussed. The positive, neutral, and negative effects of the naive Bayesian classification model and decision tree classification model on the tourism sentiment attitude of Mount Wutai are compared. The relationship between tourism resources, tourism facilities, tourism services, tourism environment, and tourism sentiment and attitude of Wutai Mountain is analyzed. The results show that the true positive rate, true negative rate, and F-measure of the Bayesian decision tree classifier to classify positive text are 86.64%, 81.27%, and 84.62%, respectively. The true positive rate for neutral text is 82.05%, the true negative rate is 78.89%, and the F-measure is 77.11%. The true positive rate for negative text is 83.67%, the true negative rate is 98.29%, and the F-measure is 82.83%. The Bayesian decision tree classifier can evaluate positive and negative texts better than neutral texts. The true positive rate of the C4.5 decision tree classifier for positive text is 91.44%, the true negative rate is 86.57%, and the F-measure is 89.45%. The true positive rate for neutral text is 90.17%, the true negative rate is 83.28%, and the F-measure is 84.06%. The true positive rate for negative text is 91.84%, the true negative rate is 99.05%, and the F-measure is 90.91%. The decision tree classifier has a better evaluation effect on positive and negative texts than on neutral texts. The ROC curve of the evaluation effect of the two classifiers shows that the evaluation effect of the two classifiers has a better evaluation effect on positive text than that of the neutral and negative texts, and the evaluation effect of the C4.5 decision tree classifier is better than that of the Bayesian classifier. The promotion degree of tourism resources and facilities in forwarding online travel notes is obviously higher, and there is a high correlation between tourism resources and facilities and forward online travel notes. In negative online travel notes, the promotion degree of tourism service and tourism environment is high, and the correlation between tourism service and tourism environment and negative online travel notes is high. In summary, improving the quality of tourism services and the tourism environment of Mount Wutai scenic spots can better enhance the recognition and satisfaction of tourists with Mount Wutai tourism.
1. Introduction
With the development of Internet technology, network public text data play an important role in people’s search for and mastery of information. The management and classification integration of public text data can effectively improve data utilization [1, 2]. Tourism plays an important role in the development of a city. Understanding the status quo and problems of regional tourism is of positive value to guiding the development of the local economy [3, 4]. Tourists’ emotional attitudes and recommendation indexes have an important influence on the development of tourist attractions [5, 6]. Exploring and analyzing tourists’ recognition of scenic spots, which is of reference value to guide the improvement and perfection of tourist attractions, is a research hotspot [7, 8].
Tourists’ evaluations and emotional attitudes toward scenic spots are expressed mainly online and offline, and online searches have shortcut lines and universality [9]. The online travel notes of travel apps are an important tool to understand tourists’ emotional attitudes toward scenic spots. Currently, widely used travel apps mainly include Ctrip, Qunar, Hornet’s Nest, Tuniu, and Tongcheng [10]. Mount Wutai, located in Xinzhou City, is a national 5A-level scenic spot with beautiful scenery and numerous temples. It is a Buddhist holy land with a beautiful natural landscape and profound cultural landscape. It is cool and comfortable in summer and a good place for summer vacation [11]. Mount Wutai scenic spot has many tourists every year, which greatly promotes local economic development and has great development potential. Understanding Mount Wutai’s tourism recognition and tourist reputation is of great value to the economic development of Mount Wutai. The emotional attitude assessment of scenic spots based on online travel notes requires text processing, classification, data organization, and management and correlation analysis. Currently, the commonly used machine learning classification methods mainly include naive Bayesian classification models, decision tree classification models, and support vector machine classification models [12–14]. Naive Bayesian classifiers have high classification accuracy and fast running speed [15, 16]. C4.5 decision tree classification can effectively process data with missing values [17, 18].
In this study, 575 online travel journal texts from the Ctrip, Qunar, Mafengwo, Tuniu, and Tongcheng websites are selected to compare the classification and evaluation effects of the naive Bayesian classification model and decision tree classification model on Mount Wutai tourism sentiment and attitude. The relationship between tourism resources, tourism facilities, tourism services, the tourism environment, and Mount Wutai’s tourism sentiment and attitude is also analyzed to provide guidance for tourists and a reference for the development and improvement of Mount Wutai scenic spots.
2. Methods
2.1. Data Acquisition
The public data selected in this study are mainly from online travel notes published on various travel websites. The online travel notes published on the Ctrip, Qunar, Hornet’s Nest, Tuniu, and Tongcheng websites are screened. After the above websites are selected, the keyword “Mount Wutai” in the travel notes of each website is searched. Octopus acquisition software is used to climb Mount Wutai network text data of the above tourism websites one by one. After screening, a total of 575 Mount Wutai travel notes from January 2016 to December 2018 are selected, with a total of 412,528 words. The number of texts on each website is shown in Table 1.
2.2. Data Feature Extraction and Typing
By comparing other similar literature and consulting relevant analysts, the public data of Mount Wutai tourism network text are divided into three types, namely, positive text data, neutral text data, and negative text data. Based on the content description of online travel notes, the description information of travel notes will contain specific descriptive words and short sentences of the Mount Wutai tourism event. For example, the event description in positive text includes keywords expressing positive evaluation such as “satisfactory,” “beautiful scenery,” and “high-cost performance.” Negative text event description includes words expressing negative evaluation such as “inconvenient transportation,” “expensive tickets,” and “poor accommodation.” Neutral event descriptions mainly include “check in” and “handle.” These specific words are set as keywords and extracted from the input data set for text data classification (Table 2).
2.3. Data Classification
2.3.1. Naive Bayesian Classification Algorithm
The Bayesian classifier supports incremental learning and has high classification accuracy. The classifier assumes that terms are independent of each other and that each sample X in the text set consists of a set of attribute values (a1, a2, a3, …, an), where ak is the value of term Ak. W is the classification variable, and is the value of W. It is supposed that there are two classes, namely, + (positive class) and − (negative class). According to the Bayesian rule, sample X is the probability p of class , as shown in (1), where X is classified as W = + if and only if and represented by the Bayesian classifier as fb(X), as shown in (2).
If the given values of class variables and the terms are independent of each other, the probability p can be expressed as p(X|w), as shown in (3). The naive Bayesian classifier, fnb(X), is obtained, as shown in (4).P(a1|w), p(a2|w), p(a3|w), …, and p(an|w) can be estimated by the training sample. The posterior probability of each class can be calculated separately, and the class with the highest posterior probability is the prediction class.
2.3.2. Decision Tree Classification Algorithm
C4.5 decision tree classification is a common method in inductive reasoning, which has great advantages in dealing with continuous attributes and discrete attributes. The measure of the uncertainty of feature vector A is the entropy of feature vector A, expressed as H(A), as shown in (5). The conditional entropy of class B of feature vector A under given conditions is expressed as H(A/B), as shown in (6), which represents the uncertainty of classification of feature vector A under given conditions of class B, where pk is the probability that A is ak, P(A = ak) = p(ak) and k = 1, 2, …, n. P(Ai|Bj) is the joint probability distribution. The difference between H(A) and H(A/B) is information gain, and the larger the information gain is, the stronger the classification ability is
In C4.5, the information gain rate is used to determine the selection of feature test points by measuring the correlation between feature A and class B and the entropy value of feature A and class B, as shown in the following equation:
The C4.5 decision tree algorithm adopts the information gain rate as the feature selection method sets a threshold αe and takes it as the stop condition. The feature with the maximum information gain rate is placed on the root node. If the information gain rate of the feature is less than the threshold value, the node constitutes a single-node tree. The class is marked as the class with the most samples in the dataset that satisfies the path condition from the root node to the local node. If the information gain rate of the feature is greater than the threshold, a branch is generated for each feature value of the node, and the training sample is assigned to the corresponding branch. If there are no samples in a branch or all samples belong to the same category, the branch ends; otherwise, the branch node is repeated until all features have been traversed.
2.4. Data Organization and Management
Data organization and management are conducive to better use of data and to improving the efficiency of data search. The obtained network travel notes text data are organized and managed, including time information, emotion and attitude information, and programmatic information. Time information includes year, month, and day, and sentiment and attitude classification includes positive text, neutral text, and negative text. According to the factors affecting tourism evaluation, the compendium of tourism resources, tourism facilities, tourism services, and the tourism environment is divided into three categories. The data organization management framework is shown in Figure 1.

2.5. Data Correlation Analysis
Data correlation analysis is conducted on the factors that affect the text category of online travel notes, and correlation analysis is conducted by using the Apriori algorithm to calculate the support degree, confidence degree, promotion degree, and confidence degree of each factor with positive and negative online travel notes.
2.6. Observation Indexes
The included online travel notes are statistically classified. The number of Mount Wutai online travel notes on different websites in different years is counted, the number of Mount Wutai online travel notes with different emotional attitudes is counted, and the number of online travel notes with revisit intentions and recommendation intentions is counted.
The evaluation results of the Bayesian decision tree classifier and C4.5 decision tree classifier on the emotional attitude of online travel journal texts are calculated, including positive text, neutral text, and negative text. The evaluation indexes mainly include the true positive rate (TPR), true negative rate (TNR), false positive rate (FPR), false negative rate (FNR), precision, recall, and comprehensive evaluation index (F-measure). The calculation method is shown in equations (8)–(14), where TP is true positive, FN is false negative, TN is true negative, and FP is false positive.
The C4.5 decision tree and Bayesian classifier are drawn to evaluate the receiver operating characteristic curve (ROC) of web travel notes with different emotional attitudes.
The organization and management results of public data of travel notes on the Mount Wutai tourism network are collected, mainly including the number of tourism resources, tourism facilities, tourism services, and the tourism environment.
The support degree, confidence degree, promotion degree, and confidence degree of each factor and positive and negative online travel notes text are calculated. Support represents the proportion of events containing both X and Y to all events, as shown in (15). Confidence represents the proportion of events containing X that also contain Y events, as shown in (16). The promotion degree represents the proportion of events containing X that also contain Y events, as shown in (17). The calculation of confidence is shown in (18).
2.7. Statistical Methods
SPSS 20.0 is used for statistical analysis of the data, and a T test is used. Rate (%) represents counting data. ROC curves of the C4.5 decision tree and Bayesian classifier are plotted to evaluate online travel notes with different emotional attitudes. is considered statistically significant.
3. Results
3.1. Public Data Result Statistics Based on Text Analysis
Figure 2 shows the statistics of Mount Wutai tourism articles of different categories. In Ctrip, there are 169 positive travel notes, 143 neutral travel notes, and 16 negative travel notes. In Qunar, there are 32 positive travel notes, 28 neutral travel notes, and 12 negative travel notes. There are 15 positive travel notes, 7 neutral travel notes, and 5 negative travel notes in Hornet’s nest. In Tuniu, there are 24 positive travel notes, 15 neutral travel notes, and 6 negative travel notes. There are 52 positive travel notes, 41 neutral travel notes, and 10 negative travel notes on the Tongcheng website. Notably, all websites have the most positive online travel notes on Mount Wutai tourism, followed by neutral and negative ones. Figure 3 shows the statistics of Mount Wutai tourism articles at different times. Ctrip had 102 travel notes in 2016, 104 in 2017, and 122 in 2018. In Qunar, there were 25 travel notes in 2016, 28 in 2017, and 19 in 2018. There were 7 travel notes in 2016, 6 in 2017, and 14 in 2018 on Hornet’s nest. There are 17 travel notes on Tuniu in 2016, 12 in 2017, and 16 in 2018. There were 38 travel notes on Tongcheng in 2016, 42 in 2017, and 23 in 2018. Table 3 is the statistical table of the emotional attitude of online travel notes. The number of tourists who have the intention to revisit and recommend is significantly higher than that of those who do not.


3.2. Public Data Bayesian Decision Tree Classification and Evaluation Based on Text Analysis
Figure 4 shows the classification and evaluation results of the public data Bayesian decision tree based on text analysis. The true positive rate and true negative rate of the Bayesian decision tree classifier for positive text are 86.64% and 81.27%, respectively. The false positive rate was 18.73%, and the false negative rate was 13.36%. The precision is 82.68%, recall is 86.64%, and F-measure is 84.62%. The true positive rate of the Bayesian decision tree classifier for neutral text is 82.05%, and the true negative rate is 78.89%. The false positive rate, false negative rate, precision, recall, and F-measure are 21.11%, 17.95%, 72.73%, 82.05%, and 77.11%, respectively. The true positive rate of the Bayesian decision tree classifier for negative text is 83.67%, the true negative rate is 98.29%, the false positive rate is 1.71%, and the false negative rate is 16.33%. The precision is 82.00%, recall is 83.67%, and F-measure is 82.83%. The F-measure of the Bayesian decision tree classifier for positive and negative texts is higher than that for neutral texts. Hence, the Bayesian decision tree classifier is better than neutral text in evaluating positive and negative texts.

(a)

(b)

(c)

(d)

(e)

(f)

(g)
3.3. Public Data C4.5 Decision Tree Classification and Evaluation Results Based on Text Analysis
Figure 5 shows the C4.5 decision tree classification evaluation results of public data based on text analysis. The true positive rate of the C4.5 decision tree classifier for positive text is 91.44%, the true negative rate is 86.57%, the false positive rate is 13.43%, and the false negative rate is 8.56%. The precision, recall, and F-measure are 87.54%, 91.44%, and 89.45%, respectively. The true positive rate of the C4.5 decision tree classifier for neutral text is 90.17%, the true negative rate is 83.28%, the false positive rate is 16.72%, and the false negative rate is 9.83%. The precision, recall, and F-measure are 78.73%, 90.17%, and 84.06%, respectively. The true positive rate of the C4.5 decision tree classifier for negative text is 91.84%, the true negative rate is 99.05%, the false positive rate is 0.95%, and the false negative rate is 8.16%. The precision is 90.00%, recall is 91.84%, and F-measure is 90.91%. The F-measure of the C4.5 decision tree classifier for positive and negative texts is higher than that for neutral texts. Hence, the evaluation effect of the C4.5 decision tree classifier for positive and negative texts is better than that for neutral texts.

(a)

(b)

(c)

(d)

(e)

(f)

(g)
3.4. Evaluation Comparison between the C4.5 Decision Tree and Bayesian Classifier
Figure 6 shows the comparison of evaluation between the C4.5 decision tree and Bayesian classifier, where A is the positive text, B is the neutral text, and C is the negative text. The evaluation effect of the two classifiers on the positive text is significantly better than that of the neutral and negative text, and the evaluation effect of the C4.5 decision tree classifier is better than that of the Bayesian classifier.

(a)

(b)

(c)
3.5. Public Data Organization Manages Results
Figure 7 shows the organization and management results of public data of online travel notes of Mount Wutai tourism. There are 513 online travel notes related to tourism resources, 279 online travel notes related to tourism facilities, and 193 online travel notes related to tourism services. There are a total of 163 online travel notes related to the tourism environment, and a large number of them are related to tourism resources and tourism facilities.

3.6. Forward Common Data Association Analysis
Figure 8 shows the association analysis of forward public data, where A is the support degree, B is the confidence degree, C is the promotion degree, and D is the conviction degree. The promotion degree of tourism resources and facilities in forward online travel notes is significantly higher, indicating that tourism resources and facilities are highly correlated with forward online travel notes.

(a)

(b)

(c)

(d)
3.7. Negative Public Data Association Analysis
Figure 9 shows the association analysis of negative public data, where A is the support degree, B is the confidence degree, C is the improvement degree, and D is the confidence degree. The promotion degree of tourism service and the tourism environment in negative online travel notes is higher, indicating that tourism service and the tourism environment are highly correlated with negative online travel notes.

(a)

(b)

(c)

(d)
4. Discussion
The development of big data provides convenient conditions for people to collect and use information. With the development of informatization, the Internet has become the main way for people to release and search for information [19, 20]. The new media form of tourism has been widely recognized. An increasing number of tourists express their feelings and attitudes during travel through the Internet, showing their travel process and expressing their travel feelings in the form of online travel notes and comments [21, 22]. These network texts are authentic and extensive and play an important role in shaping the image of tourist destinations and providing a reference for tourists [23]. Based on the online travel notes of tourism websites, the evaluation and correlation factors of tourists’ emotional attitudes on Mount Wutai are analyzed. Text classification plays an important role in text analysis and management. Mccallum and Nigam [24] compared the naive Bayesian text classification model with the unigram language model with integer words and found that the naive Bayesian text classification model performed well and had certain advantages. Tong and Koller [25] applied text classification with support vector machine active learning and achieved good results. Fesseha et al. [26] carried out text classification of Tigrinian based on a convolutional neural network and found that a convolutional neural network has higher accuracy in classification compared with other traditional machine learning models. Hutama et al. [27] used naive Bayesian and decision tree to create a classification model and classified the text data of work culture. They found that the accuracy of the three constructed work cultures was 33%, 66%, and 80%, respectively, and the accuracy of Bayesian was 83%, 50%, and 60%, respectively. Both methods had good performance.
The text classification method in this study adopts a naive Bayesian classification model and decision tree classification model [28–37]. The results show that the true positive rate of the Bayesian decision tree classifier for positive text is 86.64%, the true negative rate is 81.27%, and the F-measure is 84.62%. The true positive rate for neutral text is 82.05%, the true negative rate is 78.89%, and the F-measure is 77.11%. The true positive rate for negative text is 83.67%, the true negative rate is 98.29%, and the F-measure is 82.83%. Notably, the evaluation effect of the Bayesian decision tree classifier on positive and negative text is better than that of neutral text. The true positive rate of the C4.5 decision tree classifier for positive text is 91.44%, the true negative rate is 86.57%, and the F-measure is 89.45%. The true positive rate for neutral text is 90.17%, the true negative rate is 83.28%, and the F-measure is 84.06%. The true positive rate for negative text is 91.84%, the true negative rate is 99.05%, and the F-measure is 90.91%. The C4.5 decision tree classifier has a better evaluation effect on positive and negative texts than on neutral texts. The ROC curve of the C4.5 decision tree and Bayesian classifier shows that the evaluation effect of the two classifiers on positive text is significantly better than that of neutral and negative text, and the evaluation effect of the C4.5 decision tree classifier is better than that of the Bayesian classifier.
This study also analyzes the correlation between the factors affecting the categories of online travel notes and the emotional attitudes of online travel notes. Using the Apriori algorithm for correlation analysis, the support degree, confidence degree, promotion degree, and confidence degree between tourism resources, tourism facilities, tourism services, and the tourism environment and online travel notes with different emotional attitudes are calculated. The results show that the promotion degree of tourism resources and facilities in the forward online travel notes is significantly higher, indicating that tourism resources and facilities are highly correlated with the forward online travel notes. The promotion degree of tourism service and the tourism environment in negative online travel notes is higher, indicating that tourism service and the tourism environment are highly correlated with negative online travel notes. The tourism service quality and environment of Mount Wutai scenic area should be greatly improved to make up the shortcomings and promote the development of Mount Wutai tourism.
5. Conclusion
In this study, a naive Bayesian classification model and decision tree classification model are used to classify the emotional attitude of Mount Wutai travel notes online, and it is found that the decision tree classification model has a better classification effect. The relationship between tourism resources, tourism facilities, tourism services, the tourism environment, and the emotional attitude of online travel notes is discussed, and it is revealed that tourism resources and tourism facilities are strongly correlated with the positive text, while tourism services and the tourism environment are strongly correlated with the negative text.
Data Availability
The dataset can be accessed upon request.
Conflicts of Interest
The authors declare that they have no conflicts of interest.