Abstract

Travel route preferences can strongly interact with the events that happened in networked traveling, and this coevolving phenomena are essential in providing theoretical foundations for travel route recommendation and predicting collective behaviour in social systems. While most literature puts the focus on route recommendation of individual scenic spots instead of city travel, we propose a novel approach named City Travel Route Recommendation based on Sequential Events Similarity (CTRR-SES) by applying the coevolving spreading dynamics of the city tour networks and mine the travel spatiotemporal patterns in the networks. First, we present the Event Sequence Similarity Measurement Method based on modelling tourists’ travel sequences. The method can help measure similarities in various city travel routes, which combine different scenic types, time slots, and relative locations. Second, by applying the user preference learning method based on scenic type, we learn from the user’s city travel historical data and compute the personalized travel preference. Finally, we verify our algorithm by collecting data of 54 city travellers of their historical spatiotemporal routes in the ten most popular cities from Mafeng.com. CTRR-SES shows better performance in predicting the user’s new city travel sequence fitting the user’s individual preference.

1. Introduction

City tour has become popular in recent years as tourists may experience various food, culture, customs, and city views in this process while making use of commercial services like nice accommodation and inner-city transportation [1]. Unlike those traditional scenic spots, which are geographically isolated, a city tour combines civil resources, various facilities, and landscapes, and these form a city tour network with spatiotemporal multiplexity. Factors such as urban economy, society, and culture have an impact on the touring experience. They are coevolving through high relevance, so a city scenic spot has compound attributes of multiple labels. Furthermore, many ways of transport connect these city spots, which are geographically centered around the urban area. Thus, a city travel plan has the characteristics of personalization, flexibility, and evolving [2, 3], and the coevolving spreading dynamics of this network with multiscale structure is a great point of exploration that can apply to the city tour recommendation system. So far, the travel recommendations given by apps and OTAs are classical routes with scenic spots ranked by the number of visitors or preferences of most travellers. Thus, the recommendations are not suitable for every visitor because of a lack of personalization [4]. When modelling and solving the tour route planning problem, most papers investigate user preference and give travel route recommendations with a fixed start and endpoints [5], not taking the spatiotemporal travel sequence, length of stay, and ways of transport into consideration [6, 7].

Because of these problems mentioned above, this paper firstly defines the user travel sequence model and various elements involved in city tour travel planning, then based on this model, we present the travel sequence similarity measurement method. The method can help measure the similarities of various city travel sequences, which combine different scenic types, time slots, and relative locations. Secondly, clustering analysis is conducted based on the historical travel database. By using the travel sequence similarity measurement method, we compute the baseline model of an individual visitor’s personalized travel preference. Finally, we propose a novel approach named City Travel Route Recommendation based on Sequential Events Similarity (CTRR-SES). CTRR-SES helps recommend personal travel routes to a new destination for users. The recommendation routes are a better fit for the user’s preference as they are calculated by the travel preference baseline model and from the historical travel sequence data of the user.

2. Research Background

Travel route recommendation system gives the user city travel routes that match user’s preferences, satisfying the user’s real needs and expectations. Exploiting historical data of users to make future prediction lives at the heart of building effective recommender systems [8]. For e-commerce, some personalized recommendation strategies can be designed to promote the diffusion of products [9]. However, city tourism is a new product of modern social arrangements as tourists spend time in pursuit of recreation, relaxation, and pleasure in cities. City tourism is featured by social media posts and marks of hot city attractions. Many types of research investigate tour preference by studying traveller’s social media posts and tag data. Based on Geo-tagged photos, some research on the correlation of several Geo-tagged images with an actual number of visitors [10], some on traveller’s spatiotemporal behaviour [11], some on travel route recommendation system algorithm [12], and some on city impressions and big events and their combined impact on travel decision [13]. However, those papers do not fully consider the features of new city tourists and their touring preference sequences. Hence, they are unable to explore the unique traits of city travellers.

Big data about travel knowledge is generated each day on the Internet and various platforms. Large amounts of structured or semistructured datasets are produced by visitors who share their travel experiences, skills, or feedback through communication technology and mobile appliances. Upon travel route recommendation algorithm studies, Sun et al. use Knowledge Graph to build a travel database by extracting traveling information from the content submitted by the users, to represent personalized touring routes [14]. Li et al. present a new approach for designing tourist routes for tourists visiting Gulangyu island by applying the Stated Preference method [15].

However, those papers only study the recommendations of scenic spots, while they do not analyze the sequential order of spots in visitors' historical touring routes. We believe the sequential order plays an important role in measuring tourist preference. For example, Sequence 1 represents user A who visits urban Chongqing city, given as Ciqikou-Hongyadong-Jiefangbei-Sichuan Fine Art Institute- Eling Park. The sequence of user B is given as Sichuan Fine Art Institute-Eling Park-Hongyadong-Jiefangbei-Ciqikou. If we consider scenic spots as the plain factor to impact the visitor’s preference, then it is obvious to give both A and B the same recommendation of route sequence. However, users A and B visit those scenic spots in a different sequence, which indicates that user A prefers to spend daytime in spots tagged as shopping or fine food and night time for city sights, yet user B prefers to visit city sights in the day time and shop at night. We believe that travel route recommendations should include not only the user’s preference for the scenic type but also the visiting sequence and time slot. Then the recommendation system may give users their personal travel routes matching their individual preferences.

This paper constructs the attraction of tourist city preference model based on the city attraction knowledge base and user’s historical touring sequences. And a data mining algorithm is proposed to discover the city attraction label set. Traveller’s historical touring events are analyzed to find clusters mostly reflecting traveller’s preferences. By comparing the similarities of various travel event sequences, we aim to provide highly personalized travel recommendations that satisfy the traveller’s real needs.

3. Preliminaries

Before the problem statement, we give the definitions of these concepts as follows.

Definition 1. (Attraction ). represents city places where visitors previously visited or are interested in visiting, and it could be a natural landscape, fork culture, historical landscape, civic landscape, or consuming place.

Definition 2. (Attraction labels ). Given as a city attraction, we define the labels of as a sequence of .

Definition 3. (Travel history ). Given as a city attraction, the travel history in is given as a set of , in which as the time arriving and as the leaving time.

Definition 4. (Touring sequence ). We define the touring sequence as a time-ordered sequence of a user visiting multiple city attractions, given as . where n represents the total number of attractions that have been visited. The time interval of visiting two adjacent attractions is no longer than a threshold value, denoted as . Considering the characteristics of city travel, we set a reasonable time interval threshold as 1 hour.
Based on the definitions above, we define our city travel route recommendation problem as follows. Given all users’ historical touring sequences in the set , input the historical touring sequence set of user A, in which An denotes his/her total number of touring sequences. Then input city B. Our goal is to determine the best personalized city travel route recommendation for the user A from the travel sequence set of city B. The strategies are given as follows:(1)Learning from the user’s historical city tour sequences, identify the user’s city travel preference model(2)Based on all travel sequences in a given city and the user’s city travel preference model, determine the best personalized city tour route recommendation for the user

4. Recommendation Algorithm

City tour recommendation is challenging to satisfy the visitor’s preference and real needs when a tourist visits a new city. To meet this challenge, we propose a novel approach named City Travel Route Recommendation based on Sequential Events Similarity (CTRR-SES) by measuring the similarities of various city travel routes in a given city and learning from the user’s historical city touring sequences.

4.1. Travel Route Recommendation Framework

There are three building blocks in our CTRR-SES, as indicated in Figure 1, which are Travel History/Sequences Construction, Scenic Type-based User Preferences Baseline Modelling, and Route Recommendation System. Travel History/Sequences Construction and User Preferences Baseline Model Learning are processed offline. By analyzing the user’s open travel posts, we can obtain the user’s historical city touring sequences. Then we may compute the baseline model from the travel history using clustering analysis. Route Recommendation is processed online. Firstly, CTRR-SES computes the feature vectors which represent user’s travel characteristics from the city travel historical sequences and the preferences baseline model. Then it recommends the most similar touring sequence, which matches the user’s personal preference from existing travel sequences.

4.2. Travel Knowledge Base and Touring Sequence Construction

Using data mining technology, we construct the travel knowledge base by obtaining big data from platforms like Baidu, Mafengwo, TripAdvisor, and Booking. Attraction information is comprised of attributes of Name, Geographic Position, Type and Rating, etc. Each attraction is also labelled with a category of one or many of the following, i.e., city park, garden, arboretum, natural landscape, architecture, church, temple, museum, college campus, historical sites, food and beverage, shopping site, amusement, art performance, etc. City tour transportation modes include Taxi, Bus, Subway, and Walk. Learning from the user’s past space-time trajectory, travel sequences are generated by consecutively extracting data of geographic position, attraction label, visit duration, transportation mode, and time spent in transportation.

4.3. User Travel Preference Baseline Model Learning
4.3.1. Travel Sequence Similarity Measure

The travel sequence similarity measure is the measure by the proper algorithm of how much like multiple sequences are, which then derive similar clusters. In this paper, we present the travel sequence similarity measurement method using the Needleman–Wunsch (NW) algorithm. Moreover, we improve the traditional NW algorithm by integrating time information in the Score Function.

Definition 5. (Attraction touring history similarity ). Given as a city attraction, and are two variables in the travel sequence . Then the similarity formula between and is given as follows: indicates the similarity between two POIs, and indicates the similarity between the time visiting the two spots. and are different weights put on and , which can adjust the sensitivity of and . ; and are customized scores.

Definition 6. (Travel sequence similarity ). Given two travel sequences and , the similarity score of and is computed as follows:We can calculate the travel sequence similarity score matrix M. Normalization of data in the last row and column of the matrix generates the similarity scores of two travel sequences and . The pseudocode of the algorithm is given in Algorithm 1.
TSSA uses the method of iteration to calculate the similarity of two sequences and by comparing each item in the sequences. Then the value of similarity is stored in the 2-dimensional matrix M. If the lengths of the two sequences are not equal, then add a space gap to make them equal. The first to 6th lines in the TSSA initializes the similarity matrix, and the 7th to 18th lines conduct similarity calculation and fill in the matrix. M[i][j] represents the similarity of two corresponding items in and , the value of which is determined by the values of M[i − 1][j], M[i][j − 1], and M[i − 1][j − 1]. Equation (2) is an iterative formula and gives three paths to calculate the values of M[i][j], among which choosing the maximum value:(1)Obtain from above in the vertical line of M[i][j]. Sequences and are compared, then suppose L1′ and L2′ are generated during the comparison. Reaching the cell of M[i][j] from above is equivalent to adding the corresponding items in to L2′ and adding a gap in . Therefore, the value of M[i][j] is M[i − 1][j] + d2.(2)Obtain from the left in the horizontal line of M[i][j]. Same as (1), reaching the cell of M[i][j] from the left is equivalent to adding the corresponding items in to and adding a gap in . Thus, the value of M[i][j] is M[i][j − 1] + d2.(3)Obtain from the diagonal line of M[i][j]. By adding the corresponding items in to and adding the corresponding items in to , we can calculate the similarity value of M[i − 1][j − 1] + W(L1[i], L2[j]). When the ith POI label in L1 is overlapped with the jth label in L2, the calculation of W(L1[i], L2[j]) is written in the 10th to 12th line in the TSSA; otherwise, the value of W(L1[i], L2[j]) equals d1.The pseudocode of PSA (point similarity algorithm) in the 10th line and that of TSA (time similarity algorithm) in the 11th line are given in Algorithm 2 and Algorithm 3, respectively.
The first line in the algorithm calculates the intersection of scenic labels of two POIs. The second line measures the percent of an intersection of all labels added up and takes it as the similarity value of two POIs.
In the first line of the algorithm, we set half an hour as a single unit and then build the time axis based on it, and the time range spent in two POIs is indicated by two numeric sequences. In the second and third lines of the algorithm, the Longest Common Subsequence (LCS) algorithm is applied to find the longest subsequence present in both of the two numeric sequences. LCS can be solved using Dynamic Programming by dividing the original problem into some subproblems. The time similarity is the ratio of the length of the longest common subsequence to the length of the sequence.

Input: Travel sequence L1 and L2
Output: Similarity of L1 and L2
Initialization: Set score matrix M to 0
(1)for i ⟵ 0 to |L1| do
(2)M[i][0] ⟵ id2;
(3)end for
(4)for j ⟵ 0 to |L2| do
(5)M[0][j] ⟵ jd2;
(6)end for
(7)for i ⟵ 1 to |L1| do
(8)for j ⟵ 1 to |L2| do
(9)  if Overlap (L1[i].p.type, L2[j].p.type) then, //Overlap (a, b) means the attraction type labels of POI a and POI b overlap
(10)   Spoi ⟵ PSA (L1[i].p, L2[j].p);
(11)   Stime ⟵ TSA (L1[i].t, L2[j].t);
(12)   sim ⟵ uSpoi + (1 − u) Stime;
(13)   M[i][j] ⟵ max(M[i − 1][j − 1] + sim, M[i − 1][j] + d2, M[i][j − 1] + d2);
(14)  else
(15)   M[i][j] ⟵ max(M[i − 1][j − 1] + d1, M[i − 1][j] + d2, M[i][j − 1] + d2);
(16)  end if
(17)end for
(18)end for
(19)return M[|L1| − 1][|L2| − 1];
Input: POI information p1 and p2 of travel item r1 and r2
Output: POI similarity Spoi of r1 and r2
(1)count ⟵ Intersection (p1.type, p2.type);
 //Intersection (a, b) means the number of intersections of label a and label b
(2)Spoi ⟵ count/(p1.type.size + p2.type.size − count);
(3)return Spoi
Input: Time information t1 and t2 of travel item r1 and r2;
Output: Time similarity Stime of r1 and r2;
(1)Divide the time axis by half an hour, and number from 1, then t1 and t2 can be represented by digital sequence l1 and l2
(2)l = LCS (l1, l2)   //Calculate the longest common subsequence of sequence l1 and l2
(3)Stime ⟵ |l|/(|l1| + |l2| − |l|)
(4)return Stime
4.3.2. Travel Sequences Clustering

The K-means algorithm is one of the most popular and widely used methods of clustering due to its simplicity, robustness, and speed. It is an iterative algorithm meaning that we repeat multiple steps making progress each time. Among many clustering algorithms, K-Means is also comparatively well known for its robustness as it is nonsensitive to noise and isolated points. K-means algorithm can deal with data sets of different types and discover clusters that are irrelevant with the input order of data. Thus, this paper adopts the K-Means algorithm for travel sequence clustering analysis.

(1) Clustering Algorithm Description. K-means algorithm partitions the dataset, which includes the number n data, into K number of clusters. Then the clusters are positioned as points, and all observations or data points are associated with the nearest cluster, computed, adjusted, and then the process starts overusing the new adjustments until the desired result is reached. The Travel Sequence Clustering Algorithm (TSCA) is given in Algorithm 4.

Input: travel sequences set TS = {L1, L2, …, Ln} and the number of clusters k
Output: travel sequence cluster set TC = {TC1, TC2, …, TCk} and k center sequences set newMedoids = {L1, L2, …, LK}
Initialization: oldMedoids ⟵ null, newMedoids ⟵ null;
(1)Select k sequences L1, L2, …, Lk from TS randomly as initial center sequences to oldMedoids;
(2)TCi ⟵ Li //Each center sequence corresponds to a cluster
(3)while (!isEqual (oldMedoids, newMedoids))
(4) Calculate the similarity of each sample sequence from TS to each center sequence from newMedoids and place the sample sequence in the cluster with the highest similarity to the center sequence;
(5) oldMedoids ⟵ newMedoids;
(6) Recalculate the center sequence of each cluster TCi, sequences with the highest similarity from each sample sequence in the cluster, as newMedoids;
(7)return TC and newMedoids;

K clusters and a sequence containing K cluster centroids can be obtained by Algorithm 4. As each travel sequence reflects the traveller’s preference, the base number will be great when adding those sequences altogether. Considering the meaning of centroids has great explaining value, so we set the sequence containing number K cluster centroids as the travel preference baseline model.

(2) Performance Evaluation of Sequence Clustering. Updated Sum of Squared Error (SSE) and Silhouette Coefficient (SC) is used in this paper to evaluate the performance of clustering.Metric 1: SSESSE is a technique designed to find the sum of the squared error of sample points to centroids. Theoretically, the lower the SSE, then the better performance of clustering. This paper calculates the travel sequence similarity measure instead of a distance measure as the foundation of clustering. Therefore, the updated SSE is designed to find the sum of the similarity of sample points to centroids. Hence, the higher the updated SSE, theoretically, the better the performance of clustering.Metric 2: SC

The Silhouette Coefficient is calculated using the mean intracluster distance and the mean nearest cluster distance for each sample in D. To clarify, is the distance between a sample and the nearest cluster that the sample is not part of. The calculation equation is given below:

The SC value ranges from −1 to 1, and 1 means the clusters are well apart from each other and clearly distinguished. Just the other way round, when the updated SC value is close to −1, the performance of clustering is better.

4.4. Travel Route Recommendation

Travel route recommendation requires the user to input his or her city travel historical sequences and a new destination city B. The user’s travel preference is measured according to the relative distance between historical sequences and the preferences baseline sequence. We calculate the similarity between the city travel historical sequences and the preferences baseline model, and in the end compute the K-dimensional feature vectors which represent the user’s travel preference, in which K represents the number of clustering. Therefore, we define user travel preference as follows.

Definition 7. (User travel preference ). Given a user’s travel history or sequence (n is the number of travel sequences) and the preference referring sequence , the travel preference is indicated by a K-dimensional vector as follows:In the same way, every travel history or sequence in city B can be indicated as a K-dimensional feature vector, in which we can find the vector that matches user A’s travel preference with the highest similarity degree. This is to say, that is the travel recommendation presented to user A because the travel sequence represented by the feature vector satisfies the user’s travel preference. As Cosine Similarity (equation (5)) is a commonly used approach, we use this metric to measure the similarity of feature vectors:The travel sequence recommendation algorithm is given in Algorithm 5.
In the first line of the algorithm, we use the TSCA for travel history clustering analysis of all users. The array newMedoids stores the sequence containing K cluster centroids (K as the number of clusters). In the second line to the seventh line in the algorithm, we calculate the user’s travel preference, and the K-dimensional feature vector is stored in the one-dimensional array userpre. From the eighth to the twelfth line, every travel history or sequence in the user’s destination city can be indicated as a K-dimensional feature vector, which is stored in the size mk 2-dimensional array cityseq (m as the total number of all historical sequences in the destination city). In the fourteenth to the nineteenth line, we use the Cosine Similarity function CosSim to find the feature vector in cityseq that match the user’s travel preference vector with the highest similarity degree. The result is the travel recommendation presented to the user.

Input: Historical Travel sequences set HS = {L11, L12, …, L1n} of user A, city a, historical travel sequences HSA = {L21, L22, …, L2m} of city a, historical travel sequences HSAU of all users
Output: Travel recommendation sequences of city a for user A
(1)newMedoids = TSCA (HSAU) //Cluster historical travel sequences of all users
(2)for i ⟵ 1 to newMedoids.size do
(3)for j ⟵ 1 to n do
(4)  userpre[i] ⟵ userpre[i] + TSCA (newMedoids[i], L1j);
(5) userpre[i] = userpre[i]/n;
(6)end for
(7)end for
(8)for t ⟵ 1 to m do
(9)for r ⟵ 1 to newMedoids.size do
(10)  cityseq[t][r] = TSSA (newMedoids[r], L2t);
(11)end for
(12)end for
(13)sim ⟵ 0;
(14)for t ⟵ 1 to m do
(15)if sim < CosSim (userpre, cityseq[t]) then
(16)  sim ⟵ CosSim (userpre, cityseq[t]);
(17)  outputseq ⟵ L2t;
(18)end if
(19)end for
(20)return outputseq;

5. Experiment and Evaluation

There are various views on social network data based recommender systems by considering the usage of various recommendation algorithms. In our experiment, there are six steps to generate the dataset, as indicated in Figure 2. Web crawler collects travel spatiotemporal data from social media, travel agent websites, and navigation apps. We select 10 cities (Chongqing, Chengdu, Beijing, Shanghai, Xian, Hangzhou, Nanjing, Tianjin, Guangzhou, and Wuhan) and scenic spots in these cities to analyze the sample travellers’ touring history sequences, as indicated in Figure 3. We further compare the scenic labels with those in the Tourist Attraction Knowledge Base (denoted as TAKB) using Natural Semantic Matching technology and manual filtering. In every city, 20 attractions are selected to form the city travel knowledge base. Finally, we split the travel sequence dataset as 70% of the data for training and 30% for testing the CTRR-SES algorithm. In the following experimental evaluation, we randomly select different users for testing.

To validate the CTRR-SES, the experiment was designed based on the collected touring data.

5.1. Accuracy and Validation of Travel Route Recommendation Algorithm
5.1.1. Impact of the Value of K on the Travel Preference Baseline Model

The value of K to perform the K-means clustering algorithm has a great impact on the experimental results. Thus, we run the fixed K value multiple times and use the updated SSE and the mean of SC to determine the optimal value. As indicated in Figure 4, when the K value is greater than 4, then the growth rate of SSE decreases. The increase of value K leads to the increase of the value of . Next, we set the degree of similarity as the recommendation accuracy rate. Feature vectors of the recommended route and that of the corresponding route in the testing dataset are computed using the similarity function when K = 2, 3, 4, 5, 6 (experimental results are shown in Figure 4). As the bars show, the recommendation accuracy rate is the highest when K = 4. Hence, in the following experiments, we set the value of K = 4 in this paper.

5.1.2. Length Comparison of Recommendation Sequence and Original Sequence

The sequence length of the original route in the testing dataset and that of the recommendation route are counted and compared, as shown in Figure 5. Compared with the sequence length of the real route, the experimental result of a small error proves that our algorithm is validated in its accuracy.

5.1.3. Hit Rate

The formula of hit rate is given:

In equation (6), is the set of attractions in the recommended route, and is the set of attractions in the user’s travel historical sequence. The higher Hit Rate indicates better performance of recommendation by our algorithm. Then we calculate the accuracy of the route recommendation. The experimental hit rate result is 0.70, which further validates the CTRR-SES, proving that this algorithm will provide city travel route recommendation that effectively matches the user’s preference.

5.2. Robustness of Travel Route Recommendation Algorithm

To test the robustness of the CTRR-SES, we design the following experiments, as shown in Table 1. Randomly change one or multiple sequences in the user’s historical city touring sequence, and the experimental results are much like the original results detailed in Figure 6. Thus our algorithm has good performance in its robustness and stability.

6. Conclusion

Existing travel recommendation studies seldom analyze user behavior with different granularities to calculate spatiotemporal sequence similarity. As a lack of full understanding of behavior events from multigranularity and multiperspective, those studies are not suitable for the growing need for in-depth city travel route recommendations. We adopt the coevolving spreading dynamics to the relevance of the traveling preferences and the events in the city tour networks and explore its application on the city tour recommendation system. Based on defining the user’s touring sequence model, firstly, this paper presents the Event Sequence Similarity Measurement Method, which calculates the weighted mean of time, space, and activity similarity in certain granularity to measure spatiotemporal sequence similarity. Next, we design the CTRR-SES by applying the User Travel Preference Baseline Learning Model to study user’s city travel historical data and compute personalized travel preferences. Finally, our algorithm is validated by a series of experiments of its effectiveness and feasibility, and CTRR-SES shows better performance in predicting the user’s new city travel sequence fitting the user’s individual preference. Our work provides reference and guidance to research the multigranularity spatiotemporal sequence similarity problem for city travel route recommendation. However, only 54 real cases are selected to evaluate the performance of the CTRR-SES algorithm, and we will include more experiments and datasets to validate the work in future research.

Data Availability

The travel historical data used to support the findings of this study have been deposited in the Mafengwo.com repository.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This work was supported by the National Natural Science Foundation of China under grant no. 61672115 and Project No. 2020CDCGJSJ040 supported by the Fundamental Research Funds for the Central Universities.