Abstract
Advancement of emerging technologies and increasing of transport demands accelerate the evolution of the autonomous transportation system (ATS). Framework and architecture of ATS are becoming a research hotspot; however, by far, few studies on transportation intergeneration division are not basically involved. Previous works indicate that key components are critical representation in the distinguishing of long-term era. Besides, massive text material accumulates as the research work goes on, and natural language processing technique keeps developing, which makes quantitative research on key components in intergeneration division become possible. In this work, a method based on the massive text analysis is proposed. First, the LDA2vec is used to get the relationship between components and other elements. Then, a word set is from the component word set extraction module based on component items. Finally, the component word set is clustered to get ATS generation and to generate key components. Based on an analysis of large-scale important traffic texts, our method divides the traffic system into three generations for Chinese traffic from 2010 to 2022. The key components of our method given are consistent with human cognition of ATS. Successful application indicates that this work can be extended to other intergeneration division fields.
1. Introduction
Autonomous transportation system (ATS) is a complex traffic system that focuses on future development direction in the intelligent traffic system (ITS) framework. According to the ATS theory, the traffic system consists of five basic elements, namely, technology, need, service, function, and component [1, 2]. As a stable framework, ATS unifies the interior and exterior of transportation system. The external effect is driven by technologies and needs, the internal effect is activated by services and functions, and components carry the whole ATS as physical entities [3, 4], the ATS framework has periods in the development process. Due to the external driving effect, the internal activation effect and the carrying effect are composition and performance of the five elements that will change between different periods. When the effect is big enough to revolute ATS acknowledge, ATS will transition a new generation. Intergeneration division is proposed to divide the framework features into intergenerational groups. It clusters the features along the timeline to aggregate different stages. This work provides a clearer understanding about framework acknowledge and promotes research on framework. Furthermore, the transportation system has been rapidly affected by high technology in recent years. Artificial intelligence and connected vehicles are continuously integrated into the transportation field, making modern transportation more intelligent compared to traditional transportation [5, 6]. The way industries produce and the way people travel are changing at an accelerated rate. If we are aware of the changes in each period, we can learn about how the ATS elements evolute. This helps set development goals for the system framework and adjust traffic industrial structure and helps generate huge economic benefits.
Since the 1990s, in theory and technology research on ITS, scholars have not clearly defined about ITS intergeneration division. Instead, they use continuous design to advance the ITS theory and technology. In the research of different scholars, the intergeneration division is considered as a work based on experience. This work is relatively subjective and vague and has advantages when the transportation system is simple, and there has not been much advancement in technology. With the explosive emergence of modern transportation technologies, the transportation system is more difficult to level in a subjective way. A method of intergenerational division that divides the transportation system’s development process is becoming more and more necessary. The existing intergeneration division methods about system are difficult to systematically divide the ATS. The specific manifestations are as follows:(1)There are different element relationships between ATS and other systems. Other intergeneration division methods cannot adapt to ATS.(2)No established standard for the intergeneration division method and ways to intergeneration division are ambiguous and inconsistent.(3)The key of ingeneration division is not specific. The key of division is not based on physical entities. The generation features cannot be clearly described.
In order to solve the above problems, [2] we propose an ATS intergeneration division standard based on key components, which uses physical entities as a bridge to reflect changes between generations. All elements depend on components for expression. This concept is used in this work. In this work, an ATS intergeneration division method is proposed based on big data in our work. Based on the LDA2vec topic model [7] in natural language processing (NLP), topics are extracted from the divided text set as condensed text content. Do text similarity comparisons with topics and components to quantify the relationship between other elements and components. The word distribution and the year distribution from the text set and the values form the relationships that commonly constitute the component word set and then cluster to get the desired generation. Finally, the key components are selected according to the clustering results. In this process, components are used as physical entities to connect the entire system framework, and information to establish intergeneration division also depends on components. Besides, the selected components include feature differences between two generations. Therefore, the selected components are representative. The main contributions are as follows:(1)A machine learning method based on the LDA2vec model is proposed for ATS intergeneration division from an objective view with a group of massive texts(2)Physical entities are used to perform ATS intergeneration division from a more specific view, and a method for determining key components of intergeneration division is proposed
2. Related Work
2.1. Autonomous Transportation System (ATS)
The development of the ATS framework is based on ITS. In 1992, the ITS system framework was first established in the United States by ARC-IT [8], which defined user needs and services, logical architecture, physical architecture, and policies as its work content. Since then, the ITS system framework has been proposed by China [9], the European Union [10], and Japan [11] in accordance with their respective nations. However, the ITS framework requires more details to improve in present. Big data, autonomous driving, vehicle-road cooperation, 5G, and other technologies are used in transportation. The role of human beings as decision makers in transportation is being replaced by artificial intelligence. The ATS framework for reducing human intervention is proposed. In order to liberate human beings from driving, researchers deconstruct ATS. They propose a method to self-organizingly evolve needs, services, functions, technologies, and components, which is the ATS element framework. Adaptive adjustment is used in this framework to offer theoretical support. ATS is still in the exploratory stage; therefore, most of the ATS research is theoretical. Wei et al. [12] study the ATS needs and propose a user need system construction method. Zhang et al. [13] use Petri nets to build a relationship network between elements and model the evolution of ATS. The relationship between its ATS elements is shown in Figure 1, which is an important theoretical basis for ATS architecture. Deng et al. [1] build an architecture mapping relation using text analysis. Xu et al. [14] develop an analysis of ATS logical architecture and prove the reliability of ATS logical architecture. By ATS physical architecture, Deng et al. [15] define the physical architecture of ATS and form the theoretical basis. Zhou et al. [16] build the physical architecture of ATS and create the groundwork for application.

2.2. Topic Model
In 2003, Blei et al. [17] propose the Latent Dirichlet Allocation (LDA) topic model, which provided the topic from text set with a probability distribution. This technique is able to summarize multiple latent topics from text collection into a small number of words. After that, numerous LDA-based natural language processing (NLP) studies were conducted. Weng et al. [18] use LDA to implement opinion monitoring analysis on Twitter. Bian et al. [19] use LDA for multidocument summarization. Since LDA is a topic model based on bag-of-words, it can effectively summarize contextual macro information. However, for microsentential information, LDA does not work well. In order to address this issue, Moody et al. [7] combine word vectors and LDA to create LDA2vec, which can integrate context information and sentence information to get topics. Using LDA2vec, Luo and Shi [20] collect information in aviation safety reports and successfully complete the mining latent topics task. In [21–23], the topic model is used in different languages and gets the same satisfactory results.
2.3. Intergeneration Division
Intergeneration division is a sociological concept. Many works define social group differences as generation and perform division analysis from it. Research is broadly divided into two types, the first is the division of predetermined years to examine intergeneration variations within the same time period, for example, treating every decade as a generation to study changes in human behaviour [24, 25]. The alternative method does not rely on time. Instead, it categorizes individuals who satisfy certain criteria and then divides generations based on the temporal aggregation that corresponds to the features [26]. In medical research, generation is defined as the biological activity, such as relationship between children and parents, or several offsprings as a generation with differences in biological traits [27–29]. Unlike sociology, intergeneration divisions in medicine are not reflected in the same time interval but rather in different features. Nevertheless, both two types of studies are characterized by the fact that evolution produces intergeneration. In the field of transportation, transportation systems also exhibit evolution characteristics. Guan [30] proposes that ITS can be divided into three generations based on the evolutionary law of transportation systems and shows the future application feature in ITS 3.0. Zhang et al. [13] propose to output element replacement through the evolution relationship between nodes and edges in a complex network of the ATS evolution model and then select a replacement as the criterion for intergeneration division. Yu et al. [2] proposed an intergeneration division criterion of ATS based on key components and gave several indicators as references.
Intergeneration division has been mostly studied qualitatively, especially in the field of sociology and less in the field of transportation. Based on the ideas proposed by predecessors, we would like to propose a quantitative perspective to study the intergenerational division of transportation systems, whose features can be found from the relationship between elements. The topic model can be used to find the connection between elements by human being cognition from the record texts and proposes a method based on massive texts.
3. Proposed Method
In Figure 2, the process of our method is shown. Our intention is to divide the transportation system development process into several time periods, with generation feature. The goal of the model is to extract relationships between other elements and components from the massive text. The collected data are organized by subelement items. The component word set extraction module with LDA2vec is proposed to calculate the component word set, whose word set consists of word distribution and year distribution. After using KMeans [31] to cluster the word set, finally, the generations and their features are generated.

3.1. Data Collection
The authors in [2] demonstrate that component is the key element of ATS intergeneration division because the remaining four elements (technology, need, service, and function) are based on component as a carrier. The objective in this subsection is to collect the texts for these four aspects and determine how they relate to each other and to components. In order to obtain a complete intergeneration division, the granularity of data collection needs to be accurate to the year and calculate the relationships between the component and other elements for each year [2]. Additionally, it breaks down the ATS generation evaluation methodology into six indicators. Combined with cognition, the ATS generation texts are divided into three types: papers, reports, and policies. The keywords of each item in four elements are manually filtered out. The texts that make up our training dataset are retrieved into three types based on the keywords.
3.2. LDA2vec Topic Model
Judging from the collected texts, all kinds of texts contain a lot of redundant and repeated information. It is impractical and unrealistic to make all information as generation expressive features. The LDA2vec [7] topic model can effectively summarize information from both context and sentence semantics and a group of extracted keywords as appearance features.
LDA2vec is a combination of LDA and word2vec and learns word vectors using the skip-gram model along with topic vectors. To be able to meet our needs for text concentration, this can mix the global and local keyword features. Therefore, using LDA2vec as a summary, a deredundancy technique is viable.
From word2vec, word vector is encoded from skip-gram architecture [32]. From the LDA model, topic vector is generated by Dirichlet distribution, as shown in Figure 3. To express sentences, a document vector is set from a mixture of topic vectors.where is a fraction that denotes the membership of document in the topic . The context vector shows the information from context which is defined as follows:

The loss function is defined as follows:where is the skip-gram negative sampling loss (SGNS) [33]. is the addition of a Dirichlet-likelihood term over document weights. is the target word vector, and is the negatively sampled word vector. is . The strength of this term is modulated by the tuning parameter . is a constant parameter, which is 1 less than the number of topics.
The total loss is passed backwards to the training process of LDA2vec and iterated until the loss converges. After training of the above model, the topic vectors can be calculated form the inverse result. The topic vectors have been shown that contain sentence semantic information and context information. The topic vectors can be converted into topic descriptions with words:
3.3. Component Word Set Extraction Module
The topic description of one subelement item which is related with all topics in its directory can be defined as follows:where means that includes words. For each subelement item, there is a topic description for each year. The component description can be defined as :where is the component index. is the number of words in one component. Different components have different . The Baidu stop word package [34] is used to split the sentence into words to . In experiments, it is found that the stop word package is limited. Therefore, some words that are often used in papers and patents are added to the stop word package to make the stop word impact more noticeable.
The similarity between topics and component descriptions can be used to illustrate the similarities between components and other elements. Each component’s word description vector closeness to the topics of four other elements can be used to calculate the following value:where is the function defined by the python package [35]. encodes all words into different [1, 100] vectors, namely, word vectors, and then calculates similarity between two different word vectors. The value of similarity according to is between. The closer it is to 1, the more similar it is. For each subelement item, we sort all the components according to the relationship values and select the top 5 as winners. After counting all the words in the text set using the word bag and sorting the words by word frequency, the top five words are sampled as word distribution awarded to the winners. Their weights which come from word frequency normalization are also awarded to the winners. The purpose of this is to eliminate irrelevant connections and make the model more relevant. The year of all texts in the text set with one subelement item is counted as a year distribution and also awarded to the winner. When all sets are traversed, each component has a word set with different word weights and a year statistics table from the obtained year distribution. In order to lessen the impact of repeated words on the results, the weights of the identical words for each year under each component are added together. After the integration is finished, a component word set is built.
3.4. Clustering
In data collection, we introduce that component is the key element of ATS intergeneration division. Therefore, generations can be divided by clustering components. To integrate components’ information in ATS, a component word set is built. Different from component description, a component word set is a word set from four other subelement items’ topics, which mean components are described by ATS element relation. However, there is still a missing link, which is the unified and digital representation of all component word sets.
To achieve the requirement of clustering, an improved K-means [31] clustering model is used to cluster the component word set. The original K-means distance equation is as follows:where is the clustering center of each class. is the input data. is the number of input dimensions. We consider the dimensions of the input in K-means and the data obtained by the previous method, using a group of words and their weights from the same element in one year as one single information for K-means; [35] integrates the information into a [1, 100] vector. On this basis, considering the word weight and year information, the K-means distance equation is as follows:where is the weight of word. is one single data from [1, 100] vector. is the year in the word year list.
4. Experiments
4.1. Data Collection
The data of policies and reports in the text dataset are collected on the Internet with crawlers. The collection of papers is based on the CNKI paper database. More details about the collected data are in Table 1. Needs are to describe a vision for the future; therefore, there is no collection of papers that can be referenced in this work. To ensure that the amount of data is sufficient to meet the conditions for massive text computation, we collected a total of 200,032 reports, 72,311 policies, and 1,223 papers from 2010 to 2022. To ensure the quality of the collected papers, we counted the source information of the collected papers, with 15.0% of doctoral dissertations, 37.0% of master’s dissertations, 29.1% of papers published in journals that are part of the Chinese core, and 18.9% of others. The percentage of paper sources is shown in Figure 4, and the year distribution of the collected data is shown in Figure 5.


4.2. Results
In equation (6), the author in [7] suggests . In equation (11), in order for the proposed model to learn more about year feature information, and . The number of topics is set as 20. In order to allow the model to fully summarize the text set, the number of iterations’ epoch of the topic model is set as 60. As shown in Figure 6, the total loss reaches a low number for the model to converge, when the skip-gram negative sampling loss (SGNS) and Dirichlet-likelihood loss reach a reasonable level of overfitting. With the result in K-means clustering, the result vector must be [1, 101]. A visualize parameter is used to show the first hundred parameters of the result vector as y label, which means the word vector:where is a constant parameter. X label is the number of years. The result is shown in Figure 7.


In Figure 7, among 3 types of clustering results, the three generations are in 2010–2015, 2016–2019, and 2020–2022. The first intergeneration change occurred in 2015 and the second in 2019.
4.3. Discussion
4.3.1. Time and Generation Features
In Guan’s research [30], they believe that the transportation system evolved from ITS1.0 to ITS2.0 in 2015, and with the development of automation and intelligence, it will evolve into the next generation. Our results support this conclusion that 2015 is a time node for the transformation of the transportation system. In the other ingeneration division node 2019, it is the time when autonomous driving system is widely installed in vehicles and V2X starts to be applied with 5G.
From the top ten words in each cluster in Table 2, in 2010–2015, traffic system is talking about system control and traffic planning. Electronic information technology has developed rapidly. Electronic control products are mass produced during this period. The main applications are vehicle electronic automatic control and signal light adaptive control, which exhibit automated features. In 2016–2019, communication, data, and information are mentioned in clusters. In this period, 4G network shines in the transportation system. The Internet, cloud computing, and big data are all on fire, and the transportation system is affected by them. The faster communication speed increases the amount of data; therefore, traffic big data is widely used. The Internet makes the traffic system exhibit the feature of network connection. From 2020 to 2022, coordination is mentioned and comes to an important position, which means that V2X enters the traffic system. More advanced sensing devices enable drivers and the transportation system to access more data and more information; correspondingly, better edge computing devices are needed. The transportation system exhibits the coordination feature.
4.3.2. Key Components
For the key components, two types of key components are significant in the intergeneration division: one of which accounts for the largest portion of the clustering results and the other of which differs from the largest connotation change between the two generations.
The key components are found in the K-means clustering result. In each class of the K-means clustering results, all components are associated with a set of words, which are the outcomes of the aforementioned classification of the component word set into generations. Each word also includes a word weight , which indicates how much that word contributed to the clustering outcome. To obtain how much each component contributed to the clustering outcome, the weights of all the words in that component are added. Table 3 shows the three components with the highest contribution in each generation because the development of transportation technology in the last decade has been most concentrated on the transformation of transport vehicle, the innovation of station infrastructure, and the improvement of driver comfort. These three elements are also present in varying degrees in transportation research. Among them, current research has focused heavily on automobile technology. The scientific community and the private sector are very much interested in automatic driving, vehicle control, vehicle-side edge computing, and energy structure change. Since transport vehicle is where the transportation system change is most visibly present, transport vehicle has played a significant role in shaping that change. With the improvement of information technology, besides transport vehicle, station infrastructure is the most significant change in ITS influenced by technology. In recent years, the technology of bus stations, high-speed railway stations, airports, and other infrastructures for the interaction of incoming and outgoing carriers has made tremendous progress. Additionally, there are new stations that reflect the most cutting-edge infrastructure technologies, such as smart parking lots and charging stations. In addition, a driver also plays a significant part in intergeneration division, as driving jobs are changing at this time due to the development of autonomous driving. As shown in Table 3 and Figure 8, we note that the percentage of driver is gradually rising, which is related to the improvement of automation and driving comfort for driver operation in recent years. The current stage of vigorous development of vehicle-road collaboration and autonomous driving aims to liberate drivers, whose role in the transportation system has been changing. In addition to the driver, experience has also been significantly improved, and the level of autonomous driving is based on the tasks that the driver needs to perform. The generation 2020–2022 is the period of high development of autonomous driving, which is the reason for the increase of the percentage of driver.

On the other hand, the component with the greatest change in the two adjacent generations also plays a key role in the intergeneration transition. Generations may change with these components. The result shows the top two greatest components in Table 4. This result is in line with the generation features mentioned previously. From 2010–2015 to 2016–2019, traffic managers have changed from manual to automatic, and passenger vehicles begin to connect to the Internet. From 2016–2019 to 2020–2022, V2X enables line devices and edge devices to be updated. Both auxiliary operating devices and roadside smart devices support generation change.
4.3.3. Clustering Parameters
3-class is the best clustering result. In order to verify this, the clustering results of class 4 are shown in Figure 9.

The year division also shows a clear line, but in 2018–2020, a generation is defined, which is difficult to explain in practical cognition. Therefore, it is not suitable to cluster into 4 classes, and it also proves the effectiveness of clustering into 3 classes. In addition, if we cluster more than 4 classes or less than 3 classes, a single point will become a class by itself, so it is also discarded by us.
4.3.4. Disadvantage
Our method actually searches for the connections between elements from textual big data. In fact, it is about learning from what has been documented before, rather than actually having a connection. The real connection should be in industrial projects. Although this method can learn the implicit relationship between elements, it is hard to know whether the implicit relationship exists in the real world.
On the one hand, generations are actually fuzzy because engineering applications take time, which is always several years. The clear boundary obtained by our method should actually be a fuzzy value. The time nodes of the intergeneration division should be around 2015 and 2019.
5. Conclusion
In this work, a novel intergeneration division method based on text big data is proposed, and it successfully finds the time nodes and key components of ATS intergeneration division in massive text data. Our method searches the text records of massive ATS elements by keywords and uses LDA2vec to extract the text set topics of ATS subelements, which help to calculate the relationship between other elements and components. Then, the word distribution and the year distribution are fused through the element relationship to obtain the word set of the components. Finally, we improve K-means, cluster it into the generation we need, and calculate the key components of the generation. The experimental results show that both the time node of intergeneration division and the results of key components are in line with realistic expectations.
In the Discussion section, this study discusses the limitations of this work. What our method learns is feature relationship based on textual features, and there is a gap between them and reality. In addition, the generational line should be fuzzy due to engineering project, and the clear year line we get should actually be a fuzzy border.
In future work, we will consider the actual intergeneration relationship and complete the task of intergeneration division. Based on the evolution model [13], we will try to learn the feature of future generations, divide possible future generations, and give planning advice for future transportation development.
Data Availability
The data used to support the findings of this study can be obtained from the corresponding author upon request at xiongch8@mail.sysu.edu.cn.
Conflicts of Interest
The authors declare that they have no conflicts of interest.
Acknowledgments
This work was supported by the National Key Research and Development Program of China under grant no. 2020YFB1600400, the National Natural Science Foundation of China under grant no. 12002403, and the Shenzhen Postdoctoral Research Fund under grant no. SZBH202101.