Abstract
As 5G and other technologies are widely used in the Internet of Vehicles, intrusion detection plays an increasingly important role as a vital detection tool for information security. However, due to the rapid changes in the structure of the Internet of Vehicles, the large data flow, and the complex and diverse forms of intrusion, traditional detection methods cannot ensure their accuracy and real-time requirements and cannot be directly applied to the Internet of Vehicles. A new AA distributed combined deep learning intrusion detection method for the Internet of Vehicles based on the Apache Spark framework is proposed in response to these problems. The cluster combines deep-learning convolutional neural network (CNN) and extended short-term memory (LSTM) network to extract features and data for detection of car network intrusion from large-scale car network data traffic and discovery of abnormal behavior. The experimental results show that compared with other existing models, the algorithm of this model can reach 20 in the fastest time, and the accuracy rate is up to 99.7%, with a good detection effect.
1. Introduction
With the practical application of emerging technologies in the field of the Internet of Vehicles, the development of the Internet of Vehicles has become more rapid. Due to its particularity, that is, the car itself does not consider network security enough, the capacity of the vehicle is limited, the application environment is complex, the number of distributed nodes and sensor networks are many, and the safety requirements are incredibly high. Therefore, the security issue of the Internet of Vehicles has increasingly become a stumbling block to its application. Ensuring the security of car G road G cloud communications in the car networking security system, identifying various malicious attacks, has become the focus of close attention by industry insiders and information security experts. Intrusion detection is a network security technology used to detect intruders and aggression in any communication system through various identifications or detections.
Attack behavior, monitor and analyze network traffic, classify normal and abnormal behavior, and identify strange activities such as threats in the network are all roles played by the Internet of Vehicles. As an active defense technology, this technology has become one of the primary mechanisms to ensure the safety of the Internet of Vehicles. The application of machine learning algorithms in traditional Internet intrusion detection systems is the current mainstream research direction. Wisanwanichthan and Thammawichai [1] apply the machine learning method to intrusion detection systems (IDS), and use SVM and Naive Bayes algorithms for normalization and feature reduction for analysis and comparison. However, the key disadvantage of the machine learning-based intrusion detection mechanism is that it requires a lot of training time to process many datasets of previous data streams in the network. In the network environment, deep learning technology has good self-learning functions, Lenovo storage functions, and high-speed optimization functions, which are very suitable for processing the current complex network traffic data, especially in the complex car networking environment.
At the moment, there is a great deal of research being conducted on intrusion detection using deep learning and distributed big data technologies. Chen et al. [2] developed a hybrid deep neural network (DNN) model for classifying and detecting unknown network threats. Chen et al. [2] think that deep learning has received a lot of attention recently, and they compared conventional techniques to new deep learning methods. Chen et al. [3] constructed an intelligent intrusion detection system using deep learning’s intelligent capabilities. Vijayanand et al. [3] presented a technique for detecting anomalous intrusions using a hybrid MLP/CNN. Parimala and Kayalvizhi [4] developed a deep learning-based technique for detecting network intrusions. The KDD-CUP99 dataset was examined using the BP neural network to identify the kinds of invasions. Karatas et al. [5] developed an intrusion detection technique based on deep convolutional neural networks, which lowers the dimensionality of network data by converting it to pictures. The detection accuracy, false alarm rate, and detection rate are all enhanced via training and recognition. Shettar et al. [6] utilized Keras on top of TensorFlow to categorize various assaults using supervised deep learning and achieved the best accuracy using RNN deep learning technology. Zhang et al. [7] implemented random forests and SVMs using the Spark framework. Other machine learning methods were evaluated and compared to multilayer deep perceptions. We may conclude from studies that although deep learning algorithms are more accurate than conventional machine learning algorithms, they need more time to examine data. The static network in its traditional form intrusion detection is often classified as either host-based or network-based. The Internet of cars’ intrusion detection is accomplished by filtering the data transferred between vehicles. Due to the fact that the Internet of Vehicles is also linked to the Internet or to a specialized network, traditional harmful attack techniques are also successful on the Internet. They are more damaging, which necessitates more stringent standards for intrusion detection protection. Combining the features of the Internet of Vehicles’ massive traffic and multidimensional complexity, the application of deep neural network detection. Due to the benefits of distributed parallel computing and its rapid and influential features, this article proposes using a combined deep learning algorithm with the Spark framework [8, 9] for intrusion detection. By utilizing the Spark architecture, the traditional deep learning algorithm is improved. Combining CNN and LSTM, Dey [10] proposed the CNNGLSTM algorithm model, which was used to analyze the NSL-KDD dataset [11] and the UNSW-NB15 dataset [12–15] in order to minimize security attacks on connected vehicles. Its primary objective is to decrease the time needed to identify assaults and increase the accuracy of classification jobs, which is more appropriate for the Internet of Vehicles’ real environment. Each indication has been enhanced as a result of experimental research. Researchers are proposing various protocol schemes [16–20] to maintain the integrity, confidentiality, and security of the information shared among users and servers.
The rest of this paper is structured as follows: Section 2 describes the CNN-LSTM algorithm. The Spark framework and NSL-KDD dataset are mentioned in Sections 3 and 4, respectively. Result analysis is given in Section 5, followed by the conclusions in Section 6.
2. CNN-LSTM Algorithm
CNN is suitable for extracting data features; LSTM is suitable for processing time series, solving the dependency problem between time-series data, and improving recognition accuracy. This paper combines the advantages of the two algorithms and proposes the CNNGLSTM algorithm. Convolution neural network (CNN) [21] evolved from multilayer perception (MLP) [22]. Compared with traditional feature selection algorithms, this algorithm can learn features better. The more traffic data CNN can learn, the more useful features there are, the better the classification, which is suitable for large-scale network environments. As shown in Figure 1, its structure is divided into a convolution layer, a pooling layer, and a fully connected layer. The role of the convolution layer is to extract features, and the role of the pooling layer is to sample the features. Finally, the fully connected layer is responsible for connecting the extracted features and obtaining the classification results through the classifier.

The long-term memory network (LSTM) is an improved recurrent neural network (RNN) method, which aims to alleviate the explosion gradient problem. Compared with traditional RNN units, LSTM uses a set of gate functions to control feedback so that short-term errors will eventually be deleted while persistent features will be retained. The data processing flow is shown in Figure 2.

The LSTM is abstracted into four subnets (p-net, g-net, f-net, and q-net), a collection of gate controllers, and a link to the memory component. The figure’s input and output are controlled by the vector’s size, x (t). The state s (t) contains information about the present learning.
The CNN-LSTM method is capable of expressing both temporal and spatial information. Due to the fact that an intrusion assault occurs in real time, the methods of attack are varied, as is the target or point of attack. To extract features, a CNN is utilized, and high-level features may be retrieved using the convolution kernel operation, which has been successfully used in image processing [23–25]. Additionally, LSTM utilizes gate functions to regulate the remembering and forgetting of previous data, making it ideal for processing long-term sequence data and increasing detection accuracy [26, 27]. As a result, the CNN-LSTM algorithm model is suitable for intrusion detection processing in this study. Figure 3 illustrates the CNN-LSTM algorithm paradigm, and the particular stages are as follows:(1)The input layer collects real-time Internet of Vehicles data through the flow data collection module. This article uses the dataset to analyze characteristics, including network protocol types, network service types, network connection status, and connection time [28, 29].(2)According to the data processing steps, the data are respectively preprocessed, digitized, and normalized. The specific operation steps will be described in detail later.(3)It sends the processed data to the convolution layer for feature extraction and outputs the features through a one-dimensional convolution operation. Each convolution layer is accompanied by a pooling layer to reduce feature dimensions, accelerate convergence, and remove redundancy features to prevent network overfitting. Then all local features are integrated through the fully connected layer to form an overall feature. Finally, the leaky ReLU activation function in the fully connected layer is operated [30–32].(4)Input the features extracted by CNN into LSTM. After the SoftMax function, the classification result of network data is obtained [23, 33, 34].

3. Spark Framework
To enhance detection efficiency, this study makes use of the Apache Spark framework, a large data processing platform focused on speed, simplicity of use, and sophisticated analysis. It was created in 2009 [8–10] at the University of California, Berkeley, and became one of the Apache open-source projects. In comparison to other big data technologies such as Hadoop, Storm, and MapReduce, Spark offers the following advantages [35–37]:(a)Spark offers a consistent and comprehensive framework for handling diverse datasets and data sources (batch or real-time streaming data) with varying characteristics (text data, chart data, etc.) [38, 39].(b)Spark improves the performance of Hadoop cluster apps operating in memory by 100 times and the speed of Hadoop cluster applications running on the disc by ten times [14, 40].(c)When compared to MapReduce, Spark performs quicker data calculations and offers more robust functions [25, 41].
When the quantity of processed data surpasses the capacity of a single machine (for example, a computer with 4 GB of memory must process more than 100 GB of data), or when the amount of processed data is trivial, nonetheless, the calculation is difficult and time-consuming. As a result, the Spark cluster can use its massive computational capabilities to perform the analysis in an organized fashion. The architecture’s schematic design is shown in Figure 4.

Using the Spark distributed open-source framework, the experimental PCs are connected to form a master-slave control structure. The master node performs task scheduling, distribution, and fault tolerance on the slave nodes, and the slave nodes realize parallel computing. This structure has been proven to be an owner of a distributed design with high reliability, high concurrency, and high-performance computing capabilities. The HDFS storage system of the node is then used to store the data, and the combined deep learning algorithm is used for intrusion detection.
4. NSL-KDD Dataset
In contrast to a conventional network, the heterogeneous communication network created by vehicle-to-vehicle (V2V) and vehicle-to-infrastructure (V2I) communication are formed by the self-organization of vehicle nodes. Driving, fast channel fading, strong Doppler effect, and rapid network topology changes are all examples of rapid network topology changes. However, the attack techniques used against the Internet of Vehicles throughout the communication process are very similar to those used against conventional networks, including backdoor assaults and denial of service attacks. To evaluate the proposed Spark-based distributed combined deep learning intrusion detection method for the Internet of Vehicles, the proposed deep learning algorithm is applied to two intrusion detection benchmark datasets, namely, NSL-KDD [11] and UNSW-NB15 [12], in order to develop an effective intrusion detection system for the Internet of Vehicles’ external communication. There are a total of 21,473 pieces of training data and 51,025 pieces of test data in the experimental dataset.
The NSL-KDD dataset is a refinement of the KDD CUP 99 data collection [12]. It eliminates redundant records from the CUP 99 dataset and addresses the classifier’s bias for repeating records. In comparison to the KDD 99 dataset, the usage of NSL classification of the KDD dataset will provide comparable or superior accuracy. As a result, it is widely regarded as one of the most effective datasets for intrusion detection studies. The dataset’s assaults are classified into four groups.(1)Denial of service (DoS): the intruder will send many malicious requests to the server, causing the machine’s memory and computational resources to become insufficiently full or busy to handle genuine traffic, thus denying regular users services.(2)User-to-root (U2R): this is a kind of attack in which the attacker tries to acquire administrator privileges through regular user access.(3)Remote-to-local attack (R2L): the attacker wishes to transmit data to a computer via a network in order to obtain access to the machine fraudulently.(4)Detection attack (Probe): the network is scanned to obtain detailed information about the user’s device.
In addition, the dataset contains 49 features, which constitute the traffic that exists between the host and the network data packet and are used to distinguish normal or abnormal observation results. Compared with other datasets, it contains both real-scene data and synthetic data. Attack behavior and the complexity of UNSW-NB means dataset are valid and reliable.
5. Result Analysis
The CNNGLSTM algorithm and SVM, RNN, CNN, and LSTM algorithms are used to compare the accuracy rate (AC) and false alarm rate (FPR) of different attack types. The CNN-LSTM method has a high classification detection rate. Compared with other algorithms, it has a lower false alarm rate.
To verify the overall effectiveness and comparison of the experiment, this paper uses two datasets of NSL-KDD and UNSW GNB15 to compare the accuracy rate (AC) and false alarm rate (FPR) of the above five algorithms. The experimental results are shown in Figures 5 and 6.


It can be seen from Figures 7 and 8 that CNN-LSTM performs well in the NSL-KDD dataset and UNSWGNB15 dataset reaching 7% and 99%, respectively. The accuracy rate of 4% also has the lowest false alarm rate of two, respectively, 24% and 2.17%. Therefore, this algorithm has better performance characteristics among similar algorithms.


All the deep learning algorithms discussed in this article are implemented in a distributed manner under Apache Spark. The experimental results are shown in Figures 7 and 8. It can be seen that compared with traditional nonparallel machines and deep learning algorithms, the training and testing time is significantly shortened. Furthermore, the experimental results show that the training time and test time used by the CNN-LSTM algorithm are the shortest.
6. Conclusion
Comparative experiments found that because of the slow detection speed and low detection efficiency of big data in intrusion detection systems, the advantages of distributed frameworks and deep learning algorithms are fully considered, and the distributed architecture is combined with the deep learning CNN-LSTM algorithm. Through data, the detection efficiency and detection time are improved after the data is standardized by preprocessing and other methods. Experimental verification on the NSL-KDD dataset and the UNSW-NB15 dataset shows that the deep learning algorithm of CNN-LSTM using the Spark framework is comparable to other deep learning algorithms. It reduces the training time and test time, improves the detection rate, can well meet the real-time requirements of intrusion detection, and more satisfies the actual needs of the Internet of Vehicles for intrusion detection. In the next step, this article will improve based on intrusion detection performance and reduction of detection time, and we will further focus on the detection capabilities of deep learning algorithms, conduct intrusion detection on distributed platforms, and explore suitable distributed deep learning algorithms to meet the needs of intrusion detection for car network information security. A more efficient algorithm handles the network data traffic of the Internet of Vehicles and enhances the adaptability of the algorithm [42].
Data Availability
The data used to support the findings of this study are available from the author upon request (kusumasyadav0@gmail.com).
Conflicts of Interest
The authors declare that they have no conflicts of interest.