Abstract

Intelligent transportation system (ITS) refers to advanced applications to make transportation safer and more intelligent. The dynamic and diverse natures of the system have been creating many challenges in ITS deployment and security. The progression in recent years of machine learning provides potentially strong methods to exploit data sources from transportation networks. Machine learning-based approaches promise to deal with various challenges in networks thanks to their ability to adapt to the changing network topology and network scale. This paper investigates the application of machine-learning models in jamming detection by analyzing how the approach works on individual observations of vehicles in different scenarios. We propose a machine learning-based approach that explores hidden rules of how the observation changes under a reactive jamming attack. Our proposed machine learning-based approach improves detection accuracy compared to existing methods. The performance of the proposed method depends closely on dataset selection. Though we evaluate our approach with synthetic data, the data are generated with justifiable simulations calibrated to match a real-referenced dataset. Our analysis provides a connection between the machine-learning domain and vehicular networks by utilizing specific domain techniques for jamming detection.

1. Introduction

Intelligent transportation systems include an extensive range of applications, with the common aim of making transportation safer and more convenient [1]. These applications are realized thanks to the facilitation of vehicular networks that provide communication between vehicle to vehicle (V2V) and vehicle to infrastructure (V2I) [2]. Vehicles and roadside infrastructure exchange application-specific data and essential information related to safety, such as the acceleration of vehicles and hazardous incident warnings. In order to keep information fresh and compact, basic safety messages (BSMs) which bring this safety-related information are required to be broadcast periodically every 100 ms [3] with minimal message length. This characteristic makes BSMs vulnerable to wireless attacks, especially jamming attacks, due to their simplicity in implementation and operation.

Along with their development, vehicular networks face a wide range of security attacks at different layers from application down to the physical layer [4]. At the application layer, the attacker can jeopardize his neighbors’ safety by injecting malicious information or spoof coordination or speed [5]. He may utilize routing protocols to manipulate network traffic, such as redirecting (wormhole attacks), relaying, or destroying any coming packets without forwarding them. The attack may occur at the physical and medium access control layers, such as interference and jamming attacks [6]. By broadcasting a nonsense radio signal in the physical channel, the attacker blocks any communication within its transmission range.

The simplicity but significant impact of jamming attacks in wireless networks makes it become an interesting research topic [7]. According to the attack strategy, jamming attacks can be classified into three types: constant, reactive, and random jamming. Constant and random jamming attacks broadcast continually or randomly radio signals without regard to communication rules. Therefore, they seem to be evident and easy to be detected. While reactive jamming, which is our concern, is the most dangerous and challenging to be detected among three types. It blends with legitimate communication activities by broadcasting jamming only when transmission is in the physical channel. This attack strategy effectively prevents the exchange of information. Any missing or obsolete safety-related information may lead to severe consequences.

The features of a vehicular environment such as the large-scale and frequently changing topology, the vulnerability of BSMs, and the disguisement of reactive jamming attacks together create challenges in detecting this kind of attack that targets safety applications in vehicular networks. Since BSMs are vital for safety applications, vehicles should directly communicate with each other and infrastructure to keep updating surrounding conditions. This communication has to deal with issues of inherent characteristics of radio channels, highly dynamic operating environments, lack of centralized management, the requirement of high reliability but low latency communication, and large-scale networks [8]. In an exposed radio channel, the transmission of messages is easy to be tracked and intercepted by attackers.

Besides, vehicles may move at different speeds and trajectories while operating in ad hoc mode (direct communication). Since each vehicle cannot get knowledge of the whole network, it has little information to detect jamming itself. It can exchange its observation with other vehicles to assist detection; however, it takes some time to coordinate, which may oppose latency restrictions for safety applications.

Although there have been notable studies on detecting jamming attacks in wireless networks, they consider various applications and scenarios. In particular, jamming detection methods that comprehensively match the above problem are shortlisted. Primarily, messages supporting safety applications are supposed to be transmitted only in the control channel and renewed every short time (100 ms as suggested by [3]). This multichannel operation is specified in the IEEE 1609 standards for wireless a hoc vehicular environments (WAVEs). Vehicles switch alternatively between one control channel (CCH) to any of the service channels (SCHs). The IEEE 802.11p standard is adopted for medium access in vehicular networks. Vehicles follow the CSMA/CA protocol; hence, they can observe the medium and collect some information such as the number of successful received packets, failed transmissions, and the number of neighbor vehicles. The observation process can be depicted in Figure 1.

Considering jamming attacks on BSMs, which must be exchanged only on CCH under multichannel operation, two detection approaches have been introduced in the literature: model-based and data-driven approaches [9]. Although machine-learning techniques are applied early in data-driven approaches, their studied cases are limited to the scenario of two transceivers because of the shortage of vehicular data. Mainly, every node works in broadcast mode for safety applications, and vehicle mobility cannot be ignored. That leads to a large diversity in data properties.

In this paper, we answer two questions: Can an individual vehicle detect a jamming attack based on its medium access observations with the assistance of a pretrained machine-learning model? How do data impact the proposed machine learning-based approach for jamming detection? Our contributions are two folds:(i)We design an expandable simulation model with real-data validation to generate three justifiable datasets.(ii)We propose a machine learning-based approach to detect jamming attacks, specifically on BSMs in vehicular networks. The detection is made based on medium access observations at the individual vehicle. The approach is evaluated and analyzed with different training datasets.

The paper is organized as follows: The threat of jamming attacks on safety applications in vehicular networks and detection approaches are briefly described in this Section 1. Section 2 discusses related works on jamming detection methods, efforts to collect data and applying machine learning in vehicular networks. Design and validation of the simulation model and our data generation are described in Section 3. Our proposed machine learning-based method is elaborated in Section 4. Section 5 is reserved for performance evaluation and analysis of our machine-learning method. Section 6 concludes the paper.

Model-based approaches detect jamming attacks based on communication rules in the network. Rule violation may lead to some unreasonable metric values such as a packet delivery ratio [10], the number of forwarding messages by neighbor vehicles [11], or an index computed from medium access observations. Among model-based approaches, the most relevant research in [8] proposes a real-time detection method that performs high detection probability in different scenarios. However, the method is susceptible to the wrong estimation of its neighborhood. The number of surrounding vehicles, called neighbors, is prone to be miscounted due to vehicle movement and unreliable communication environment.

Without prior network knowledge, data-driven approaches use machine-learning and data-mining techniques to analyze collected data from networks. They promise to solve the problem comprehensively in a dynamic vehicular environment. However, because of difficulties in collecting data, research results on machine learning-based jamming detection methods are so far very few and specific to certain scenarios. An experimental evaluation is performed in [12] for the scenario of jamming attacks on one communication link between two vehicles. Results from analysis in [9] are for the scenario of a platoon. Advantages and disadvantages of the related works belonging to the two jamming detection approaches are discussed in Table 1.

The evolution of machine-learning techniques within the last decade brings us powerful tools to explore knowledge from various data sources. However, it is not easy to verify if the machine-learning solution works as a machine-learning model’s performance depends on the data [14]. Although studies on applying the machine-learning approach in vehicular network problems show fruitful results [9], most of them use synthetic data due to the shortage of real datasets [15]. At the same time, synthetic ones are mentioned very briefly, and no justification is given for their reliability.

The scale of vehicular networks requires huge resources, including time, equipment, and workforce, to collect data, especially the diversity of experiment scenarios (urban, highway environment, various trajectories, etc.). Therefore, researchers have turned to simulated data as a substitute. However, simulation is limited in describing real networks as not everything is modeled in the simulation. That leads to a gap between the real data and simulated data. It is important to validate the simulated data with reference to the real one.

Due to the lack of available datasets, researchers generated their datasets through experiments and/or simulations to facilitate their research experiments. Real data can be collected via experiments on real communication systems or testbeds, while synthetic data can be collected from simulations. Between the two types of data, the number of real datasets is much smaller than that of synthetic ones because of the hardware requirement. According to statistical analysis in [16], real datasets occupy around of the total number of vehicular datasets.

Moreover, datasets are collected and generated with different aims such as performance evaluation [17], enhancement [18], resource management [14, 19, 20], or security [21, 22]. Data attributes vary from network traffic patterns collected from the application layer, traffic routing information from the network layer [23], and transmission monitoring collected from the physical layer [13]. Despite the wide range of vehicular datasets, relevant ones specifically about jamming attacks in vehicular networks are few [10, 24]. Table 2 compares the two most relevant published datasets and our dataset. The two others provide observations at the physical layer, including transmission information (power, the time stamp of sending and receiving, packet size, etc.) and geographical information. UCI [13] demonstrates V2R communication, while VANET2014 [10] focuses on V2V mode between one transmitter and one receiver. We are not aware of any dataset on medium access observations under jamming attacks in a vehicular environment where many vehicles have to communicate frequently. Another issue about data collection in vehicular networks is how to validate synthetic datasets and justify their reliability.

Machine learning can approach tasks that are too difficult to solve with fixed programs. Classification is the most common machine-learning task which can be performed on many types of data. In this task, the algorithm specifies which categories some input belongs to. There are many variants of classification which can be grouped into two types: binary classification (only two possible outcomes) and multilabel classification (more than two possible outcomes) [25]. The problem of jamming detection can be treated as a binary classification in which the input is the individual medium access observation, and the output is a decision whether the vehicle is under a jamming attack or not.

Applying machine-learning algorithms for jamming detection has been studied in different aspects, data, and specific scenarios [26]. Arjoune et al. [27] generated data of physical signals with features including received signal strength (RSS), bad packet ratios, packet delivery ratios, and clear channel assessment. They have compared the performance of three classification machine-learning algorithms on their dataset: random forest, SVM, and neural network. Although good results are obtained for those models after many experiments of training, testing, and fine-tuning, their dataset is not analyzed. Moreover, their considered scenario is general wireless communication. The use case of their proposed machine learning-based approach has not been specified. Upadhyaya et al. [28] applied several machine learning techniques on a real network dataset of RSS collected from IoT nodes within an indoor office building. The performance of the method is measured by commonly used metrics such as accuracy, false negative, false positive, and F-measure. However, their application is specific to the given dataset, and the data are not analyzed to show their characteristics. The selection of training data and classification algorithms seems to be random. The machine-learning approach is proposed to detect and classify jamming attacks on unmanned aerial vehicles (UAVs) in [29]. Many experiments with different classification algorithms are conducted in the work. The study is specified for jamming attacks on unicast communication between a fixed transmitter and a drone. Krayani et al. [30] proposed a method to detect a combined attack of a spoofer and a jammer basing on data from physical layer, including GPS and received signal characterization. They considered the presence of roadside units (RSUs) that have advantages in collecting information. RSUs can monitor the GPS and received signals at their attached vehicles.

Benslimane and Nguyen-Minh [8] have proposed a jamming detection method that figures out a rule of how the medium access observation changes under a reactive attack. The rule theoretically helps detect jamming with high probability as it is correct in majority of observations. However, its performance reaches a maximum benchmark because there are a minority of cases where the rule cannot be applied. A machine-learning approach may explore more hidden rules to bypass the benchmark of the existing method.

Data analyzing is essential in the study of machine-learning approaches as it helps ensure that the data used to train models are relevant and properly processed for optimal performance. Besides, most works focus on machine-learning algorithms without considering conventional approaches such as model-based ones. Hence, no referenced value exists to evaluate the novel proposed machine-learning approach. Considering the above open issues, we design a simulation model of reactive jamming attacks and generate three datasets with different characteristics. The data generator is validated by comparing the correlation between the generated data and a real dataset. It is verifiable and reused by further studies on jamming attacks. In this work, we propose to apply the machine learning-based approach to detect jamming on broadcasting among vehicles based on medium access observations at an individual vehicle. Characteristics of the datasets are analyzed to predict and explain the performance of the proposed approach with different training data selections. Moreover, our proposed method is compared to the existing model-based method in the platoon application.

3. Reactive Jamming Model Validation and Datasets

There is always a gap between the actual data and the simulated one. While there is a small amount of existing real data collected in vehicular networks due to the difficulty in large-scale data collection, synthetic data’s reliability is hard to evaluate. Specifically, real data on jamming attacks in vehicular networks are scarce. For this reason, we design and validate a reactive jamming model in network simulator NS-3 to generate justifiable synthetic data contributing to research on jamming attacks in vehicular networks. In particular, we consider the same data collected at the MAC and physical layer as in our previous work [8] with a novel approach that produces higher performance.

3.1. Reactive Jamming Model

Reactive jamming is activated only when it detects radioactivity in the operating channel. Therefore, it conforms easily to legitimate transmissions. Whenever there is any radioactivity, it decides to broadcast a jamming signal with probability ; i.e., legitimate transmissions are partly attacked. Reactive jamming can be considered a protocol-aware attack as it senses and obeys medium access protocols. Although it has been widely studied in wireless networks [10, 3133], its use cases are primarily unicast (communication between only two devices). In vehicular networks, the broadcasting mode is more common. Vehicles frequently exchange beacons, including instant information about their owners, to accommodate safety applications. A recent reactive jamming model specified for broadcasting is proposed in our previous work [8]. In this work, we reconstructed the analytical model as an abstraction in a network simulator with details. This allows us to validate existing works and generate more synthetic data for further researches (source code can be requested, please contact: nguyen-minh.huong@usth.edu.vn).

We assume that each vehicle broadcasts one fixed-length beacon at each CCHI. All vehicles follow the medium access control protocol specified in the IEEE 802.11p standard. The jamming attacker is a transceiver with higher sensitivity than other vehicles and can adjust his transmitting power level. Its operating algorithm is elaborated in Algorithm 1.

(1)if sensing a transmission in the medium then
(2)if is not transmitting then
(3)start broadcasting jamming signal after predefined time ;
(4)else
(5)ignore;
(6)end if
(7)end if

The transmission time of the jamming signal and waiting time before transmitting define a reactive jamming type. They have a different impact on the reception of the legitimate signal at vehicles in attacked vicinity [10].

3.2. Model Validation

We validate our jamming simulation model by comparing the simulated data with real data under the same scenario. The comparison assists us in calibrating our simulation model in order to make the simulation as realistic as possible. The calibrated model can be extended to simulate more complicated scenarios that are worthwhile for further investigation.

The real datasets for [10] were collected in an open field rural area located in the periphery of Aachen in Germany. The area has two perpendicular roads, the 600 m main road, and a 120 m long side road. The jammer is placed at the crossroad between the main and side roads. Communicating vehicles move at a constant speed of along the main road, come closer then leave the jamming impacted area. The transmitter sends packets to the receiver at a rate of 100 per second over 2 minutes. This procedure is repeated for different scenarios. The closest scenario to our study is the experiment with the presence of a reactive jammer with a signal length of and reaction delays of 12 . We set up a simulation scenario similar to the experiment.

First, we set up a simulation with the same parameters as the real dataset description and captured the varying packet delivery rate (PDR), computed for a duration of every , according to simulation time. PDR in simulation follows the same trend as the real one, as illustrated in Figure 2. However, there is a mismatch between the simulation and the real data in the length of the blackout area when communication is blocked. It is because of jamming for nearly in real data and in simulated data.

If we consider PDR according to time as a random variable, the similarity between simulation PDR (denoted by ) and real data PDR (denoted by ) can be estimated by the correlation coefficient between them. Taking into account constraints of the IEEE 802.11p standard devices [8], we calibrate physical parameters in simulation: the transmitting power and sensitivity of the jammer and vehicles. Respectively, to each parameter set, a dataset of simulation is obtained. A sliding window algorithm is applied to find the referenced time in which simulation and real data PDR are most correlated. This referenced time period coincides with the blocking time interval and the blackout area.

The correlation coefficient, denoted as , can be defined in equation (1), where and are correspondingly mean and standard deviations of simulation and real PDR and is the expected value operator:

Calibration is persisted until the correlation coefficient reaches the maximum on the referenced time. The PDR curves after calibration are shown in Figure 3. The blocking time caused by jamming attacks is matched between the simulation and real data, approximately , and the correlation coefficient between two PDRs is maximal at a value of 0.7871. The respective parameter set after calibration, which performs the highest similarity between two data, is described in Table 3.

3.3. Data Generation

Vehicles travel with different trajectories and speeds on the road. This creates a variety of communication scenarios. From the view of the model-based jamming detection method, the slower network topology changes, the easier it is to detect an abnormality. Therefore, these methods outperform in platoon scenarios where vehicles travel in line and close. They are susceptible to the wrong estimation of the surrounding environment, such as the number of neighbor nodes.

In this work, we run the above-calibrated simulation model to generate three datasets corresponding to three commonly considered scenarios in vehicular networks: one-hop communication, platoon, and cluster. Each scenario is defined by how vehicles move in networks (mobility model). Safety application is served; each vehicle’s observation of the access medium is collected. The one-hop communication scenario represents the case that every vehicle is within a transmission range of each other and the jammer. The attack consequence is always visible as it is ideal for jamming attacks. The platoon scenario mimics the real platoon application in ITS, where vehicles travel closer to each other and have nearly identical speeds. Platoon applications promise to bring many benefits in transportation management as traveling in platoons increases the capacity of roads [34]. It is a key to the realization of an automated highway system. Without platooning protocols, a platoon of compact vehicles is brutal to form and is restrictive. In cluster scenarios, vehicles may travel in groups but not follow each other. This situation is common in reality, where vehicles can be distributed randomly in a considered transmission range. A reactive jammer is randomly located on the considered road segment. The number of vehicles joining the network varies from 5 to 20.

In each scenario, medium access information is observed and collected at each vehicle. It includes the number of successfully received packets , number of failed transmissions on the medium , number of packets the given vehicle transmits in each CCHI , and estimated number of neighbor vehicles. Observation in each CCHI makes a sample in the dataset. Corresponding to three scenarios, we generated three datasets published on IEEE Dataport (https://doi.org/10.21227/yvxd-mf03). Each dataset includes two subdatasets in the same scenario, with and without attacks.

4. Machine Learning-Based Jamming Detection Approach

In this work, we propose to apply machine-learning techniques to exploit medium access observations at an individual vehicle. The results help vehicles detect jamming attacks without any coordination. From the view of the network layering model, various types of information are available at different layers. We focus on the medium access control (MAC) and physical layer, such as what a vehicle can observe by monitoring medium status, including idle and busy state, successful and failed reception of packets, and its transmissions. When there are many vehicles in networks, the failed packet reception can result from transmission collision, which is normal, or jamming attacks. We formalize the problem of jamming detection as the task of classifying network activities as normal or a jamming attack.

We aim to overcome the two shortcomings of our previous work [8]. First, the jamming detection method [8] is strongly sensitive to the number of neighbor estimations that is prone to be wrongly observed at individual vehicles. Second, analysis in this previous work shows a theoretical limitation on the probability of detecting reactive jamming attacks even if detection is assumed to be performed using a central detector (a device dedicated to observing medium access). Our proposed machine learning-based method is expected to be workable for individual vehicles and does not rely highly on the estimated number of neighbors. The method is fed with the data that a vehicle can observe and estimate that may include the wrong estimation due to the dynamic network topology.

4.1. Machine-Learning Model Representation

As described above, we study the machine-learning model on three synthetic and validated datasets corresponding to three typical vehicular communication scenarios. Each dataset is comprised of approximately 15,000 to 28,000 samples that are MAC observations in each CCHI of all vehicles in the network during travel on a road segment of . Each sample includes six most concerned features: the number of successful received packets , the number of failed transmission on the medium , the number of error received packets , the number of packet transmitted by the given vehicle , time step , and the estimated number of neighbor vehicles . Besides, there is other information such as node id of the observing vehicles, system time, idle time computed in slot time according to the IEEE 802.11p standard, and monitoring of the medium state. This extra information may be useful for further investigation.

Treating the problem of jamming detection as a classification problem, we start with classical classification of machine-learning algorithms such as SVM and Adaboost. The output of the models is a decision whether the vehicle is under the attack of jamming or not. Samples in datasets are labeled as 0 or 1 when the given vehicle is instantly within the sensing range of the activated jammer.

Besides SVM and Adaboost, several classical binary classifiers may also be suitable for our case: logistic Regression, decision tree, and random Forest. SVM and logistic regression share the idea of separating data points into two classes. Adaboost and random forest are ensemble-learning techniques combining multiple decision trees to strengthen the algorithm. Hence, it can be considered that there are two groups of algorithms that we can compare. In the scope of this work, we aim to create a proof of concept to apply machine learning to jamming detection in vehicular networks. For this reason, we initially implemented and tested only SVM and Adaboost.

The SVM algorithm represents the dataset samples, each having features, which are 6 in our datasets as points in an n-dimensional space segregated into classes, 2 classes in our case, by a hyperplane [35]. It is one of the most influential approaches to supervised learning in classification problems. Before the proposal of SVM, decision trees have been commonly applied for classification. A decision tree classifies dataset samples by designing a logical tree with various conditions. After several branches, the dataset is divided into 2 classes (binary classification). A single decision tree may not be able to predict the class of an object accurately, but multiple of them, with each one progressively learning from the mistakes of the other, can be a robust model. This is the idea of the Adaboost algorithm [36].

As the data already have a vector format, they conform better to SVM, which is originally a vector classification algorithm. Indeed, the result from experiments shows a better performance with SVM than with Adaboost in specific training data selection.

4.2. Data Preparation

From the view of machine learning, the more diverse samples on the training dataset are, the more likely the machine-learning model will perform better. In comparing the diversity of the three generated datasets, the cluster dataset contains most various observations. Experiments are conducted with different training dataset selections, and performance is evaluated on a platoon dataset.

4.3. Assessment Metrics

In the topic of jamming detection in a vehicular environment, the three most concerned metrics are detection probability, reliability of a jamming alarm, and false alarm probability. As depicted in Table 4, they are three derivations from a confusion matrix that is used for visualizing performance of the machine-learning algorithm [37]. The three derivations, true-positive rate or recall, positive predictive value or precision, and false-positive rate (FPR), are, respectively, computed in equations (2)–(4).

Detection probability is defined as the ratio of the number of detected to the number of actual attacks, which is the recall metric. Whenever an attack is detected, an alarm is raised. Among alarms, there are false ones; i.e., there are no jamming attacks, but an alarm is still raised. The reliability of an alarm can be estimated by the ratio of true alarms to the total number of alarms, which is the metric precision. When there are no jamming attacks, the prediction can be attacked or not. The false alarm probability, represented by the metric FPR, is computed by the ratio of false alarms to the number of predictions in this circumstance:

The detection probability and reliability of detection are expected to be high, while the false alarm probability should be small. We propose to use an overall assessment metric, denoted as , that is, an average of the three metrics, as equation (5), for evaluating the performance of algorithms in our experiments:

Besides, accuracy, computed as equation (6), is an effective metric to compare performance of machine-learning models. It measures how precisely a given model predicts the state of the network, either under attack or not.

5. Performance Evaluation

Two of our three datasets, one-hop communication and cluster, are used to train the SVM and Adaboost algorithms’ machine-learning model. The algorithms are evaluated on our platoon dataset and compared with the most relevant and recent model-based method, in [8], denoted as TVT. According to our best understanding, the TVT method is the latest work that detects reactive jamming attacks based on medium access observations that share the same research objectives as ours [9]. Furthermore, the TVT method was compared with their relevant methods in [8], which shows its advantage on broadcast messages. Due to above reasons, we choose TVT as a typical model-based method to compare with our proposed machine learning-based method.

Using our three datasets for training and evaluation, seven experiments are conducted as described in Table 5. The six models and the TVT method are evaluated on the platoon dataset: SVM1 and Adaboost 1 are trained on the one-hop communication dataset; SVM2 and Adaboost2 are trained on the cluster dataset; SVM3 and Adaboost3 are trained on both; TVT requires no training dataset.

The difference among the three datasets lies in the mobility of vehicles in three scenarios. All vehicles do not move in the one-hop communication scenario. Hence, vehicles observe a nonmobile environment; i.e., the number of neighbors is fixed, and distance from the vehicle to the jammer does not change. That leads to limited communication situations. From the view of data analysis, the dataset is not diverse. While vehicles move in platoons or cluster in the two other scenarios, the number of neighbors and distance to the jammer vary according to the movement of vehicles. In particular, vehicles communicate in a broadcast manner and move forward and then leave the jammer sensitivity range. Many communication cases occur as the neighborhood of each vehicle changes. The distance between transmitters and receivers and their distances to jammers vary during the simulation. The cluster dataset is the most diverse among the ones. Hence, it should be the most suitable training dataset.

5.1. Vehicle Density

Vehicle density is a prevalent factor that may impact vehicular network solutions. The density at the location of an observing vehicle can be measured by the division of its number of neighbors and its communication range. A histogram of the number of neighbors in samples from the three datasets is visualized in Figure 4. The cluster dataset has a wide range of values covering the two other datasets.

We analyze the impact of the number of neighbors on the performance of machine-learning models to study how density affects our proposed approach. The models are trained on the cluster dataset and evaluated on the one-hop dataset.

Figure 5 shows a fluctuation of the detection accuracy at individual vehicles in the one-hop dataset. The accuracy leverages at a low value, 0.21, when the SVM model is used, no matter the number of neighbors. In comparison, high detection accuracy, up to 0.87, is achieved with the Adaboost model and slightly reduced in vehicles with more neighbors.

5.2. Machine-Learning Models

Performance evaluation on the platoon dataset, SVM, Adaboost models on different training datasets, and the TVT method can be found in Table 6. The models and the TVT method are compared based on the four assessment metrics defined in Section 6.

First, we compare the performance of machine-learning models on different training datasets. How training dataset selection affects the performance of machine-learning models is confirmed by experiment results shown in Figure 6. Among three selections of the two training datasets: training only on the one-hop communication dataset and only on the cluster dataset, and on both datasets, the models trained only on the cluster dataset show the best performance in terms of the overall metric B for SVM and Adaboost algorithms. The second place is the selection of both two datasets. The worst is the selection of one-hop communication as training data. Indeed, the most diverse dataset brings the best performance to the models.

Considering SVM algorithms, the highest performance is obtained with the SVM2 model; the second best is the SVM3 model, ; the third place is the SVM1 model, . Although Adaboost performs about better than SVM with one-hop communication and two training datasets, the best model among experiments is SVM2, and the SVM algorithm is trained on a clustered dataset.

SVM2 and Adaboost2 achieve the best accuracy, respectively, 0.84 and 0.83. Their ROC curves, in Figure 7, show that if we choose a random threshold, Adaboost2 performs better than SVM2 as its area under the ROC curve (AUC), 0.93, is higher than AUC of SVM2, which is 0.88. However, if the threshold is chosen at the best point where , SVM2 has an FPR of 0.12, while Adaboost2 reaches this point when its FPR is higher, which is 0.2.

5.3. Machine Learning-Based Method and Model-Based Method

Second, we compare the machine learning-based method with the selection of the cluster dataset for training with the model-based method, i.e., TVT. The TVT method detects jamming by computing the average number of vehicles joining transmission at once from what a vehicle observes. If the number is smaller than 2, an alarm is raised. Figure 8 visualizes the comparison between two methods in terms of recall, precision, specificity, and overall metric B. TVT can detect jamming attacks with a high probability, more than , correspondingly, , that is, nearly as high as that of machine learning-based model SVM2. However, SVM2 outperforms the TVT method in terms of precision and specificity. It improves the reliability of raising alarms with precision by nearly , compared to of TVT. The specificity, which is the inverse ratio to the false alarm probability, is enhanced in SVM2. It is , higher than 0.543 in TVT. SVM2 reduces false alarm percentage to , while the number is with TVT. Overall, SVM2 demonstrates an enhancement of around to TVT.

5.4. Reducing Data Size

The original datasets contain individual observations of all vehicles in the network with the presence of a jammer. The data size is computed as a product of the number of observations and features, initially, 6. We evaluate the SVM algorithm in an original data size of 169518. Vehicular networks are very dynamic, so resources are limited in terms of time (for processing, communicating, and responding to any incident) and also computational performance. To economize the memory to store the data and make machine-learning models lighter and quicker, we propose a method to reduce data size in two dimensions by removing duplicated samples and selecting the most important features that have the highest effect on the performance of the model. Experiments are conducted for the SVM2 model. Performance and data size after each reducing step are shown in Table 7 and Figure 9.

First, we reduce the data size by removing duplicate identical samples. As depicted in Figure 9, the performance of model SVM2 trained on the first-step reduced-size cluster dataset, denoted as SVM2-RR1, is slightly reduced by , while the number of samples reduces by .

Second, we consider the 4 features that have possibly the highest impact on the performance of the models: the number of successfully received packets, the number of collisions, time stamp, and the number of neighbors. The data size can be continually reduced by removing duplicated samples with identical values at the four considered features. At this second step, the training dataset, SVM2-RR2, has a data size reduced slightly by to SVM2-RR1. The performance of the model is leveraged.

Third, by selecting only the four most important features and removing all others, the dataset, namely SVM2-RR2-FS, is reduced to 66664, which is only of the original dataset. A slight degradation in performance of compared to the original one is observed.

6. Conclusion and Future Works

In conclusion, we have considered open issues of detecting jamming attacks in the broadcasting environment, the shortage of datasets, and no central observation in vehicular networks. We have proposed to apply a machine learning-based approach for jamming detection based on medium access observations at an individual vehicle. In order to facilitate the investigation of our proposal, we have designed and validated a simulation model of jamming attacks on vehicular networks. The model can be verified and further used to generate various data. From evaluating different classification algorithms with training data in three mobility scenarios, correspondingly, datasets with different diversity, we propose a pipeline to apply machine learning in the problem of detecting jamming attacks using medium access observations at an individual vehicle. In our work, we suggest the SVM algorithm with the most diverse data for training, which is a cluster scenario. The selection of SVM trained on the cluster dataset demonstrates significant improvements in detecting probability and reducing false alarms, which is the shortcoming of the model-based method.

As it remains a question of optimally choosing a machine-learning algorithm, we plan to further investigate two other algorithms (logistic regression and random forest) and even deep neural networks. We will need more simulations in various scenarios to comprehensively analyze their effectiveness.

Data Availability

Observation in each CCHI makes a sample in the dataset. Corresponding to three scenarios, we generated three datasets published on IEEE Dataport https://doi.org/10.21227/yvxd-mf03.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This work was funded by the Graduate University of Science and Technology (GUST), Vietnam Academy of Science and Technology, 18 Hoang Quoc Viet, Hanoi, Vietnam, under the post-doctoral Scheme 2020 in collaboration with the Institute of Information Technology, award number GUST.STS.DT2020-TT02.