Abstract

Intelligent transportation systems necessitate a fine-grained and accurate estimation of vehicular traffic flows across critical paths of the underlying road network. However, such statistics should be collected in a manner that does not disclose the trajectories of individual users. To this end, we introduce a privacy-preserving protocol that leverages roadside units (RSUs) to communicate with the passing vehicles, in order to construct encrypted Bloom filters stemming from random vehicle IDs that are chosen secretly by the individual vehicles. Each Bloom filter represents the set of vehicle IDs that contacted the RSU but may also be used to estimate the traffic flow between any number of RSUs. More precisely, we designed a probabilistic model that approximates multipoint traffic flows by estimating the number of common vehicles among a given set of RSUs. Through extensive simulation experiments, we demonstrate that our protocol is very accurate—with a minor deviation from the real traffic flow—and show that it reduces the estimation error by a large factor, when compared to the current state-of-the-art approaches. Furthermore, our implementation of the underlying cryptographic primitives illustrates the feasibility, practicality, and scalability of the system.

1. Introduction

Traffic statistics facilitate transportation authorities in many use cases, including investment plans, signal time determination, road expansions, etc. Traffic statistics are measured in terms of single-point, point-to-point, or multipoint traffic flows. Single-point traffic flow refers to the number of vehicles passing through a specific location, which can be measured by placing fixed sensors at roadsides, such as inductive loop detectors, wireless magnetometer sensors, road cameras, or microwave radar sensors. Single-point statistics are useful for measuring the average annual daily traffic (AADT) [14]. On the other hand, point-to-point traffic flow, sometimes referred to as origin-destination (O-D) flow, is defined as the total number of vehicles moving from one location to another. Finally, multipoint traffic refers to the traffic flow that passes through a set of specific locations. Point-to-point and multipoint traffic statistics may be deduced from single-point counters [5]; however, there are serious concerns about the accuracy of such approaches because they are oblivious to the identities of the underlying vehicles.

Consequently, to derive accurate traffic statistics, we need alternative methods for collecting data from moving vehicles. One approach is to use the drivers’ smartphones [6, 7] or the GPS systems integrated in most vehicles [8, 9], in order to generate detailed object trajectories. Alternatively, recent advancements in vehicular communications and networking technologies have brought cyber physical systems (CPS) into road networks, where roadside units are used to collect traffic data [10]. The dedicated short-range communications (DSRC) protocol is standardized by the IEEE (802.11p) [11] and enables the direct communication between vehicles and roadside units (RSUs). In this scenario, generating accurate traffic statistics is trivial: each vehicle reports its ID to every RSU it encounters, with all the reports being aggregated to a centralized server (the transportation authority). Nevertheless, this approach violates the privacy of the vehicle owners and may reveal sensitive personal information, such as home and work locations, habits, etc.

To address these privacy concerns, Zhou et al. [12, 13] have proposed the use of a bit array as an alternative to individual vehicle IDs. In particular, every vehicle selects (in advance) a set of s bit locations that it may reveal to an RSU. When the vehicle identifies a new RSU, it randomly selects one of the s locations and sends it to the RSU as its identifier. Each RSU gradually aggregates the bit array data from all passing vehicles by setting a bit to “1” if it is selected by at least one vehicle (nonselected bits are set to “0”). Point-to-point [13] or multipoint [12] traffic flows are then constructed by comparing the bit arrays of the corresponding RSUs. While this approach improves the privacy compared to the trivial method, it still has several limitations. First, the bit information is exchanged in plaintext and is, thus, vulnerable to timing attacks. Indeed, any number of RSUs may collude to deduce the vehicles’ secret information (the s bit locations), by correlating successive samples based on the driving distance between the RSUs. Second, for sufficient privacy, s should be relatively large (e.g., s ≥ 5), which negatively affects the accuracy of the traffic flow statistics.

To this end, our work advances the state of the art in two directions. Our major contribution is a novel method that derives from our earlier conference paper [14] and summarizes the vehicle information at each RSU into an encrypted Bloom filter [15]. We employ a two-tiered approach where (i) the vehicles’ Bloom filters are encrypted with a simple onetime pad (OTP) cipher and (ii) the OTP keys are encrypted with a homomorphic threshold public key cryptosystem. The underlying cryptosystems make it infeasible for an adversary to decrypt the individual Bloom filters, thus enhancing significantly the vehicles’ privacy. Each RSU aggregates the Bloom filters and OTP keys from multiple vehicles and sends the corresponding ciphertexts to the transportation authority. Finally, the transportation authority engages multiple trusted parties to decrypt the aggregate OTP keys and retrieve the plaintext Bloom filters.

The second contribution is the extension of our previous work [14] from origin-destination to multipoint traffic flow estimation. This is an important feature, because it allows transportation authorities to estimate the traffic flows across arbitrary road network paths. Specifically, we introduce a simple and accurate probabilistic method for estimating the number of common vehicles among any number of distinct Bloom filters, which is an indication of the traffic flow volume between the underlying RSUs. Furthermore, we significantly extended our simulation experiments, by considering more diverse settings where the number of vehicles at each RSU exhibits a large variance. Our results demonstrate that, compared to the current state-of-the-art approaches, our methods improve the accuracy of both point-to-point and multipoint traffic flow estimation by a large factor. In addition, our software implementation of the basic cryptographic primitives at the RSUs and the server illustrates the feasibility and scalability of our approach.

The remainder of this paper is organized as follows. Section 2 surveys the existing literature on traffic flow estimation. Section 3 discusses the data structures and cryptographic primitives utilized in our work, and it also presents the underlying system and threat models. Section 4 introduces our privacy-preserving aggregation protocol, and Section 5 presents the probabilistic model for estimating the traffic flows. Section 6 shows the results of our experimental evaluation, and Section 7 concludes our work.

Estimating and predicting traffic flow is a mature research problem that has evolved over time. Early work on road network traffic flows focused on the prediction of the annual average daily traffic (AADT). To this end, a variety of machine-learning models have been applied. Mohammed et al. [4] predicted the AADT on county roads in the state of Indiana, USA, using a linear regression model. Lam and Xu [3] estimated the AADT on urban roads of Hong Kong with a neural network, using short counts. A similar prediction model was designed by Eom et al. [2], who computed the spatial dependencies with the use of a regression method. Specifically, the authors leveraged a geostatistical technique, called kriging, for modeling spatial trends and spatial correlations between monitoring stations and neighboring stations. Neto et al. [1] utilized support vector machine for regression (SVR), by applying data-dependent parameters based on the distribution of the training data. Similarly, Tsapakis et al. [16] designed twelve models based on regression and Bayesian analysis, using various parameters (such as roadway functional class, population density, and spatial location) collected from five regions in the state of Ohio, USA. Finally, other research work has addressed the feature selection problem for AADT prediction. For example, Yang et al. [17] selected the AADT estimation parameters via the smoothly clipped absolute deviation (SCAD) penalty. All the above systems exploit the capabilities of either sensors or roadside units for traffic data collection.

On the other hand, the authors of [6, 7, 9] utilize data from mobile phones and GPS devices to extract origin-destination information. They exploited the global system for mobile communications (GSM) to observe the flow of mobile phones using cellular network technology, and correlate it to the road network traffic flow. This is similar to the approach used by Google Maps and Waze (https://www.google.com/maps, https://www.waze.com/) to optimize their routing decisions [8].

Hoh et al. [18] highlighted the serious privacy threats in traffic monitoring systems, as they were able to locate 85% of the drivers’ home locations from the collected data. This is also true for popular apps, like Google Maps and Waze, which use a static ID for each reporting client, even across different trips [8]. Therefore, to protect the privacy of trajectory data, Hoh and Gruteser [19] proposed a path perturbation algorithm for a centralized, trusted server, by employing path confusion. PADAVAN [20] is another scheme for anonymous data collection that allows anonymous data reporting, while avoiding fake submissions and linkage between submitted samples. Likewise, Rass et al. [21] introduced trajectory anonymization by deriving pseudonyms for trips and samples. Finally, Hoh et al. [22] proposed a distributed and privacy-preserving traffic monitoring system that utilizes virtual trip lines, i.e., geographical markers where the vehicle has to report its location. Based on the trip line ID, the scheme allows for aggregation and cloaking of several locations, without revealing the precise locations. With a distributed architecture, no single entity has full information of probe IDs and fine-grained locations. However, if multiple compromised entities collaborate, location updates may be revealed with high accuracy.

The abovementioned schemes introduce some level of privacy in traffic monitoring systems, but they are not well-suited toward estimating point-to-point or multipoint traffic flows with high accuracy. To this end, several researchers applied intricate cryptographic protocols to protect the privacy of drivers and their trips. Forster et al. [23] proposed a distributed secret sharing algorithm, using location- and time-specific keys, and enforced k-anonymity of location data in a decentralized environment. At the end of a trip, the vehicle reports its O-D locations with the corresponding time information. Trip reports are encrypted with location and time-specific keys, and uploaded to a centralized and untrusted database. The distributed key sharing algorithm assures that trip reports are available once k vehicles have the same trip with matching O-D and start-end times. Zhou et al. [24] measured origin-destination traffic flows using commutative one-way hash functions that are constructed from an RSA-like cryptosystem. Although the hash function hides the vehicle’s ID from the RSU, it fails to protect the privacy of the trajectory when all the data are aggregated at the centralized server.

The current state-of-the-art protocols for privacy-preserving traffic data collection are due to Zhou et al. [12, 13] and Sun et al. [25, 26], which provide privacy without the use of cryptographic techniques. Specifically, in their methods, every receiver maintains a physical bit array of size m (similar to a Bloom filter), and each passing vehicle sets one of the bits to “1”. To avoid being identified across multiple receivers, the vehicle uses a logical bit array of size sm, i.e., it can only set (randomly) one of s preselected bits at each receiver. Zhou et al. [12] applied maximum-likelihood estimation (MLE) on the bit arrays to measure the intensity of the traffic flow between any pair of receivers. In their other work [13], the authors introduced variable sized bit arrays to estimate the traffic flow across multiple RSUs. Using the same data collection approach, Sun et al. [25] computed the number of vehicles that persistently (in every measurement period) travel on a given road network edge. Furthermore, the same authors [26] also analyzed the number of vehicles that travel x out of T measurement periods on the network edge.

While the aforementioned data collection approach is very efficient, it has several shortcomings. For sufficient privacy, s should be large, but this negatively affects the accuracy of the traffic flow estimation. In addition, the lack of encryption makes it possible to launch a variety of attacks, such as timing or direct observation, which may disclose the contents of a vehicle’s logical bit array. Our method overcomes these limitations by (i) revealing only aggregate Bloom filter data and (ii) utilizing encryption for the submission of individual measurements.

3. Preliminaries

3.1. Bloom Filter

A Bloom filter is a fast, memory-efficient, and probabilistic data structure that was designed by Burton Howard Bloom [15] in 1970 for the purpose of rapid searching (set membership). It is essentially a bit array of size m, whose bits are initially set to “0.” Prior to adding elements into the Bloom filter, we must define k hash functions H1, H2 ,..., Hk, each returning a value in the range [0, m). Then, to add an element into the Bloom filter, we derive k random indices by hashing with each of the k hash functions, i.e., i1 = H1(), i2 = H2(), ..., ik = Hk(). Finally, for each computed index ij, we set the corresponding bit on the bit array to “1.” Similarly, to search for an element in the Bloom filter, we first derive the k indices i1, i2, ..., ik from the hash functions and, if at least one of the bits in these locations is “0,” is definitely not a member of this set. Otherwise, is most likely included in the set, although there is a small probability for a false positive because of collisions (multiple elements may share the same index).

To reduce the false-positive rate, the optimal value for m must be selected based on the number k of hash functions, and the expected number of elements that will be inserted into the Bloom filter. Note that a larger k increases the complexity of the Bloom filter but, in our protocol, a large k also results in more user privacy. It is also worth noting that it is possible to produce the Bloom filter of the union of two (or more) sets, by computing the bitwise OR of the individual Bloom filters. We will take advantage of this property in our methods described in Section 5.

3.2. Paillier Cryptosystem

Public-key homomorphic cryptosystems [27] allow for the manipulation (through mathematical operations) of encrypted plaintexts without requiring access to the private decryption key. In particular, fully homomorphic encryption (FHE) [28] permits arbitrary computations over ciphertexts (both additions and multiplications) and can, thus, be used with any circuit over encrypted data. However, FHE is still very inefficient and not suitable for real-time traffic data collection. Instead, in our work, we rely on Paillier’s additively homomorphic cryptosystem [29]. Paillier encryption is semantically secure and its security is based on the decisional composite residuosity assumption (DCRA). The scheme works as follows.

3.2.1. Key Generation

Select two large primes p, q of equal length and compute the RSA modulus n = pq. Choose a uniformly random integer and compute λ = lcm(p−1, q−1) and µ = (L( mod n2))−1 mod n, where . Output public key (n, ) and the corresponding private key (λ, µ).

3.2.2. Encryption

Given a message m < n, choose a uniformly random integer and output ciphertext c = rn mod n2.

3.2.3. Decryption

Given a ciphertext c, compute the plaintext message m = L (cλ mod n2) · µ mod n.

To illustrate the additively homomorphic property of the Paillier cryptosystem, consider the encryption of two messages Enc(m1) and Enc(m2). Then, the following equation holds:

In our work, we leverage a threshold variant of Paillier’s encryption scheme [30], where the secret key is shared among t parties. While this is easily done with the help of a trusted third party, there are protocols that allow the t parties to compute the Paillier key pair in a distributed manner. With the threshold cryptosystem, all t parties must cooperate to decrypt a Paillier ciphertext.

3.3. System Model

We consider the cyber physical system depicted in Figure 1, where roadside units (also called receivers) are installed at different geographical locations along the road network. Every vehicle and receiver is equipped with a computing unit, and is able to communicate with other devices through the IEEE 802.11p protocol. Whenever a vehicle comes in close proximity to an RSU (e.g., within 100–500 m), it transmits an encrypted Bloom filter that is a representation of its unique identifier (to enhance privacy, a vehicle may choose a different identifier at the start of a new trip.). The RSU blindly aggregates the individual Bloom filters and submits the resulting ciphertexts to the transportation authority. Note that we are interested in collecting fine-grained traffic statistics, so each RSU will produce a new Bloom filter when (i) a timer expires (e.g., every 5–10 minutes) or (ii) a threshold number of measurements is reached (e.g., 1000–2000 vehicles). To satisfy basic privacy requirements, an RSU will not submit a Bloom filter if the number of vehicles is below a lower threshold (e.g., 100–200 vehicles). Such fine-grained measurements allow for the collection of important traffic statistics, including point-to-point or multipoint travel speed estimation.

Prior to system deployment, the transportation authority initializes a threshold Paillier cryptosystem in collaboration with several third-party trusted entities. For example, these entities may include various consumer protection agencies and other nonprofit organizations. The reason for employing a threshold cryptosystem is to prevent individual parties—especially the transportation authority that has access to all data through the RSUs—from decrypting the Bloom filters that are transmitted by the passing vehicles. Threshold decryption necessitates the collaboration of all involved parties, so a single honest player is sufficient for ensuring that only aggregate Bloom filters are being decrypted.

3.4. Threat Model

We assume that all entities in the aforementioned system model are semi-honest, i.e., they will execute all protocols correctly but will try to gain an advantage (in identifying vehicle-specific secret information) by examining the exchanged messages. We also allow collusions among the transportation authority and the RSUs, i.e., the adversary is given the full communication transcript of the underlying network. Furthermore, our methods do not require TLS-based communications to thwart eavesdropping attacks, because all transmitted information is encrypted (nevertheless, if we wish to authenticate the vehicles and/or receivers, we may utilize the TLS protocol with the existing public key infrastructure.) The only requirement with regard to privacy is that, out of the t trusted parties engaged in the threshold cryptosystem, at most t−1 of those can collude. Finally, we assume that the vehicles can remove all identifying information when communicating with a receiver, for e.g., by spoofing the actual MAC address of their network interface card. While it is still possible to track a vehicle using road cameras or other devices, such attacks are out of the scope of this paper.

4. Privacy-Preserving Data Aggregation

Our solution comprises three distinct phases: Bloom filter encryption, aggregation, and decryption. We describe all of them in detail in the following sections.

4.1. Bloom Filter Encryption

To add a vehicle into the Bloom filter, we compute Hi(),i ∈ {1, 2, ..., k} and set the corresponding bits in the bit vector to “1.” Unfortunately, aggregating Bloom filters with an additively homomorphic public key encryption scheme (such as Paillier) is infeasible, because the logical OR operation necessitates a fully homomorphic cryptosystem [28]. Instead, we may utilize counting Bloom filters [31] which are integer vectors that enable element deletions as well. In a counting Bloom filter, instead of setting the bits at the k vector positions, we simply increment the corresponding counters. However, this approach is vulnerable to correlation attacks, by inspecting the Bloom filters of two or more successive receivers. If their counters differ in only a small fraction of the m vector locations, an adversary may be able to deduce the k secret indexes of the noncommon vehicles.

As a result, in this work, we employ a onetime pad cipher to encrypt the Bloom filter vectors. In particular, let q =  be a prime power and let Fq be the finite field of integers modulo q. For a vehicle , its Bloom filter is represented as a vector , where bi,i ∈ {H1(), H2(), ..., Hk()} is chosen uniformly at random from the range [1, q). The remaining values are all set to zero. To encrypt its Bloom filter, the vehicle chooses a random vector and outputs the following ciphertext (in modulo q arithmetic).

The next step is to devise an efficient method for vehicles to communicate the aggregate encryption keys to the transportation authority. For that purpose, we employ the additively homomorphic Paillier cryptosystem that is instantiated prior to system deployment. Assume now that the number of vehicles that may be summarized into a single Bloom filter is upper bounded by n. Then, the number of bits required to store a single Bloom filter entry is log n + log q. Based on the maximum message size that can be encrypted under Paillier (which depends on its RSA composite), we denote as l the max number of Bloom filter entries that can fit into a single Paillier ciphertext. Then, the vehicle will output the following ciphertext vector r, where “|” denotes the concatenation operator.

The size of vector r is and the elements ei represent the elements of the key vector e. The tuple〈c, r〉is the encrypted Bloom filter (i.e., identification) of that vehicle.

4.2. Bloom Filter Aggregation

When the vehicle encounters a new RSU, it will transmit its Bloom filter〈c, r〉. After that, it will compute a re-randomized version of the Bloom filter, by choosing fresh random values for vectors b and e. In this way, the vehicle cannot be identified across multiple RSUs. Each RSU maintains, locally, an aggregate (encrypted) Bloom filter〈cA, rA〉that summarizes the vehicles that have passed during the current measurement period. Once it receives a new sample from a passing vehicle, it updates the vectors as follows:, where ☉ denotes element-wise multiplication. When the current measurement period ends, the RSU will send〈cA, rA〉to the transportation authority, along with the duration of the measurement period (start and end time).

4.3. Bloom Filter Decryption

Decryption is a two-step process that involves the transportation authority and all the trusted third-parties. Once the transportation authority receives a new encrypted Bloom filter 〈cA, rA〉, it engages all t trusted parties to collectively decrypt the aggregate encryption key from the ciphertext vector rA. This step entails the threshold decryption of Paillier ciphertexts. Next, it reduces (element-wise) eA modulo q, and computes the plaintext of the aggregate Bloom filter as follows:

It is important to note that, an adversary cannot determine the number of vehicles that have set a certain bit, because the corresponding value is uniformly random in the range [0, q). The downside of this approach is that certain Bloom filter entries that have been selected by at least two vehicles may produce an incorrect value of zero (which signifies a “0” bit). However, the probability of that event is low. More specifically, let P(i) be the probability that a certain Bloom filter entry is selected by exactly i out of n vehicles:where m is the Bloom filter size, and k is the number of hash functions. From this formula, we may compute the bit error probability Perr as follows:

As an example, if n = 2000, m = 8000, k = 4, and q = 210, the bit error probability is just 0.026%. As we will show in our simulation experiments, the effect of bit errors on the accuracy of our protocol is negligible.

4.4. Privacy Analysis

We define as a privacy breach the disclosure of a vehicle’s secret Bloom filter. This may be accomplished by (i) performing ciphertext-only attacks on the underlying cryptosystems or (ii) analyzing Bloom filters from different receivers.

Regarding the first type of attack, we argue that it is infeasible because of the semantic security of the OTP and Paillier cryptosystems that render every message indistinguishable. Furthermore, according to our threat model stated in Section 3.4, at least one of the trusted third-parties will not collude to decrypt individual Bloom filters. Instead, the only plaintext information available to the adversary is the aggregate encryption key from the OTP ciphertexts of the n vehicles. That information alone is not sufficient to decrypt individual Bloom filters.

Indeed, let us consider a single Bloom filter entry j, and the n ciphertext values that are known by the adversary, namely c1, c2, ..., cn. To retrieve the corresponding plaintext values bi, the adversary must solve the following system of equations:

Clearly, a unique solution does not exist, since there are n + 1 equations with 3n unknowns. In fact, we can easily produce a solution for any combination of vehicles that have selected a nonzero value for that exact Bloom filter entry.

Therefore, the only viable attack vector for an adversary is to examine the Bloom filters from two different RSUs, whose Hamming distance is small. To this end, we consider the worst-case scenario for our approach, which entails two almost identical datasets. In particular, consider the unlikely scenario where the adversary has knowledge (e.g., using external observations) that sets A and B (containing n − 1 and n vehicles, respectively) are constructed such that all of A’s vehicles are also present in B (this is similar to the definition of differential privacy [32]). We can then determine the probability that the adversary can derive one or more of the k bit locations that identify the extra vehicle. Let Pi be the probability that we can identify exactly i out of k bits. This is equal to the probability that the i bits have been set by exactly one vehicle, which, using equation (7), can be written as

For instance, if n = 2000, m = 8000, and k = 4, the probability of recovering the vehicle’s entire Bloom filter is just 1.8%.

5. Traffic Flow Estimation

Let A1, A2, ..., AN be N sets containing the vehicles that contacted receivers RSUA1, RSUA2, ...., RSUAN, respectively. Also, let |Ai| denote the cardinality of Ai. Using the basic set theory, the number of common vehicles across the N RSUs (i.e., the traffic flow on that specific road network path) can be computed as follows:

For example, if we consider N = 3 RSUs, then the number of common vehicles is equal to:

Nevertheless, in our traffic monitoring scenario, no information is available regarding the vehicle IDs that are present in each of the above sets, and therefore, it is not possible to compute an exact solution. Specifically, even though the individual cardinalities |Ai| can be disclosed by the corresponding RSUs, this is not advisable in terms of privacy, because it contributes to the adversary’s knowledge. On the other hand, it is infeasible to compute the exact cardinality for any of the set unions. As such, we can only rely on cardinality estimations that may be derived from the individual Bloom filters.

Recall, from Section 3.1, that we can compute the correct Bloom filter of the union of an arbitrary number of distinct Bloom filters, by applying the logical OR operation on the corresponding bit arrays. Therefore, to estimate the traffic flow across any number of receivers, we need a formula that estimates the number of elements stored in a Bloom filter, based on the number of bits that are set. To this end, we employ equation (6) to approximate the cardinality of the underlying set, by measuring the fraction of the bits that are “0.” More specifically, the probability that a bit is not set is given below:

Solving for n gives us our estimate , which can be written as (13):

To summarize, given N unique RSUs, the transportation authority will estimate the underlying traffic flow as follows:(1)With the cooperation of the trusted entities, decrypt the Bloom filters from the N receivers(2)Compute (with a logical OR) the Bloom filters from all the set unions that appear in equation (10)(3)Use equation (13) to estimate the cardinality of every set appearing in equation (10)(4)Substitute the computed values into equation (10) and

Note that, equation (10) necessitates the enumeration of all combinations of i out of N elements, for i = 1 to N. As such, the computational complexity grows exponentially with N. Nevertheless, in our simulations, we were able to get reasonable running times for up to N = 14 receivers.

6. Simulation Experiments

In this section, we evaluate the performance of the proposed system in terms of accuracy and efficiency, and we compare our results with the current state-of-the-art approach. In addition, to test the accuracy of the system in a real environment, we run the protocol on a simulated road network model.

6.1. Simulation Environment

We simulated our system with a maximum of 14 receivers (RSUs) on a desktop machine with sixteen 3.0 GHz CPU cores and 64 GB of memory. To simulate the limited computational capabilities of the vehicles and RSUs, we employed a single CPU core to perform their tasks, i.e., Bloom filter encryption and aggregation, respectively. On the other hand, the server process that performs Bloom filter decryption and flow estimation utilized all sixteen cores in a multi-threaded implementation. Our code is written in C++ and we leveraged the OpenSSL library (https://www.openssl.org/) for arbitrary precision arithmetic operations. For sufficient security, the RSA modulus of the Paillier cryptosystem was set to 2048 bits, which produces ciphertexts of size 512 bytes.

6.2. Accuracy

Our system requires each RSU to maintain a Bloom filter of size m, containing n entries (vehicles). Since the main motive is to design a system for fine-grained traffic estimation, we considered a small number of vehicles (i.e., n ≤ 2000) at each receiver. In this scenario, new Bloom filters are generated at relatively short time intervals. Each passing vehicle sets k random bits at positions H1(), H2(), ..., Hk() and, when n reaches a certain threshold (or a timer expires), the Bloom filter is sent to the transportation authority. A larger k results in more privacy (more uncertainty for an adversary) but incurs higher cost, because it requires larger Bloom filters. Specifically, we observed that the traffic flow estimation is more accurate when m ≥ kn. However, the system assures sufficient privacy to vehicles with k ≥ 4, as evident in equation (9). Similarly, the size of q also affects the accuracy of the system, because it is inversely proportional to the bit error probability Perr, as shown in equation (7). Nevertheless, our results indicated that the effect is negligible, even for a value as low as 128. To illustrate the simplicity and accuracy of our method, we used the same parameter settings in all the experiments: k = 4, m = 8000, and q = 128.

In our first set of experiments, we tested the traffic flow estimation accuracy for a varying number of receivers N (path length). We set the number of vehicles passing through a receiver to lie in the interval [nl, nh] and varied the number of common vehicles across the N receivers in the range of 10%–75% of nl (in increments of 1%). We also considered the case where the number of vehicles is identical across all receivers, i.e., nl = nh, which is common during peak hour traffic. As a performance metric, we used the average absolute difference (AAD) between the real and estimated traffic flows. In particular, we performed each experiment 1000 times and computed the AAD as follows:where ni and are the actual and estimated traffic flows at iteration i.

Figure 2 shows the AAD (in percentage w.r.t the real traffic flow) as a function of the real traffic flow, for the case where the number of vehicles at each RSU is constant (nl = nh = n). One important observation is that the accuracy does not decline when estimating the traffic flow across multiple RSUs; rather, the accuracy is improved significantly when involving more RSUs in the estimation. The second observation is that the estimation accuracy improves with increasing traffic flow, i.e., when the Bloom filters share a large number of common vehicles. For example, for 10 RSUs, n = 2000, and a real traffic flow of 200, the AAD is equal to 15, i.e., 7.5% of the real traffic flow. On the other hand, when the real traffic flow rises to 1500 vehicles, the AAD is just 12 (or 0.8%).

Nevertheless, in most cases, the number of vehicles captured by each RSU will vary. Therefore, in the second set of experiments, we measured the accuracy under this more realistic scenario. Figure 3 illustrates the results, which are quite similar to the case where n is fixed. In particular, we observe a minor impact on the estimation accuracy under longer path lengths, especially when the real traffic flow is low.

6.3. Comparison against the State-of-the-Art Protocol

In our earlier work [14], we compared our point-to-point traffic flow estimation protocol against Zhou et al.’s [13] method that is designed specifically for origin-destination flows. Therefore, in this study, we focus our comparison on Zhou et al.’s multipoint traffic estimation protocol [12] that can handle paths of arbitrary length. Recall that their method is similar to ours, in that they utilize a bit array of size m to encode the vehicle IDs that pass through an RSU. However, instead of a Bloom filter with k hash functions, they require each vehicle to submit one of s precomputed indices (chosen randomly) to each encountered RSU. In addition to the AAD, we also computed the standard deviation (σ) of the estimated values from the actual traffic, based on the following formula:

Again, ni and are the actual and estimated traffic flows at iteration i.

In the first experiment, we consider a fixed number n of vehicles at each RSU, and also use the same bit array size m = 8000 for both methods. For Zhou et al., we tested three different versions, namely, for s ∈ {2, 4, 7}. Figures 4 and 5 depict the AAD (%) and standard deviation, respectively, as a function of the real traffic flow (for N = 2 receivers). Our approach clearly outperforms the competitors, especially for larger values of s. When s ≥ 4, our method reduces the AAD by a factor of 3–15. Note that, the value s = 2 is not considered privacy-preserving, because it is very easy for an adversary to track the same vehicle across different RSUs (a vehicle can only choose between two random indices). Figures 6 and 7 show the same experiments for N = 3 receivers. Our approach is clearly superior to the state-of-the-art protocol, and the performance gap is increased significantly compared to the case of the 2 receivers. We do not present results for N > 3 receivers, because the accuracy of Zhou et al.’s approach drops considerably (this was also observed by the authors of [12]). On the other hand, our method is very accurate when the number of RSUs increases, as illustrated in the previous section.

Next, we consider the case where the number of vehicles n varies across the different RSUs. Figures 8 and 9 illustrate the AAD (%) and standard deviation, respectively, as a function of the real traffic flow (N = 2). Similarly, Figures 10 and 11 plot same performance metrics for the case of N = 3 receivers. Here, our protocol outperforms the state-of-the-art method by several orders of magnitude, especially for the case of 3 receivers.

To this end, we also tested the effect of m on Zhou et al.’s accuracy. In particular, the authors proposed to configure the bit array size at each RSU as m = 2log(nf), where n is the number of vehicles in the current measurement period and f is the load factor. Following the authors’ recommendations, we set f = 4 and repeated the experiment where the number of vehicles n varies across the RSUs. The results are depicted in Figures 1215. Overall, there is not any significant improvement in terms of AAD and standard deviation. In fact, the accuracy is negatively affected when the variance of n across the RSUs is small (i.e., when n ∈ [1500, 2000]). Our protocol consistently outperforms the competitors by a very large factor and, more importantly, it does not exhibit any loss in accuracy when increasing the path length of the measurement. This is a very desirable feature for intelligent transportation systems that need to estimate the traffic flow on specific paths along the road network, instead of on an origin-destination basis.

6.4. Overhead

The main advantage of the current state-of-the-art protocols is the lack of cryptographic operations that results in a very efficient implementation. However, this has a negative impact on both the privacy of the vehicles and the accuracy of the traffic flow estimation. Therefore, to illustrate the feasibility of our approach, we need to investigate the overhead of the cryptographic operations in terms of both computation and communication costs. These costs are directly related to the size of the encrypted Bloom filter, which is a function of the chosen parameters n, m, and q.

6.4.1. Computational Overhead

This is the processing time cost that relates to the cryptographic operations at each involved entity, i.e., the vehicles, the RSUs, and the server. To this end, the two basic operations involved in our methods are the modular exponentiation (for Paillier encryption/decryption) and the modular multiplication (for Paillier homomorphic addition). In our software implementation on a single-core processor, these operations cost, on average, 8 ms and 0.015 ms, respectively. Notice that the overhead of the OTP operations is negligible compared to the public key operations and is, thus, not measured in our results.

Figure 16 shows the computational cost at the vehicle and the server as a function of the bit-length of q. For the vehicle, the cost involves the encryption of the OTP keys into multiple Paillier ciphertexts. The processing time grows slowly with increasing bit-length, because more ciphertexts are needed to store the entire Bloom filter. Even in the worst-case configuration, when q = 216 and m = 16000, the cost is just 1.8 sec, which is long enough for the vehicle to compute a new Bloom filter before reaching the next RSU. It is also possible for the vehicle to precompute (offline) several Bloom filters before the start of a new trip.

Similarly, the cost at the server shows the time needed to decrypt a single aggregate Bloom filter (each trusted party will incur this cost). Here, the server application employs all sixteen CPU cores, so the cost is greatly reduced. With the worst-case configuration, i.e., q = 216 and m = 16000, the decryption cost for one Bloom filter is just 310 ms. On the other hand, Figure 17 depicts the processing time required at the server to estimate the traffic flow across N receivers, using equation (10). As we explained in Section 5, the cost grows exponentially with N, because it necessitates the enumeration of all combinations of i out of N elements, for i = 1 to N. This cost clearly limits the maximum path length N that can be supported, but Figure 17 shows that the cost is quite reasonable for N < 15.

Finally, the computational cost at the RSU involves only modular multiplications, which are considerably cheaper than exponentiations. In our implementation of the RSU-related operations, the RSU entails just 1 ms of CPU time to add one vehicle to the aggregate Bloom filter (with the configuration of n = 2000, m = 8000, and q = 128). At this rate, the RSU can process approximately 1000 vehicles/sec.

6.4.2. Communication Overhead

The communication cost is measured as the bandwidth usage between (i) the vehicles and the RSUs, and (ii) the RSUs and the server. It is worth noting that the communication cost is the bottleneck with regard to the vehicle processing throughput at the RSU. Indeed, the data rate of the DSRC protocol is between 6 and 27 Mbps, which can only support a limited number of Bloom filter transmissions within any given time period. As an example, when n = 2000, m = 8000, and q = 128, the size of a single Bloom filter is just 43 KB. Figure 18 shows the processing throughput as a function of the available bandwidth, which demonstrates that at 10 Mbps, the system is able to accommodate the load of a typical rush hour traffic.

On the other hand, the bandwidth usage between the RSUs and the server is considerably less. Specifically, after each aggregation period (which is in the order of a few minutes), the RSU has to send to the server a single aggregate Bloom filter. In our example above, this entails just 43 KB of data, e.g., an average data rate of 144 bytes/sec, for an aggregation period of 5 min. This is a data rate that is easily supported by 3G communication technologies.

6.4.3. Road Network Simulation

To test the accuracy of our system in a more realistic environment, we estimated the traffic flows on a simulated traffic model, using a road segment with multiple entry and exit points for vehicles. The model is depicted in Figure 19. In particular, we consider five RSUs deployed along the road segment, where each consecutive pair is separated by a distance of 5 km (i.e., the length of the road segment is 20 km). There are a total of 2000 vehicles, with their entry-exit points fixed, as shown in the figure. For simplicity, all vehicles start entering the road segment at time t0 = 0, at a rate of one vehicle per 60 ms. Furthermore, we assume that every vehicle moves at a speed that is uniformly distributed in the interval 60–80 km/h. As such, the time needed to pass through two consecutive RSUs is upper bounded by 5 min, and the total simulation time is just over 20 min. Finally, we assume that each RSU generates a new Bloom filter every 5 min, so, to compute the traffic flow for a given time window (ti, tj), the server has to aggregate (for each involved RSU) the Bloom filters that fall within this time window. The aggregation is done with a logical OR of the individual Bloom filters, and the traffic flow among the selected RSUs is estimated via equation (10). The source code of our simulation is available on GitHub (https://github.com/rathoremazhar/Traffic-Flow-Estimation).

Figure 20 illustrates the real and estimated traffic flows for different combinations of RSUs. Specifically, each arrow indicates the involved RSUs, while the time window T is depicted on the left-hand side of the figure. The number above each arrow conveys the estimated value and the one below shows the real traffic flow, i.e., the number of common vehicles. In our simulation, a vehicle needs at most five minutes to travel from one RSU to the next, so, to accurately compute the traffic flow between N RSUs, we must aggregate the Bloom filters from a time window of at least 5N minutes. For instance, to estimate the traffic flow between receivers A, B, and C, the correct time window to choose is (0, 15]. As evident in Figure 20, our protocol produces very accurate results, regardless of the number of RSUs or the underlying traffic volume.

7. Conclusions

We proposed a very simple and accurate protocol for estimating—in a privacy-preserving manner—the traffic flow across arbitrary paths on a road network. Our solution leverages roadside units that interact with the passing vehicles, in order to construct encrypted Bloom filters that summarize the underlying vehicle IDs. We performed an extensive simulation study, using diverse traffic characteristics, and our results are extremely promising. In particular, we demonstrated that our protocol’s estimations exhibit only a minor deviation from the real traffic flow. More importantly, the accuracy is maintained regardless of the underlying path length. We also compared our approach with the current state-of-the-art protocols and showed that it improves the estimation accuracy by a large factor. Finally, we implemented the cryptographic primitives involved in our method and demonstrated the feasibility and scalability of the system.

Data Availability

No data were used to support this study.

Disclosure

This paper is an extension of the conference paper “Privacy-preserving traffic flow estimation for road networks,” by E. Bentafat, M. M. Rathore, and S. Bakiras, published in Proc. IEEE GLOBECOM 2020 (https://ieeexplore.ieee.org/document/9322454).

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.