Abstract
There are various means of monitoring traffic situations on roads. Due to the rise of artificial intelligence (AI) based image processing technology, there is a growing interest in developing traffic monitoring systems using camera vision data. This study provides a method for deriving traffic information using a camera installed at an intersection to improve the monitoring system for roads. The method uses a deep-learning-based approach (YOLOv4) for image processing for vehicle detection and vehicle type classification. Lane-by-lane vehicle trajectories are estimated by matching the detected vehicle locations with the high-definition map (HD map). Based on the estimated vehicle trajectories, the traffic volumes of each lane-by-lane traveling direction and queue lengths of each lane are estimated. The performance of the proposed method was tested with thousands of samples according to five different evaluation criteria: vehicle detection rate, vehicle type classification, trajectory prediction, traffic volume estimation, and queue length estimation. The results show a 99% vehicle detection performance with less than 20% errors in classifying vehicle types and estimating the lane-by-lane travel volume, which is reasonable. Hence, the method proposed in this study shows the feasibility of collecting detailed traffic information using a camera installed at an intersection. The approach of combining AI and HD map techniques is the main contribution of this study, which shows a high chance of improving current traffic monitoring systems.
1. Introduction
Urban road traffic is a complex phenomenon caused by interactions among various moving entities, such as vehicles and pedestrians. The growth in urban population during the past decades has raised the severity of urban traffic congestion, leading to socioeconomic and environmental problems in modern cities. To mitigate this issue, brisk trials have been conducted to apply intelligent transportation systems (ITS) in urban roads. In this regard, traffic monitoring is one of the most valuable functions of traffic management systems (TMSs). Particularly in advanced TMSs (ATMSs), real-time collection of precise information through traffic monitoring plays a crucial role for traffic managers when they develop various control strategies [1–3]. Furthermore, the detailed numerical status of real-time traffic such as lane-by-lane travel volume and queue length can be used as supplementary information for cooperative intelligent transportation system (C-ITS) operations based on autonomous vehicles [4, 5].
Traffic monitoring systems have been developed in various ways, and traffic information is collected indirectly or directly depending on the characteristics of a specific monitoring system. Indirect methods estimate traffic status such as travel volume and travel time within a road section based on the data samples collected via roadside units (RSU) or global positioning systems (GPS), which are instances of automatic vehicle identification (AVI) technologies [6–8]. However, the estimation performance of these methods is highly dependent on the market penetration rate (MPR) of equipped vehicles for vehicle-to-infrastructure (V2I) communication. On the contrary, direct methods measure the traffic conditions using point sensors such as loop detectors [9–11], radars [12–14], and video cameras [15, 16]. Loop detectors have been widely used for traffic monitoring due to relatively higher reliability in collecting travel volume, occupancy, and spot speed, but their installation and maintenance complexity is higher because they are normally installed on road surfaces [17]. Radar-based monitoring systems are relatively easier to install, but the cost of the hardware itself is more expensive [18]. Moreover, the common limitation of both loop detectors and radar is difficulty in classifying vehicle types. However, cameras are relatively cheaper than radars, and camera-based monitoring systems are able to classify vehicle types [19]. They can also distinctively obtain traffic information in each lane of a road spot [17]. They have a high potential for extracting more detailed traffic information at a specific location, but it requires advanced image processing techniques to obtain reliable information, which is problematic. Automatic traffic data collection via camera-based monitoring systems can be operated at lower costs only when proper image processing techniques support the system.
Various methods have been proposed in several studies related to automatic image processing techniques. Some studies from the early 2000s had focused on improving the poor performance with respect to vehicle detection owing to several technical issues, such as segmentation of objects in the background and shadows [20], difficulties in detecting dark-colored vehicles [21], differences in day- and night-vision data [22], and influences of weather conditions [23]. An attempt to develop a technique to detect accidents automatically was also reported [24].
Recently, studies began to focus on using machine-learning or deep-learning techniques, and one of the most popular examples is the application of You Only Look Once (YOLO) to process traffic vision data [25]. YOLO has high applicability to real-time traffic monitoring based on its capability to process multiple images faster than conventional region-based convolutional neural networks (R-CNNs). With the aid of deep-learning techniques such as YOLO or faster R-CNN, the performance of detecting vehicles using real-time traffic vision data has been tried to improve in several studies. Their common purpose was to accurately count vehicles for estimating traffic conditions in specific road spots [26]. Some of them specifically focused on detecting vehicles in captured scenes with several objects (vehicles) with high density [27], while others focused on detecting small objects (vehicles) in complex scenes [28, 29]. Some studies have also attempted to distinctively detect road vehicles and pedestrians [30, 31].
Such object detection techniques have evolved into real-time visual object tracking approaches. Several studies have proposed methods for tracking multiple objects in time series based on convolutional neural networks (CNNs) [32–34]. There are also some examples of using kernelized correlation filter (KCF) for high-speed tracking of objects on roads and even in waterway traffic [35, 36]. Within the context of object tracking on roads, there were a few studies related to tracking moving vehicles particularly for the purpose of collecting more detailed traffic behaviors [37]. They have proposed methods for extracting and analyzing trajectories of multiple vehicles within a specified road spot for capturing lane-change events [38] or measuring the speeds of individual vehicles [39]. However, till now, only rough estimations have been conducted on trajectories without accurately measuring vehicle positions over time. For example, with the current machine-learning-based image processing techniques, a possibility of detecting multiple vehicles as a single object arises when they travel through similar paths and speeds, even though on different road lanes. Hence, it is still difficult to obtain an accurate trajectory of a vehicle by tracking the exact position of the road lane where the vehicle is located. Obtaining accurate trajectories of multiple vehicles would be advantageous to traffic managers intending to improve the accuracy of collecting travel volume or queue length values in each traveling direction at an intersection. Furthermore, it would enable us to obtain information on different road lanes, which can be useful for deeper analysis of traffic flow behavior and supporting autonomous vehicle operations.
Therefore, we present a method for deriving traffic information using a camera installed at an intersection for improving monitoring performance. The method uses a deep-learning-based approach for image processing for vehicle detection and vehicle type classification. Then, the method estimates lane-by-lane vehicle trajectories by matching the detected vehicle locations with the high-definition map (HD map). While estimating the vehicle trajectories, we attempt to reduce the error of estimating the center points of the bounding boxes in the images of vehicles to ensure proper performance of the HD map-matching process. Based on the estimated vehicle trajectories, the traffic volumes of each lane-by-lane traveling direction and queue lengths of each lane were estimated as well. In fact, this is not the first attempt to increase the accuracy of trajectory estimation to the lane level. The work in [40] had a similar purpose and approach but differs from the present study in that recent deep-learning techniques and HD map technology are combined for estimating vehicle positions accurately.
The remainder of this paper is organized as follows. Section 2 provides a description of the method of vehicle detection and classification, along with the method of matching the detected vehicle positions with the HD map for lane-by-lane trajectory estimation. Section 3 describes the settings for testing the performance of the proposed method, and Section 4 presents the test results. Section 5 concludes this paper and offers suggestions for further work.
2. AI-Based Vehicle Detection System at Intersection
2.1. Data Flow Framework
In fact, the image processing technology these days can easily identify a vehicle in a captured image, as long as the image resolution is sufficient. However, the focus of this study is on how to precisely extract traffic information upon multiple vehicles on roads rather than a single vehicle and how to deal with the extracted data from the traffic monitoring perspective. Hence, it is necessary to consider the data flow framework of the camera-based vehicle detection system.
Figure 1 shows the data flow framework of the artificial intelligence (AI) based vehicle detection system for C-ITS. As shown, the system consists of four components: roadside sensor, traffic monitoring center, RSU for communication, and an on-board unit (OBU) in vehicles. In this study, traffic cameras installed at intersections were considered as the main roadside sensors. First, the vision data of the traffic status at an intersection were collected in real-time via a roadside sensor and sent to a data collecting server in the traffic monitoring center. Then, using the vision data, the center conducted the vehicle detection task using the AI-based image processing technique. The information gathered from the vehicle detection task was then used to extract and predict the trajectories of vehicles. Then, the trajectory data information underwent the HD map-matching task to improve the prediction accuracy. The information message of the detected vehicles and their predicted trajectories were sent to the OBU in a subject vehicle via RSU using infrastructure-to-vehicle (I2V) communication. When a message was received, the collision risk of the subject vehicle could be calculated based on the predicted trajectories and also be displayed to the vehicle monitoring system. The status of the subject vehicle could be sent back to the traffic monitoring center via the RSU using V2I communication.

The framework described above provides two major advantages in terms of C-ITS operations. The first is that vehicle-to-vehicle collisions can be prevented by providing vehicles with their detection information traveling through intersections. Implementing a service that provides detailed information, such as vehicle location, speed, and abnormal status, is possible. In addition, it provides predictive information in seconds using the previously detected information. Second, a more detailed road status can be provided by extracting lane-by-lane traffic conditions near intersections. It is possible to provide a service that provides information on the traffic volume and vehicle queue of each lane. Furthermore, a service that detects illegally parked vehicles on streets can also be implemented. In this study, we aim to improve the advantages of the framework. The focus of this study is to develop methodologies for AI-based vehicle detection and HD map matching, which are the tasks of the traffic monitoring center described above.
2.2. AI-Based Vehicle Detection and Trajectory Prediction
In this study, a deep-learning algorithm is adopted using roadside sensors to extract object information such as vehicle location, movement trajectory, and vehicle speed at intersections and surrounding areas, and useful traffic information, such as traffic volume and queue length, is estimated. The proposed algorithm is based on vision data transmitted from the roadside sensors to a vision data collecting server located in the traffic monitoring center, and the predicted data are stored in a real-time database for real-time data communication. As shown in Figure 2, the proposed algorithm consists of (1) vehicle detection and classification with deep learning, (2) trajectory extraction, (3) trajectory correction, and (4) trajectory prediction, and the details are outlined as follows.

2.2.1. Vehicle Detection and Classification with Deep Learning
We used a deep-learning-based algorithm for vehicle detection as it has higher applicability to real-time traffic monitoring compared to other image processing techniques such as traditional labeling due to its capability of processing multiple images faster than others. The proposed system performs real-time detection of vehicle location and speed from the vision data sent from the vision data collecting server based on the YOLOv4 deep-learning algorithm and performs vehicle type classification. The YOLOv4 algorithm uses the state-of-the-art deep-learning method and is optimized, showing 10% improved performance for the detection accuracy index (MAP: mean average prediction) and a 12% improved detection speed index (FPS: frame per second) compared to YOLOv3, the previous version of the algorithm. In particular, YOLOv4 can process vision data with efficiency, enhancing its applicability in the traffic safety sector where detection, preprocessing, and warning message generation must be performed within 0.1 seconds.
In the process of vehicle detection and classification with deep learning, the algorithm processes vision data in frames and primarily generates vehicle type information such as cars, trucks, and buses and vehicle location information based on pixels. As for vehicle type information, data derived from YOLOv4 can be directly used, and additional separate training was performed based on the target site data to improve the accuracy of vehicle type information. Vehicle location information was generated based on the information of each vertex and the center point of the bounding box. This information was then converted into longitude and latitude coordinates based on the center point of the vehicle’s bottom through correction.
2.2.2. Trajectory Extraction
The vision data collected from the roadside are distorted when converting 3D real-world images into 2D images. Because of this distortion, a significant error occurs between the actual physical coordinates and the image coordinates depending on the degree of vision data distortion when the location information detected in pixel units is converted directly into longitude and latitude coordinates. In this study, to remove this error, the corrected vision data were generated from the distorted vision data by inverse application of the camera intrinsic parameters extracted through its calibration. Note that the focal length, principal point, and distortion are the intrinsic parameters of the camera. The values of the intrinsic parameters were determined by projecting a 2D image into 3D world space. Also, note that an existing method is used for the distortion correcting process in this study. For a better understanding of the details of the distortion correcting method, refer to the work by Seong et al. [40]. The equation for correcting the vision data distortion is as follows:where are the pixel coordinates of an image, are the pixel coordinates of the image corrected for distortion, are the normalized planar coordinates with distortion, and are the normalized planar coordinates with corrected distortion. Focal length: ; . Principal point: ; . Distortion: , , , and .
The vehicle location information detected from each image frame was expanded to a continuous frame for extracting the vehicle trajectory information and data for use in vehicle location correction. In the video images captured by a camera, the similarity between the feature information of the object in the image frame is used to track the location change of the objects. To track the vehicle’s location, its location and size in the previous frame were compared with those of the vehicle object detected in the next frame. As a result, the vehicle with the largest intersection of union (IOU) was classified as an identical vehicle to the vehicle existing in the previous frame; based on this classification, the continuous movement of the vehicle was tracked. In addition, if there was no intersection of union where the location and size of the detected object for a set frame (0.2 s) overlapped with that in the previous frame, the object was recognized as a new object, and a new vehicle tracking ID was assigned.
The pixel coordinates extracted from an image are calculated based on a matrix transformed using Transverse Mercator coordinates of four designated points in the HD map. The transformed matrix is derived by using homography that generalizes transformation relationships after obtaining coordinates corresponding to sample image coordinates. If there are four points , , , and in a plane and these points are projected to another plane as , , , and , there exists a 3 by 3 homography matrix satisfying relationship among these corresponding points. The camera image coordinates are converted to real-world coordinates using such a mechanism.
2.2.3. Trajectory Correction
In general, deep-learning-based vehicle detection extracts information in the form of a bounding box, and the central point of the bounding box represents the overall vehicle location information. However, as shown in the example in Figure 3, when vehicle location information is extracted with reference to the center point of the bounding box, the result differs from the location with reference to the center point of the vehicle bottom, which is the actual required information for traffic monitoring. In addition, when the center point is estimated based on the bounding box, an error occurs in the estimated position according to the heading shown (by captured angle) in the vehicle image. This type of error can lead to another error in trajectory prediction. This subsequent error can lower the performance of the HD map-matching process, which deals with extracting lane-by-lane traffic information later. Furthermore, if we assume that the trajectory prediction with such an error is utilized in a vehicle’s collision warning or avoidance system, it can also lead to insufficient performance of the safety system. Hence, it is necessary to give an effort in reducing the errors while estimating the center point of the bounding box.

In this study, to reduce the error in center point estimation, real-time correction of vehicle location was performed through the following two steps: (1) extracting the heading and determining the traveling direction of the vehicle and (2) estimating the shape of the vehicle bottom and correcting the location.
For the first task, the vehicle heading was obtained using the pixel coordinates detected in the vision data collected from the road (the bounding box center point value) and the pixel coordinates of the previous frame, as shown in Figure 4. The heading of a vehicle is extracted through the following steps: (1) The vehicle position of the previous image frame and the position of the current image frame are converted into coordinates using a transformation matrix. (2) The angle formed by the two positions is calculated using the Pythagorean equation, and the distance between the two positions is calculated using the coordinate values. The extracted heading for each frame was corrected based on the low-pass filter as follows:where is the corrected heading, is the heading at previous time, is the heading at current time, and is the weight.

(a)

(b)
The vehicle traveling direction and the vertical direction are derived using the heading obtained from the real-time estimation and the detected pixel coordinates. Figure 4(a) shows the corrected results of the low-pass filter. In Figure 4(b), the orange and blue colored lines represent the filtered and raw data, respectively. Noisy data points and variation in heading information are smoothed using a low-pass filter.
Based on the previously derived information of vehicle traveling direction and vehicle type, the shape of the vehicle bottom was estimated, as shown in Figure 3. Because the vehicle height varies depending on the type, the shape of the bottom surface within the bounding box is estimated by applying the average vehicle height per vehicle type. The bottom surface information is estimated based on the following steps: (1) The center points of the bounding boxes in the previous image frame and in the current image frame are converted into Transverse Mercator coordinates. (2) Since the vector formed by the two center points is the moving direction of the vehicle, a hypothetical vector perpendicular to the moving direction is drawn to create a rectangular vehicle bottoms shape (assuming that vehicles have a rectangular shape from the top view). (3) Let be the height between camera and ground surface, be the height of a vehicle, be the distance on the surface between camera and vehicle, and be the distance on the surface between the camera and point where the line connecting between the camera and the top of the vehicle meets the surface. Here, , , and are directly obtained from image data, and then can be calculated by the triangle proportional theorem. Note that the height of the vehicle is assumed to be half of the actual height because the center point of the bounding box detected in the image is half the actual height in usual. Based on this method, the four corner points (in 3D coordinates) of the vehicle bottom are estimated. (4) The 3D coordinates of the vehicle bottom are then converted into the image coordinates using an inverse transformation matrix, and this finalizes estimating the vehicle bottom. The center point information of the vehicle’s bottom surface is extracted based on the estimated pixel information of the bottom surface, and the final pixel-based location information of the vehicle is derived based on this information.
2.2.4. Trajectory Prediction
Using the previously derived real-time trajectory data of the vehicle, the upcoming vehicle trajectory information from 1 to 3 seconds was estimated. Location information for each time slot was used to estimate the future trajectory of the vehicle. In addition, a polynomial curve fitting algorithm was used, as shown in the following equation, by applying a linear equation if the past data is a vehicle traveling forward or a quadratic equation for a turning vehicle, to extract the future location of the vehicle.
When estimating the future location of the vehicle based on the detected vehicle location information alone, the result showed that the prediction performance is decreased at the intersection approach where a fewer number of points exist in the trajectory data. To address this limitation, the HD map previously built at the intersection was used, as shown by the solid black lines in Figure 5. Using the location information per link in the HD map, the future vehicle location was estimated assuming that the vehicle trajectory will follow the shape of the HD map link, and the estimated result is shown in Figure 5. The blue solid line represents the ground truth, the green- and blue-dotted lines represent the link of the HD map where the detected vehicle is assigned, and the red-dotted line represents the estimated future location of the vehicle.

2.3. Provision of V2X Communication-Based Detection Information
2.3.1. Generation of HD Map-Based Information
The current C-ITS provides information such as unforeseen incidents or accidents via messages that include longitude and latitude data. For such type of information, C-ITS has an advantage in terms of general use of information but is disadvantageous when the number of messages increases sharply with the increase in the number of related pieces of information. Furthermore, in the case of existing C-ITS based on location information, the computational load increases rapidly as the number of messages increases when matching the predicted vehicle trajectory information and point of event occurrence for each event. What is worse in the case of predicting trajectory based on the past trajectory is that the accuracy decreases at curved sections and intersections, leading to reduced accuracy when matching events. Therefore, in this study, to overcome the limitation in sending location information based on longitude and latitude, the AI-based detection and prediction information provided in the previous subsection was combined with the HD map link information, as shown in Figure 6(a).

(a)

(b)
Figure 6(b) shows the HD map link allocation algorithm. First, in the process of extracting HD map link information, information such as the length, linearity, and type of link and the longitude and latitude of the start and end points are extracted from the link attribute information of the HD map. This information of the HD map is compared with the detected location coordinates of the vehicle, and matching is performed with the nearest link, extracting the lane on which the vehicle is currently traveling. Figure 7 shows an example of the HD map link allocation based on the trajectories of the forward-traveling vehicle and turning vehicle. As shown in the figure, information on whether the vehicle travels forward or turns is extracted based on the vehicle trajectory for the past 1 s. Based on this information, if the vehicle is determined to be traveling forward, links with forward-type traveling are extracted from the HD map links, and the extracted candidate links and vehicle trajectory for 1 s are matched based on the start and end points, thereby extracting the HD map link with the closest matching. Finally, the HD map link extracted based on the distance is compared with the heading of the vehicle traveling direction, and when the latter shows consistency within a set threshold, the HD map link is allocated.

To enhance the applicability of the extracted information based on AI, the information extracted from the vision sensor is allocated in HD map link units. Then, the number of vehicles present in the link representing density, the most necessary information in traffic management, and queue length information are generated by the link. The density is calculated as the difference between the number of vehicles entering the starting point and that leaving the end point of the link . As for the queue length of a vehicle, when the average speed over the last 1 s is smaller than the set speed for each HD map link, the corresponding vehicle is classified as the vehicle in the queue. To improve the applicability of the information, the queue length is expressed based on the offset of the HD map. For example, if the length of the HD map link is 50 m, the start point of the link is set to 0, and the end point of the link is set to 50 based on the vehicle traveling direction. Based on these values, when the vehicle queue length is 20 m from the end point of the link, the start point of the queue is offset by 30, and the end point by 50.
2.3.2. Data Design for V2X Communication-Based Information Provision
Data converted based on the link format of the HD map are stored in the server in the format shown in Tables 1 and 2 to be utilized in messages in C-ITS in the future. Table 1 shows the storage format of vehicle information, which is used for storing and sharing object information (vehicle type, longitude, and latitude coordinates) extracted from AI. However, to improve the applicability of the information and accuracy of matching with the vehicle trajectory, the information allocated to the HD map link is combined. In addition, providing predicted information of vehicle objects based on the HD map link ID facilitates the calculation of the probability of collision in the future traveling direction of an autonomous vehicle.
Table 2 shows the storage format of data, primarily processed to facilitate the application of the information extracted from AI-based detection information to the traffic management field. As described above, information of the number of vehicles present in the link (density), queue length information, and average speed information is generated with reference to the HD map link. Similar to the storage format of vehicle information (Table 1), the predicted information is provided to facilitate the calculation of the collision probability in the future traveling direction of an autonomous vehicle. As shown in Tables 1 and 2, not only does the proposed system enhance the applicability of information by actively utilizing HD map links, but also the object extraction and traffic-related information are provided in combination with a similar format considering the need for other information attributes depending on the situation.
3. Target Site for Application of the Proposed Method and Evaluation
3.1. Target Site
Figure 8 shows the target site for applying the AI-based vehicle detection and prediction technique proposed in this study. The proposed system was evaluated using data collected for three days, and data for accuracy verification were generated in two steps as follows. First, the ground truth data for calculating the accuracy of vehicle location information were generated using a drone, by capturing the same area as the image data collected from the roadside vision sensor and collecting vertical images. Second, information such as the vehicle type, number of vehicles in a link, and queue length was generated based on a field survey, and the vehicle type and the number of vehicles were manually counted from visual observation of image data. To prevent human errors in counting, a cross-check and final check were performed using labeled image data.

(a)

(b)
3.2. Accuracy Evaluation
Descriptions of how we evaluate the performance of the proposed methodologies are provided in this subsection, which is based on five different evaluation criteria: vehicle detection rate, vehicle type classification, trajectory prediction, traffic volume estimation, and queue length estimation.
3.2.1. Accuracy of Vehicle Detection and Classification
To evaluate the vehicle detection performance, the detection rate was calculated to determine whether all vehicles were successfully detected regardless of vehicle type. As in equation (4), it is defined as the ratio of the total number of detected objects to that of ground truths:
Vehicle classification performance is evaluated through MAP, which is a performance evaluation index widely used in the field of computer vision. MAP is the mean of the average precision (AP) values of each vehicle type. The AP represents the performance of the classification algorithm as a single value, and it is calculated as the area below the graph line in the precision-recall graph. As in equation (5), the precision is calculated as the ratio of the number of correct answers for vehicle type classification (true positives) to that of all detected vehicles (sum of true and false positives). The recall is calculated as the ratio of the number of correct answers (true positives) to that of all ground truths (sum of true positives and false negatives), as shown in equation (5). Precision and recall are inversely related to each other. Hence, the changes in such a relationship are analyzed to properly evaluate the overall performance of the proposed method.
3.2.2. Accuracy of Vehicle Trajectory Estimation
The performance of the vehicle trajectory prediction was evaluated by comparing the predicted and actual trajectories. The predicted trajectory is the set of coordinates within an intersection derived by the AI-based detection technique, while the actual trajectory is that directly generated from the image data. The average Euclidean distance is used for calculating the prediction accuracy, as shown in the following equation:where is the number of sets of for the comparison, are the actual coordinates of the vehicle location at , and are the predicted coordinates at .
3.2.3. Accuracy of Traffic Volume Estimation
Traffic volume was estimated by comparing the number of vehicles counted by the image processing (estimated value) technique and that counted manually (actual value). The evaluation was performed by calculating the root mean square error (RMSE) and mean absolute percentage error (MAPE). The former is used to check the degree of difference between the estimated and actual values, which can be calculated using the following equation:where is the number of data points for the comparison, is the -th predicted value, and is the -th actual value of the traffic volume.
However, RMSE is highly influenced by the size of the estimation subject (scale-dependent errors), and it may emphasize only greater errors than the small ones. Hence, we calculate MAPE as well, which is independent of the scale of the estimation subject and can be calculated using equation (9):where is the number of data points for the comparison, is the -th predicted value, and is the -th actual value of the traffic volume.
3.2.4. Accuracy of Queue Length Estimation
The evaluation of the performance of the queue length estimation is similar to that of the traffic volume estimation. This is done by comparing the queue length in meters derived by the image processing (estimated value) technique and that collected from a drone image (actual value). Here, we calculate the RMSE of the queue length estimation using equation (8), where is the -th actual value of queue length. We also calculate the MAPE of the queue length estimation using equation (9), where is the -th actual value of the queue length.
4. Result of Applying AI-Based Vehicle Detection and Trajectory Prediction
4.1. Accuracy of Vehicle Detection and Classification
Figure 9 shows an example of vehicle detection using the proposed training model based on YOLOv4. The system detects vehicles within the detection range and saves the results of the vehicle classification and coordinates of the bounding box as an image file (.jpg) and data files (.txt), using the same filename. By using the data collected by the drone (considered as actual data) and that extracted by the classification model, the performance evaluation is performed with the detection rate and MAP described in the previous section. The number of tested samples was 6,804. As a result, the detection rate was 99%, indicating that it could judge the detected objects as vehicles very well. In terms of vehicle type classification, the MAP value was 95% for cars, 87% for trucks, and 81% for buses.

4.2. Accuracy of Vehicle Trajectory Extraction
In vehicle trajectory extraction, during preprocessing, the locations of a vehicle are extracted at each frame by the proposed method. Then, the locations of the vehicles are projected onto the drone image. By matching the vehicle locations of the drone image and those from the proposed method to the same coordinates, it is determined that the vehicle area overlaps the most with the same vehicle. As shown in the figure, the vehicle locations extracted by the proposed method (red box area) are projected onto the drone image, where the actual vehicle locations are displayed in the drone image (blue box area) to determine them as the same vehicle. The performance of vehicle trajectory prediction is evaluated using these two different vehicle trajectories, as described in the previous section. The number of tested samples was 60,531. As a result, the average Euclidean distance was 1.138 m.
4.3. Accuracy of Traffic Volume Estimation
Figure 10 shows an example of comparing drone and camera images for the performance evaluation of traffic volume estimation. The test area is the blue box area within an intersection. As shown in the figure, the identification names are assigned for each in/out lane, and the pairs of the lane-by-lane travel directions of the vehicles can be seen in Tables 3 and 4. Then, the number of vehicles in each lane-by-lane traveling direction is counted from the drone images manually to obtain the actual data. On the contrary, the proposed method extracts the number of vehicles in each lane traveling direction based on the camera images to obtain the estimated data. Table 3 shows the traffic volume in each traveling direction counted from the drone images, and Table 4 shows that extracted from the camera data. In these tables, the notations in the second column represent the identification numbers of departure lanes (from I_1 to I_12) in approaching roads (from Road_1 to Road_4). However, those in the second row are the identification numbers of arrival lanes (from O_1 to O_12) in the roads in each direction. For example, if some vehicles pass through the intersection from the right-most lane of Road_1 (I_1) to the left-most lane of Raod_3 (O_5) and they are counted as 5, we record the counted number as shown in the tables. Hence, the entire table represents the lane-by-lane vehicle count values (travel volumes) of all departure and arrival pairs. The unknown in the latter table is the case when the camera-based system fails to detect a vehicle. When comparing the results of the two, the RMSE is 4.20 vehicles, and the MAPE is 16.41%.

4.4. Accuracy of Queue Length Derivation
Figure 11 shows an example of a drone image for queue length derivation, which was also performed manually. A person selects the starting and ending points of the vehicles within the delayed section on the road. Then, data containing the bounding box information of the vehicles at the starting and end points, map coordinates, and queue length within the image are saved. Using the information from these data, the true value of the queue length is calculated by converting the values into the real-world scale, which is considered the actual queue length. However, the proposed method directly derives the queue length through HD map matching to obtain the estimated data, which is compared with the actual queue length from the drone image. The number of tested samples was 62,205. Comparing the two, the RMSE is 2.37 m, and the MAPE value is 13.25%.

4.5. Comprehensive Evaluation
The overall performance of the proposed method is presented in Table 5. As described in the previous subsections, the detection rate is the total number of detected objects over the total number of ground truths, and a successful detection performance of 99% for 6,804 attempts is achieved, which can be judged to be highly consistent. The performance of the vehicle classification is performed in terms of MAP. With 6,804 test samples, the MAP values were 95%, 87%, and 81% for cars, trucks, and buses, respectively. Hence, the proposed method also shows reasonable performance in classifying the vehicle types. In terms of trajectory prediction, the average Euclidean distance was 1.138 m when 60,531 samples were tested. Such a low degree of error indicates the high performance of the proposed method. In terms of both traffic volume and queue length estimations, the absolute differences are only 4.20 vehicles for vehicle counting and 3.08 m in queue length estimation upon the RMSE values for more than 60,000 test samples. The MAPE values are less than 20%, which means that the performance of the proposed method is reasonable, particularly when estimating the lane-by-lane traffic information. Overall, based on the analyses of the five different evaluation criteria, the method proposed in this study shows the feasibility of collecting detailed traffic information with a camera installed at an intersection. In addition, the average time taken from image collection, data processing, and data storage in the server is 0.034 seconds, showing that the performance of the entire process can be completed within 0.1 seconds in general. Considering the results of this study, the proposed method is a highly optimistic technology to be applied to the fields of ITS and C-ITS.
5. Conclusion
In this study, we considered a method for deriving traffic information using a camera installed at an intersection to improve the monitoring system for roads. The method uses a deep-learning-based approach for image processing for vehicle detection and vehicle type classification. The method then estimates the lane-by-lane vehicle trajectories using the detected locations of vehicles. Based on the estimated vehicle trajectories, the traffic volumes of each lane-by-lane traveling direction and queue lengths of each lane were estimated. The performance of the proposed method was tested with thousands of samples according to five different evaluation criteria: vehicle detection rate, vehicle type classification, trajectory prediction, traffic volume estimation, and queue length estimation. As a result, the method shows the feasibility of collecting detailed traffic information with a camera installed at an intersection.
The proposed method has two research values. It has shown high accuracy in (1) real-time vehicle detection and classification based on deep-learning-based image processing and (2) estimating lane-by-lane vehicle trajectories by matching the detected vehicle locations with the HD map. While estimating the vehicle trajectories, this study has attempted to reduce the error of estimating the center points of the bounding boxes in the images of vehicles to ensure proper performance of the HD map-matching process. Hence, the approach of combining AI and HD map techniques is the main contribution of this study. This study shows a high chance of improving current traffic monitoring systems.
Although the proposed method has shown reasonable performance, this study is not without limitations. The error rates for both lane-by-lane traffic volume and queue length estimations are greater than 15% even though the vehicle detection showed a 99% performance, which is reasonable but not sufficient in terms of the reliability of traffic information. This is due to intermittent mismatches between the vehicle locations of the camera images and the HD map coordinates. Hence, further studies should consider enhancing the matching performance between camera image-based data and map data. Furthermore, the results of this study confirmed that the error increased with the distance between the camera and vehicle. Thus, investigating the minimum required distance between the camera and the intersection area can be a topic for future studies. In addition, for road lanes, additional research is required to develop a vehicle location correction algorithm. It is also necessary to perform training with trucks and buses to further improve the detection rate. Subsequent studies should consider these limitations for the further development of image processing-based traffic monitoring systems.
Data Availability
The data used to support the findings of this study are available from the corresponding author upon request.
Conflicts of Interest
The authors declare that there are no conflicts of interest regarding the publication of this paper.
Acknowledgments
The authors would like to acknowledge Editage (http://www.editage.co.kr) for English language editing. This research was supported by the Ministry of Land, Infrastructure, and Transport (MOLIT, Korea) under the Connected and Automated Public Transport Innovation National R&D Project (Grant no. 21TLRP-B146733-04).