Abstract

Environmental protection is a fundamental policy in many countries, where the vehicle emission pollution turns to be outstanding as a main component of pollutions in environmental monitoring. Remote sensing technology has been widely used on vehicle emission detection recently and this is mainly due to the fast speed, reality, and large scale of the detection data retrieved from remote sensing methods. In the remote sensing process, the information about the fuel type and registration time of new cars and nonlocal registered vehicles usually cannot be accessed, leading to the failure in assessing vehicle pollution situations directly by analyzing emission pollutants. To handle this problem, this paper adopts data mining methods to analyze the remote sensing data to predict fuel type and registration time. This paper takes full use of decision tree, random forest, AdaBoost, XgBoost, and their fusion models to successfully make precise prediction for these two essential information and further employ them to an essential application: vehicle emission evaluation.

1. Introduction

The popularization of vehicles in our daily life has been continuously enhanced with the expansion of urbanization around the world. Gasoline-engine vehicles are the most popular and widely used type compared with new energy ones, and the pollution gases, such as carbon dioxide, carbon oxide, hydrocarbon, and oxynitride, from vehicles have become the main contaminants in urban atmospheric pollution [1]. Efficient vehicle pollution detection therefore turns to be an emergency task which attracks more and more attention. Exhaust emission detection methods have evolved from periodic detection in the environmental monitoring station to daily road detection with remote sensing technology. This paper studies the vehicle emission detection in cities of China which is one of the largest developing countries.

In the USA, EPA (Environmental Protection Administration) proposed MOVES algorithm [2] to calculate the vehicle emission ratio in some fixed locations and periods of time. The Japanese government enforces the vehicle exhaust emission monitoring system in their country, and the emission behaviour of each vehicle in Japan can be checked on the official website of Japanese national transportation [3]. In order to rapidly capture the emission detection results, a French transport agency collects the emission pollution-related information from different places and puts them together to realize the sharing network for vehicle emission detection [4]. Related researches and works on this area started a bit later in China. In 2011, Cheng et al. [5] made systematic analysis for the harm caused by vehicle emission, verifying the necessities of exhaust emission controlling. Next year, Wu [6] collected the values of CO2, HC, CO, and NO exhausted by 1092 vehicles in the Xian Yang city using simplified loaded mode. They established regression equations between the emission value and vehicle information and found that the average emission value was highly related with the vehicle acceptability and the age of the vehicle. Referring to the local standards, they further gave a systematic explanation for the rationalization of the local standard mean emission value based on their research. With the development of remote sensing technology, a large amount of practical exhaust emission data can be obtained by environmental protection agencies in China. This paper introduces data mining technology to these valuable data to explore efficient information in vehicle exhaust emission detection. This research has a huge potential contribution in promoting the environmental protection department’s accurate assessment of unqualified vehicles and providing a theoretical basis for policymakers to learn from.

The first successful vehicle emissions demonstration system was probably an across-road vehicle emissions remote sensing system (VERSS) proposed by Gary Bishop and colleagues in the University of Denver in the late 1980s [7, 8]. A liquid nitrogen cooled nondispersive infrared was the first instrument that can only measure CO and CO2. In the next two decades, their team continuously refined the system: added hydrocarbon, H2O, and NO channels to their NDIR system [9, 10], integrated an ultraviolet spectrophotometer and improved it to enhance NO measurement [11, 12], and removed the dependence on the liquid nitrogen cooling [13]. The Denver group designed another commonly used remote sensing device, known as fuel efficiency automobile test, providing some of the inchoate comments on across-road particulate measurement [14]. There are also many other sensing systems typically based on multiple spectrometric approaches proposed for detection of passing vehicle emissions [1517]. More recently, Hager Environmental and Atmospheric Technologies introduced an infrared laser-based VERSS named Emission Detection and Reporting (EDAR) system, which incorporated several new functions, making it a particularly interesting system for vehicle emission detection.

Important information is buried in the vehicle emission remote sensing data. This paper exploits data mining methods to deal with the data and obtain valuable knowledge from them. There are three main directions in data mining: the improvements of classical data mining algorithms, ensemble learning algorithms, and data mining with deep learning. The improvements on classical algorithms are usually performed and employed in multiple application scenarios taking additional information into consideration. Ensemble learning is actually the integration of multiple learners with a certain structure which completes learning tasks by constructing and combining different learners. Its general structure can be concluded as follows: firstly, generate a set of individual learners and then combine them with some strategies. The combining strategies mainly include average method, voting method, and learning method. Bagging and boosting [18] are the most commonly used ensemble learning algorithms which improve the accuracy and robustness of prediction models. As the rapid development and popularization of deep learning, it plays more and more important roles in data learning with the support of big data and high-performance computing. Many traffic engineering-related researches mainly focus on analyzing relevant data such as traffic diversion [19], traffic safety monitoring [20], engine diagnosis [21], road safety [22] and traffic accident [23], and remote sensing image processing [2435], extracting useful information and digging out valuable knowledge. A few works are proposed in vehicle emission evaluation in data mining ways which is the key study subject in this paper. Xu et al. [36] used XgBoost to develop prediction models for CO2eq and PM2.5 emissions at a trip level. In [37], Ferreira et al. applied online analytical processing (OLAP) and knowledge discovery (KD) techniques to deal with the high volume of this dataset and to determine the major factors that influence the average fuel consumption and then classify the drivers involved according to their driving efficiency. Chen et al. [38] proposed a driving-events-based ecodriving behaviour evaluation model and the model was proved to be highly accurate (96.72%).

Relevant environmental policies have been introduced to define difficult limitation standards based on the vehicle fuel type and registration time in China. The vehicle license plate number, plate color, speed, acceleration, and VSP (vehicle specific power), etc., will be captured by the surveillance system when vehicles pass by the remote survey stations. The analysis for the smoke plume generated by gas emission is simultaneously conducted by laser gears at the stations, where the exhaust emission value can be calculated. With the fuel type and registration time information learned from vehicle plate numbers, it is able to obtain the gas emission standard value to judge whether the vehicle emission is eligible. However, register information of nonlocal vehicles and partial local vehicles is not recorded in the official database due to the limitation of environmental policies, which leads to the failure to provide the fuel type and registration time information for vehicle emission detection. According to the National Telemetry Standard in China, relevant departments will treat the information-missing vehicles as the diesel consumption ones, and this situation keeps the limitation criteria of the emission value of partial vehicles unknown, resulting in the evaluation for these vehicles being unable to carry on. Therefore, the precise information upon fuel types and registration time of vehicles is an essential prerequisite for finding out the pollution-exceeding vehicles. This paper adopts multiple data mining methods to learn the fuel type and registration information of vehicles from remote sensing data and further utilize cascaded classified framework to make accurate prediction on vehicle emission-related information, providing valuable reference standards on evaluation of different vehicles.

2. Data Mining Models for Analysis

In this section, detailed descriptions on the models and dataset used in this study are given [39].

2.1. Data Mining Methods
2.1.1. Decision Tree Model

Decision tree model [40] is a commonly used data mining method based on information theory and a greedy algorithm-like framework, which is proposed for classification or prediction. The model divides the whole dataset into branch-like parts to construct an inverted tree with a root node, internal nodes, and leaf nodes. The nonparametric design enhances the efficiency and generalization ability of the decision tree in processing a large and complex dataset.

Five core components made up the tree decision, including the following (1) nodes: root node, internal nodes, and leaf nodes which are the three types that represent different choice operations for data distribution. (2) Branches: they represent the splitting process of nodes in the decision tree, and each branch from the root node to a leaf node represents a corresponding decision rule on classification. (3) Splitting: it is a procedure to generate child nodes from parent nodes terminating when the predetermined homogeneity or stopping criteria is met. (4) Stopping: stopping rules are applied to prevent the overfitting and inaccuracy happening. (5) Pruning: it is an alternative way to establish a large tree first and then prune it to an optimal structure by removing useless nodes. This paper uses decision trees to make classification for vehicle fuel type and specifically calculates the information gain of multiple corresponding attributes to generate the rule model for fuel type prediction. The attributes that greatly affect the final results can be shown in a quite intuitive way. Figure 1 illustrates the decision tree model for fuel type prediction.

2.1.2. Random Forest Model

The random forest model is another classic and efficient data mining method that belongs to bagging learning. In 2001, Leo Breiman combined ensemble learning theory [41] with random subspace method [42], proposing the well-known machine learning methodology: random forest. It is a data-driven nonparametric model without a priori knowledge and has good tolerance to noise and abnormal values, as well as excellent extendibility and parallelism abilities for high-dimensional data classification. The ensemble learning structure enables random forest in some extent overcoming the performance bottleneck and overfitting of single classifiers such as SVM.

Given dataset , , random forest is essentially a series of combined classifiers made up by decision trees . The classification result is decided by voting of every decision tree and is highly related with two vital randomizations: sample bagging and feature random subspace. The sample bagging process randomly picks training dataset with return which shares the same size with the oral dataset, constructing a corresponding decision tree. When a node in the decision tree is split, the model will randomly select a feature space from the whole features (usually use ) from which an optimal splitting feature is selected to construct trees. These features consist of the feature random subspace and contain more discriminative feature combination for classification. Since in the construction of each decision tree, the process of randomly picking training data and feature subspace is independent and the constructions are procedurally identical; is a sequence of random variables with independently distribution. This character makes it applicable and efficient to be realized in a parallel computing way and simultaneously ensures the high extendibility of random forest. The structure and construction of random forest are shown in Figure 2.

2.1.3. AdaBoost Model

Similar to the bagging method, the boosting also belongs to the ensemble learning method which enhances the classification/prediction accuracy of base learner continuously with an iterative update process. AdaBoost [43] based on boosting learning was proposed by Professor Freund and Schapire in 1995. The algorithm was widely used in various classification/prediction fields due to its outstanding performance. The central idea of AdaBoost is to continuously update the wrong judgment weights. Weights of mistaken classified samples of the previous basic classifier are set to increase while the correctly classified samples’ weight is programmed to decrease, and the correct one will be used to train the next basic classifier again. At the same time, a new weak classifier is added to cascaded classifiers in each iteration, and the final strong classifier stays undetermined until a predetermined sufficiently small error rate or the prespecified maximum number of iterations is reached. The concrete procedures are presented in the following paragraph. Given the training set , is the training data with its label , and and denote the positive and negative labels, respectively. is the classifier of AdaBoost. Initialize the weights at first with the uniform distribution:

Then, perform iteration: the base classifier is trained in the cost of , and the cost accumulation of the wrong classified samples labeled by is represented by the classification error rate which is defined as

The coefficient of the base classifier () is calculated as the following formula:

In the next iteration, the cost distribution of samples is updated aswhere is defined as

The final strong classifier is the combination of the trained base classifiers, denoted as

2.1.4. XgBoost Model

XgBoost [44] is a modified algorithm based on GBDT (gradient boosting decision tree) [45]. Both of them share the same idea with boosting methodology; in each iteration, the current decision trees are learned by the previous iteration results and move forward to the residual diminishing direction. When dealing with multiclassification problems, the logarithmic likelihood loss function is defined aswhere is the label of the input data , denotes the attribute value, and is an indicator function that if the predicted value is , then . The prediction probability is denoted as

At the iteration, the label of sample data is ; the current negative gradient can be calculated as

As a classic algorithm, GBDT has advantages such as high accuracy, robustness, and conveniences. Yet, it just uses the first-order partial derivative to calculate the negative gradient which may cause relatively big error. Aiming at overcoming this shortage, XgBoost deduces the first-order derivative by second-order Tailor expansion and the results turn to be more approximate to ground truth when the deduction is introduced to calculate the leaf node weight. In addition, XgBoost firstly ranks the sample data which stored the records in the form of block and the speed of XgBoost is much faster than GBDT with the same training data.

2.2. Model Fusion

Although the single models mentioned in the above subsection have satisfactory performances on data classification and prediction, the model fusion methods based on the difference between the characters of different single models are able to further improve the accuracy and robustness for the final results. The voting manner is a common method for model fusion. This paper adopts two fusion methods to construct the combined model: hard voting and soft voting [46].

The hard voting classifier follows the simple idea that the minority obeys the majority. Given the classification/prediction results, treat each label result for the same variance as a vote, the most voted value is set to the variance in hard voting way. This process can be defined aswhere denotes the variance, is the vote result of hard voting, is an indicator function that shows whether the -th classifier deeming the label of is , and when the probability that belongs to label calculated by -th classifier: exceeds some threshold values; otherwise, .

The soft voting classifier is another fusion strategy, which treats the average of the probabilities of all classification/prediction samples for a certain label as the standard. The corresponding label with the highest probability is the final result and the voting can be demonstrated aswhere is the total number of classifiers and is the number of labels.

This paper combines different models mentioned in Section 2 in both methods denoted as hard voting model and soft voting model, respectively, to compare the final results.

3. Data Description

The data source is collected by means of remote sensing. The vehicle remote sensing system is a complicated synthetic system made up of some subsystems, including tail assay, environmental information monitoring, traffic condition monitoring, and vehicle identification. When vehicles pass by the remote sensing devices, the equipment performs detection of smoke plume produced by the diffusion of vehicle exhaust. The specific process can be summarized as follows: the probe light from the remote sensing device passes through the air mass and then returns to the detection unit through the right-angle displacement unit, finishing the detection for carbon dioxide, carbon monoxide, hydrocarbons, nitrogen oxides, etc. The intensity of opaque smoke exhausted by diesel engine vehicles can also be monitored by the gas-diesel integration design. As the parameters of vehicles’ running status, the instantaneous velocity and acceleration of vehicles are obtained synchronously by remote sensing system. The exhaustion and running data are used for the final remote sensing results.

Vehicle exhaust includes water vapor, oxygen, hydrogen, nitrogen, carbon dioxide, carbon monoxide, hydrocarbons, nitrogen oxides, sulfur dioxide, and particulate matter. There are two main methods for remote sensing detection on road for vehicle exhaust analysis, namely, nondispersive infrared red analyzer (NDIR) to measure carbon monoxide, carbon dioxide, and hydrocarbons and the dispersion of the ultraviolet (DUV) method to measure nitric oxide, smoke factors, and particulate matter (including opacity). The exhaust diffuses and dilutes in the air immediately after being discharged, and the variation of the dilution concentration is affected by factors such as air disturbance, wind direction, and wind speed. Direct measurement of the concentration of each pollutant in the exhaust plume may not reflect the vehicle emissions accurately and efficiently. Therefore, carbon dioxide is adopted as the reference gas to measure various exhaust pollutants in vehicle remote sensing technology. The same exhaust remote sensing optical path (including horizontal or vertical erection) is incapable of remotely measuring the exhaust of multiple vehicles at the same time. It must pass one by one and the time slot between passing vehicles should be greater than one second so that enough time could be set aside for the remote sensing device to measure the exhaust of the preceding vehicle. This also allows the exhaust of the front vehicle to spread out in time without affecting the remote sensing of vehicles behind.

The remote sensing data used in experiments come from the real detected data in the database of the Environmental Protection Bureau of a certain city, which is derived from the remote sensing database and consists of three parts of information: the first part is the vehicle basic information , it contains the license plate number , vehicle license plate color , and the passing time ; is denoted as . The second part is the vehicle condition information which includes the vehicle bodywork length , the speed , and the acceleration , denoted as . The third part is the remote sensing result . It includes the detection value of carbon dioxide , carbon monoxide hydrocarbon , nitric oxide , smoke intensity , and the environmental detected values: wind speed , wind direction , temperature , humidity , and atmospheric pressure , the remote sensing data thus is recorded as . Each record used in the research is composed of the above three parts of information and the range of vehicle fuel type and the registration time are the predicted targets.

4. Experiments

4.1. Data Preprocessing
4.1.1. Fuel Type Prediction

The data of fuel type prediction come from two tables: vehicle information table and remote sensing record table. The ID and fuel type fields are extracted from the former table, and the latter contains (a) vehicle running conditions (speed, acceleration, passing time, etc.). (b) Environmental meteorological conditions (lane, wind direction, wind speed, temperature, humidity, and atmospheric pressure). (c) Remote sensing results (detection value of carbon dioxide, carbon monoxide, hydrocarbons, nitrogen oxides, and the intensity of smoke). This paper relates these two tables by vehicle ID of which the specific process is described as follows:

(1). Data analysis. The preliminary statistical analysis on the number of vehicles with different fuel types is made and the result (shown in Figure 3) demonstrates that vehicles with multiple fuel types, such as mixed oil and natural gas type, other than gasoline and diesel account for a very low proportion. The ratio between gasoline and diesel cars in the data is about 4.6 : 1, and the unbalanced ratio means this is an unbalanced dataset.

(2). Feature layering. In the case of unbalanced distribution dataset, all the data at the end of the horizontal axis in the distribution graph are concentrated to one level so that the data becomes more concentrated, and the number of levels of data can be reduced. A temporary feature will be generated for hierarchical sampling, dividing the dataset into different sections. This paper divides the vehicle passing time into three time periods: morning, afternoon, and evening.

(3). Data cleaning. Data processing for character data of vehicles and missing values are the main components of data cleaning procedure. Vehicle character features include license plate number, license plate color, and the test results, and they are usually converted to one-hot codes in machine learning. There are three common methods to deal with missing values: ignoring the record of missing value, ignoring the missing features, and median/average padding. If the license plate number, license plate color, or detection result is missing, the record will be directly ignored, and continuous value parameters are filled with the mean value in other situations.

(4). Feature selection. The main work is to find correlations and feature combinations and generate a correlation matrix for the original data. Check the features in the original data that are most relevant to the fuel type, and use related techniques to find the features that are positively correlated, such as license plate color and nitric oxide. Features that show obvious negative correlation include validity, transit time, and carbon dioxide. Then, check the features that are most relevant to the date of registration of diesel vehicles in the original data, and through technical means, find that the features that show more obvious positive correlation include license plate color, etc., and the attributes that show more negative correlation include validity, detection line, and nitric oxide. In terms of feature combination, because the concentration of plume is affected by the wind speed and the vehicle’s own speed during the remote sensing process, the gas concentration is used as a reference when recording the exhaust pollutant concentration. The new feature combined is the ratio of other pollution items to . The amount of data in this study is trillions; after filtering, the number of features is about 30. The decision tree algorithm is used to calculate feature importance of fuel type prediction. The calculation results are shown in Figure 4. Using random forest to calculate the feature importance of gasoline vehicle registration time period is shown in Figure 5, and calculating the feature importance of diesel vehicle registration time period is shown in Figure 6.

(5). Feature scaling. This experiment discovered that the difference in the value range between different features is very large which brings huge obstacles to the decision part during the learning process of the model, and larger values usually cause greater changes in the model. Commonly used scaling methods such as normalization and standardization are introduced in experiments. The normalization formula is (x − mean)/num(v), x denotes the value of variance, mean is the average value of all the variances, and num(v) is the total number of variance. The normalization formula can be denoted as x − min/max − min where min and max are the maximum value and minimum value among variances, respectively. With the normalization process, the entire feature values are mapped into the period of [0, 1].

(6). Dataset division. This paper randomly divides the source dataset into two parts: 85% data as training dataset and 15% data as validation dataset.

(7). Model training. The processed data is put into the three models for training and verification. The comparison among different results from these models is performed in the experiment part.

4.1.2. Registration Time Prediction

According to the national and local standards of motor vehicles, the following subdivisions are made for the vehicle registration time. Relevant departments stipulate that for gasoline-powered vehicles, it is divided into two categories: before 2001-10-1 and after 2001-10-1, recorded as

The registration time of diesel-powered cars is divided into three periods, i.e., before 2008-7-1, between 2008-7-1 and 2013-7-1, and after 2013-7-1, denoted as

When the period division is done, the prediction of registration time can be treated as a multiclassification problem.

This remote sensing record table contains three parts information: (a) remote sensing data of vehicle running conditions, including speed, acceleration, and passing time; (b) environmental meteorological conditions, including lane condition, wind direction, wind speed, temperature, humidity, and atmospheric pressure; (c) exhaust remote sensing results that contain detection value of carbon dioxide, carbon monoxide, hydrocarbons, nitrogen oxides, and the intensity of smoke. In the classification of the national landmarks and local landmarks mentioned above, there exists a large error that it is impossible for the neighboring vehicles around certain segment points to change a lot. Therefore, this paper chooses to discard the record data of three months before and after the segmentation point.

4.2. Parameter Optimization
4.2.1. Decision Tree Tuning

There are mainly three parameters to adjust in decision tree algorithm:D_max_depth represents the max depth of a decision tree. According to the principle of decision tree, it can be seen that with deeper layers a decision tree has more power to thoroughly divide attributes and mine the deep relationship between data. This paper experimentally sets the max_depth range from 1 to 32. The F1 curve is shown in Figure 7.D_min_samples_split denotes the minimum number of samples required to split the internal nodes. At least one sample is required for each node to perform splitting. When the number of internal node samples increases, more samples will participate in the split of the tree. The decision tree will suffer more constraints with more reference data utilized in node splitting, which could also affect the speed of model execution. This paper sets the minimum number of samples for internal nodes ranging from 10 to 500. The F1 curve is shown in Figure 8.D_min_samples_leaf is the number of samples required for leaf node splitting, which is called the minimum number of samples. The leaf node will be pruned if the number of leaf node samples is less than the minimum one.

In experiments. the D_min_samples_leaf is set to range from 1 to 100 and the F1 curve is plotted in Figure 9.

According to Figures 79, the optimal parameter selection of the decision tree algorithm obtained by the experiment in this paper is as follows: D_max_depth is 12, D_min_samples_split is 150, and D_min_samples_leaf is 10.

4.2.2. Random Forest Tuning

There are four main parameters to adjust in the random forest model:n_estimators is the number of decision trees in random forest that plays a significant role in the performance of the model. Small value of n_estimators means fewer base classifiers participate in the decision process, leading to the decrease in prediction accuracy, while large number of decision trees will bring out computational burden to the system and take more running time. This paper sets n_estimators from 10 to 100 and plots the F1 curve, as shown in Figure 10.max_features represents the maximum number of features that can be used when the splitting of decision trees happens. Each node selects all features in the splitting process when and selects no more than and when and , respectively. is the total number of features. There are few sample attributes in the experiment, and this paper sets the value of max_feature to R_max_depth is the maximum depth of decision trees in random forest. This paper sets R_max_depth ranging from 1 to 100 and F1 curve is shown in Figure 11.R_min_samples_leaf is the minimum number of samples in leaf node for splitting. This paper sets it ranging from 1 to 100 and F1 curve is shown in Figure 12.

According to the result of Figures 1012, this paper defines that n_estimators is 100, max_features is 5, R_max_depth is 24, and R_min_samples_leaf is 2.

4.2.3. AdaBoost Parameters Setting

The default base classifier of AdaBoost is decision tree, which is a classic ensemble learning algorithm with a boosting structure. The base classifier parameters refer to the optimal parameters of decision tree model above, where max_depth = 12, min_samples_split = 150, and min_samples_leaf = 10. Two significant parameters of AdaBoost are the number of tuning for base classifiers and learning rate . The model is easy to overfit if the is too large, and underfitting in reverse. This paper sets and to be 1 by default.

4.3. Experimental Results and Analysis
4.3.1. Fuel Type Prediction Model

This section uses decision tree, random forest, and AdaBoost algorithm to make fuel type prediction. Vehicle fuel type prediction is a typical two-class classification. Five classification models are used in the experiment including decision tree, random forest, AdaBoost algorithm, hard voting fusion model, and soft voting fusion model. After parameters optimization of all kinds of models, the single classifier models are compared to fusion ones to evaluate these models. The whole technological process is illustrated in Figure 13. More details are given in the following subsections.

In Section 4.2, this paper optimizes the parameters of a single model. Although the performance of the single model is already very good, it is based on the differences between the single models, and the model fusion of the single models becomes very meaningful.

It can be seen from Table 1 that random forest performs best in a single model scene. The fusion model obtained by the voting method also performed very well. Compared with most single models, fusion models have higher prediction accuracy. In the process of predicting the vehicle’s fuel type, this paper has obtained a good prediction effect. The random forest algorithm and the fusion model have the best prediction results. The F1 value of the random forest is 90.41%, and the F1 value of the soft fusion model is 90.3%. Because the random forest prediction speed is faster and the prediction model is better to explain, the random forest prediction model is selected as the final model for fuel type prediction.

4.3.2. Registration Time Prediction Model: Mixed Fuel Type

Vehicle registration time predictions are divided into diesel vehicles registration time prediction and gasoline vehicles registration time prediction. From the statistical analysis of the data, it can be known that diesel vehicles are mainly divided into vehicles registered between 2009 and 2013 and after 2013. The proportion of registrations before 2009 was very low. Gasoline vehicles are mainly vehicles after 2008, accounting for about 90%. The purpose of predicting the registration date and fuel type is to find the limit value of unknown vehicles to judge whether the emission is qualified. The earlier the registration, the higher the limit value standard. Predicting a car before 2008 to a car after 2008 is equivalent to selecting a low limit value, and it is easy to judge a vehicle with a qualified emission as unqualified, which causes a waste of resources for car owners and environmental inspection workstations. Based on the above analysis, random forest and XgBoost are used in this section to create a prediction model of registration time.

The classification periods are as follows: gasoline + after 2001-10-1, gasoline + before 2001-10-1, diesel + during 2008-7-1 and 2013-7-1, diesel + during before 2008-7-1, and diesel + after 2013-7-1. Since it is a multiclassification problem, this paper uses random forest and XgBoost to perform prediction and the results are shown in Table 2. In order to reduce the randomness of the learning algorithm, the results show the mean and variance of the results of 10 independent runs.

This paper finds that when the data is divided into five categories, the verification results are unbalanced. The number of gasoline-powered vehicles after 2001 is higher and the verification accuracy is much lower than the training accuracy. The random forest model is superior to the XgBoost model in terms of training accuracy and verification accuracy. Its training accuracy reaches 99.0%, and its verification accuracy is about 91.7%, which indicates that overfitting occasionally happens, leading to the decrease in verification accuracy.

4.3.3. Registration Time Prediction Model: Gasoline Vehicle

The gasoline cars are classified as follows: gasoline + after 2001-10-1 and gasoline + before 2001-10-1. The results are shown in Table 3.

The prediction accuracy and verification are more than 99% using the random forest model. This is mainly caused by the severely unbalanced data distribution and the insufficient data amount.

5. Conclusions

Environmental protection has been a hot topic in academic and industrial communities. This paper focuses on predicting the missing basic information of vehicles from telemetry data to monitor the vehicle emission. A variety of data mining methods are adopted to perform predictions based on the vehicle telemetry data provided by an environmental protection agency in a certain city and successfully made precise inferences on fuel type and gasoline-powered vehicle registration time. In the prediction for the registration time of diesel vehicles, the prediction accuracy rate just reaches about 70% due to the fact that the division of registration time is artificially controlled and the status of different vehicles varies a lot for different users. Further work will be carried out on the basis of more related data and improved algorithms to make more precise prediction on the vehicle emission-related information.

Data Availability

The SQL data used to support the findings of this study have not been made available because the data provider is the Municipal Bureau of Ecology and Environment of a certain city and the data is not public.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.

Acknowledgments

This work was supported in part by the National Natural Science Foundation of China (NSFC) under Grants 61772386 and 61862015 and in part by the State Grid Hubei Electric Power Co., Ltd., under Grant SGHBDK00DWJS1800134.