Abstract

With the increasing proportion of the logistics industry in the economy, the study of the vehicle routing problem has practical significance for economic development. Based on the vehicle routing problem (VRP), the customer presence probability data are introduced as an uncertain random parameter, and the VRP model of uncertain customers is established. By optimizing the robust uncertainty model, combined with a data-driven kernel density estimation method, the distribution feature set of historical data samples can then be fitted, and finally, a distributed robust vehicle routing model for uncertain customers is established. The Q-learning algorithm in reinforcement learning is introduced into the high-level selection strategy using the hyper-heuristic algorithm, and a hyper-heuristic algorithm based on the Q-learning algorithm is designed to solve the problem. Compared with the certain method, the distributed robust model can effectively reduce the total cost and the robust conservatism while ensuring customer satisfaction. The improved algorithm also has good performance.

1. Introduction

To reduce transportation costs in planning vehicle transportation and distribution links, the vehicle routing problem (VRP) has always been a key problem in the field of logistics scheduling. The VRP was first proposed by Dantzig and Ramser in 1959 [1]; the basic problem is a capacity vehicle routing problem (CVRP). Under the constraints of meeting the known vehicle load and customer demand, the problem aims to minimize the vehicle transportation distance to serve all customers, thereby reducing the logistics distribution cost.

In the classical vehicle routing problem, it is generally assumed that all the information is complete and fixed; that is, the path planner grasps all the information before planning the path, including customer information and road network information, and then finds the optimal solution or satisfactory solution satisfying the constraint conditions. However, in the actual distribution process, the above information is not invariable, and it is also difficult for the path planner to fully grasp the information of all distribution nodes in advance, resulting in the uncertainty of information, which will make the original optimization problem nonoptimal or even impossible. Studying the VRP with uncertain parameters is called the vehicle routing problem with uncertainty (UVRP).

At present, most of the research on UVRP is conducted for the uncertainty of customer demand or travel time. In real-world distribution, ignoring the impact of customer absence will bring many additional transportation costs and efficiency losses. It is of great significance to study the problem of uncertain customers. Lei et al. [2] first studied the uncertain customer and its zoning problem and proposed a two-stage solution. In the research of VRPUC, most problems are optimized by assuming that the probability distribution of customer existence is a deterministic value [3] or formulating the strategy of real-time re-optimizing the path [4]. However, the probability distribution of customer presence cannot be accurately measured and can only be estimated based on historical data. At the same time, the re-optimization will lead to frequent return of vehicles to the warehouse to pick up goods, which will lead to higher transportation costs and lower efficiency. Therefore, in our study, we consider the uncertainty of customer presence probability and predict the probability distribution of customer presence based on the historical demand information of customers.

For UVRP, the most common method is to build an uncertainty model and then convert it into a deterministic model to solve them. Researchers have adopted many approaches to transform uncertainty models into deterministic models: Markovic et al. [5] regarded the urban garbage collection problem as a VRP with random demand and travel time and expressed the problem as a chance-constrained programming model with a normal distribution. Hou et al. [6] established a stochastic programming model of the VRPUC with a time window and solved it by using an adaptive genetic algorithm. Ge et al. [7] studied the electric VRP with random demand and active remedies and established a model with probability constraints. Souyris et al. [8] proposed a robust model for VRP with soft time window and service time uncertainty and verified it based on real-world data by branch and price method. Solano-Charris et al. [9] proposed a robust optimization method based on discrete scenes to deal with uncertain travel costs for VRP with uncertain travel costs. For uncertainty problems, stochastic programming requires exact probability distribution information of random variables, which cannot be obtained from observation. Classical robust optimization can solve this problem, but the solution is too conservative. Several heuristic algorithms have been proposed for UVRP, and these include variable neighborhood search [10], tabu search [11], and particle swarm optimization [12].

In contrast, distributed robust optimization theory has developed rapidly in recent years, which can make up for the shortcomings of classical robust optimization and stochastic optimization. It combines statistical learning and optimization theory and obtains a good enough solution by assuming that the parameters obey some possible distributions. At present, distributed robust optimization has been applied to energy scheduling [13, 14], outpatient scheduling [15, 16], fault detection [17], and other fields, and has achieved good results. However, the application of distributed robust optimization in UVRP is relatively rare, especially in solving VRPUC. It is also meaningful to use distributed robust optimization theory to solve UVRP.

Reinforcement learning, as a powerful decision-making tool, has attracted extensive attention. Naturally, we think of applying reinforcement learning to the high-level strategy of using a hyper-heuristic algorithm to select the low-level strategy. Zhang et al. [18] took the reinforcement learning algorithm based on the deep Q-network as the high-level strategy of the hyper-heuristic algorithm, used it to evaluate the performance of the underlying operator, and learned which operator to use in a specific situation to obtain the maximum reward through interaction with the environment. Qin et al. [19] introduced several metaheuristic algorithms with different characteristics as the low-level heuristic strategies and introduced policy-based reinforcement learning as the high-level selection strategy. In addition, deep learning is used to extract hidden patterns in collected data to better combine the advantages of underlying heuristics. However, their proposed algorithm is complex and has poor performance. At the same time, we can see that reinforcement learning is increasingly considered in the design of hyper-heuristic algorithms, and great progress has been made in recent years.

For VRPUC, we proposed a data-driven distributed robust optimization method to solve this problem. By considering the uncertainty of customer existence probability, we established a distributed robust optimization vehicle routing model and the data-driven kernel density estimation method to fit the historical data and then optimize the robust model. A reinforcement learning algorithm based on Q-learning algorithm is designed to solve the above model. In summary, the main contributions of this paper are as follows:(1)Compared with the previous studies, which generally assumed that the presence of customers obeyed a certain probability distribution, we considered the uncertainty of the probability of the presence of customers.(2)We introduced the distributed robust optimization theory to establish a mathematical model for solving the problem and used the historical data of customer points to optimize the model. The experiments showed that the proposed method could solve this problem well.(3)We proposed a hyper-heuristic algorithm based on reinforcement learning to solve the model. We used Q-learning algorithm as a high-level selection strategy to select the low-level operators. Through experiments, we can find that our algorithm has positive effects.

2. VRPUC Model

2.1. Problem Description

The VRPUC is described as follows: there is a distribution center with known locations, and there is a group of customers with known locations. The demand of each customer is known, and the model of the vehicles is fixed. It is known that some customers have fixed demand and must have vehicles to serve them every time; the remaining customers present according to a deterministic probability, they can produce situations that do not need services, and the probability of presence is uncertain. A certain number of vehicles are assigned to deliver services to customers. Each customer can be served by only one vehicle, and the vehicle must meet the demand of each customer. The vehicle must return to the distribution center after serving all customers that should be served. Because of the expected demand, the vehicle will not meet the demand of all customers. At this time, the vehicle will return to the distribution center for replenishment and then return to continue to serve the remaining customers. The goal is to solve the total distance cost, including replenishment distance, on the premise of meeting all customer demand.

2.2. Mathematical Model

The formulation for the VRPUC is as follows: let i= 0 be the distribution center, the set of customers be represented by i (i = 1, 2, 3, …, L), the set of vehicles be represented by k (k = 1, 2, 3, …, K), and Q be the standard capacity of the vehicle. Each customer demand is represented by di (i = 1, 2, 3, …, L). The uncertain demand probability of each customer is represented by pi, and the pi values are independent of each other; if the customer demand is determined, then pi = 1. The overall event set is represented by R, and a customer presence event is represented by ζ. R/ζ represents the remaining events after removing ζ events, that is, events for which the customer does not require services. The set of points for which the customer does not meet the demand is O, the penalty coefficient is ε, the distance cost is DT, the penalty cost is S, the total cost is Cost (xijk, yik), and cij represents the distance from customer i to customer j. The following variables are defined:

The mathematical model is as follows:

Objective:

Subject to:

Equation (2) represents the target expected value, that is, the minimum expected cost. Constraint (3) indicates that the total demand of customer points served by each vehicle is not greater than the standard load capacity of the vehicle. Constraint (4) represents the probability product of the event. The former is the probability product of the occurrence of the demand event, and the latter is the probability product of the nonoccurrence of the demand event. Constraint (5) represents the total cost. The former is the calculation formula of distance cost, and the latter is the calculation formula of penalty cost. Constraint (6) ensures that each customer point is served. Constraint (7) ensures that each customer has only one vehicle for service. Constraint (8) eliminates the subloop. Constraint (9) indicates that the probability of customer demand is [0, 1]; that is, customers can have demand or probability demand in a certain distribution. Equation (10) judges whether the customer’s demand is met. If so, the value is 0; otherwise, it is 1.

3. Robust Optimization of the VRPUC Formulation

3.1. Robust Model Optimization

Distributed robust optimization is generally used for the approximate distribution of known uncertain parameters. The actual distribution is derived from the combination of the approximate distribution and historical data. The worst case is selected in the actual distribution, and then, an improved solution is obtained from the worst case, which better optimizes the inconsistency between the approximate distribution and the actual distribution, resulting in overly conservative determination of the method. For the model with uncertain probability of customer presence, distributed robust optimization is adopted because the actual distribution of the probability of each customer presence is unknown, but the first-order and second-order moments can be obtained from historical data.

According to Reference [20], the minimum value is obtained when there is a probability distribution, that is, the expected minimum value. Therefore, (2) can also be written aswhere P represents the probability distribution of compliance, and EP(∙) represents the expected value of solution. The distributed robust optimization method is introduced to optimize it, and P’ is designed as a set of probability measures, which includes all possible distributions of the demand probability pi of each customer with a real situation, and its range is larger than the P set. It can be imagined that, if the worst case (maximum value and upper bound) is selected in the P’ set and meets the required conditions, the situation in the p’ set must also meet the required conditions. Therefore, (11) can be rewritten as

Because the latter part of (12) is only related to the result of path planning and is not directly related to the probability of an event at a certain point, constraint (12) can be written aswhich implies

According to (14), the key to the solution is the first half without considering the penalty cost. Because the penalty cost can only be determined by path planning, that is, according to the actual situation, it is impossible to obtain its maximum value through a functional expression. Therefore, (14) can be solved by using

The solution is divided into two stages. First, we solve the first half, that is,

Expanding its expected formula gives

Unifying pi and (1 − pi) in (17) and rewriting them into corresponding piecewise functions yield

Let μ be the average value obtained from historical data and σ2 be the variance. Then, the probability distribution of set P has the following formulas:

Because there are only two results (whether the customer needs services), it can be seen that the P probability should roughly meet the binomial distribution; that is, the variable is a discrete variable, so (24) can also be expressed as

Equations (18)–(21) can be transformed by using duality theory. Firstly, we transform the following equation:and constraint condition (20) towhere αi and βi are Lagrange multipliers. Therefore, (15) can be expressed as

Similarly, (29) can be expressed as

Again, (31) and (22)–(24) can be transformed using the cone duality theory, and because cijxijk is a constant with an undetermined value, the conditions can be omitted. Therefore,where αi, βi, and υ are also Lagrange multipliers. Because the conditions before transformation in the above relationship are equal sign relationships, the values of the three variables are αiRn×n, βiRn, and υR. The variable after removing the subscript can be expressed as the corresponding matrix vector, and because this condition reflects the strong dual relationship, it can be written as follows by using constraint condition (32):

Therefore, if (33) is constant, the following condition must be met:

According to semidefinite optimization theory, constraint (32) can also be written as

Finally, (15) can be simplified to the following dual formula:

The final formula is

In (36), the maximum part is a linear programming function when pi is used as the variable value and other parameters are used as fixed values. It can be seen that it is a convex function with a maximum value. The maximum value occurs whenfrom which it can be obtained that

According to the constraints (28), (34), and (35) and the attributes of whether the probability pi itself is ≥ 0 or ≤0, the optimal arrangement solution with variables (αi, βi, α, β, yik) can be solved, the minimum value can be determined, the fixed solution of pi can be obtained, and the final deterministic robust optimization model with parameters can be obtained.

3.2. Optimizing Robust Models through Data Driving

Solving the VRPUC model based on distributed robust optimization requires knowing the distribution of uncertain customer probability variables and the distribution eigenvalues such as mean and variance. The samples obtained from simple data collection cannot represent the overall data, which is one-sided. To obtain the distribution characteristics of uncertain parameters more accurately, the data-driven nonparametric kernel density estimation method is used. The uncertain variable to be estimated is the probability that the customer point needs service, which is set as p. Suppose P1, P2, …, Pn are random samples of n discrete variables, which have a probability density function of f (x). The specific form and parameters of the function are unknown. The empirical distribution function can be expressed as

Suppose the kernel function is K0(p). Then, its density estimation function is

Selecting the most widely used Gaussian function as the kernel function transforms the above formula to

Bandwidth h has a great influence on the distribution. An empirical method proposed in Reference [21] can be used to estimate bandwidth: for sample data close to a normal distribution or a bimodal probability function, we selectwhere σ is the standard deviation of the sample data and R is the quartile distance obtained from the sample data. We next design a historical data sample of a customer point with 1000 historical data samples about a customer presence probability, and the customer point has a maximum probability that the demand probability is >55%. We can use the above method to estimate its data distribution, as shown in Figure 1.

4. Hyper-Heuristic Algorithm

In the field of reinforcement learning, the Q-learning algorithm is widely used because of its classicality and easy to understand principle [22]. A hyper-heuristic algorithm based on the Q-learning algorithm (HH-QL) was designed. The design points of the Q-learning algorithm are mainly state, action, reward, punishment values, Q value, discount rate (γ), and learning rate (α).

4.1. Design of the State Value

state= [s1, s2, …, sn] represents the performance of the fitness value of the algorithm at this stage. For some problems where the final solution is determined, it is easy to judge the gap between this time and the final solution. For the VRP, the state of the algorithm cannot be clearly expressed. At this time, the fitness value is poor, or it may just jump out of the local optimal solution and go to another peak. When it finally reaches the peak, the fitness value may be better than this value. Therefore, the following state value design method in [19] is adopted:where fit represents the current fitness value, fit’ represents the fitness value of the previous generation, Ck represents the cardinality of different types of operators, and the cardinality of mutation type operators is different from that of non-mutation-type operators, with values of 20 and 40, respectively, so as to distinguish the impact of different types of operators on the current state.

4.2. Design of the Action and Reward Values

Action= [a1, a2, …, an] represents the next action that the algorithm will perform. Only low-level heuristics (LLH) underlying operators are selected, so action represents the sequence number of the operator. For instance, a1 is the designed No. 1 operator, that is, a single path operator. The operator proposed in Reference [23] is used as the underlying operator. It mainly includes three categories: local research (LLH-L), mutation (LLH-M), and location-based radial ruin (LLH-LR), for a total of 11 LLH underlying operators.

reward = [r1, r2, …, rn] indicates the evaluation of the impact of action on the algorithm in the historical stage. The evaluation can be good or bad. A good evaluation will guide the algorithm to increase the probability of selecting the action when it encounters the same stage of state in the future. A bad evaluation will reduce the possibility of selecting the action. The design of reward has different effects because of the different types of operators. For instance, the impact of the 2-OPT operator is immediate, while the impact of destruction and reorganization operators is delayed. Therefore, reward consists of immediate reward and delayed reward, which are, respectively, aimed at the operators that produce immediate influence and the operators that produce delayed influence. The design is as follows:where I_re represents immediate reward value, given by

And F_re represents the delayed reward value.

If the fitness value is improved, the immediate reward is +1; otherwise, it is −1; the delayed reward for the immediate reward operator is always set to 0. For the delayed reward operator, its value is set to 0 in the current calculation. At this time, the serial number of the delayed reward operator and relevant information, such as the Q value, are saved. If the fitness value has been improved before the next delayed operator appears, the value of the corresponding operator reward is +1, and its Q value and other information will be updated at the same time.

4.3. Design of the Q Value

The Q value is the function value calculated by according to the reward value and the Q history value. It is used to prevent the evaluation from being too one-sided and affecting the trend of the algorithm based on using reward only. The calculation formula is as follows:where s represents state, a represents action, r represents reward, t represents the corresponding value of the previous generation, t +1 represents the updated value of this generation, and maxa represents the value with the largest Q value among all action values.

4.4. Design of γ and α Values

γ indicates the impact on future reward and punishment values. If the γ value is larger, more emphasis is placed on the future prediction results; that is, it will have a more positive view of the current behavior. In contrast, if the γ value is smaller, more emphasis is placed on the current income and there are stricter requirements for the current behavior. The value of γ has been studied in detail in the literature [23], and the recommended value is 0.8. In most cases, it has a good effect on the algorithm.

α represents the learning degree of the Q value function. Larger α values indicate that the new value replaces the old value to a greater extent. In contrast, smaller α values indicate that the old value still occupies an absolute position in the data, and the algorithm pays more attention to the historical effect of the action. In the VRP, the algorithm may converge to a local optimal solution, so the α value needs to be increased when the fitness value has not been improved for a long time. By using the characteristics of reciprocating fluctuation of the cosine function, the α value can be determined as

The SN value is designed as the record of the number of iterations for which the fitness value is not improved.

4.5. Other Considerations

To avoid the situation that, when the probability of the customer presence is very high, the service path has been arranged in the path, but the final customer point does not generate the required service, and the bundling customer point method was adopted before the population generation. The method can avoid a sharp increase in cost resulting from the long path between the two customer points before and after the unserved customer points to generate services. The bundling principle is based on bundling nearby according to the distance between customer points.

The algorithm flow is as follows (see Figure 2):Step 1: Bundle the customer points. First, uncertain customers are bundled with other fixed customers. Let the uncertain customer point be UCi and judge the corresponding j fixed customer point according to the original distance matrix. If the distance cij is the minimum value compared with other fixed customer points, the customer point j is bundled with i to form a new customer point C[i,j].Step 2: Initialize the population. Randomly generate Npop group individuals, generate feasible solution group P (pi=p1, p2, …, pNP), calculate the fitness value as f (fi=f1, f2, f3, …, fNP), and randomly select a group of P. Initialize the global optimal population PB and the optimal solution FB, initialize the state value to 0, initialize the action to any 11 operators, and initialize the Q-Table to 0.Step 3: Calculate and update. The initialized action operator is used to search the solution to obtain a new individual Ind and fitness value fit. At the same time, if fit ≤ FB, this proves that the solution at this time is better, then update the global optimal population PB and FB so that PB = Ind and FB=fit.Step 4: Calculate state. First, we judge whether to accept the solution according to the simulated annealing. If it is accepted, the state value is calculated. Otherwise, the value remains unchanged.Step 5: Evaluate action. We determine the operator corresponding to the action currently. If it is an operator that has a delayed impact on the algorithm and if DLN is the empty set, we record the corresponding DLN= [action, reward, Q, fit]. If it is not the empty set, the history value in the corresponding Q-Table will be updated again.Step 6: Calculate reward and Q values. We calculate reward and Q values according to equations (45)–(47) and update the Q-Table.Step 7: Select action. According to the Q-Table, we select the action with the largest Q value corresponding to the state as the next action of the algorithm.Step 8: Update learning rate α. We update learning rate α according to equation (48).Step 9: Determine the iteration. If the current iteration number t ≤ tmax, we return to Step 4 for reiteration; otherwise, we exit the algorithm.

5. Computational Experiments

5.1. Implementation Aspects, Configuration of Parameters, and Instances

The algorithm was coded in MATLAB, and all experiments were conducted on a computer with an Intel Core i5-3230M and 12 GB of RAM. After repeated testing, the parameters used in the algorithm are set to the following: discount rate γ = 0.8 in the Q value function, the initial value of ε= 0.5, iterative maximum iteration tmax = 106, empirical pool NE = 800, and the number of samples selected for learning NS= 600.

All calculation instances (https://www.bernabe.dorronsoro.es/vrp/) in set A in the CVRP were selected for experiment. In each calculation instance, 25% of the total customer point scale was randomly selected as the uncertain customer. The random probability was designed at 0.3–0.9, and different interval conditions such as [0.3, 0.5] and [0.6, 0.9] were used to generate 1000 random {0,1} samples that are not uniformly distributed, where 0 represents no demand and 1 represents demand. The demand of the next customer point can only be known when the vehicle arrives at the previous customer point. The demand probability was counted every 10 times in 1000 random samples. Through the method described above, a random demand quantity based on the demand quantity of the original customer point was generated, and 1000 historical samples were generated for each customer point (and, at a customer point with defined demand, the demand remains constant). Because there are many uncertain customer points in each instance, they were not listed one by one. Only the statistical mean u and variance σ2 of some customer points in the randomly selected instance were displayed, as listed in Table 1.

5.2. Comparison with Other Algorithms

To test the performance of the algorithm, the HH-QL algorithm is compared with the literature [2426]. The results are shown in Table 2. Each calculation instance is calculated 20 times, the shortest distance of the total traveling path is the objective function, and the smaller the distance is, the better the performance of the algorithm is. Column BKS represents the optimal distance solution for the instance, column Min represents the optimal distance solution obtained by the algorithm, column Avg represents the average of the optimal distance solution, and column DEV represents the gap between its costs and BKS. The DEV is calculated is shown in (49). Also, the values in bold refer to the optimal solutions.

It can be seen from the results that the proportion of the optimal solution obtained by LNS-ACO, HVNSOS, and OHGA in the all instances is 62.96%, 74.00%, and 51.85%, respectively, and the proportion of HH-QL is 81.48%. Compared with other algorithms, HH-QL algorithm has some advantages in the overall optimal rate. The average deviation of each algorithm for the optimal solution of 27 instances is 0.60%, 0.13%, 0.37%, and 0.07%. HH-QL is 0.53%, 0.09%, and 0.06% lower than the other three algorithms. In all instances with more than 60 customers, the algorithm proposed in this paper is relatively more accurate than other algorithms. It can be concluded that the algorithm in this paper has a better search effect on the set a standard example compared with the other three algorithms of multigroup search.

5.3. Comparison with Certain Parameters Problem

To test the effectiveness of the distributed robust model and algorithm mentioned above, the proportion of nonsatisfaction customers and the additional cost caused by nonsatisfaction customers are evaluated.

Table 2 lists the path distance cost and customer satisfaction ratio of each calculation example after robust optimization obtained from the calculation example of set A. The greater the percentage of cost reduction and the larger the proportion of customer satisfaction after optimization, the better the optimization effect. The instance column lists the serial number of the problem, Certain in the Type column represents certain parameters problem, Robust represents robust optimization problem of uncertain parameters, the Cost column lists the distance cost, the Decrease column lists the percentage of the expected cost lower than the original planned cost, and the Total column lists the proportion of path planning with minimum cost that can be successfully met in 1000 historical samples with uncertain probability of customer presence. E1 represents the proportion of samples that can be satisfied by all customer points in the total sample, E2 represents the proportion that two customer points are not satisfied, and E3 represents the proportion that three or more customer points are not satisfied.

After planning the route, if the demand of the customer point is not met, the vehicle immediately returns to the origin for replenishment and then returns to the customer point that does not meet the demand to continue the distribution service according to the original route. Table 3 shows the set examples of measuring the additional distance cost incurred by returning to the warehouse to pick up the goods and continue delivery because the customer demand cannot be satisfied. The shorter the cost of increased distance, the better the robust optimization effect. The increased cost is 0, which means that the vehicle does not need to return to the warehouse for delivery. If the penalty coefficient is set to 1, the penalty cost is the distance cost of the driving path. The Customer number column lists the maximum number of customers who do not meet the demand of customers in 1000 historical samples. The More cost column lists the increased cost of returning to the origin by not meeting the demand of customer points in one of 1000 historical samples. The Maximum and Minimum columns list the maximum and minimum costs, respectively. The Fact cost column lists the distance cost obtained by driving 1000 historical samples according to the actual customer points. The symbol — indicates that there is no dissatisfaction with customers for certain solutions (Table 4).

Table 2 lists the test experiments of the set A instance, along with the certain and robust optimization methods in terms of expected value and customer point satisfaction. Because the paths obtained by the certain method account for the presence of all customer points, the Total column does not exist. The path obtained by the robust optimization method accounts for the expectation of the probability of customer presence. The demand of the customer point is appropriately reduced according to the probability of customer point presence. Therefore, when the customer point actually exists, the vehicle load is insufficient, resulting in the failure to meet the demand of a certain point. If replenishment is returned, it will affect this point and subsequent points. Combining this with the comparison diagram of cost value and expected value in Figure 3, we can obtain that the overall distance value after robust optimization is less than that obtained using the certain method, and the proportion range of reduction is 17%–30%, which is obvious, reflecting the adjustment of distance cost. Moreover, in the customer satisfaction diagram, it can be seen that the height of the E1 column almost accounts for the absolute value of the whole column diagram of an instance. Meanwhile, with the increase of the scale of customer points, there is a high probability that three or more customer points are not satisfied; that is, the more customers there are, the greater possibility of uncertain parameter disturbance, the greater the difficulty in determining the degree, and the higher the requirements for the planned path. It can be concluded that the satisfaction of the robust optimized path to the customers in the actual sample is still high, and there is a low probability that one or two customers will not be satisfied.

Table 3 lists the solution obtained after calculating the actual transportation distance cost according to the customer presence data of 1000 historical data samples in the set A test experiment and the solution of the distance cost generated by replenishment in case of replenishment. As shown in Figure 4, regardless of the maximum value, minimum value, or average value, the line after robust optimization is almost below the line of certain results; that is, the prior path after robust optimization can reduce the transportation distance cost and prevent unnecessary cost waste compared with the path obtained by using the certain method. In the maximum figure, when the customer scale is large, the degree of savings is higher, and when the customer scale is small, the result is not obvious. In the graphs of minimum value and average value, the effect is quite obvious after the customer point scale reaches 48.

As shown in Figure 4(d), if the penalty coefficient is assumed to be 1 and only the impact of replenishment distance cost is considered, the average value of the actual distance cost after robust optimization of 1000 historical data samples about uncertain customers is added to the average value of the penalty cost caused by replenishment; that is, the total cost, which is the difference graph compared with the average value of the actual distance cost obtained by the certain method, is certain cost average − (robust cost average + more cost average). It can be seen from the figure that the total cost of most of the two instance sets after robust optimization is less than that of the certain method. There are only three exceptions in set A, accounting for 18.75% of the total. When the line chart is almost approximate, owing to the low replenishment probability and small replenishment distance of the path obtained after robust optimization, the total cost maintains a downward trend, which fully reflects the advantage of the robust model in cost saving.

Figure 5 shows the path diagram of the certain and robust optimization methods with a large total cost reduction in the set. The points of the triangle represent uncertain customer points. From the figure, one can see that the path distributions of the two methods are roughly the same. Because the certain method looks for the shortest path graph with customers whose need for services is certain, there is less intersection between paths, while the robust optimization method pursues the reasonable distribution of uncertain customers. Because the bundling point method is used in the algorithm, in robust optimization, uncertain customer points tend to be on the same path as the nearest customer points, such as points 49, 35, and 4 in A-n62. On the same path as the nearest point, when the uncertain customer point does not have a demand or there is a demand, the change of the total path is small. Of course, from the distribution of paths, because of the many path intersections, the proposed algorithm may design more iterations in the calculation, so the result is not optimal. However, if it is the optimal solution, the distance cost will only be shorter, better reflecting the effect of robust optimization.

6. Conclusions and Future Work

A distributed robust optimization method is proposed to solve the probability uncertainty problem of customer demand service. The method uses the distributed robust optimization theory to transform the uncertain model into a deterministic robust optimization model with parameters. At the same time, the model is optimized by using the data-driven kernel density estimation method. We also develop a hyper-heuristic algorithm for solving the problem.

Compared with other algorithms, our hyper-heuristic algorithm based on reinforcement learning has good performance. In solving UVRP, the data-driven distributed robust optimization model produces much better results than the deterministic model, and our method can effectively solve the problem and reduce the additional cost.

This paper studies VRPUC, but in real world, there are more uncertain factors, such as vehicle travel time and customer demand. It is necessary to consider the joint influence of multiple uncertain factors in future work. The hyper-heuristic underlying operators in this paper are common and relatively general. It is also meaningful to design some specific underlying algorithms to solve such problems in the future.

Data Availability

All calculation instances (https://www.bernabe.dorronsoro.es/vrp/) in set A in the CVRP are selected for the experiment. All the data used to support the findings of this study are also available from the corresponding author upon request.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

The study was supported by the National Natural Science Foundation, China (no. 61402409), the Natural Science Foundation of Zhejiang Province China (no. LY19F030017), and also supported by the Foundation of State Key Laboratory of Digital Manufacturing Equipment and Technology (Grant no. DMETKF2022024).