Abstract

Metro intelligent system produces massive passenger flow and traffic data every day, among which route, station, and operation data are important for optimizing the train operation scheme. We collect passenger flow information of Shenzhen metro, analyze the passenger flow pattern and its distribution characteristics based on the data warehouse of the Hadoop platform, and optimize the train operation scheme in this paper. Using dynamic passenger flow data, an optimization model with train departure and dwell time as decision variables and passenger waiting time, passenger ride time, train full load ratio, and train operation balance as objectives is developed. An improved parallel genetic algorithm (GA) incorporating a simulated annealing algorithm (SAA) and an optimal individual retention strategy is used to find the optimal result. To verify the usefulness of the method, simulation experiments are conducted on the optimization model and method using the real passenger flow and train operation data of Shenzhen metro, and the simulation results are compared with the original plan.

1. Introduction

The metro system is characterized by large capacity, fast speed, high frequency, and punctuality. It has become one of the best schemes to alleviate urban traffic congestion [1]. Metro system produces a large number of passenger flow data [2] such as passenger origin-destination (OD) information and train operation data. Using big data to analyze passenger flow data can improve rail transit train transportation efficiency [3] and passenger satisfaction.

The intelligent construction of the metro is an important means to relieve the pressure of urban traffic, and train schedule optimization is one of the important ones [4]. In the metro system, passenger origin-destination (OD) information is very important. It can be used for the optimization of the metro train operation plan. The train operation plans are developed from historical traffic data. It determines the train's departure time at each station, its dwelling time at the station, and its arrival time at the station. It needs to meet some operational constraints such as train full load factor and travel time. Through the analysis of OD data and passenger flow data, we can optimize the train operation scheme to improve passenger satisfaction [5] and reduce the operation cost of the metro.

Lots of research have been performed on metro schedule optimization by many scholars. In terms of optimization models and optimization objectives. Wang et al. [6] proposed a mixed integer programming model based on time-varying demand, which minimizes the passenger waiting time and the number of passengers unable to transfer, using train capacity as a constraint. Zhang et al. [7] developed two nonlinear nonconvex programming models considering the variation of train frequency, train running time, and stopping time, and under the constraints of train operation and passengers getting on and getting off process, the train timetable with the minimum full passenger travel time is designed. Qu et al. [8] proposed a two-step optimization model to change the metro schedule, in which the train departure interval is used as a decision term to reduce the waiting time of people in the first-step model. In the second step model, the total energy consumption of all trains is minimized by taking the train leave and arrival times at various stations as the decision terms. Wu et al. [9] proposed a multi-objective train schedule optimization method with the objectives of minimizing total energy consumption, average waiting time, and average maximum load deviation and demonstrated through a case study that the method can be used to reduce the total energy consumption, the maximum load deviation and the waiting time of passengers. Xie et al. [10] designed a synchronized metro schedule and stopping timetable optimization model for passengers and energy saving and demonstrated experimentally that it is very effective in reducing train energy consumption, running time, and delay probability. In terms of optimization methods, Wihartiko et al. [11] used an improved integer programming model of the genetic algorithm to solve the bus schedule problem in chromosome design, initial population recovery technique, chromosome reconstruction, and generation-specific chromosome extinction, respectively. Shang et al. [12] established a total passenger travel time model to minimize the total passenger travel time and proposed a spatial branching delimitation algorithm to solve the model. Wang et al. [13] proposed a linear weighted compromise algorithm and a heuristic algorithm to find the best solution for the bi-objective integer programming model with the train stopping time control. Guo et al. [14] proposed a mixed integer nonlinear programming model for generating optimal train schedules and maximizing interchange synchronization events, and then a hybrid optimization algorithm (PSO-SA) combining particle swarm optimization and simulated annealing is designed, and its superiority is proved by comparing with many algorithms. Tang et al. [15] combines the genetic algorithm and the simulated annealing algorithm to find the best result of an optimization model considering multiple constraints. Liu et al. [16] developed a mathematical model of it considering headway time distance and dwell time. Then an improved artificial bee colony algorithm is designed to solve this problem. Tang et al. [17] developed a bi-objective optimization model considering the minimization of full passenger waiting time and departure time and designed an improved nondominated ranking genetic algorithm (NSGA-II) for fast search of Pareto optimal solutions by using a specific coding scheme. Huang et al. [18] proposed a two-step model for matching metro passenger relationships and reducing the full waiting time of passengers, respectively, and designed a hybrid MCMC-GASA (Markov chain Monte Carlo genetic algorithm simulated annealing) approach to solve the problem.

A review of the literature shows that there has been extensive discussion and research by many experts in the area of the subway train schedule optimization problem, and in previous studies, it was common to assume a constant passenger flow model at a particular moment in time and then to optimize the train travel plan for that particular moment in time. The reality is that passenger flows vary dynamically with time distribution [19], and in previous train schedule optimization, the passenger flow distribution is often first assumed to be normal or some other distribution pattern. However, modeling passenger flow patterns in complex scenarios by such approximate estimation models is inaccurate, which may lead to the inapplicability of the optimization model to the normalized environment. With the rapid development of big data technology, big data analysis methods provide new methods and techniques for train schedule optimization in the metro. We collect historical passenger ticket card data from the metro AFC, clean the data through a Hadoop big data platform, and then calculate the passenger arrival rate at each station and the passenger disembarkation rate between stations distributed over time. A multi-objective train schedule optimization model that takes into account train movements and passenger demand is proposed. Then a parallel genetic algorithm (GA) incorporating a modified simulated annealing algorithm is designed and the optimal subindividual retention strategy is added to get the best result. We use the measured data of Shenzhen metro to evaluate the proposed model and a solution method, and the result shows that the method is effective and accurate.

Other parts of this article are as follows: in Section 2, we describe the methodology for AFC data acquisition and processing. In Section 3, we develop a multi-objective optimization model considering metro operations and passenger travel demand. In Section 4, we propose a parallel improved genetic algorithm incorporating simulated annealing algorithm to solve the multi-objective optimization function. Section 5 brings in the multi-objective optimization model based on real historical passenger flow data of the Shenzhen metro and solves the optimal solution. Finally, Section 6 gives the conclusion of this paper.

2. Data Acquisition and Processing

2.1. Description of Data

The raw data we capture is the ticket card information from the metro automatic fare collection (AFC) system. When a passenger through the gate to ride the subway, the passenger information is saved in the AFC system and a corresponding travel data set is generated. The data set includes start station address, start line, start station time, destination station address, destination line, and destination time. Shenzhen metro generates approximately 5.9 million records per day, each record containing more than 60 attributes. To facilitate data statistics in the future, the source data is cleaned and transformed, and only the fields we can use are retained, as shown in Table 1.

The start station ID (indicated by s_station) is the station number where the passenger enters the station. Start line ID (denoted by s_line) is the line where the passenger enters the station. Inbound time (denoted by s_time) is the time when the passenger entered the station. Destination station ID (denoted by d_station) is the station where the passenger left the station. The destination line ID (denoted by d_line) is the line on which the passenger exits the station. The exit time (represented by d_time) is the time the passenger left the metro station. Thus, a passenger's ride record can be expressed as

2.2. Data Processing

In recent years, big data analysis technology has been developing, and accordingly, big data platforms are becoming more and more advanced and perfect [20]. The core features of big data platforms are scalable distributed storage and efficient parallel data processing and computing capabilities. In this paper, we set up a multinode Hadoop platform and add the corresponding ecological components, such as Hive and HBase, and then complete data processing and model building in this big data platform.

To reduce data interference and computational effort, we take the raw data stored in HDFS for data cleaning and then use Hive to store the data. Calculations are performed using Hive to get the passenger arrival and disembarkation rates.

Calculate the number of passengers who take the metro at station in the same line in the period .

Count the number of passengers who leav stations in the same line during period .

Calculate the number of passengers who take the metro from station and get off at station in the period .

The passengers’ arrival rate at stations can be calculated by dividing by .

The proportion of passengers leaving stations can be calculated by dividing by .

3. Multi-Objective Optimization Model

To improve the operational efficiency of the metro, we develop a passenger flow data-driven dynamic optimization model of the metro train operation plan in this section based on the passenger flow and travel data preprocessed by the Hadoop platform described in the previous part. The optimization model considers both metro operation and passenger experience, including train operation stability and train loading efficiency, and passenger experience including passenger ride and waiting time and the number of passengers on the train. We use a metro line consisting of metro stations and trains [21] as the target of our study, specifying the starting station as station and the ending station as station . To quantify the various parameters to describe the mathematical model, to better match the actual situation of metro operations as well as to simplify the overall optimization model, the following assumptions are required in this paper to build the model in terms of both passengers and metro trains.(1)Only one train can stop at the same station in the same direction of subway operation at the same time, and there will be no overtaking when parallel trains are running on the subway line.(2)When the train enters the metro station, all passengers line up to get off and get on following the principle of “first off, then on, first to arrive, first to serve.”(3)The maximum capacity of each train is a fixed value. When the number of passengers waiting on the platform exceeds the capacity of the train, the remaining passengers need to continue to wait on the platform and wait for a train to arrive.

Assumption (1) is generally applicable to most urban transportation systems to ensure that trains operate in sequence. Assumption (2) is in line with the mainstream passenger queuing principle, and assumption (3) can improve the running stability of the train and the comfort of passengers.

3.1. Model of Train Operation

Describing the operation of a train is generally performed by train exit time, inter-station running time, entry time, and dwell time [22]. Given a train and a subway station , the travel interval between train and its preceding train can be expressed as the difference between the exit times of the two trains at station :where is the moment of departure of train from station and is the moment of departure of train from station . can be represented by the moment when train arrives at station and the stop time at station .

The time at which the train arrives at station can be described as the total of the train's departure time from the last station and traveling time between the two stations.

The running time is usually a preset fixed value because the distance between stations is certain and the train runs in autopilot mode between the two stations.

The stopping time of train at station can be expressed by this equation:where is the minimum stopping time of the train, and are two parameters that denote the time required for a passenger to board and alight respectively, which can be obtained analytically, is the number of trains opening their doors at stations, for the convenience of calculation, we assume that the passengers who are going to get on the train will consciously form two lines, and the passengers who are going to get off the train will form one line in the train, and denote the number of passengers getting on and getting off train at station , respectively, these two parameters can be estimated from the historical data.

In addition, to improve safe train operation, two adjacent trains need to satisfy the minimum headway time constraint, i.e., the difference between the arrival time of train at station and the departure time of the previous train from station should be greater than a constant, which can be described as .

3.2. Model of Passenger Demand

The number of passengers in a train when the train leaves the station is . It can be represented by the number of passengers in train when it leaves station , the number of passengers who get off from station and the number of passengers who get on board at station :

There is a maximum amount of passengers that a train can carry when it is running. As a result, passengers may become stranded at stations during peak traffic. The number of passengers boarding the train at the station is . It can be expressed by the number of passengers remaining in the train at station and the number of passengers waiting at station :where the number of remaining passengers in train at station is . It can be represented by the maximum number of passengers on board as , the number of passengers on board as , and the number of passengers off the train as :

The number of passengers waiting for train at station is . It can be expressed by the number of passengers stranded at station by the previous train and the number of passengers arriving in the travel interval between adjacent train and train , where is the passenger arrival rate in the interval between two adjacent trains [23].

The number of passengers stranded by train at station can be described as

The number of passengers on train who get off at station is . It can be represented by the number of passengers who boarded at the previous stations as , and the passenger boarding and alighting ratio O-D matrix as :

Big data analysis techniques can be used to statistically analyze historical passenger flow data to determine the proportion of passengers boarding and disembarking at each stop.

3.3. Multi-Objective Optimization Function

The optimization of train schedules based on dynamic and uneven passenger flows mainly includes train operation optimization and passenger satisfaction optimization. The train operation optimization mainly includes reducing the deviation of the actual train capacity from the desired capacity and ensuring the balance of train operation. Passenger satisfaction optimization consists of reducing the waiting time in the station and the travel time between stations.

The waiting time of passengers at the platform is a sum of the waiting time of passengers who are stranded after the departure of the previous train and the waiting time of new arrivals in the interval between the operation of two trains. It can be expressed as

Passenger travel time is the sum of the time passengers who are on board when the train is running and the time passengers who wait on board when the train stops at each station and can be expressed as

The train running balance can be expressed as the difference between the stopping times of two adjacent trains running between stations at each station, and can be expressed as

The difference between the actual capacity of the train and the desired capacity of the train can be expressed as follows:

Considering the above elements to be optimized, the multi-objective optimization function can be described aswhere denote the weights of each objective, which are set differently according to different optimization needs. It is vital to increase the values of a and b suitably during peak passenger periods in order to carry passengers rapidly and decrease waiting and journey times. The stability of train operation should be improved and the operating cost should be decreased during the low-peak time of passenger flow, thus the values of c and d need to be suitably increased. The weights can be set in a balanced manner, taking into account the stability of train operation and the length of time passengers must wait, during the stable period of passenger flow. In conclusion, when choosing the weights for each optimization target, it is important to take into account both the passenger flow and the optimization requirements. The best weights should be chosen after conducting numerous tests.

4. Solution Method

To find the best solution for the multi-objective optimization model proposed in the previous section, we designed an improved parallelized genetic algorithm and completed the algorithm implementation in Hadoop big data platform.

4.1. Improved Genetic Algorithm

Genetic algorithm is a computing model that models natural selection and biological evolution, and it is a way of searching for optimal solutions by simulating the natural evolutionary process. GA provides a number of benefits, including the capacity to handle continuous and discrete variables, the adaptability of constraint definition, the capacity to handle huge search spaces, and the capacity to provide numerous optimal or good solutions [24]. The simulated annealing algorithm is derived from the solid annealing principle and has shown to be quite successful in locating the global optimum for a variety of NP-hard combinatorial problems [25]. Starting from a certain initial temperature, the probabilistic abrupt change property of SA can help the objective function to obtain the global optimal solution in the desired time as the temperature decreases [26]. Given the benefits of these two methods, Gandomkar et al. [27] presented a hybrid algorithm that combines GA and SAA to optimize the distributed generation resource allocation problem.

The advantage of the genetic algorithm is that it can quickly search out the whole solution in the solution space, excellent global search ability, overcoming the fast descent trap problem of other algorithms; suitable for distributed computing, natural parallelism speeds up the convergence speed. Relatively, genetic algorithm local search ability is insufficient, a simple genetic algorithm is time-consuming and less efficient for search in late evolution. SAA has a relatively powerful local search ability [28], but it cannot make the optimization search process the most promising area. Therefore, we improved the genetic algorithm and designed an adaptive genetic algorithm incorporating a simulated annealing algorithm with an optimal individual replacement strategy as follows:(a)Encoding: The code consists of the train's departure moment at the origin station and the stopping time at each station, using a real number code whose values are generated within the departure interval and stopping constraints. An individual in the initial population can be represented as , where is the number of trains, is the number of stations, denotes the interval between the departure of the train and the preceding train from the first station, and denotes the stopping time of the train at the station .(b)Selection: The genetic algorithm uses the roulette wheel selection method, but the probabilistic selection is random, to retain the good individuals, we use the best individual replacement strategy, i.e., we replace the individuals with low fitness values with those with high fitness values, thus increasing the fitness of the offspring. The specific selection method is as follows:(1)Find the individual with the highest fitness by calculating the fitness of each individual in the current population, assuming that the number of individuals in the population is N.(2)Calculate the probability that an individual is selected and the cumulative probability .(3)Randomly generate N numbers between [0, 1] in the array m as the selection probability. If the cumulative probability is greater than the element in the array, the individual is selected, if it is less than , the next individual is compared until an individual is selected.(4)Repeat step 3 until N individuals are selected.(5)Find the individual with the lowest fitness by recalculating the fitness of the N freshly created individuals.(6)Replace the worst individual with the previously selected best individual to form the next generation population.(c)Crossover: Two individuals are selected for simulated binary crossover operation based on the set crossover probability, and then the child fitness value and the parent fitness value are calculated for the simulated annealing operation. Let denotes the initial temperature, is a positive number less than 1 and generally takes values between 0.8 and 0.99 the temperature calculation formula isThe new state is accepted at annealing with a probability according to the Metropolis criterion.(d)Mutation: Regular polynomial variation encoded in real numbers for chromosomes that have completed the crossover operation according to a set probability of mutation.

4.2. Parallel Genetic Algorithm

Based on the improved genetic algorithm proposed earlier, we have proposed an improved parallel genetic algorithm. The specific algorithm is described as follows:

Input: < key, value >, where the key is individual in one population, and the value is fitness in one population.
Output: < key′, value′ >, where key′ is the best individual in the iterative process, and value′ is the best fitness value individual to key′.
Algorithm Procedure:
(1) Identify the number of iterations as M.
(2) Initiate integer i = 0.
(3) While(i < M):
  Compute individual fitness;
  Compute individual cumulative probability;
  Select;
  Crossover;
  Mutation;
  i++;
  End while
(4) Compute individual fitness;
(5) Find the best chromosome and fitness;
(6) Output the chromosome and fitness;
Input: < key, value>pair, where the key is the best individual in each population, and the value is the best fitness in each population.
Output: < key′, value′>pair, where key′ is the ideal individual in all populations, and value′ is the best fitness in all optimal individuals of the population.
Algorithm Procedure:
(1)Identify the number of population N;
(2) For i = 1 to N:
  Find maximum fitness;
  End For
(3)Output the chromosome and fitness;

In Algorithm 1, Step (3) is the regular genetic operation, including selecting individuals with high fitness from the population and eliminating individuals with low fitness, crossing chromosomes with a certain probability, and mutating chromosomes with a certain probability. Step (4) is to compute the individual fitness after the iteration. Step (5) is to choose the optimal chromosome and fitness. Steps (6) is to output the intermediate <key, value> pair, and < key′, value′>pair. In Algorithm 2, Step (2) is to find the optimal chromosome and fitness in each population’s optimal solution. Step (3) is to output the final chromosome <key, value> pairs and fitness value<key′, value′> pairs to the sequence file on HDFS.

5. Numerical Results

With the intention of verifying the performance of our designed optimization method in the multi-objective optimization model of the metro schedules, we collected the AFC data of Shenzhen metro line 6. The dataset contains a total of 15 million passenger trips and the data file size is over 25 GB. Shenzhen metro line 6 has a total of 27 stations, the distribution of which is shown in the map below (Figure 1).

The existing train schedules have fixed stopping times at each station as shown in the following Table 2.

Since the subway trains are in automatic mode, the train runs between two adjacent stations for a fixed period of time. This is shown in the following table (Table 3).

The train is a 6-part A-type train and the other information about the train are listed as follows. (Table 4).

The passenger arrival rate with time distribution is obtained using the historical passenger flow data statistics with the Hadoop big data platform for the study period. The following figure shows the distribution of passenger arrival rate at each station of Shenzhen metro line 6 over time in a day (Figure 2).

We decided to focus on two hours of the morning peak period to perform more precise schedule optimization research. In Figure 3, the statistical exit ratios between stations are displayed, where the final station is on the horizontal axis and the starting station is on the numerical axis. The data in the figure is 0, which means that few or no passengers get off from the station during the period.

A total of 17 trains are scheduled to depart during this period with a departure interval of 435 s. Using the departure interval of trains at the first station and stopping time at each station as the decision variables, the improved genetic algorithm introduced above is used to find the best result. The input information for setting up the genetic algorithm is listed below (Table 5).

The waiting time and travel time of passengers are the first optimization objectives, and the train operation balance is the secondary optimization objective. Therefore, the weights of the optimization function are set as a = 0.4, b = 0.3, c = 0.2, d = 0.1, respectively. The optimized train schedule does not increase the number of departures, and the departure interval of each train at the departure station is shown in the table below.Table 6.

The results of the comparison between the original train timetable and the optimized timetable are shown in Figure 4, where the horizontal axis is the arrival and departure time of trains at various stations and the vertical axis of each station of line 6 (Table 7).

The experimental results show that the optimized metro schedule reduces passenger waiting time by 21.42%, reduces passenger travel time by 22.56% and increases train full capacity by 2.65% compared to the existing schedule. It can be seen that the optimized metro timetable driven by passenger flow data improves passenger satisfaction and train operation efficiency more than the existing planned schedule.

6. Conclusion

By analyzing and mining past passenger flow data, which the metro system creates in large quantities, it is possible to significantly increase operational efficiency and passenger pleasure. In this paper, we built a Hadoop big data platform to process and analyze the enormous historical passenger flow data of the Shenzhen metro, then we built a data warehouse to calculate the passenger inbound rate and the station-to-station disembarkation ratio of each station that changes at any time of the day through the Hive component. A multi-objective model considering both trains and passengers is proposed to optimize the train timetable. We have designed a parallel genetic algorithm incorporating simulated annealing algorithm improvements, using the best individual replacement strategy to retain the best individuals to get the best solution. Results of experiments using actual data from Shenzhen metro line 6 show that an improved train timetable can decrease passengers' waiting and transit times while also enhancing the balance of train operations and transportation effectiveness.

In future studies, we will further develop the proposed model with AFC data for multiple line interchanges. We will consider train operations for train turnarounds and turn-backs for the study, and another task to be performed is to analyze the passenger travel characteristics on holidays and weekends to optimize various nonworking day train schedules based on it.

Data Availability

Due to the nature of this research, participants of this study did not agree for their data to be shared publicly, so supporting data is not available.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This work was funded by the Beijing Municipal Natural Science Foundation (Grant No. L201015); the National Key R&D Program of China (Grant No. 2020YFC0833104); and The Green, Intelligent and Safe Mining of Coal Resources (Grant No. 52121003).