Abstract

Aiming at the problem of low statute efficiency of prefix sum execution during the execution of the parallel differential evolutionary particle filtering algorithm, a filtering algorithm based on the CUDA unfolding cyclic prefix sum is proposed to remove the thread differentiation and thread idleness existing in the parallel prefix sum by unfolding the cyclic method and unfolding the thread bundle method, optimize the cycle, and improve the prefix sum execution efficiency. By introducing the parallel strategy, the differential evolutionary particle filtering algorithm is implemented in parallel and executed on the GPU side using the improved prefix sum computation during the algorithm update. Through big data analysis, the results show that this parallel differential evolutionary particle filtering algorithm with the improved prefix sum statute can effectively improve differential evolutionary particle filtering for nonlinear system states and real-time performance in heterogeneous parallel processing systems.

1. Introduction

Particle filtering is a sequential Monte Carlo method that employs particles to approximate the posterior probability density distribution. In [1], the multi-intelligent coevolution mechanism is introduced into particle filtering, and the resampling process is realized by the competition, crossover, mutation, and self-learning among particles, which effectively solve the problem of particle degradation and particle scarcity. Literature [2] compared the filtering accuracy of particle filtering under different search strategies, and the accuracy of the differential evolutionary particle filtering algorithm was improved, but the computational complexity was increased. To address the computational complexity problem, literature [35] proposed a GPU-based particle filtering parallel algorithm, which effectively combines the traditional particle filtering algorithm with GPU to make full use of the performance of GPU parallel computing and accelerate the computational speed of the particle filtering algorithm. Literature [6, 7] proposed a GPU-based parallel optimization design and implementation of particle filtering to improve the computational speed of the tracking algorithm. Literature [810] designed and implemented a parallel particle swarm optimization algorithm based on CUDA, which uses a large number of GPU threads to accelerate the convergence speed of the whole particle swarm. Parallel statute algorithms are used in the abovementioned literature for parallel particle filtering algorithms to simplify thread operations. Prefixes and algorithms are an important primitive for parallel algorithm programming and are utilized as basic modules for many different algorithms. Compared to serial algorithms, CUDA-based parallel algorithms execute single instruction multithreaded commands, which can perform more operations and improve the efficiency of algorithm execution. However, due to the execution mode and memory access mode of the prefix sum algorithm [1113], the execution process is prone to thread division and memory access conflict phenomena, which cannot effectively utilize the hardware resources of GPU. Prefix summation contains a large number of repetitive operations, which are simple but inefficient. Segmented prefix summation avoids thread repetition but suffers from serious memory access problems, making the utilization of GPU hardware resources low. Literature [14] introduces additional instructions and demonstrates their application in the construction of efficient parallel algorithm primitives, such as prefix sums and segmented binary prefix sums. In literature [15], researchers used parallel segmented prefixes to construct data processing and optimize them to improve the overall performance of the algorithm. In literature [16], researchers used GPUs and the practical parallel particle swarm well to solve the problem of singular facility locations, demonstrating that particle swarm optimization is a flexible optimization technique. In literature [17], several tree data structures are studied for the prefix sum problem, providing a variety of practical solutions, all of which obtain a good speedup factor.

To address the problem of thread differentiation in the execution of the differential evolutionary particle filtering parallel algorithm, based on CUDA architecture, this paper proposes a differential evolutionary particle filtering algorithm based on unfolding cyclic prefixes and optimization to remove thread differentiation and reduce the lag caused by judgment and branch prediction, which makes the particle filtering algorithm gradually improve the computational performance.

2. Differential Evolutionary Particle Filtering Algorithm

Differential evolutionary algorithm (DE) is a stochastic parallel direct search algorithm, whose basic idea is to start from a certain randomly generated initial population, iterate continuously according to certain operation rules, and according to the fitness value of each individual, keep the good individuals and eliminate the inferior ones, and guide the search process to approach the optimal solution. The algorithm has the advantages of simple structure, easy implementation, no need for gradient information, fewer parameters, etc., and has a variety of different search strategies.

The calculation process of the DE-PF algorithm in this paper is as follows.

Step 1. For the initialization step, sampling is performed at time . The resulting particles are used as initial samples, and the distribution of the initial samples is . All particles have the same initial weights . Repeat iterations for .

Step 2. For the prediction step, set, sample particle at the current moment through the state transfer model, and calculate the current measure .

Step 3. The weights are calculated and normalized, and after receiving the measurements in Step 2, each particle needs to update the weights according to the likelihood function : The normalization process makes the sum of the particle weights equal to one, and the normalization process is expressed as

Step 4. For differential evolutionary resampling, we have the following: (1). The initial particle of evolution (2)The variation operation is performed on the particle set , and then the crossover operation is performed to obtain the candidate particle set (3)The fitness value of the candidate particle set is calculated, and the selection operation is performed, and the resulting particle set is(4)If and , then set ; turn to Step 2; otherwise, go to the next step

Step 5. Arrange the particles in descending order.

Step 6. Count the number of times each particle is copied, except for its own.

Step 7. Calculate the weighted sum of the weights in Step 4, except for its own.

Step 8. Eliminate the small particles.

Step 9. For the state output step, the optimized set of particles is used as a sample of equal weights :

3. Improved Parallel Prefix Sum

The parallel algorithm needs to calculate the cumulative distribution function (CDF) of the particles when performing the computation, which is a simple continuous prefix and operation described as follows: where,, andis the size of the data. The sequential computation is very straightforward and makes parallelization difficult due to the dependencies between output data. For small prefix sum problems, only one thread block is used and recursive multiplication is used to solve the problem. However, parallel particle filtering requires a longer computation of the prefix sum problem when the number of particles , and Figure 1 expresses the same operation on different particles, i.e., the parallel way of prefix summation.

The parallel prefix sum can be understood as the parallelization of the process of summing all the numbers in an array. In general, the idea of parallelization is based on the binary statute of “trees,” as shown in Figures 2 and 3. The implementation of parallel prefix summation can be divided into two types:

(1) Direct Prefix Sum. Elements are paired with their direct neighbors to find the sum

(2) Interleaved Prefix Sums. Elements are paired according to a given span

Based on the problem of idle threads in the parallel operation of the interleaved prefix sum algorithm, this paper proposes a spread-loop prefix sum method to reduce idle threads and improve the efficiency of prefix sum execution.

By assessing the interleaved prefix sum method, the initial value of stride is half of blockDim.x. When and then executing subsequent instructions, it means that half of the threads in the first iteration are idle, which wastes GPU computing resources and targets a new problem: idle threads. The performance of the parallel algorithm can still be improved if all of them can be utilized, which is also pending the next step to be optimized and improved.

Expanding loops is a technique that is intended to optimize loops by reducing the frequency of branch occurrences and loop maintenance instructions. In a loop expansion, the body of the loop is written multiple times in the code, rather than just writing the body of the loop once and then using another loop to execute it repeatedly. Any closed loop can have its number of iterations reduced or removed altogether. The number of copies of the loop body is referred to as the loop expansion factor, and the number of iterations becomes the singular number of iterations divided by the loop expansion factor. In sequential arrays, loop expansion is the most efficient way to improve performance when the number of iterations of the loop is known before the loop is performed. Assuming a thread block length of 1024, the threads involved in the computation of the statute iterations at 512, 256, 128, and 64 are distributed in different thread bundles (since each warp can only have 32 threads executing simultaneously); then, there is an order of precedence in the SM execution of these thread bundles, so each step of the statute iteration needs to be synchronized within the block. Only when the statute iterates to 32, 16, 8, 4, and 2, the thread bundle execution they are in is not associated with other thread bundles and no interblock synchronization is needed, while there is implicit synchronization after each instruction in the process of thread bundles in SM, so the intrabundle synchronization problem can be solved, making the global array corresponding to the threads get updated in time without affecting the execution of the next instruction.

In the preceding prefix and computation, each thread block was responsible for one corresponding data block. Now, each thread block is responsible for prefixing and calculating two data blocks, thus eliminating instruction consumption and increasing the scheduling of more independent instructions to improve performance. The following is a schematic diagram of the prefix sum with expansion factors of 2 and 4. There are three scales of expansion, 2, 4, and 8, where a block computes 2 blocks, 4 blocks, and 8 blocks of data, respectively, adds the adjacent data blocks to the data block corresponding to the current thread block, and then sums them, listed as Tables 13.

The parallel prefix and method algorithm strength is low, so the bottleneck in the system may be due to the scheduling instructions. The solution is to expand the for loop. __syncthreads is used for intrablock synchronization. In the statute kernel function, it is used to ensure that all threads in each round have written their local results to global memory before the thread moves to the next round. During the statute, the number of active threads decreases, and when there are less than 32 active threads, we will have only one warp. In a single warp, the execution of instructions follows the SIMD (single instruction multiple data) pattern; i.e., when there are less than 32 active threads, there is no need for synchronization control, and each instruction is followed by an implicit intrabundle synchronization process after each instruction. Therefore, it is necessary to solve the problem of loop control and thread synchronization when there is only one thread bundle. Based on this, the thread bundle expansion method with interleaved prefix sum is proposed.

Through the previous experimental analysis, the iterative loop below 32 threads is unfolded. In fact, because of the length limit of the thread block (generally 1024), the number of loops is determined, so the loop can be fully unfolded, i.e., 1024, 512, 256, 128, and 64, and calculated, and the only thing that needs to be noted is that each calculation should be synchronized afterwards. Table 4 shows the pseudocode for a fully expanded loop.

4. Experiment and Performance Analysis

In order to verify the basic performance of the parallel algorithm with the improved prefix sum, the performance of the algorithm is simulated using a typical one-dimensional nonlinear system model and compared with the parallel prefix sum based on the unfolding cycle, parallel prefix sum based on the thread unfolding cycle, and parallel prefix sum based on the full unfolding filtering algorithm. The experimental platform includes the Win10 64-bit system, Visual Studio 2013 programming software, and CUDA9.2-based programming framework, where the GPU is GTX1080Ti and the CPU is i5-4460. Detailed parameters are listed in Table 5.

The one-dimensional nonlinear system model is as follows:

System noise of the model is , total observation time is, crossover probability isof the evolutionary algorithm, and the maximum evolution time is iteration number. In this paper, we use the parallel prefix and expansion factors 2, 4, and 8 (2U-PRPDE-PF, 4U-PRPDE-PF, and 8U-PRPDE-PF) based on the expansion loop. In this paper, the comparison experiments are conducted among the three algorithms, i.e., PF, 8U-PRPDE-PF, warp unrolling PRPDE-PF, and complete unrolling PRPDE-PF.

4.1. Experimental Analysis of Root Mean Square Error

The algorithm is simulated times by independent Monte Carlo, and the root mean square error of the time is defined as follows:

and denote the actual and predicted states at the moment in the th simulation, respectively. The measurement noise . Figure 4 gives a comparison of the momentary root mean square error of the five algorithms for the two settings of the particle number and .

The performance of the algorithm state estimation is basically the same. It can be seen from Figure 4 that the mean square error of the five improved algorithms, 2U-PRPDE-PF, 4U-PRPDE-PF, 8U-PRPDE-PF, WU-PRPDE-PF, and CU-PRPDE-PF, under the same experimental conditions of the particle number, is reduced relative to the IIPRPDE-PF algorithm, and all of them can guarantee the state estimation ability with the accuracy of the algorithm improved to some extent, indicating that the improved methods improve the state tracking performance of the filtering algorithm to some extent.

4.2. Particle Distribution and Calculation Time Experiments

Figures 5 and 6 show the curves of particle number variation and computation time of the improved prefix sum algorithm based on the unfolding cycle for 60 simulation moments. The comparison of the simulation curves in Figure 5 shows that the particle numbers of 2U-PRPDE-PF, 4U-PRPDE-PF, 8U-PRPDE-PF, WU-PRPDE-PF, and CU-PRPDE-PF decrease gradually and adjust the numbers adaptively with time. In Figure 6, at the time, it can be seen that the time of CU-PRPDE-PF is lower than that of the other filters due to performing full unfolding, fully improving the recursive loop, increasing the prefix and execution efficiency, i.e., increasing the execution rate of resampling and computation time consumption, while the time of 2U-PRPDE-PF, 4U-PRPDE-PF, 8U-PRPDE-PF, and WU-PRPDE-PF is smaller than that of IIPRPDE-PF with a decreasing trend.

After improving the recursive loop in resampling, the particles are reduced adaptively, and the computation time of the parallel filtering algorithm after all five unfolded loops is relatively reduced and smaller than IIPRPDE-PF. After unfolding the recursive loop within resampling, the overall complexity of the algorithm increases, and the time required for recursive sampling to update the number of particles for calculation in real time is not enough to offset the time saved by the reduction of particles when the number of particles is small, and this situation disappears at the time when the computation time of the parallel differential evolutionary particle filter for all unfolding loops is smaller than that of the corresponding parallel differential evolutionary particle filter, which also indicates that the PRPDE-PF sampling of unfolding loops improves the computation time more significantly. The filter computation time shown in Table 6 is obtained by 60 independent Monte Carlo experiments and taking the average of the running time of each filter for each experiment, and it can be seen that CU-PRPDE-PF requires the least computation time. Combining the performance indicators of computational accuracy and computation time of each filter, the fully unfolded loop filter algorithm CU-PRPDE-PF has the least computation time and is the best performance among the five improved filtering algorithms in this paper.

Also, compared with the three smart optimized parallel particle filtering algorithms in the article of Wang et al. [18], the computation time of the improved algorithms (2U-PRPDE-PF, 4U-PRPDE-PF) and the block parallel particle smart optimized particle filtering algorithm in this paper are shown in Table 7, respectively. It can be seen from the table that among the three intelligent optimized parallel algorithms, the block parallel particle filtering algorithm block parallel CRPF has the best performance, followed by the optimized block parallel; the optimization part increases the complexity of the algorithm; and the computational performance decreases compared to the block parallel algorithm. The algorithms proposed in this paper, 2U-PRPDE-PF and 4U-PRPDE-PF algorithms, are compared with the block parallel algorithm, respectively. From Table 7, it is concluded that the improved 2U-PRPDE-PF algorithm in this paper has stronger computational performance than the block parallel CRPF, and a 3.82x acceleration ratio is obtained as the number of particles grows, and the 4U-PRPDE-PF algorithm obtains a speedup ratio of 4.689x as the number of particles increases asymptotically, so the algorithm proposed in this paper has improved performance and can obtain a good speedup ratio compared to the block parallel algorithm.

Comparison of the runs of the five parallel differential evolutionary particle filtering algorithms 2U-PRPDE-PF, 4U-PRPDE-PF, 8U-PRPDE-PF, WU-PRPDE-PF, and CU-PRPDE-PF was based on CUDA cyclic unfolding with improved prefixes and postimprovement in the same GPU case. Tables 8 and 9 show the running schedules and speedup ratios of the five improved algorithms, respectively, and Figures 7 and 8 correspond to Tables 8 and 9, respectively, where the speedup ratio is defined as the value obtained by dividing the running time of the original algorithm by the running time of the improved algorithm under the same particle count condition. Figure 8 shows the values obtained by dividing the operation time of the original algorithm IIPRPDE-PF by the operation times of 2U-PRPDE-PF, 4U-PRPDE-PF, 8U-PRPDE-PF, WU-PRPDE-PF, and CU-PRPDE-PF, respectively. The acceleration ratio of CU-PRPDE-PF is the largest and remains around 1.19 as the number of particles increases, while the acceleration ratios of the other four types of 2U-PRPDE-PF, 4U-PRPDE-PF, 8U-PRPDE-PF, and WU-PRPDE-PF eventually remain at a certain value as the number of particles increases. Under GTX1080Ti, the number of particles is 1024; after the direct unfolding with unfolding factors of 2, 4, and 8, from 39.2 ms to 35.64 ms, it can be seen that the direct cyclic unfolding has a very big impact on the efficiency; this is not only because of saving the extra thread block running but also because the improvement has more independent memory loading, and storage operations can produce better performance, with better hidden latency. The algorithm in this paper has been improved incrementally to obtain an overall performance improvement of up to 1.19 times.

4.3. Real-Time Performance of Algorithms on Different GPUs

Experimental simulations are performed with the above five improved algorithms based on different GPU conditions. The whole experimental platform includes the Win10 system, Visual Studio 2013 programming software, and CUDA9.2-based programming framework with i5-4460 CPU, listed as Table 10, running the algorithms on four different GPUs with the number of particles from to .

The performance experiments of the five improved algorithms in this paper are done based on the same CPU and different GPU conditions. According to the analysis in Figures 913, compared with the IIPRPDE-PF algorithm, the five improved algorithms of this paper based on IIPRPDE-PF for cyclic unfolding, 2U-PRPDE-PF, 4U-PRPDE-PF, 8U-PRPDE-PF, WU-PRPDE-PF, and CU-PRPDE-PF, exhibit approximately the same growth rate for different GPUs. It is discussed that the acceleration ratio of the algorithm under different GPU conditions is basically proportional to the computational power of the GPU itself, and the performance of the algorithm is optimal under the experimental environment of GPU GTX1080Ti. In this paper, the performance improvement of GPU computation is limited to the improved prefix and problem. Based on the direct segmentation prefix and the improved differential evolutionary particle filtering algorithm, the overall performance improvement speed of CU-PRPDE-PF is up to 19% relative to the IIPRPDE-PF algorithm, and the performance improvement factor of CU-PRPDE-PF can reach up to 1.45 compared with that of the original PDE-PR algorithm. The main reason for the limited performance improvement of the improved algorithm is just not simply parallel on the GPU and requires complex operations or even contains quite a few logical judgments. However, some performance gains can be achieved by prefixing and incremental improvements.

5. Conclusion

In this paper, we propose a CUDA unfolding loop-based state estimation method for differential evolutionary particle filtering to address the problem of inefficient parallel differential evolutionary particle filtering with parallel execution threads and improve the execution efficiency of the prefix sum by unfolding the prefix sum method with an unfolding loop and a thread bundle. The proposed method uses the segmented prefixes after the unfolding loop and the improved resampling and the latest moment of observation to update the proposed distribution of the optimized particle filter in real time and adaptively adjusts the number of particles to be sampled for the particle filter to a smaller number using differential evolutionary resampling. In addition, for the execution of the particle filtering algorithm, the prefix and execution have the problem of inefficient thread execution, and the GPU does not have the branch prediction capability, at every branch it performs, so the algorithm removes the thread bundle differentiation and thread idleness existing in the parallel prefix by unfolding the loop and unfolding the thread bundle method, eliminating the lag caused by the failure of judgment and branch prediction, further improving the overall computational performance. The current CUDA compiler cannot do this optimization for us and requires artificially unfolding the loop within the kernel function, which can greatly improve the kernel performance. The purpose of unfolding the loop in CUDA is twofold: to reduce instruction consumption and to increase the performance by adding more independent scheduling instructions to reduce fragmentation. Simulation results show that the parallel differential evolutionary particle filtering algorithm with this unfolding loop can effectively improve intelligent optimal particle filtering for nonlinear system states and real-time performance. Finally, experimental simulations show that the algorithm with the improved prefix sum can achieve the best speedup factor of 1.19 relative to the IIPRPPDE-PF algorithm and 1.48 relative to the PDE-PF algorithm under GTX1080Ti, and the experimental data show that the overall performance of the algorithm under different GPUs is proportional to the GPU. The experimental data show that the overall performance of the algorithm under different GPUs is proportional to the GPU computational power, which indicates that the improved algorithm in this paper has universal applicability.

Data Availability

The data used to support the findings of this study are included within the article.

Conflicts of Interest

The authors declare that they have no conflicts of interest.