Abstract

Loop selection for multilevel nested loops is a very difficult problem, for which solutions through the underlying hardware-based loop selection techniques and the traditional software-based static compilation techniques are ineffective. A genetic algorithm- (GA-) based method is proposed in this study to solve this problem. First, the formal specification and mathematical model of the loop selection problem are presented; then, the overall framework for the GA to solve the problem is designed based on the mathematical model; finally, we provide the chromosome representation method and fitness function calculation method, the initial population generation algorithm and chromosome improvement methods, the specific implementation methods of genetic operators (crossover, mutation, and selection), the offspring population generation method, and the GA stopping criterion during the GA operation process. Experimental tests with the SPEC2006 and NPB3.3.1 standard test sets were performed on the Sunway TaihuLight supercomputer. The test results indicated that the proposed method can achieve a speedup improvement that is superior to that by the current mainstream methods, which confirm the effectiveness of the proposed method. Solving the loop selection problem of multilevel nested loops is of great practical significance for exploiting the parallelism of general scientific computing programs and for giving full play to the performance of multicore processors.

1. Introduction

With the rapid development of multicore processor technology, how to effectively use multicore processors to improve the performance of general scientific computing programs is challenging. One way is to create multiple threads that can be executed in parallel. In general, in scientific computing programs, there is an enormous number of loops, which are well structured and take up a large amount of time for the program to execute. Therefore, many studies have focused on creating multiple threads to execute the loops in parallel to improve the overall performance of the program. However, due to the presence of complex dependencies, loops cannot be directly executed in parallel unless there is parallel transformation (e.g., loop distribution and loop exchange). With single-level loops or a small number of nested loop levels (), parallelism can usually be readily achieved through loop transformation. In the presence of many nested loop levels, it must be discussed separately. In one case, there are loop levels for which parallelism can be directly executed in the multilevel nested loops, and thus, they are moved to the outermost layer for parallel execution through loop exchange. Unfortunately, this situation is usually very rare. In another case, which is more common, there are no loop levels at which parallelism can be directly executed in multilevel nested loops, and a set of loop levels must be selected to be moved to the outermost layer for serial execution to expose possible new parallelism. As the number of nested loop layers increases, the dependencies between the loop iterations become very complicated, which makes it difficult to analyze and search for such a group of loop levels.

There are currently two approaches for the parallelism problem of multilevel nested loops. One approach is to use the multithread implementation technology of the underlying hardware, in which the appropriate loop layer is chosen for multithread parallelization with the assistance of the underlying hardware, e.g., the thread-level speculation (TLS) technique and the transactional memory (TM) technique. However, the dependency analysis capability of the underlying hardware is limited and lacks flexibility, which makes it difficult for the TLS technique to analyze the complex dependencies of the multilevel nested loops and thus is impossible to provide effective parallelism [16].The hardware TM technique has limitations on the transactional size (transactional buffer) and duration (operating system events and disruptions can abort the transaction); therefore, when applied to the thread-level parallelism of nested loops, it often leads to overflow of the underlying hardware resources due to excessive granularity of the transaction itself, which ultimately manifests poor parallel performance [712]. Another common approach is to use the traditional software-based static compilation techniques, in which the dependency analysis test module in the compiler is used to test whether there are dependencies between the loop iterations in a layerwise manner. However, as the number of levels of nested loops to be tested increases, the number of dependencies that must be tested increases exponentially or even factorially, which makes the compilation time unbearably long. When encountering this situation, mainstream compilers (e.g., gcc) usually adopt a conservative approach; i.e., they assume that dependencies are present between all iterations of the loops, and thus, parallelism cannot be implemented. Typically, scientific computing programs contain a large number of loops, with also a large number of nesting levels. Researchers from the University of Minnesota and Intel have made a survey on the maximum numbers of nested loop levels and loops in some scientific computing programs in SPEC2006, and the result is shown in Table 1 [13]. Obviously, the number of nested loop levels in the SPEC2006 standard test set is very large (e.g., the nesting depth of 445.gobmk reaches even 22 levels), with a large number of such loops. Therefore, solving the parallelism problem in nested loops can not only enhance the overall performance of general scientific computing programs but also be of great practical significance for expanding the applications scope of thread-level parallelism and making effective use of multicore processors.

Since common multilevel nested loops lack loop layers that can be directly executed in parallel, the key to solving the nested loop parallelism issue is to select a group of appropriate loop layers and move them to the outermost layer, to enable them to be executed in serial, for the purpose of exposing new parallelism. To expose parallelism with maximum granularity, the number of loop layers that are chosen to be exchanged to the outermost layer to be executed in serial must be as low as possible. With the deepening of the number of loop layers, the granularity of the parallelism becomes increasingly small. Therefore, to improve the parallelism performance of multilevel nested loops in general scientific computing programs, in this study, we propose a method that uses the least number of loops to expose the maximum parallelism in multilevel nested loops.

Many researchers have informally pointed out that selecting the smallest set of loop layers from multilevel nested loops to obtain maximum parallelism is a nondeterministic polynomial complete (NPC) problem [14, 15]. As the number of nested loop levels increases, the scale of the problem grows very fast (in factorial order). Therefore, to solve the loop selection problem of multilevel nested loops, some heuristic methods can only be used to obtain approximate solutions, from which a satisfactory solution is then chosen. Compared with traditional heuristic methods, genetic algorithms (GAs) have a very strong search ability; they can find the global optimal solution of a problem with a high probability, and their inherent parallelism is more suitable for processing optimization problems [1623]. Therefore, a GA was adopted in this study to solve the loop selection problem.

The main contributions of this study are as follows: (1) a formal specification and mathematical model of the loop selection problem of multilevel nested loops is presented; (2) a new idea for solving the loop selection problem of multilevel nested loops is provided; and (3) the initial population generation algorithm, chromosome repair algorithm, and the calculation method for a series of genetic parameters are designed for GAs to solve the multilevel nested loop selection problem. The experimental results show that compared with the current mainstream methods for the loop selection problem of multilevel nested loops (e.g., static compilation combined with a dynamic operation method and machine learning methods), the proposed method is more effective, which indicates that the parallel compilation of programs is still a key factor that restricts the capabilities of hardware multicore processors.

2. Overview of Solutions to the Loop Selection Problem of Multilevel Nested Loops

In this section, we outline the proposed solution to the loop selection problem of multilevel nested loops. First, we present the formal specification of the loop selection problem; second, we further abstract the mathematical model of the problem; last, we design the solving algorithm for the problem based on the mathematical model of the problem, as shown in Figure 1.

We first use a direction matrix to represent the complex dependencies of multilevel nested loops, and then, we symbolize the feasible solutions, constraints, and optimal solution of the problem with sets. Next, we use the 0-1 matrix to represent the direction matrix and the selection vector to represent the set of feasible solutions, and we give a mathematical representation of the constraints of the feasible solutions to further abstract the mathematical model of the problem.

Based on the mathematical model of the problem, we found that the solution space of the problem is determined and all possible solutions can be represented through coding. Moreover, we found that the fitness function of the possible solutions can be determined, thus meeting the preconditions of using GAs to solve a problem. Therefore, we opt to use GAs to solve the problem. In the following sections, we will present the formal specification and mathematical model of the problem and the specific steps of the GA when solving the problem, in detail.

3. Formal Specification and Mathematical Model of the Loop Selection Problem of Multilevel Nested Loops

3.1. Example Problem

Before giving the formal specification of the loop selection problem of multilevel nested loops, we first illustrate the difficulty in the problem with the following example, which is derived from simplifying a nested loop segment from a large scientific computing program, as shown in Figure 2. Figure 3 is the direction matrix based on the dependencies in the nested loop segment, in which the row represents the direction vector of the dependency of the syntaxes in the nested loop segment and the column represents the direction vector of a nested loop layer in all directions. According to the theorem proposed by Allen and Kennedy, it is safe to exchange loops in the nested loop segment [14], but all loop layers carry the dependencies (because “” is present in each column of the direction matrix); as a result, this loop segment has no loop layer that can be directly parallelized.

In Figure 3, the set composed of loop layers in the nested loop segment is , and the direction vector set is . At the position of the row of the column of the direction matrix, “<” indicates that the nested layer of carries dependencies at the direction vector of , and the serial execution of can satisfy all of the dependencies carried in ; i.e., covers . Table 2 shows the set of loop layers that cover all direction vectors. By randomly selecting a loop layer, respectively, from the set of , a loop layer set that covers all direction vectors (, and so on) is obtained.

In Set , the safe parallel execution of the loop layer can be achieved by serial execution of loop layers , , and . In Set , by serially executing and while loop-exchanging and , the safe parallel execution of the loop layer can be achieved. In Set , by changing and to the outermost layers for serial execution through the loop exchange technique, the safe parallel execution of the and loop layers can be achieved and so on. Finally, we found that, in this example, the maximum safe parallel execution granularity can be obtained through Set , as shown in Figure 4. Therefore, the loop selection problem of multilevel nested loops is essentially the search for the coverage of all direction vectors for maximum granularity of parallelism through the smallest nested loop layer set.

Through the description of this example, we found that, in the case of a small number of nested loops (e.g., three or fewer layers), the problem can be directly solved through enumeration. As the number of nested loops increases, a combinatorial explosion occurs, which makes the solving difficult. Wang et al. have informally argued for selecting the least number of loop layers in multilevel nested loops to cover all direction vectors, thus revealing that maximum parallelism is an NPC problem [15]. Premised under legitimate loop exchanges and based on a direction matrix, we propose a method of processing multilevel () nested loops. In the next section, we will present the formal specification of the loop selection problem of multilevel nested loops and then the mathematical model of the problem to provide a theoretical basis for the design of an algorithm for solving the problem.

3.2. Formal Specification of the Loop Selection Problem of Multilevel Nested Loops

As shown in Figure 5, without losing generality, we assume that loop exchanges of nested loops are legitimate in nested loops with layers that contain dependencies (, , , ), with a direction matrix, as shown in Figure 6. The goal is to seek the set that has the least number of loop layers that cover all direction vectors.

The following two aspects should be noted here. First, the multilevel nested loops to be solved must have legitimate loop exchanges. According to Allen and Kennedy, loop selection must be based on legal loop exchange [14]; if the legality of the loop exchange cannot be guaranteed, then the loop selection technique does not exist. Therefore, in this study, it is required that the loop exchange of multilevel nested loops be legal. Second, in this study, we focus on loop layers that cannot be directly executed in parallel, i.e., each loop layer of the -layer nested loops carries at least one dependency, and therefore, it is required that .

Next, we proceed to the formal specification on the loop selection problem for the -layer nested loops shown in Figure 5. As shown in Figure 6, for the set that is composed of all nested loop layers () and the set that is composed of all direction vectors (), assuming that there are subsets () of Set that cover all direction vectors and the sets that are composed of the loops that cover the direction vectors of are , respectively, and then the subsets must satisfy the following:

Furthermore, the subset that satisfies the condition below is the desired set:

3.3. Mathematical Model of the Loop Selection Problem of Multilevel Nested Loops

In the direction matrix of the -layer nested loops shown in Figure 6, the position in the direction matrix with “” is represented by 1 and that with “” by 0, which gives rise to , and the 0-1 matrix that corresponds to the direction matrix, as shown in Figure 7. In the 0-1 matrix, means that the loop layer of covers the direction vector of . To facilitate further specification, it is necessary to introduce the selection vector.

Definition 1. (selection vector). Let be a set that is composed of nested loop layers (i.e., ), and let be the corresponding 0-1 selection numbers of the nested loop layers of . If , then ; if , then . The vector , which is composed of , is the selection vector that corresponds to .

According to the definition of the selection vector, in the example described in Section 3.1, the selection vectors of , , and are , , and , respectively. According to the mutual exclusiveness of sets, the set that is composed of the different loop layers has its unique corresponding selection vector, and thus, the above three selection vectors are not identical. At the same time, the size of the set can be calculated through the selection vectors; for example, , , and .

Next, groups of sets () that cover all direction vectors () are represented by the selection vector. The selection vector of is represented by , with a size of ; the selection vector of is represented by , with a size of ; the selection vector of is represented by , with a size of .

According to the definitions of the selection vector () and the 0-1 matrix (), we have

From the constraint formula (1), we have

According to formula (2), the selection vector that satisfies the equation below is the selection vector that corresponds to the desired set:

3.4. Formal Specification and Mathematical Model

In this section, we show the graphical comparison of the formal specification and mathematical model of the loop selection problem of the -layer nested loops, as shown in Figure 8.

4. Solving the Loop Selection Problem of Multilevel Nested Loops Using GAs

In this section, we will present the steps of solving the loop selection problem of multilevel nested loops using a GA. First, we will give a brief introduction to GAs; second, we will show the overall framework of the solution with GAs; last, we will describe the chromosome representation method and fitness function calculation method in the GA operation process, the initial population generation algorithm and chromosome improvement methods, the specific implementation method of genetic operators (crossover, mutation, and selection), and the offspring population generation method and the stopping criterion of the GA.

4.1. Basic Framework of the GA

Conceived based on the principle of biological evolution, GA is a bionic algorithm for searching for the optimal solution. It simulates the natural process of gene recombination and evolution and encodes the parameters of the problem to be solved into binary code or decimal code (or other numerical code), i.e., genes and multiple genes form a chromosome (individual). Paired crossover and mutation operations similar to natural selection are then performed on many chromosomes and iterated (i.e., generational inheritance) until the final optimization result is obtained [24]. The basic framework of the GA is as follows: begin/∗GA∗/t0;/∗ number of evolutionary generations∗/ Generate initial population P(t); Calculate the fitness value of each individual in the initial population P(t);while (the stopping criterion is not met) do /∗ Use the following operations to generate new individuals, and select better individuals to form a new population∗/(1)  Recombine individuals in population P(t) through reproduction, crossover or mutation to generate candidate population C(t);/∗ Note that C(t) does not include individuals of P(t)/(2)  Calculate the fitness value of each individual in C(t);(3)  Choose better individuals from C(t) and P(t) to form a new population P(t+1) according to the fitness values;(4)  tt+1end while Choose the best individual in P(t) as the solution;end begin

4.2. Basic Framework for Solving the Loop Selection Problem of Multilevel Nested Loops Using a GA

As shown in Figure 9, we describe the basic framework for solving the loop selection problem of multilevel nested loops using GAs, with the following steps:(1)Determine the chromosome representation scheme and the calculation method for the fitness function.(2)Generate the initial population. Calculate the size of the population () and input the loop layer set () and direction vector set () into Algorithm 1 to generate the initial population.(3)Crossover operation. Select parents using the tournament selection method; generate offspring chromosomes using the single-point crossover technique.(4)Mutation operation. Select individuals from offspring chromosomes to perform a mutation operation to generate new offspring chromosomes.(5)Chromosome repair. Perform feasibility judgment on offspring chromosomes generated through crossover and mutation; repair infeasible chromosomes using Algorithm 2. Enter the repaired offspring chromosomes along with the parent chromosomes to the candidate set of the offspring population.(6)Fitness evaluation. Form the candidate set of the offspring population using the repaired offspring chromosomes and parent chromosomes; calculate the fitness of all of the chromosomes in the candidate set; select chromosomes with the best fitness to form the offspring population.(7)Check the stopping criterion. If it is not met, then return to step (3); otherwise, stop the algorithm and output the chromosome individual with the best fitness in the current population.(8)Generate the loop layer set that corresponds to the chromosome with the best fitness.

Input:all loop layer set , all direction vector set
Output:chromosome set of the initial population
(1)functionINITIALISEPOPULATION ()
(2) Calculate the population size
(3)fordo
(4)  Calculate the set of loop layers () that cover the direction vector of
(5)end for
(6)fordo
(7)  fordo
(8)   Randomly choose a loop layer () from and add it to the set
(9)  end for
(10)  whiledoDelete the redundant loop layer
(11)    () randomly remove the loop layer from the set
(12)   fordo
(13)    Calculate the number of loop layers () in the set that covers the direction vector of
(14)    ifthen
(15)     
(16)     break;
(17)    end if
(18)   end for
(19)  end while
(20)  Generate Chromosome that corresponds to and add it to the set
(21)end for
(22)end function
Input:, chromosome to be repaired (), the set of all direction vectors ()
Output:repaired chromosome ()
(1)functionREPAIRCHROMOSOME()
(2)  Calculate the set of loop layers () contained in Chromosome
(3)  for eachdoCalculate the set of all direction vectors () covered by Chromosome
(4)   Calculate the set of direction vectors () covered by
(5)   
(6)  end for
(7)  Calculate the set of direction vectors () not covered by Chromosome
(8)  ifthenchromosome repair
(9)   for eachdo
(10)    Calculate the set of loop layers () that cover
(11)    Randomly select a loop layer () from and add it to the set of
(12)   end for
(13)  end if
(14)  whiledodelete redundant genes
(15)   randomly delete a loop layer () from the set of
(16)   for eachdo
(17)    Calculate the number of loop layers () in the set of that cover the direction vector of
(18)    ifthen
(19)     
(20)     break;
(21)    end if
(22)   end for
(23)  end while
(24)  Generate Chromosome that corresponds to
(25)end function

Next, we give a detailed description of the specific implementation method of each step.

4.2.1. Chromosome Representation Scheme and Calculation Method of the Fitness Function

To solve the loop selection problem of multilevel nested loops using GAs, we must first determine the representation scheme of the chromosomes. According to the mathematical model of the problem detailed in Section 4, the selection vector () that corresponds to feasible solutions (, the set of the loop layer that covers all of the direction vectors) is a 0-1 matrix, and thus, adopting the -bit binary numbers to encode feasible solutions is a suitable chromosome representation scheme.

Analogous to the method of representing feasible solutions with the selection vector, -bit binary coding () is used to represent the chromosomes of the feasible solution , in which is the number of nested loop layers, and the value of 1 at the bit () represents that the loop layer is in this set of solutions (i.e., ), and the value of 0 at the bit of represents that the loop layer is not in this set of solutions (i.e., ). Figure 10 shows the chromosome that corresponds to the feasible solution.

Analogous to the method of using the selection vector to calculate the size of the corresponding feasible solution set of , Chromosome is represented with a binary (0-1) code (). Accordingly, the fitness function of Chromosome can be calculated using the following formula:

Table 3 shows the representation of the chromosome that corresponds to the feasible solutions and the calculation of the fitness values of the example program shown in Figure 2.

4.2.2. Generation of the Initial Population

After the chromosome representation scheme is determined, the generation of the initial population is performed. First, the size of the population is decided. If the size of the initial population is too large, the execution efficiency is low although the optimal solution can be obtained with relatively fast convergence (within a small number of generations); if the size of the initial population is too small, then the convergence will be too slow. Therefore, the population size has a large impact on the quality of the solution obtained using GA, and choosing an appropriate population size can improve the algorithm’s efficiency.

In GAs, the product of the chromosome length and the population density () is usually used as an approximate value of the population size, of which the population density refers to the ratio of the number of dependencies in the direction matrix to the size of the entire matrix [25]. In principle, it is required that chromosomes in an initial population must contain all of the genes (i.e., all of the nested layers). Assuming that the population size is and the average number of genes contained in each chromosome in the population is , then the total number of genes in the population is . To ensure that (in which is the set of nested layers that correspond to any chromosome in the population and is the set of all nested loop layers), we must have (, and is the total number of loops), and thus, we have . Through repeated experiments, it was found that the determinant value of the direction matrix of the multilevel nested loops is low, and the dependence density varies greatly with different numbers of loops. Therefore, in actual operation, the size of is manually set according to the characteristics of the multilevel nested loops, to achieve a good outcome; for example, when there are four loop layers, ; when there are 15 loop layers, .

After the size of the initial population is determined, we must also consider the feasibility of generating new chromosomes in the initial population (i.e., whether the set of loop layers represented by the chromosomes covers all direction vectors). The feasibility of the chromosomes in the initial population has an important impact on the quality of the solution and the convergence rate of the GA. Therefore, we require that the gene set contained in each chromosome in the initial population must cover all of the direction vectors.

By combining the initial population size and the feasibility of the chromosomes, we propose the generation algorithm for the initial population, as shown in Algorithm 1. The rationale of the algorithm is as follows. First, a loop layer is randomly chosen from the loop layer set ( ()) of each direction vector ( ()) and added to the set in such a way that the set covers all of the direction vectors. Second, the redundant loop layers in the set are deleted to form the set and, ultimately, the that corresponds to Chromosome . The above steps are iterated times, which give rise to the initial population.

The deletion of the redundant loop layers of the set is performed as follows. From the set, a loop layer () is randomly deleted, and then, the number of loop layers ( ()) in the set that covers the direction vector ( ()) is examined. If , then the loop layer of is not a redundant layer; then, it is included in the chromosome generation set (); otherwise, the loop layer of is redundant and is removed.

4.2.3. Crossover and Mutation

After the initial population is determined, we will discuss various genetic operations, including parent selection, crossover, and mutation.(1)Parent selection. Parent selection is the opportunity that each individual in the population is given to produce offspring, and it can be implemented in many ways, including proportional selection, tournament selection, and roulette selection [26]. In this study, we use the tournament selection method to select the parent individuals in the population. Specifically, two mating pools are generated, each of which consists of () individuals who are randomly selected from the population, and two individuals with the best fitness in their respective mating pools are chosen to mate.In this study, we evenly divide the chromosomes in the population into two groups ( or ), and then, we choose the chromosome with the best fitness in each group to mate until one group is empty or both groups are empty. The reason for taking this specific action is that it is relatively easy to select and efficient to implement, without having to calculate the probability of each individual in the population being selected, as in the case of the proportional selection method, which introduces additional computational overhead that affects the overall execution efficiency of the algorithm. In the actual experiments, we also found that the tournament selection method is more efficient than the proportional selection method.(2)Crossover operation. Crossover is the main method for introducing new individuals, and single-point crossover or multipoint crossover are often used in GAs. In both crossover operations, the crossover point is first set, and the chromosome segments of two parents are exchanged to produce two new offsprings. In this study, the single-point cross with a fixed crossover point is adopted. Specifically, let and be two parent chromosomes, and let the set crossover point be (); then, the two offspring chromosomes that are generated areAfter setting up the single fixed-point crossover operation, we also must set the crossover rate to control the number of chromosomes in the mating pools that are participating in crossovers. In this study, the chromosomes in two pools are all subjected to the crossover operation. The reason is that the numbers of rows and columns of the direction matrix of the multilevel nested loops at issue in this study are not too high, and the size of the population is, thus, not large; to ensure the efficient convergence of the GA, the structure of the offspring chromosomes should not be vastly different from those of the parent chromosomes, and thus, the single-point fixed crossover method is used. Moreover, the single-point fixed crossover operation can ensure that new individuals are continuously produced in the mating pool, while avoiding the number of new individuals being so large that the genetic order is broken. Therefore, all individuals in the parent selection pools are subjected to the crossover operation.(3)Mutation operation. Mutation, as another way to generate new individuals, is an operation on offspring chromosomes that are produced by crossover operations; it is achieved by flipping a certain bit of a certain offspring chromosome (i.e., flipping 0 to 1 or flipping 1 to 0) at a low probability to produce new individuals in the population. It is usually regarded as a process in which genes on chromosomes mutate to generate genes that do not exist in the initial population, or they reintroduce genetic information lost due to crossover operations. The real meaning of mutation is to have a defense mechanism to prevent the population from entering convergence prematurely and failing to find a better solution [27].

The mutation rate is the ratio of the number of mutated genes to the total number of genes in a population. When applying a GA to solve various problems, the mutation rate of the selected operation is an important factor that affects the efficiency of the algorithm. A reasonable mutation rate allows the offspring to obtain new genes without losing the good traits inherited from both parents. According to an early study by Back, the mutation rate of a population cannot be lower than ( is the chromosome length); otherwise, it will affect the efficiency of the GA in finding the optimal solution [28]. Thus, in this study, we randomly choose chromosomes from the offspring chromosomes produced through crossover to perform the mutation operation at the “1” bit to generate new chromosomes and then determine the feasibility of the newly generated offspring chromosomes and repair infeasible chromosomes (the repair algorithm will be detailed in the next section), which are then added to the next generation of the candidate population. Figure 11 illustrates the mutation operation of two chromosomes.

4.2.4. Chromosome Improvement

Although Algorithm 1 ensures the feasibility of the chromosomes in the initial population, crossover and mutation will inevitably produce infeasible chromosomes. There can be two issues with the newly generated chromosomes. First, new chromosomes can cover all direction vectors but could have redundant genes, which must be deleted. Second, the newly generated chromosomes cannot cover all direction vectors. There are two ways to address these issues: abandon and repair. If the chromosomes are abandoned, then the optimal solution could also be lost, thereby slowing down the convergence speed. For this reason, we adopt the repair method to address the above-described chromosome issues. The chromosome repair algorithm is detailed in Algorithm 2.

Using Algorithm 2, we can repair infeasible chromosomes that are generated through crossover and mutation, to guarantee the feasibility of chromosomes in the population while removing redundant genes, to further enhance the efficiency of the algorithm.

4.2.5. Offspring Population Generation and Stopping Criterion

(1)Method for generating new offspring population. Two methods are commonly used in producing offspring populations: the steady-state replacement method and the whole generation replacement method. In the steady-state replacement method, once a new generation of chromosomes is produced, the newly produced chromosomes and the chromosomes in the parent population are combined to form the candidate population, from which chromosomes with higher fitness are selected to form the new offspring population. In the whole generation replacement method, the newly generated offspring chromosomes directly replace the original parent chromosomes [29]. In this study, we adopt the steady-state replacement method because this method can not only use the newly generated offspring chromosomes for various operations (e.g., selection and crossover) but also ensure that the chromosome with the highest fitness will always be retained in the population. Hence, the steady-state replacement method converges faster with a higher efficiency in finding the optimal solution than the whole generation replacement method, an advantage that was also confirmed in our experiment.(2)Stopping criterion. In practice, GA needs to perform evolution over multiple generations until the fitness of the chromosomes in the population is stabilized, or the optimal solution is found, or the preset maximum number of generations is reached; at any of these points, the evolutionary process ends. Because GAs do not include information such as the gradient of the objective function, it is impossible to determine the location of an individual in the solution space in the evolutionary process and a theoretical criterion of convergence [30]. A commonly used method is to stop the evolution when the predetermined number of evolutionary generations is reached. Given that the numbers of the rows and columns of the direction matrix of the multilevel nested loops are rather low, to reduce the execution time of the actual operation, the maximum number of generations was set to 100 after repeated experiments on operations for the search for optimal solution for nested loops of 20 or fewer layers. In the actual operation process, the parameters are set to determine whether the chromosomes have changed between two consecutive generations of the population, based on which the convergence of the algorithm is assessed and the iterative solution process can be ended early when the convergence is reached.

5. Experimental Evaluation

The GA proposed in this study to solve the loop selection problem of multilevel nested loops was integrated in the test version of the basic compiler of Sunway TaihuLight supercomputer, and it passed the correctness test of the multithreaded automatic parallel optimization process.

5.1. Test Platform and Content
5.1.1. Test Platform

The test platform was the Sunway TaihuLight supercomputer with the SW26010 heterogeneous many-core processor (with a total of 40,960 processors), with a peak system performance of 125.436 PFlops, under the operating system of Raise Linux, which was integrated with the basic compiler for the proposed method and equipped with complete compiler tool chain and performance analysis tools. The tests of this study were performed on one computing node of the Sunway TaihuLight supercomputer, with the following parameters:CPU: 4∗ (1 master core + 64 slave cores)Main frequency: 1.25 GHzMemory: 8 GL1 Cache: 32 KB privateL2 Cache: 256 KB privateVector unit: 256 b

5.1.2. Test Content

We chose ten scientific computing programs of the SPEC CPU2006 benchmark test set and the NPB3.3.1 parallel computation test set as the test benchmark. Tables 4 and 5 show the brief description of the SPEC CPU2006 and NPB3.3.1 component suites. The input scales of the SPEC CPU2006 benchmark program test set are categorized, from small to large, into three scales, i.e., test, train, and reference. The reference scale, which is the largest, was adopted in this study. The input scales of the NPB3.3.1 parallel computing test set are categorized, from small to large, into six levels, i.e., S, W, A, B, C, and D. The B level was adopted in this study. Because the proposed method was integrated in the basic compiler of the Sunway TaihuLight supercomputer, we chose the single-node four-thread automatic parallelization performance improvement test.

To test the effectiveness of the proposed method, we conducted two sets of experiments on the Sunway TaihuLight supercomputer. One set is the comparative performance test of the basic compiler using the proposed method and the basic compiler using the loop selection method in combination with the underlying hardware techniques (the thread-level speculation technique and the TM technique), and the other set is the comparative performance test of the basic compiler using the proposed method and the basic complier using the loop selection methods (i.e., those combining the compile time and runtime and those based on machine learning), which were mostly adopted in previous studies.

5.2. Comparison between the Proposed Method and the Underlying Hardware Techniques

On the Sunway TaihuLight platform, we conducted automatic parallelization performance tests of (1) the basic compiler (as a benchmark for the performance test), (2) the basic compiler in combination with the loop selection method based on the TM technique, (3) the basic compiler in combination with the loop selection method based on the thread-level speculation technique, and (4) the basic compiler using the proposed method, respectively. The test results are shown in Figures 12 and 13.

Figure 12 shows the automatic parallelization performance test results of the ten scientific computing programs chosen from the SPEC CPU2006 test set. All of the programs contain many multilevel nested loops. Examination of the printed intermediate representation files generated by the automatic parallelization process during compilation indicated that the conventional compiler technique cannot accurately analyze the complex dependencies in the nested loops, thus showing poor parallel acceleration performance. The dependency analysis ability of the TM and TLS techniques relies on the traditional compiler technique, and as a result, their parallel acceleration performance is basically identical to that of the traditional compiler technique. Because the proposed method can provide the parallelization information of multiple loop layers after loop exchange, it enables the parallelization with multigranularity in the compilation process and achieves a remarkable speedup improvement. For example, using the proposed method, the 401.bzip2 and 433.milc compilers not only successfully parallelized the outer loops of the multilevel nested loops, which could not be parallelized before but also provided the automatic vectorization information of the inner loops. The acceleration performance of several other scientific computing programs was not as pronounced because the multilevel nested loops in the source code are not a hotspot of the program and only take up very little execution time of the entire program.

Figure 13 shows the automatic parallelization test results on the NPB3.3.1 test set. The proposed method showed significant acceleration improvement on five programs, i.e., BT, FT, LU, MG, and SP. The reason is that these programs contain many multilevel nested loops, some of which cannot be parallelized through the basic compiler but can be successfully parallelized by the proposed method, thus achieving a notable acceleration effect. In comparison, the acceleration effect of the automatic parallelization of five other programs, i.e., CG, DC, EP, IS, and LU-HP, was insignificant. The reason is that their loop structure is either irregular or based on operations with a complex pointer structure or contains an indirect array access mode, which is beyond the scope of this study.

5.3. Comparison of the Proposed Method with Existing Techniques

We selected two loop selection methods, Method A and Method B, which have been mostly investigated. Method A uses a loop selection method that combines the compile time and the runtime [31], and Method B uses a loop selection method based on machine learning [32]. The automatic parallelization performance of the basic compiler was again used as the benchmark of the test. Then, the ten scientific computing programs selected from the SPEC2006 test set and the NPB3.3.1 parallel computing test set were tested, with the results shown in Figures 14 and 15, respectively.

Figure 14 shows the automatic parallelization performance test results of the proposed method and Methods A and B. Method A greedily selected multiple different nested layers of the same loop as candidate speculation objects through the compiler and determined the nested loop layer with the best parallel performance based on the execution delay and the number of cancellations as dynamically monitored at runtime. Through the target program hotspot analysis tool and the printed automatic parallelization optimization result information during the compilation process, we found that Method A has only a limited capability in speculating nested loop layers and executing the model in parallel; as a result, some nested loop layers that could be executed in parallel did not give parallelization prompt information. Moreover, the scope of the runtime performance analysis cost model of Method A is fuzzy and, as a result, some of the nested loop layers that could be executed in parallel to obtain a limited benefit were not executed in parallel at runtime. Therefore, the number of parallelized multilevel nested loops obtained using Method A was lower than that using the proposed method.

Method B uses the K-nearest neighbor (KNN) algorithm in machine learning to focus on analyzing the thread granularity, data dependency distance, thread creation, and the performance impact caused by excitation to determine the optimal loop nesting layer for parallel execution. We constructed a machine learning-based loop selection model using Method B, based on which compilation prompts were given in the compilation process. However, through experiments, we found that the nested loop layer selected by Method B was not necessarily the loop layer with the optimal benefit and thus did not achieve an ideal speedup improvement effect. We also found that the multilevel nested loop parallelization benefit is associated not only with characteristics such as the loop dependency distance and thread granularity of Method B but also with the software environment and hardware parameters of the actual machine; for example, the thread synchronization method, the division of task granularity, and the memory size of the operating system of the software environment affect the actual execution performance of the loop.

Based on the direction matrix constructed through dependency relationships, the proposed method successfully exposed more parallelism while ensuring the safe parallel execution of multilevel loops in multilevel nested loops. In the experiment, we found that the compiler that used the proposed method prompted not only the coarse-grained thread-level parallelization at the outer loop layers of the multilevel nested loops but also the fine-grained instruction-level parallelization of the inner loops, showing multigranularity parallelization; for example, through the proposed method, the compiler guaranteed the correctness of the parallel execution and performed automatic vectorization and loop unrolling on inner-level loops to further gain the instruction-level parallelism while obtaining the thread-level coarse-grained parallelism in the outer-level loops. This property is also the reason that the proposed method is superior to Methods A and B in terms of automatic parallelization performance improvement.

Figure 15 shows that the acceleration effect of Methods A and B on the CG, DC, EP, IS, and LU-HP programs were superior to that of the proposed method on the NPB3.3.1 test set. The reason is that these five programs do not have the complex dependencies between multilevel nested loop iterations, while the effect of parallelization mainly depends on the performance evaluation of the parallelization benefit of the loop layers. Method A combines the loop-layer parallelization benefit evaluation at compile time and runtime, and Method B, through the machine learning method, combines the information from a large number of nested loop layers to heuristically select beneficial loop layers for parallelization, while the proposed method was based only on the static analysis of the compiler’s parallelization performance benefits for the loop layers. Thus, the acceleration effect of the proposed method was lower than that of Methods A and B. Moreover, in the experiment, when examining the assembly code that corresponds to the multilevel nested loop segment, we surprisingly found that, after ensuring the correctness of the dependency, the proposed method obtained a much larger instruction scheduling space in the subsequent optimization process by using techniques such as loop unrolling and register renaming while eliminating the instruction conflicts caused by interinstruction dependencies, thereby further improving the parallelism between the instructions. Therefore, the application of the proposed method to Methods A and B is expected to achieve better results.

The authors in [1, 4, 3337] obtained the profiling information that reflects the behavior of the program through multiple preexecution programs using the thread-level speculation technique in combination with the information on the program during preexecution, and then, the appropriate loop layer was selected for parallel execution based on the predicted performance. However, the dependence analysis ability of the hardware was limited, and thus, the success rate of these techniques in stimulating multithreaded parallel execution in multilevel nested loops was not high [1, 4, 3337]. In [31, 3841], a multilevel nested loop thread-level parallelization scheme that combines static analysis and dynamic scheduling was adopted. Most of these methods used traditional compilation techniques to assess the benefits of the loops to obtain information on parallelization and made corresponding dynamic adjustments on various program behaviors caused by different platforms and different input data in actual operations, which is also the main direction of the current studies. However, reliance on only traditional compilation techniques cannot give accurate loop-level parallelization prompts for multilevel nested loops. Therefore, the methods described in these references lack generality [31, 3841]. As reported in [32, 4244] the relevant features, such as the number of loop iterations of multilevel nested loops and the number of nested layers, were extracted from the intermediate representation of the compiler to construct the loop selection assessment model of multilevel nested loops. Compared with the models based on the thread-level speculation technique, these models improve the parallelization effect of multilevel nested loops to some extent. However, as the number of nested loop levels of the program increases, the iteration dependency between loops becomes complicated, which makes them still unable to achieve the desired performance improvement [32, 4244]. The authors in [4547] proposed frameworks for misspeculation based on the loop cost in the compiler. Based on the control flow graph and data dependency graph in the compiler, these frameworks analyze the probability of parallel execution failure of loop layers in multilevel nested loops to further speculate on the cost of the parallel execution of the loop layers, based on which the multilevel nested loops are radically changed into loops executed in parallel [4547]. These frameworks speculate that the accuracy of the parallel execution of the loops depends on the compiler’s limited ability to analyze the dependencies between the loop iterations, and thus, they have little effect on the benefit of loop layer parallelization of multilevel nested loops. In this study, we proposed a solution for the parallel static compilation of multilevel nested loops using GAs, to employ the least number of loop layers to cover all loop-carried dependencies, thereby obtaining the highest parallel granularity. This method can be better combined with other methods and thus has a wider range of applications.

7. Discussion

Experimental results indicated that the proposed method can effectively solve the loop selection problem for multilevel nested loops and find a set of fewer loop layers that cover all direction vectors in a limited time to expose the parallelism of multilevel nested loops, to obtain a higher parallel granularity, and to further improve the performance of the entire program. In this study, the number of nested loop layers was used to represent the chromosome length, which is short in this case and proportional to the population size, and the preset stopping criterion ensured sufficient reproduction generations, while the settings of the parameters, such as the chromosome crossover rate and mutation rate, prevented the population from falling into locally optimal solutions. Therefore, the optimal solution to the loop selection problem of the multilevel nested loops was successfully found by the proposed method multiple times.

Compared with the method that combines static compilation benefit analysis and a dynamic operating loop scheduling scheme [31], the proposed method can obtain a higher parallelism granularity. Compared with the multilevel nested loop parallelization model proposed by Liu et al. [32] based on the KNN algorithm of machine learning, the proposed method is simpler and more intuitive, which makes it easier to integrate with mainstream compilers. Given that Dice et al. [9] considered the actual running state of the multilevel nested loops to be closely related to the underlying hardware and that Li et al. [2] extracted the relevant parameters of multilevel nested loops from the intermediate representation of the Prophet compiler, which relies on the organizational structure of the specific compiler, the proposed method starts from the direction matrix of multilevel nested loops and thus has a higher applicability in mainstream compilers.

The proposed method can not only solve the loop selection problem of multilevel nested loops but also assist the compiler in finding the maximum parallel granularity in multilevel nested loops in a short time, thereby improving the overall performance of the program. This method also provides a new idea for solving the parallelism of multilevel nested loops with complex dependencies as well as a new method for mainstream compilers to explore the parallelism of multilevel nested loops.

8. Conclusions

Due to the complexity of the dependency analysis of the multilevel nested loops and the limitations of the static analysis capability of basic compilers, the loop selection problem of multilevel nested loops remains a very difficult problem. With the advent of the era of artificial intelligence, the multilevel nested loops that are present in a large number of machine learning algorithms seriously affect the overall performance of the programs. Therefore, solving the loop selection problem of multilevel nested loops and finding the thread-level parallelism with the maximum granularity in the multilevel nested loops are of great practical significance for making full use of multicore processors and even many-core processors, thus improving the overall performance of a program.

Data Availability

The SPEC2006 and NPB3.3 data used to support the findings of this study have been deposited in the SPEC and NPB website (http://www.spec.org/benchmarks.html#cpu and https://www.nas.nasa.gov/publications/npb.html).

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This work was supported by the National Key Research and Development Program “High-Performance Computing” Key Special (Grant no. 2016YFB0200503).