Abstract
Multi-cores offer higher processing power than single core processors. However, as the number of cores available on a single processor increases, efficiently programming them becomes increasingly more complex, often to the point where the limiting factor in speeding up tasks is the software.
We present Grammatical Automatic Parallel Programming (GAPP), a system that synthesizes parallel code on multi-cores using OpenMP parallelization primitives in problem-specific grammars. As a result, GAPP obviates the need for programmers to think parallel while still letting them produce parallel code.
The performance of GAPP on a number of difficult proof of concept benchmarks informs further optimization of both the design of grammars and fitness function to extract further parallelism. We demonstrate an improved performance of evolving programs with controlled degree of parallelism. These programs adapt to the number of cores on which they are scheduled to execute.
1 Introduction
As the multi-core processors become the norm, researchers fabricate thousands of cores on a single chip [6, 28, 32]. As the number of cores on a chip increase, efficiently programming them becomes increasingly complex, often to the point where the limiting factor in speeding up tasks is the software. Contrarily, high performance computing developers [37, 46, 49] have identified that the software is trailing behind the rise of multi-cores. The inability of sequential software to scale with multi-cores initiates the necessity for the programmers to write parallel programs that exploit multi-cores.
Parallel programming APIs such as MPI [30] and OpenMP [26] help in exploiting the higher processing power of multi-cores. OpenMP exploits processing power on the shared memory architectures. Writing parallel programs using either of the above two standards is challenging compared to sequential programming [41]. Challenges include identifying the available parallelism, configuring the shared data, use of locks for mutual exclusion in order to guarantee correctness of the code, synchronizing and balancing the workload among multiple processors.
Alternatively, automatic parallelization, transforms a sequential program into a semantically equivalent parallel code. Some automatic parallelization compilers include Polaris [8], SUIF [5], and Vienna Fortran Compiler [7]. Automatic parallelization is still difficult, where the burden moves from the software developer to a compiler engineer. Later, engineers’ efforts were augmented and in some cases replaced with machine learning [53]. Clearly, we need better tools to fully exploit the multi-cores.
We introduce an automatic parallel programming tool, Grammatical Automatic Parallel Programming (GAPP) to reduce the gap between traditional parallel programming and the human difficulties. GAPP combines Grammatical Evolution (GE) together with the design of parallel context-free grammars (CFGs). GAPP predominantly addresses the parallel programming concerns on shared memory architectures thereby, we use OpenMP parallelization constructs in order to guarantee parallelism. OpenMP primitives are an integral part of the grammars, GE together with these primitives creates a feasible solution space of parallel programs.
We examine GAPP in synthesizing parallel programs in both recursion and iterative sorting domains. We study the performance, measured in terms of speed-up and the amount of effort required to synthesize, measured in terms of the number of generations. The results indicate that GAPP generates correct and efficient parallel programs. We extend GAPP, and as a result we witness a slight improvement in the performance of the resultant parallel programs. At this stage, as a result of the improvements, we encounter a peculiar behaviour in the execution of the synthesized parallel programs. This characteristic behaviour is different in both the problem domains, where recursive parallel programs exhibit excessive parallelism while iterative sorting programs suffer with the concept of false sharing. In order to address these challenges, we further extend GAPP—slightly modify the design of the grammars. The enhancements resolve these hurdles while improving the performance of the synthesized parallel programs.
We organize the rest of the paper as follows: Sect. 2 describes the existing work; Sect. 3 describes GAPP on both the problem domains; Sect. 4 presents the experimental parameters; Sect. 5 demonstrates the experimental results; and Sect. 6 analyses and extends GAPP; finally, Sect. 7 concludes.
2 Related Research
2.1 Evolutionary Techniques for Recursion
Some of the earliest work on evolving recursion is from Koza [36, Chapter-18] which evolved a Fibonacci sequence; this work cached previously computed recursive calls for efficiency. Brave [10] used Automatically Defined Functions (ADFs) to evolve recursive tree search. In this, recursion terminated upon reaching the tree depth. Then, [55] concluded that infinite recursions was a major obstacle to evolve recursive programs. However, Wong and Mun [57] successfully used an adaptive grammar to evolve recursive programs; the grammar adjusted the production rule weights in evolving solutions.
Spector et al. [48] evolved recursive programs using PushGP by explicitly manipulating its execution stack. The evolved programs were of \(\ensuremath {\operatorname {O} \left (n^2\right )}\) complexity, which became \(\ensuremath {\operatorname {O} \left (nlog(n)\right )}\) with an efficiency component in fitness evaluation. Recently, Moraglio et al. [38] used a non-recursive scaffolding method to evolve recursive programs with a CFG based GP. Recently, Agapitos et al. [4] presented a review of GP for recursion.
2.2 Evolutionary Techniques for Sorting
In evolving sorting networks, Hillis [31] evolved a minimal 16-input network for the sorting network problem. O’Reilly and Oppacher [44] initially failed to evolve sorting with genetic programming (GP); however, they succeeded in [45] with a swap primitive. Later, Kinnear [33, 34] generated a bubble sort by swapping the disordered adjacent elements. Abbott [1] used Object Oriented Genetic Programming (OOGP) for insertion and bubble sorts. Spector et al., [48] used PushGP for recursive sorting that had an \(\ensuremath {\operatorname {O} \left (n^2\right )}\) complexity and enhanced to \(\ensuremath {\operatorname {O} \left (nlog(n)\right )}\) by adding efficiency.
Recently, Agapitos and Lucas [2, 3] evolved efficient recursive quicksort using OOGP in Java. The evolved programs were of \(\ensuremath {\operatorname {O} \left (nlogn\right )}\) complexity. Then, O’Neill et al. [43] applied GE for program synthesis by evolving an iterative bubble sort in Python; the evolved programs had quadratic \(\ensuremath {\operatorname {O} \left (n^2\right )}\) complexity. Most of these attempts belong to quadratic complexity \(\ensuremath {\operatorname {O} \left (n^2\right )}\), while the attempts in [2, 48] belongs to \(\ensuremath {\operatorname {O} \left (nlogn\right )}\).
2.3 Automatic Evolution of Parallel Programs
In general, automatic generation of parallel programs can be divided into two types: auto-parallelization of serial code and the generation of native parallel code.
Auto-parallelization requires a serial program. Using GP, [47, Chapter-5] proposed Paragen which had initial success, however, the execution of candidate solutions for fitness evaluation ran into difficulties with complex and time consuming loops. Later, Paragen-II [47, Chapter-7] dealt with loop inter-dependencies relying on a rough estimate of time. Then, [47] extended Paragen-II to merge independent tasks of loops.
Similarly, genetic algorithms evolved transformations; [40] and [56] proposed GAPS (Genetic Algorithm Parallelization System) and Revolver respectively. GAPS evolved sequence restructuring, while Revolver transformed the loops and programs, both optimized the execution time. On the other hand, native parallel code generation produces a working program that is also parallel. With multi-tree GP, [51] concurrently executed autonomous agents for automatic design of controllers.
Unlike PARAGEN-II [47], GAPP does not utilize dependency analysis; instead, GE works the data interdependencies out by selecting pragmas that guarantee program correctness. Recently, Chennupati et al., [15, 17] evolved natively parallel regression programs. Thereafter, MCGE-II in [20] evolved task parallel recursive programs. The minimal execution time of the synthesized programs was merely due to the presence of OpenMP pragmas which automatically map threads to cores. However, use of a different OpenMP pragma alters the performance of a parallel program, and skilled parallel programmers carefully choose the pragmas when writing code. To that end, in this paper, we extend MCGE-II in two ways: we re-structure the grammars so task and data level parallelism is separate, and we explicitly penalize long executions.
3 Grammatical Automatic Parallel Programming
Grammatical Automatic Parallel Programming (GAPP) presents the first instance of using grammars for the task of automatic parallel programming. GAPP provides an alternative to the craftsman approach of parallel programming. This is significantly different from other parallel EC approaches, because not only do we produce individuals that, in their final form, can exploit parallel architectures, we also exploit the same parallel architecture during evolution to reduce the execution time.
Figure 1 presents an overview of GAPP that operates on a string of codons which separate the search and the solution spaces. Like any application of GE, GAPP uses the typical search process, genetic operations and genotype-phenotype mapping. However, the major contribution of GAPP is the design of grammars that produce parallel programs, in which, OpenMP primitives are an integral part of the grammars and creates a feasible solution space of parallel programs. The GE search process helps to find the [near] optimal parallel programs in that space. These programs are evaluated on a number of fitness cases such that the best among them is identified as a parallel program. We now discuss the parallelization strategy of the end programs.
3.1 GAPP for Parallel Recursion
GAPP relies on the grammars designed to produce parallel recursive programs [20]. We discuss the design of parallel recursive grammars for Fibonacci.
OpenMP Pragmas
OpenMP is a portable, scalable, directive based specification to write parallel programs on shared memory systems. It consists of compiler directives, environment variables and run time libraries that are used to designate parallelism in C/C++ and Fortran. These directives are special preprocessor instructions termed as pragmas that follow the fork-join parallelism.
Some OpenMP pragmas are—parallel for is a loop construct that distributes the iterations of a loop among the threads. The use of a parallel for construct is limited to a for loop that has defined boundaries, that is a loop with a terminating condition. Another pragma, parallel sections defines a parallel region, in which, each task is handled independently. If there are more number of threads than the independent blocks, the remaining threads will be idle, otherwise, all the threads execute multiple code blocks. parallel task is another work-sharing pragma that works similar to section construct. Notice that the use of omp task is not optional. A detailed description of OpenMP API can be found in [12].
Design of Grammars
Figure 2 presents GAPP grammars for the synthesis of Fibonacci program. The program begins at <program>, which derives the symbols <condition> and <parcode>. The symbols <omptask> and <ompdata> represent the task and data parallel pragmas, while <omppragma> selects one of the two options, therefore, a clear separation between task and data parallelism. This helps to accelerates the evolution of solutions because of the grammatical bias [54]. This design constrains the search to explore data or task parallel space rather both spaces.
The grammar has shared (<shared>) and private (<private>) clauses. The input (<input>) and two variables (temp, res) are shared among the threads. Input represents the nth Fibonacci number, while variable res returns the result of parallel execution. The local variables (temp, a) store the auxiliary results of recursive calls.
The variable “a” is thread private. OpenMP has private (<private>) clauses: private( a) makes a variable thread-specific, such that any changes on the variable are invisible outside the parallel region; firstprivate( a) maintains a constant value throughout the program; lastprivate( a) keeps the changes of last thread in the parallel region. Evolution selects one of the three private clauses depending on the problem.
The non-terminal <parblocks> produces parallel code blocks that are mapped through <blocks>. The non-terminal <blocks> generates a sequence of parallel blocks with each block containing an independent recursive call. The parallel code blocks ensure task level parallel execution. The non-terminal <stmt> depicts the recursive call of the Fibonacci program, while the symbols <bop> and <lop> refer to the binary arithmetic and logical operators respectively. The symbol <const> maps to integer constants. The base case is generated from the input variable, logical operators and constants generating non-terminals: <line1> and <line2>. The non-terminal <expr>, expresses the recursive calls, that is called in <parcode> and <blocks>.
Performance Optimization
We encourage parallelism with the inclusion of execution time in the fitness function. The run time exerts external selection pressure, which helps in selecting an appropriate parallelization primitive. Therefore, the fitness function is a product of two factors: execution time and the mean absolute error, both are normalized in the range [0, 1]—a maximization function. The following equation computes the fitness of evolving parallel recursive program (f rprog):
where, t is the execution time of a program; y i and \(\hat y_{i}\) are the actual and evolved outputs respectively. The choice of an OpenMP pragma can significantly impact the execution time of a program. The presence of an incorrect pragma in the end program will have an adverse effect on fitness evaluation. In limiting such effects, the first term, normalized execution time in Eq. (1) helps to select the correct pragma. That is, the changes in time component influence the performance of resultant programs, whereby, those with minimum execution time become the best parallel programs. Meanwhile, the second term, normalized mean absolute error enforces program correctness. Together, the twin objectives push for a correct and efficient parallel program.
3.2 GAPP for Parallel Iterative Sorting
Design of Grammars
We describe the design of grammars for the synthesis of parallel odd-even sort programs [21]. Similar to recursive grammars, Odd-Even sort grammars are in [13, Appendix B]. The generation of an end program starts with <program> symbol, which derives to <for_out> and <condition> symbols. The non-terminal <for_out> maps to an outer “for loop”. Note that GE fails to generate correct loop structures [43], hence, we preserve the loops in synthesizing the iterative sorts. The non-terminal <condition> derives problem specific base/termination conditions.
The symbols <schedule> and <type> derive the type of scheduling strategy. In scheduling a parallel for loop, OpenMP offers three clauses: static, dynamic and, guided; static divides the work among threads before the loop execution and dynamic allocates the work during the execution. The third type, guided, divides work in the execution but the allocation begins with the given chunk size (CHUNK) and decreases.
We include the mechanism of swapping the adjacent elements in two phases (odd and even). The input, index (<index>), and the size of the array are shared among all the cores. The temporary variable (temp in <index>) is private to the thread. We use absolute values (abs in <for_in_line> and <swap>) to avoid negative indexes.
Performance Optimization
As with recursion (Eq. (1)), the time in fitness evaluation of parallel iterative sorting helps to choose an appropriate pragma. The accuracy is defined as mean inversions. For example, if a 1a 2a 3…a n is a permutation of the set 1, 2, …, n then the pair (a i, a j) is an inversion of the permutation iff i < j and a i > a j [35]. The fitness function (f sprog) is shown in Eq. (2).
where, t stands for the execution time of the evolved parallel program over all the fitness cases (N); n(I(A i)) is the number of inversions in the ith array (A i; total, N arrays); and TP is the total number of pairs in all the fitness cases (N).
4 Experiments
We evaluate GAPP on six recursive and four iterative sorting benchmark problems. Table 1 presents all the benchmarks with their properties. Of the six recursive problems: first three (Sum-of-N, Factorial, Fibonacci) accept a positive integer as input; for Sum-of-N, it is randomly generated from the range [1, 1000] while, for Factorial and Fibonacci problems, it is in the range [1, 60] due to the limitations of data types in C; the remaining three problems (Binary-Sum, Reverse, Quicksort) accept an array of integers with their start and end indexes as input, for which, an array of 1000 elements are randomly generated from the range [1, 1000]. For the four iterative sorting benchmarks, 100 training cases with each array containing 1000 elements are randomly generated from the range [1, 1000]. The end programs of these four benchmarks use conditional (if), iterative (for) and variable indexing structures.
Table 2 describes the algorithmic and hardware parameters. The grammars are general enough except for a few minor changes with respect to the problem at hand.
Generality of Grammars
Grammars for the benchmarks Fibonacci (Fig. 2) and Odd-Even sort represent both the experimental domains. The grammars for other benchmarks are 90% similar, where all of them have common OpenMP pragmas while they differ in some domain specific knowledge. Grammars for all the benchmarks are presented in [13, Appendix B]. We evolve programs in C; however, GAPP is general enough to apply to the programming languages that offer OpenMP like parallelism. JOMP [11] is an OpenMP API for JAVA and can synthesize parallel programs in JAVA.
4.1 GAPP Variants
With the two different features (design of grammars and performance optimization) of GAPP, we study their influence in the synthesizability and fitness evaluation of the parallel programs. The study contains four GAPP variations: first variant, named as GAPP (Unoptimized), does not use both the separation of task and data parallel primitives as well as the time component of performance optimization (shown in Eqs. (1) and (2)). Second variant, named as GAPP (Grammar), uses the design of grammars with parallel primitives and does not use the time component of performance optimization. Third variant, GAPP (Time), neglects the separation of task and data parallel primitives, but uses the time in performance optimization. Finally, the fourth variant, GAPP (Combined), uses both the design of grammars and the performance optimization.
5 Results
We present experimental results of GAPP for both recursion and iterative sorting domains. The results report two measures: speed-up and mean best generation (MBG), where the speed-up informs performance of the synthesized parallel programs while MBG shows the time taken to synthesize a best of run program in terms of generations.
Speed-Up
The speed-up is defined as the ratio of mean best execution time (MBT) of synthesizing parallel programs on 1-core to n-cores and is shown in Eq. (3):
where, T MBT−1−core is the mean best execution time on a single core, while T MBT−n−cores is that of n-cores of a processor. Mean best execution time (MBT) is defined as the mean of all the execution times of the average best-of-generation programs across all the experimental runs of GAPP, and is given as shown in Eq. (4):
where T bprog(g) is the execution time of the best program in a given generation g, G is the number of generations, r is a run, and R is the number of runs.
Mean Best Generation (MBG)
Mean best generation (MBG) is defined as the number of generations required to converge to the best fitness, with a pre-condition that the program under consideration must be correct, averaged across R runs. MBG helps to investigate the effect of restructuring grammars on the synthesizability (ease of evolving) of the correct parallel programs.
5.1 Recursion
Figure 3 presents the speed-up of each of the four GAPP (Unoptimized, Grammar, Time, Combined) variants at different cores for all the six recursive benchmarks. The results indicate that the performance of GAPP improves as the number of cores increase. Non-parametric Friedman tests [27] are used to show the significance of these results.
Table 3 shows the non-parametric Friedman tests with Hommel’s post-hoc [29] analysis on the speed-up of GAPP for recursive problems at α = 0.05. The first column shows the number of cores. The second column shows the GAPP variant, while the third column presents the average rank. The fourth and the fifth columns show the p-value and p-Hommel. The lowest average rank shows the best (GAPP (Combined)) variant, and is marked with an asterisk (*). A variant is significantly different from the best variant if p-value is less than p-Hommel at α = 0.05, and is in boldface. A value is in boldface if it is significantly different from the best variant. The p-value of the corresponding method is less than the critical p-Hommel at α = 0.05.
The performance on 2 cores is insignificant as the cost of thread overheads offset the performance gains. For 4 cores, GAPP (Combined) significantly outperforms the remaining three variants. For 8 and 16 cores, GAPP (Combined) outperforms the two GAPP (Unoptimized, Grammar) variants, and the difference with GAPP (Time) is insignificant due to the presence of execution time in their fitness evaluation.
Table 4 shows MBG of the four GAPP variants and statistical tests. GAPP (Grammar) outperforms GAPP (Unoptimized, Time, Combined), which requires a less number of generations over the remaining variants in synthesizing the best programs.
Although GAPP (Grammar) takes a few generations to synthesize parallel recursive programs, performance results (Fig. 3) show that GAPP (Time, Combined) outperform GAPP (Grammar), where GAPP (Grammar) synthesized programs are not as efficient as that of GAPP (Time, Combined). However, GAPP (Combined) outperforms GAPP (Time) in terms of MBG (Table 4), where GAPP (Combined) is quick to synthesize efficient parallel recursive programs. Therefore, GAPP (Combined) is the best variant that reports an average (on the recursive problems) speed-up of 8.13 on 16 cores, an improvement of 23.86% over GAPP (Unoptimized) that reports 6.19 speed-up.
5.2 Iterative Sorting
Figure 4 shows the speed-up of GAPP (Unoptimized, Grammar, Time, Combined) on the four iterative sorting benchmarks for 2, 4, 8 and 16 cores. Table 5 shows the Friedman tests with Hommel’s post-hoc analysis on speed-up of GAPP (Unoptimized, Grammar, Time, Combined) for 16 cores. A variant with the lowest rank is the best variant (GAPP(Combined)) and marked with an asterisk (*).
For 4 cores, GAPP (Combined) outperforms GAPP (Unoptimized) while it is insignificant from GAPP (Grammar, Time). For 8 and 16 cores, GAPP (Combined) outperforms GAPP (Unoptimized, Grammar) and is insignificant over GAPP (Time). GAPP (Combined) shows an average speed-up of 11.03, an improvement of 15.75% over GAPP (Unoptimized), which has an average speed-up of 9.29.
Table 6 compares the MBG of GAPP (Unoptimized, Grammar, Time, Combined) and their statistical significance. GAPP (Grammar) outperforms GAPP (Unoptimized, Time) while it is insignificant over GAPP (Combined). GAPP (Grammar) produces parallel iterative sorting programs in a less number of generations over GAPP (Time). It is because of the variation in the design of grammars among GAPP variants, which in fact impacts the evolution of programs. However, GAPP (Combined) evolves efficient programs over GAPP (Grammar) (see Fig. 4). Therefore, GAPP (Combined) is the best variant for the evolution of efficient parallel iterative sorting programs.
6 Enhancements in GAPP
We analyze the effect of OpenMP thread scheduling on performance of the GAPP evolved parallel programs, both in recursion and iterative sorting domains. We find that code growth in GAPP is surprisingly insignificant [19, 22, 23], therefore it does not affect the program execution. We found that the thread scheduling has a distinct influence on both the problem domains.
6.1 Recursion
The quality of parallel code is difficult to quantify as execution time often depends on the ability of OS to efficiently schedule the tasks. This job itself is complicated by other parallel threads (from other programs) running at the same time. OpenMP abstracts away much of these concerns from programmers, which makes it easier at the cost of some fine control. We compensate this through adapting a program to the hardware.
Excessive Parallelism
Hardware caps the maximum number of threads; however, in the grammars each recursive call spawns a new thread. OS factors, specifically for the Linux kernels, eventually fail to scale in scheduling a high number of threads [9]. Moreover, when a parent thread spawns a child thread, it sleeps until all the child threads have finished. This process is expensive, when a large number of threads are involved. Memory access restrictions over shared and private variables can add to the complexity of the executing code. Complexity in this instance comes from the vagaries of scheduling what can be a high number of threads. We extend GAPP to overcome these limitations.
Extending GAPP for Recursion
Armed with the knowledge of excessive parallelism, we constrain the system so as to optimize the degree of parallelism. We combine parallel and serial implementations of the evolved programs, which, further improves the performance. This reduces the overhead caused due to excessive parallelism as the top level recursive calls distribute load across a number of threads, whereas the lower level calls appropriately carry out the work instead of merely invoking more threads. Evolution detects the exact level at which recursion switches from parallel to serial. The intermediate results are saved temporarily in an auxiliary variable and are shared amongst all the threads under execution. This ceases the creation of exponential number of threads thereby reduces the overhead caused due to excessive parallelism.
The GAPP grammars used for recursive benchmarks (Sect. 3.1) are modified as shown in Fig. 5, termed as GAPP (Scaled), hereafter. We alter the non-terminal <condition> to synthesize nested if-else condition blocks. The changes generate a program that reduces the execution time of the final programs, which evolve a two digit thread limiting constant, at which, the program starts to execute sequentially.
Figure 6 shows an example of the GAPP (Scaled) generated Fibonacci program. It evolves a thread limiting constant for a given problem and the computational environment; this constant, arrests the further creation of threads and continues to execute serially. The intermediate result (temp in else if) is shared among the threads, thus, further optimizes the execution time, thereby, efficiently exploits the power of multi-cores.
The constant (39) in the else if (in Fig. 6) condition is the thread limiting constant for 16 cores. Figure 7 shows the thread limiting constants (standard deviation) with respect to the number of cores for the six benchmarks. These constants adapt to the underlying hardware architectures. For example, the constant (39), which, at a large input (say, a 1000000 element array) may not be an optimal value, that can be a bigger constant. This is addressed with digit concatenation grammars [42, Chapter 5].
Figure 8 shows the speed-up of GAPP (Scaled) over all the six benchmarks for 2, 4, 8, and 16 cores. Like the other GAPP variants, the speed-up of GAPP (Scaled) improves with an increase in the number of cores. Especially, the speed-up of GAPP (Scaled) can be better seen, where the performance is much better than its counterparts.
Table 7 presents the mean best generation (MBG) of GAPP (Unoptimized, Grammar, Time, Combined, Scaled) variants. The results show that GAPP (Grammar) generates a program faster than the two GAPP (Combined, Scaled) variants because of the grammatical bias. However, the last two GAPP (Combined, Scaled) variants use execution time in fitness evaluation. Thus, the evolution becomes hard, nevertheless, GAPP(Scaled) generates the efficient task parallel recursive programs.
Table 8 shows the non-parametric Friedman tests with Hommel’s post-hoc analysis on speed-up and MBG of GAPP variants. The best variant with the lowest rank is marked with an asterisk (*), and significantly different variants are in boldface.
For speed-up, GAPP (Scaled) outperforms the remaining four GAPP variants. Note, these results are for 16 cores of a processor, and are similar for 8 cores, while they are insignificant with 4 cores and below. On average, for 16 cores, GAPP (Scaled) speeds up by a factor of 9.97, which improves over GAPP (Combined) and GAPP (Unoptimized) by 17.45% and 37.91% respectively. On an average, for 16 cores, GAPP (Scaled) shows a speed-up of 9.97, a significant improvement of 17.45% over GAPP (Combined). Similarly, a significant improvement of 37.91% over GAPP (Unoptimized).
For MBG, GAPP (Grammar) outperforms the other four GAPP variants. These results are for 16 cores, while they are similar for 8 and below. Although GAPP (Scaled) requires slightly more generations over other variants, GAPP (Scaled) is the best amongst all the GAPP variants in this paper, as it generates efficient parallel recursive programs.
However, a similar solution is to keep a table that records the result of a recursive call in its first evaluation, then, refer the table for the repeated recursive calls, similar to Koza [36]. However, that approach has often been criticized [38] for not being an exact recursion. We now analyze and extend GAPP for iterative sorting domain.
6.2 Iterative Sorting
In contrast to excessive parallelism in recursion, factors such as OpenMP work load scheduling plays a vital role in optimizing the performance of synthesized iterative sorting programs. Interestingly, OpenMP hides these details from the developer, which makes it easy to use, at the same time hard to realize their full potential. Load balancing by parallel threads is a serious concern on shared memory processors. OpenMP scheduling strategies (static, dynamic, guided) solve these performance issues effectively. However, optimally assigning the optional chunk size (chunk) explicitly is difficult, as the ideal value often requires the problem specific knowledge. The input chunk size changes with respect to the loop iterations, number of cores, and the threads under execution. On the other hand, smaller chunks of data leads to a well known parallel programming challenge of false sharing.
False Sharing
False sharing is a performance challenge that occurs when threads on different cores modify variables that reside on the same cache line [50], which invalidates the cache line and forces a memory update, thereby reduces the performance. Precisely, if one core tries to load the same cache line loaded by another core, that line is marked as “shared” access. If this core stores shared cache line, then that line is marked as “modified” and all the remaining cores will receive a cache line “invalid” message. Herein, if any core tries to access the cache line marked with modified, that line will be written back to the memory and marks it as “shared”. The other cores that try to access the same cache line will incur a cache miss. This frequent coordination among the cores, cache lines and memory that caused due to false sharing significantly degrades the performance of an application.
False sharing can be avoided by placing the variables far apart in the memory (using some compiler directives) so that they do not align in the same cache line. In the case of arrays, it can be avoided by aligning the array of elements on the cache line boundary. If this is impossible, we can set the array size to double the cache line, which is possible when dynamically allocating the array sizes. Our extensions of GAPP ensure that controlling the array sizes helps to deal with false sharing in improving the performance of the evolving programs.
Extending GAPP for Iterative Sorting
This section proposes to solve the false sharing that further extends GAPP to evolve more efficient parallel iterative sorting programs. We overcome the problem of ideal load balancing by evolving an appropriate chunk size that is independent of the problem and the number of cores that it executes. We adopt the digit concatenation grammars [42] for symbolic regression.
Figure 9 shows the modified GAPP grammar that automatically generates a sequence of digits. The evolved chunk size adapts to the number of cores, amount of load, and the number of threads. The proposed enhancements evolve more efficient programs.
Figure 10 presents the successfully evolved parallel iterative Odd-Even sort program using GAPP (Scaled) grammars. Note, the program contains two constants (89, 87) as it operates in two phases (odd and even).
Table 9 shows the GAPP (Scaled) evolved constants (chunk size). These are averaged from the evolved best of run programs across 50 runs. The chunk results are reported for 8, and 16 cores. They showed significant performance optimization while, 2 and 4 are insignificant, hence, neglected. As a result, the evolved constants balance the load effectively. These chunk sizes created larger data arrays as opposed to the arrays with a default chunk size of 10. Creation of larger chunks of data helped in controlling false sharing, is evident at the higher number of cores, thus, the performance improves.
Figure 11 shows the speed-up of GAPP (Scaled) evolved programs. The results indicate that the performance improves with an increase in the number of cores. It shows an average speed-up of 12.52 for 16 cores, a better improvement of 11.91% over GAPP (Combined), an improvement of 25.79% over GAPP (Unoptimized).
Table 10 represents the Wilcoxon Signed Rank Sum significance tests between GAPP (Scaled) and GAPP (Combined) at α = 0.05. It contains the p-value for the corresponding problem while “☑” indicates that the difference between the results of both the methods is significant; that is, p < 0.05.
Vargha and Delaney [52] A-measure states how often that GAPP (Scaled) outperforms GAPP (Combined). A-measure lies in between 0 and 1: when it is above 0.5, GAPP (Scaled) is better than GAPP (Combined); when it is 0.5, then both are equal; when it is less than 0.5 GAPP (Combined) is better than GAPP (Scaled); if it is close to 0.5, the difference is small, otherwise the difference is large. For example, on Bubble sort with 16 cores, 35% of the time, GAPP (Scaled) performs better than GAPP (Combined). In other words, 65% of the time, GAPP (Combined) performs better than GAPP (Scaled). Overall, GAPP (Scaled) performs better than GAPP (Combined). Similarly for MBG, GAPP (Grammar) takes a less number of generations to evolve a parallel program. However, GAPP (Scaled) exhibits better performance with a few extra generations.
7 Conclusion
We presented GAPP to automatically generate efficient task parallel recursive and data parallel iterative sorting programs. GAPP offered a separation between the task and data parallelism in the design of the grammars along with the execution time in fitness evaluation. The modifications in the grammar favored quick generation of programs, while the execution time helped in optimizing their performance. We then analyzed the effect of OpenMP thread scheduling on performance of both the problem domains. We ceased the excessive parallelism, while restricting the degree of parallelism in the evolving programs. We limited this behavior with the evolution of programs that run both in serial (for lower level recursive calls) and parallel, thus, further optimized the performance. The most interesting contribution is the automatic load balancing that adapts to the experimental hardware environment, with which, the system has further improved the performance of the evolving sorting programs. The limiting constants for iterative sorting programs produce larger or smaller constants with the help of digit concatenation grammars. GAPP can further be extended to synthesize lock-free parallel programs, like in [24], applicable in gaming industry. Moreover, the synthesizability of GAPP can be further leveraged as in [14, 16, 18] that further improves the performance of the evolving parallel programs in less number of generations. Similarly, probabilistic approaches [25] for performance prediction help to further optimize the execution time.
References
R. Abbott, J.G.B. Parviz, Guided genetic programming, in Proceedings of the International Conference on Machine Learning; Models, Technologies and Applications, ed. by H.R. Arabnia, E.B. Kozerenko (CSREA Press, Las Vegas, 2003), pp. 28–34
A. Agapitos, S.M. Lucas, Evolving efficient recursive sorting algorithms, in IEEE Congress on Evolutionary Computation (IEEE, New York, 2006), pp. 2677–2684
A. Agapitos, S.M. Lucas, Evolving modular recursive sorting algorithms, in Genetic Programming, ed. by M. Ebner, M. O’Neill, A. Ekárt, L. Vanneschi, A.I. Esparcia-Alcázar. Lecture Notes in Computer Science, vol. 4445 (Springer, Berlin, 2007), pp. 301–310
A. Agapitos, M. O’Neill, A. Kattan, S.M. Lucas, Recursion in tree-based genetic programming. Genet. Program Evolvable Mach. 18(2), 149–183 (2017)
S.P. Amarasinghe, J.-A.M. Anderson, M.S. Lam, C.-W. Tseng, An overview of the SUIF compiler for scalable parallel machines, in Proceedings of the 7th SIAM Conference on Parallel Processing for Scientific Computing (1995), pp. 662–667
S. Bell, B. Edwards, J. Amann, R. Conlin, K. Joyce, V. Leung, J. MacKay, M. Reif, L. Bao, J. Brown, M. Mattina, C.-C. Miao, C. Ramey, D. Wentzlaff, W. Anderson, E. Berger, N. Fairbanks, D. Khan, F. Montenegro, J. Stickney, J. Zook, Tile64 - processor: a 64-core soc with mesh interconnect, in Proceedings of the 14th International Solid-State Circuits Conference, ISSCC ’08 (IEEE, New York, 2008), pp. 88–598
S. Benkner, VFC: the vienna fortran compiler. Sci. Program. 7(1), 67–81 (1999)
B. Blume, R. Eigenmann, K. Faigin, J. Grout, J. Hoeflinger, D. Padua, P. Petersen, B. Pottenger, L. Rauchwerger, P. Tu, S. Weatherford, Polaris: the next generation in parallelizing compilers, in Proceedings of the Workshop on Languages and Compilers for Parallel Computing (Springer, Berlin, 1994), pp. 10.1–10.18
S. Boyd-Wickizer, A.T. Clements, Y. Mao, A. Pesterev, M.F. Kaashoek, R. Morris, N. Zeldovich, An analysis of linux scalability to many cores, in Proceedings of the 9th USENIX Conference on Operating Systems Design and Implementation, OSDI ’10 (USENIX Association, Berkeley, 2010), pp. 1–8
S. Brave, Evolving recursive programs for tree search, in Advances in Genetic Programming, vol. 2 (MIT Press, Cambridge, MA, 1996), pp. 203–220
J.M. Bull, M.E. Kambites, Jomp–an openmp-like interface for java, in Proceedings of the ACM 2000 Conference on Java Grande, JAVA’00 (ACM, New York, 2000), pp. 44–53
B. Chapman, G. Jost, R. van der Pas, Using OpenMP: Portable Shared Memory Parallel Programming. Scientific and Engineering Computation (The MIT Press, Cambridge, MA, 2007)
G. Chennupati, Grammatical evolution + multi-cores = automatic parallel programming!, PhD thesis, University of Limerick, Limerick, Ireland, 2015
G. Chennupati, C. Ryan, R.M.A. Azad, An empirical analysis through the time complexity of GE problems, in 19th International Conference on Soft Computing, MENDEL’13, Brno, Czech Republic, ed. by R. Matousek (2013), pp. 37–44
G. Chennupati, J. Fitzgerald, C. Ryan, On the efficiency of multi-core grammatical evolution (MCGE) evolving multi-core parallel programs, in Proceedings of the Sixth World Congress on Nature and Biologically Inspired Computing (IEEE, New York, 2014), pp. 238–243
G. Chennupati, C. Ryan, R.M.A. Azad, Predict the success or failure of an evolutionary algorithm run, in Proceedings of the Annual Conference on Genetic and Evolutionary Computation Companion, GECCO Comp ’14 (ACM, New York, 2014), pp. 131–132
G. Chennupati, R.M.A. Azad, C. Ryan, Multi-core GE: automatic evolution of CPU based multi-core parallel programs, in Proceedings of the Genetic and Evolutionary Computation Conference Companion (ACM, New York, 2014), pp. 1041–1044
G. Chennupati, R.M.A. Azad, C. Ryan, Predict the performance of GE with an ACO based machine learning algorithm, in Proceedings of the Genetic and Evolutionary Computation Conference Companion, ed. by D.V. Arnold, E. Alba (ACM, New York, 2014), pp. 1353–1360
G. Chennpati, R.M.A. Azad, C. Ryan, On the automatic generation of efficient parallel iterative sorting algorithms, in Proceedings of the Genetic and Evolutionary Computation Conference Companion, GECCO Companion ’15 (ACM, New York, 2015), pp. 1369–1370
G. Chennupati, R.M.A. Azad, C. Ryan, Automatic evolution of parallel recursive programs, in Proceedings of the 18th European Conference on Genetic Programming, EuroGP’15, ed. by P. Machado, M.I. Heywood, J. McDermott, M. Castelli, P. García-Sánchez, P. Burelli, S. Risi, K. Sim (Springer, Berlin, 2015), pp. 167–178
G. Chennupati, R.M.A. Azad, C. Ryan, Automatic evolution of parallel sorting programs on multi-cores, in Proceedings of the 18th European Conference on Applications of Evolutionary Computation, EvoApplications’15, ed. by A.M. Mora, G. Squillero (Springer, Berlin, 2015), pp. 706–717
G. Chennupati, R.M.A. Azad, C. Ryan, Performance optimization of multi-core grammatical evolution generated parallel recursive programs, in Proceedings of Genetic and Evolutionary Computation Conference, GECCO’15 (ACM, New York, 2015), pp. 1007–1014
G. Chennupati, R.M.A. Azad, C. Ryan, Synthesis of parallel iterative sorts with multi-core grammatical evolution, in Proceedings of the Genetic and Evolutionary Computation Conference Companion, GECCO Companion ’15 (ACM, New York, 2015), pp. 1059–1066
G. Chennupati, R.M.A. Azad, C. Ryan, Automatic lock-free parallel programming on multi-core processors, in Proceedings of the IEEE Congress on Evolutionary Computation, CEC ’16 (IEEE, New York, 2016), pp. 4143–4150
G. Chennupati, N. Santhi, S. Eidenbenz, S. Thulasidasan, AMM: scalable memory reuse model to predict the performance of physics codes, in International Conference on Cluster Computing (CLUSTER) (2017), pp. 649–650
L. Dagum, R. Menon, Openmp: an industry-standard api for shared-memory programming. IEEE Comput. Sci. Eng. 5(1), 46–55 (1998)
J. Demšar, Statistical comparisons of classifiers over multiple data sets. J. Mach. Learn. Res. 7, 1–30 (2006)
H. Esmaeilzadeh, E. Blem, R. St. Amant, K. Sankaralingam, D. Burger, Dark silicon and the end of multicore scaling. SIGARCH Comput. Archit. News 39(3), 365–376 (2011)
S. García, F. Herrera, An extension on “statistical comparisons of classifiers over multiple data sets” for all pairwise comparisons. J. Mach. Learn. Res. 9, 2677–2694 (2008)
W. Gropp, E. Lusk, N. Doss, A. Skjellum, A high-performance, portable implementation of the MPI message passing interface standard. Parallel Comput. 22(6), 789–828 (1996)
W.D. Hillis, Co-evolving parasites improve simulated evolution as an optimization procedure. Phys. D Nonlinear Phenom. 42(1), 228–234 (1990)
J. Howard, S. Dighe, Y. Hoskote, S. Vangal, D. Finan, G. Ruhl, D. Jenkins, H. Wilson, N. Borkar, G. Schrom, F. Pailet, S. Jain, T. Jacob, S. Yada, S. Marella, P. Salihundam, V. Erraguntla, M. Konow, M. Riepen, G. Droege, J. Lindemann, M. Gries, T. Apel, K. Henriss, T. Lund-Larsen, S. Steibl, S. Borkar, V. De, R. Van Der Wijngaart, T. Mattson, A 48-core ia-32 message-passing processor with dvfs in 45nm cmos, in Proceedings of the 16th International Solid-State Circuits Conference, ISSCC ’10 (IEEE, New York, 2010), pp. 108–109
K.E.J. Kinnear, Evolving a sort: lessons in genetic programming, in IEEE International Conference on Neural Networks (IEEE, New York, 1993), pp. 881–888
K.E.J. Kinnear, Generality and difficulty in genetic programming: evolving a sort, in Proceedings of the 5th International Conference on Genetic Algorithms, ed. by S. Forrest (Morgan Kaufmann, Los Altos, 1993), pp. 287–294
D.E. Knuth, The Art of Computer Programming, Volume 3: Sorting and Searching, 2nd edn. (Addison Wesley Longman Publishing, Redwood City, 1998)
J.R. Koza, Genetic Programming: On the Programming of Computers by Means of Natural Selection (MIT Press, Cambridge, MA, 1992)
T. Mattson, M. Wrinn, Parallel programming: can we PLEASE get it right this time?, in 45th Design Automation Conference (IEEE, New York, 2008), pp. 7–11
A. Moraglio, F.E.B. Otero, C.G. Johnson, S. Thompson, A.A. Freitas, Evolving recursive programs using non-recursive scaffolding, in IEEE Congress on Evolutionary Computation (IEEE, New York, 2012), pp. 1–8
M. Nicolau, D. Slattery, Libge - grammatical evolution library (2006), http://bds.ul.ie/libGE/index.html
A. Nisbet, GAPS: a compiler framework for genetic algorithm (GA) optimised parallelisation. in High-Performance Computing and Networking, ed. by P. Sloot, M. Bubak, B. Hertzberger. Lecture Notes in Computer Science, vol. 1401 (Springer, Berlin, 1998), pp. 987–989
M.F.P. O’Boyle, J.M. Bull, Expert programmer versus parallelizing compiler: a comparative study of two approaches for distributed shared memory. Sci. Program. Parallel Comput. Proj. Swiss Prior. Programme 5(1), 63–88 (1996)
M. O’Neill, C. Ryan, Grammatical Evolution: Evolutionary Automatic Programming in an Arbitrary Language (Kluwer Academic Publishers, Norwell, 2003)
M. O’Neill, M. Nicolau, A. Agapitos, Experiments in program synthesis with grammatical evolution: a focus on integer sorting, in IEEE Congress on Evolutionary Computation (IEEE, New York, 2014), pp. 1504–1511
U.-M. O’Reilly, F. Oppacher, An experimental perspective on genetic programming, in Parallel Problem Solving from Nature, ed. by R. Manner, B. Manderick, vol. 2 (Elsevier Science, Amsterdam, 1992), pp. 331–340
U.-M. O’Reilly, F. Oppacher, Chapter 2: A comparative analysis of genetic programming, in Advances in Genetic Programming, ed. by P.J. Angeline, J. Kenneth, E. Kinnear, vol. 2 (MIT Press, Cambridge, MA, 1996), pp. 23–44
D. Patterson, The trouble with multi-core. IEEE Spectr. 47(7), 28–32, 53 (2010)
C. Ryan, Automatic Re-engineering of Software Using Genetic Programming. Genetic Programming, vol. 2 (Springer, Berlin, 1999)
L. Spector, J. Klein, M. Keijzer, The push3 execution stack and the evolution of control, in Proceedings of the Genetic and Evolutionary Computation Conference (ACM, New York, 2005), pp. 1689–1696
C. Stephen, Multicore processors create software headaches, Technical report, MIT Technology Review, April 2010
J. Torrellas, M. Lam, J.L. Hennessy, False sharing and spatial locality in multiprocessor caches. IEEE Trans. Comput. 43(6), 651–663 (1994)
A. Trenaman, Concurrent genetic programming, tartarus and dancing agents, in Genetic Programming, ed. by R. Poli, P. Nordin, W.B. Langdon, T.C. Fogarty. Lecture Notes in Computer Science, vol. 1598 (Springer, Berlin, 1999), pp. 270–282
A. Vargha, H.D. Delaney, A critique and improvement of the “cl” common language effect size statistics of mcgraw and wong. J. Educ. Behav. Stat. 25(2), 101–132 (2000)
Z. Wang, M.F. O’Boyle, Mapping parallelism to multi-cores: A machine learning based approach, in Proceedings of the 14th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP ’09 (ACM, New York, 2009), pp. 75–84
P.A. Whigham, Grammatical bias for evolutionary learning, PhD thesis, University of New South Wales, New South Wales, Australia, 1996
P.A. Whigham, R.I. McKay, Genetic approaches to learning recursive relations, in Progess in Evolutionary Computation, ed. by X. Yao. Lecture Notes in Artificial Intelligence (Springer, Berlin, 1995), pp. 17–27
K.P. Williams, Evolutionary algorithms for automatic parallelization, PhD thesis, University of Reading, 1998
M.L. Wong, T. Mun, Evolving recursive programs by using adaptive grammar based genetic programming. Genet. Program Evolvable Mach. 6(4), 421–455 (2005)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer International Publishing AG, part of Springer Nature
About this chapter
Cite this chapter
Chennupati, G., Azad, R.M.A., Ryan, C., Eidenbenz, S., Santhi, N. (2018). Synthesis of Parallel Programs on Multi-Cores. In: Ryan, C., O'Neill, M., Collins, J. (eds) Handbook of Grammatical Evolution. Springer, Cham. https://doi.org/10.1007/978-3-319-78717-6_12
Download citation
DOI: https://doi.org/10.1007/978-3-319-78717-6_12
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-78716-9
Online ISBN: 978-3-319-78717-6
eBook Packages: Computer ScienceComputer Science (R0)