1 Introduction

As the multi-core processors become the norm, researchers fabricate thousands of cores on a single chip [6, 28, 32]. As the number of cores on a chip increase, efficiently programming them becomes increasingly complex, often to the point where the limiting factor in speeding up tasks is the software. Contrarily, high performance computing developers [37, 46, 49] have identified that the software is trailing behind the rise of multi-cores. The inability of sequential software to scale with multi-cores initiates the necessity for the programmers to write parallel programs that exploit multi-cores.

Parallel programming APIs such as MPI [30] and OpenMP [26] help in exploiting the higher processing power of multi-cores. OpenMP exploits processing power on the shared memory architectures. Writing parallel programs using either of the above two standards is challenging compared to sequential programming [41]. Challenges include identifying the available parallelism, configuring the shared data, use of locks for mutual exclusion in order to guarantee correctness of the code, synchronizing and balancing the workload among multiple processors.

Alternatively, automatic parallelization, transforms a sequential program into a semantically equivalent parallel code. Some automatic parallelization compilers include Polaris [8], SUIF [5], and Vienna Fortran Compiler [7]. Automatic parallelization is still difficult, where the burden moves from the software developer to a compiler engineer. Later, engineers’ efforts were augmented and in some cases replaced with machine learning [53]. Clearly, we need better tools to fully exploit the multi-cores.

We introduce an automatic parallel programming tool, Grammatical Automatic Parallel Programming (GAPP) to reduce the gap between traditional parallel programming and the human difficulties. GAPP combines Grammatical Evolution (GE) together with the design of parallel context-free grammars (CFGs). GAPP predominantly addresses the parallel programming concerns on shared memory architectures thereby, we use OpenMP parallelization constructs in order to guarantee parallelism. OpenMP primitives are an integral part of the grammars, GE together with these primitives creates a feasible solution space of parallel programs.

We examine GAPP in synthesizing parallel programs in both recursion and iterative sorting domains. We study the performance, measured in terms of speed-up and the amount of effort required to synthesize, measured in terms of the number of generations. The results indicate that GAPP generates correct and efficient parallel programs. We extend GAPP, and as a result we witness a slight improvement in the performance of the resultant parallel programs. At this stage, as a result of the improvements, we encounter a peculiar behaviour in the execution of the synthesized parallel programs. This characteristic behaviour is different in both the problem domains, where recursive parallel programs exhibit excessive parallelism while iterative sorting programs suffer with the concept of false sharing. In order to address these challenges, we further extend GAPP—slightly modify the design of the grammars. The enhancements resolve these hurdles while improving the performance of the synthesized parallel programs.

We organize the rest of the paper as follows: Sect. 2 describes the existing work; Sect. 3 describes GAPP on both the problem domains; Sect. 4 presents the experimental parameters; Sect. 5 demonstrates the experimental results; and Sect. 6 analyses and extends GAPP; finally, Sect. 7 concludes.

2 Related Research

2.1 Evolutionary Techniques for Recursion

Some of the earliest work on evolving recursion is from Koza [36, Chapter-18] which evolved a Fibonacci sequence; this work cached previously computed recursive calls for efficiency. Brave [10] used Automatically Defined Functions (ADFs) to evolve recursive tree search. In this, recursion terminated upon reaching the tree depth. Then, [55] concluded that infinite recursions was a major obstacle to evolve recursive programs. However, Wong and Mun [57] successfully used an adaptive grammar to evolve recursive programs; the grammar adjusted the production rule weights in evolving solutions.

Spector et al. [48] evolved recursive programs using PushGP by explicitly manipulating its execution stack. The evolved programs were of \(\ensuremath {\operatorname {O} \left (n^2\right )}\) complexity, which became \(\ensuremath {\operatorname {O} \left (nlog(n)\right )}\) with an efficiency component in fitness evaluation. Recently, Moraglio et al. [38] used a non-recursive scaffolding method to evolve recursive programs with a CFG based GP. Recently, Agapitos et al. [4] presented a review of GP for recursion.

2.2 Evolutionary Techniques for Sorting

In evolving sorting networks, Hillis [31] evolved a minimal 16-input network for the sorting network problem. O’Reilly and Oppacher [44] initially failed to evolve sorting with genetic programming (GP); however, they succeeded in [45] with a swap primitive. Later, Kinnear [33, 34] generated a bubble sort by swapping the disordered adjacent elements. Abbott [1] used Object Oriented Genetic Programming (OOGP) for insertion and bubble sorts. Spector et al., [48] used PushGP for recursive sorting that had an \(\ensuremath {\operatorname {O} \left (n^2\right )}\) complexity and enhanced to \(\ensuremath {\operatorname {O} \left (nlog(n)\right )}\) by adding efficiency.

Recently, Agapitos and Lucas [2, 3] evolved efficient recursive quicksort using OOGP in Java. The evolved programs were of \(\ensuremath {\operatorname {O} \left (nlogn\right )}\) complexity. Then, O’Neill et al. [43] applied GE for program synthesis by evolving an iterative bubble sort in Python; the evolved programs had quadratic \(\ensuremath {\operatorname {O} \left (n^2\right )}\) complexity. Most of these attempts belong to quadratic complexity \(\ensuremath {\operatorname {O} \left (n^2\right )}\), while the attempts in [2, 48] belongs to \(\ensuremath {\operatorname {O} \left (nlogn\right )}\).

2.3 Automatic Evolution of Parallel Programs

In general, automatic generation of parallel programs can be divided into two types: auto-parallelization of serial code and the generation of native parallel code.

Auto-parallelization requires a serial program. Using GP, [47, Chapter-5] proposed Paragen which had initial success, however, the execution of candidate solutions for fitness evaluation ran into difficulties with complex and time consuming loops. Later, Paragen-II [47, Chapter-7] dealt with loop inter-dependencies relying on a rough estimate of time. Then, [47] extended Paragen-II to merge independent tasks of loops.

Similarly, genetic algorithms evolved transformations; [40] and [56] proposed GAPS (Genetic Algorithm Parallelization System) and Revolver respectively. GAPS evolved sequence restructuring, while Revolver transformed the loops and programs, both optimized the execution time. On the other hand, native parallel code generation produces a working program that is also parallel. With multi-tree GP, [51] concurrently executed autonomous agents for automatic design of controllers.

Unlike PARAGEN-II [47], GAPP does not utilize dependency analysis; instead, GE works the data interdependencies out by selecting pragmas that guarantee program correctness. Recently, Chennupati et al., [15, 17] evolved natively parallel regression programs. Thereafter, MCGE-II in [20] evolved task parallel recursive programs. The minimal execution time of the synthesized programs was merely due to the presence of OpenMP pragmas which automatically map threads to cores. However, use of a different OpenMP pragma alters the performance of a parallel program, and skilled parallel programmers carefully choose the pragmas when writing code. To that end, in this paper, we extend MCGE-II in two ways: we re-structure the grammars so task and data level parallelism is separate, and we explicitly penalize long executions.

3 Grammatical Automatic Parallel Programming

Grammatical Automatic Parallel Programming (GAPP) presents the first instance of using grammars for the task of automatic parallel programming. GAPP provides an alternative to the craftsman approach of parallel programming. This is significantly different from other parallel EC approaches, because not only do we produce individuals that, in their final form, can exploit parallel architectures, we also exploit the same parallel architecture during evolution to reduce the execution time.

Figure 1 presents an overview of GAPP that operates on a string of codons which separate the search and the solution spaces. Like any application of GE, GAPP uses the typical search process, genetic operations and genotype-phenotype mapping. However, the major contribution of GAPP is the design of grammars that produce parallel programs, in which, OpenMP primitives are an integral part of the grammars and creates a feasible solution space of parallel programs. The GE search process helps to find the [near] optimal parallel programs in that space. These programs are evaluated on a number of fitness cases such that the best among them is identified as a parallel program. We now discuss the parallelization strategy of the end programs.

Fig. 1
figure 1

An overview of GAPP parallel program synthesis

3.1 GAPP for Parallel Recursion

GAPP relies on the grammars designed to produce parallel recursive programs [20]. We discuss the design of parallel recursive grammars for Fibonacci.

OpenMP Pragmas

OpenMP is a portable, scalable, directive based specification to write parallel programs on shared memory systems. It consists of compiler directives, environment variables and run time libraries that are used to designate parallelism in C/C++ and Fortran. These directives are special preprocessor instructions termed as pragmas that follow the fork-join parallelism.

Some OpenMP pragmas are—parallel for is a loop construct that distributes the iterations of a loop among the threads. The use of a parallel for construct is limited to a for loop that has defined boundaries, that is a loop with a terminating condition. Another pragma, parallel sections defines a parallel region, in which, each task is handled independently. If there are more number of threads than the independent blocks, the remaining threads will be idle, otherwise, all the threads execute multiple code blocks. parallel task is another work-sharing pragma that works similar to section construct. Notice that the use of omp task is not optional. A detailed description of OpenMP API can be found in [12].

Design of Grammars

Figure 2 presents GAPP grammars for the synthesis of Fibonacci program. The program begins at <program>, which derives the symbols <condition> and <parcode>. The symbols <omptask> and <ompdata> represent the task and data parallel pragmas, while <omppragma> selects one of the two options, therefore, a clear separation between task and data parallelism. This helps to accelerates the evolution of solutions because of the grammatical bias [54]. This design constrains the search to explore data or task parallel space rather both spaces.

Fig. 2
figure 2

Design of GAPP grammar to synthesize parallel recursive Fibonacci programs

The grammar has shared (<shared>) and private (<private>) clauses. The input (<input>) and two variables (temp, res) are shared among the threads. Input represents the nth Fibonacci number, while variable res returns the result of parallel execution. The local variables (temp, a) store the auxiliary results of recursive calls.

The variable “a” is thread private. OpenMP has private (<private>) clauses: private( a) makes a variable thread-specific, such that any changes on the variable are invisible outside the parallel region; firstprivate( a) maintains a constant value throughout the program; lastprivate( a) keeps the changes of last thread in the parallel region. Evolution selects one of the three private clauses depending on the problem.

The non-terminal <parblocks> produces parallel code blocks that are mapped through <blocks>. The non-terminal <blocks> generates a sequence of parallel blocks with each block containing an independent recursive call. The parallel code blocks ensure task level parallel execution. The non-terminal <stmt> depicts the recursive call of the Fibonacci program, while the symbols <bop> and <lop> refer to the binary arithmetic and logical operators respectively. The symbol <const> maps to integer constants. The base case is generated from the input variable, logical operators and constants generating non-terminals: <line1> and <line2>. The non-terminal <expr>, expresses the recursive calls, that is called in <parcode> and <blocks>.

Performance Optimization

We encourage parallelism with the inclusion of execution time in the fitness function. The run time exerts external selection pressure, which helps in selecting an appropriate parallelization primitive. Therefore, the fitness function is a product of two factors: execution time and the mean absolute error, both are normalized in the range [0, 1]—a maximization function. The following equation computes the fitness of evolving parallel recursive program (f rprog):

$$\displaystyle \begin{aligned} f_{rprog} =\dfrac{1}{\big(1+t\big)} * \dfrac{1}{\bigg(1+\dfrac{1}{N}\sum_{i=1}^{N}{\lvert y_{i}-\hat y_{i}\rvert}\bigg)} \end{aligned} $$
(1)

where, t is the execution time of a program; y i and \(\hat y_{i}\) are the actual and evolved outputs respectively. The choice of an OpenMP pragma can significantly impact the execution time of a program. The presence of an incorrect pragma in the end program will have an adverse effect on fitness evaluation. In limiting such effects, the first term, normalized execution time in Eq. (1) helps to select the correct pragma. That is, the changes in time component influence the performance of resultant programs, whereby, those with minimum execution time become the best parallel programs. Meanwhile, the second term, normalized mean absolute error enforces program correctness. Together, the twin objectives push for a correct and efficient parallel program.

3.2 GAPP for Parallel Iterative Sorting

Design of Grammars

We describe the design of grammars for the synthesis of parallel odd-even sort programs [21]. Similar to recursive grammars, Odd-Even sort grammars are in [13, Appendix B]. The generation of an end program starts with <program> symbol, which derives to <for_out> and <condition> symbols. The non-terminal <for_out> maps to an outer “for loop”. Note that GE fails to generate correct loop structures [43], hence, we preserve the loops in synthesizing the iterative sorts. The non-terminal <condition> derives problem specific base/termination conditions.

The symbols <schedule> and <type> derive the type of scheduling strategy. In scheduling a parallel for loop, OpenMP offers three clauses: static, dynamic and, guided; static divides the work among threads before the loop execution and dynamic allocates the work during the execution. The third type, guided, divides work in the execution but the allocation begins with the given chunk size (CHUNK) and decreases.

We include the mechanism of swapping the adjacent elements in two phases (odd and even). The input, index (<index>), and the size of the array are shared among all the cores. The temporary variable (temp in <index>) is private to the thread. We use absolute values (abs in <for_in_line> and <swap>) to avoid negative indexes.

Performance Optimization

As with recursion (Eq. (1)), the time in fitness evaluation of parallel iterative sorting helps to choose an appropriate pragma. The accuracy is defined as mean inversions. For example, if a 1a 2a 3a n is a permutation of the set 1, 2, …, n then the pair (a i, a j) is an inversion of the permutation iff i < j and a i > a j [35]. The fitness function (f sprog) is shown in Eq. (2).

$$\displaystyle \begin{aligned} f_{pprog} =\dfrac{1}{\big(1+t\big)} * \dfrac{1}{\bigg(1+\dfrac{\sum_{i=1}^{N}n\big(I(A_{i})\big)}{TP}\bigg)} \end{aligned} $$
(2)

where, t stands for the execution time of the evolved parallel program over all the fitness cases (N); n(I(A i)) is the number of inversions in the ith array (A i; total, N arrays); and TP is the total number of pairs in all the fitness cases (N).

4 Experiments

We evaluate GAPP on six recursive and four iterative sorting benchmark problems. Table 1 presents all the benchmarks with their properties. Of the six recursive problems: first three (Sum-of-N, Factorial, Fibonacci) accept a positive integer as input; for Sum-of-N, it is randomly generated from the range [1, 1000] while, for Factorial and Fibonacci problems, it is in the range [1, 60] due to the limitations of data types in C; the remaining three problems (Binary-Sum, Reverse, Quicksort) accept an array of integers with their start and end indexes as input, for which, an array of 1000 elements are randomly generated from the range [1, 1000]. For the four iterative sorting benchmarks, 100 training cases with each array containing 1000 elements are randomly generated from the range [1, 1000]. The end programs of these four benchmarks use conditional (if), iterative (for) and variable indexing structures.

Table 1 Summary of both the recursive and iterative sorting benchmarks under investigation with their properties used in the experiments

Table 2 describes the algorithmic and hardware parameters. The grammars are general enough except for a few minor changes with respect to the problem at hand.

Table 2 Parameters and experimental environment

Generality of Grammars

Grammars for the benchmarks Fibonacci (Fig. 2) and Odd-Even sort represent both the experimental domains. The grammars for other benchmarks are 90% similar, where all of them have common OpenMP pragmas while they differ in some domain specific knowledge. Grammars for all the benchmarks are presented in [13, Appendix B]. We evolve programs in C; however, GAPP is general enough to apply to the programming languages that offer OpenMP like parallelism. JOMP [11] is an OpenMP API for JAVA and can synthesize parallel programs in JAVA.

4.1 GAPP Variants

With the two different features (design of grammars and performance optimization) of GAPP, we study their influence in the synthesizability and fitness evaluation of the parallel programs. The study contains four GAPP variations: first variant, named as GAPP (Unoptimized), does not use both the separation of task and data parallel primitives as well as the time component of performance optimization (shown in Eqs. (1) and (2)). Second variant, named as GAPP (Grammar), uses the design of grammars with parallel primitives and does not use the time component of performance optimization. Third variant, GAPP (Time), neglects the separation of task and data parallel primitives, but uses the time in performance optimization. Finally, the fourth variant, GAPP (Combined), uses both the design of grammars and the performance optimization.

5 Results

We present experimental results of GAPP for both recursion and iterative sorting domains. The results report two measures: speed-up and mean best generation (MBG), where the speed-up informs performance of the synthesized parallel programs while MBG shows the time taken to synthesize a best of run program in terms of generations.

Speed-Up

The speed-up is defined as the ratio of mean best execution time (MBT) of synthesizing parallel programs on 1-core to n-cores and is shown in Eq. (3):

$$\displaystyle \begin{aligned} \mathrm{Speed-up} =\dfrac{T_{MBT-1-core}}{T_{MBT-n-cores}} \end{aligned} $$
(3)

where, T MBT−1−core is the mean best execution time on a single core, while T MBTncores is that of n-cores of a processor. Mean best execution time (MBT) is defined as the mean of all the execution times of the average best-of-generation programs across all the experimental runs of GAPP, and is given as shown in Eq. (4):

$$\displaystyle \begin{aligned} T_{MBT} =\dfrac{\sum\nolimits_{r=1}^{R}\sum\nolimits_{g=1}^{G}{T_{bprog}(g)}}{R\times G} \end{aligned} $$
(4)

where T bprog(g) is the execution time of the best program in a given generation g, G is the number of generations, r is a run, and R is the number of runs.

Mean Best Generation (MBG)

Mean best generation (MBG) is defined as the number of generations required to converge to the best fitness, with a pre-condition that the program under consideration must be correct, averaged across R runs. MBG helps to investigate the effect of restructuring grammars on the synthesizability (ease of evolving) of the correct parallel programs.

5.1 Recursion

Figure 3 presents the speed-up of each of the four GAPP (Unoptimized, Grammar, Time, Combined) variants at different cores for all the six recursive benchmarks. The results indicate that the performance of GAPP improves as the number of cores increase. Non-parametric Friedman tests [27] are used to show the significance of these results.

Fig. 3
figure 3

The speed-up of GAPP (Unoptimized, Grammar, Time, Combined) variants for all the six experimental problems. The number of cores vary as 2, 4, 8 and 16. The horizontal dashed (- -) line represents the speed-up of 1 and acts as a reference for the remaining results

Table 3 shows the non-parametric Friedman tests with Hommel’s post-hoc [29] analysis on the speed-up of GAPP for recursive problems at α = 0.05. The first column shows the number of cores. The second column shows the GAPP variant, while the third column presents the average rank. The fourth and the fifth columns show the p-value and p-Hommel. The lowest average rank shows the best (GAPP (Combined)) variant, and is marked with an asterisk (*). A variant is significantly different from the best variant if p-value is less than p-Hommel at α = 0.05, and is in boldface. A value is in boldface if it is significantly different from the best variant. The p-value of the corresponding method is less than the critical p-Hommel at α = 0.05.

Table 3 Friedman statistical tests with Hommel’s post-hoc analysis on speed-up of all the four GAPP variants

The performance on 2 cores is insignificant as the cost of thread overheads offset the performance gains. For 4 cores, GAPP (Combined) significantly outperforms the remaining three variants. For 8 and 16 cores, GAPP (Combined) outperforms the two GAPP (Unoptimized, Grammar) variants, and the difference with GAPP (Time) is insignificant due to the presence of execution time in their fitness evaluation.

Table 4 shows MBG of the four GAPP variants and statistical tests. GAPP (Grammar) outperforms GAPP (Unoptimized, Time, Combined), which requires a less number of generations over the remaining variants in synthesizing the best programs.

Table 4 The mean best generation (MBG ± [standard deviation]) of all the four GAPP (Unoptimized, Grammar, Time, Combined) variants on 16 cores and the lowest MBG is in boldface

Although GAPP (Grammar) takes a few generations to synthesize parallel recursive programs, performance results (Fig. 3) show that GAPP (Time, Combined) outperform GAPP (Grammar), where GAPP (Grammar) synthesized programs are not as efficient as that of GAPP (Time, Combined). However, GAPP (Combined) outperforms GAPP (Time) in terms of MBG (Table 4), where GAPP (Combined) is quick to synthesize efficient parallel recursive programs. Therefore, GAPP (Combined) is the best variant that reports an average (on the recursive problems) speed-up of 8.13 on 16 cores, an improvement of 23.86% over GAPP (Unoptimized) that reports 6.19 speed-up.

5.2 Iterative Sorting

Figure 4 shows the speed-up of GAPP (Unoptimized, Grammar, Time, Combined) on the four iterative sorting benchmarks for 2, 4, 8 and 16 cores. Table 5 shows the Friedman tests with Hommel’s post-hoc analysis on speed-up of GAPP (Unoptimized, Grammar, Time, Combined) for 16 cores. A variant with the lowest rank is the best variant (GAPP(Combined)) and marked with an asterisk (*).

Fig. 4
figure 4

The speed-up of all the four GAPP (Unoptimized, Grammar, Time, Combined) variants on the four iterative sorting problems for 2, 4, 8, and 16 cores

Table 5 Statistical tests on speed-up of GAPP for recursion with 16 cores

For 4 cores, GAPP (Combined) outperforms GAPP (Unoptimized) while it is insignificant from GAPP (Grammar, Time). For 8 and 16 cores, GAPP (Combined) outperforms GAPP (Unoptimized, Grammar) and is insignificant over GAPP (Time). GAPP (Combined) shows an average speed-up of 11.03, an improvement of 15.75% over GAPP (Unoptimized), which has an average speed-up of 9.29.

Table 6 compares the MBG of GAPP (Unoptimized, Grammar, Time, Combined) and their statistical significance. GAPP (Grammar) outperforms GAPP (Unoptimized, Time) while it is insignificant over GAPP (Combined). GAPP (Grammar) produces parallel iterative sorting programs in a less number of generations over GAPP (Time). It is because of the variation in the design of grammars among GAPP variants, which in fact impacts the evolution of programs. However, GAPP (Combined) evolves efficient programs over GAPP (Grammar) (see Fig. 4). Therefore, GAPP (Combined) is the best variant for the evolution of efficient parallel iterative sorting programs.

Table 6 The mean best generation (MBG ± [standard deviation]) of all the GAPP (Unoptimized, Grammar, Time, Combined) variants (The lowest generation is in boldface)

6 Enhancements in GAPP

We analyze the effect of OpenMP thread scheduling on performance of the GAPP evolved parallel programs, both in recursion and iterative sorting domains. We find that code growth in GAPP is surprisingly insignificant [19, 22, 23], therefore it does not affect the program execution. We found that the thread scheduling has a distinct influence on both the problem domains.

6.1 Recursion

The quality of parallel code is difficult to quantify as execution time often depends on the ability of OS to efficiently schedule the tasks. This job itself is complicated by other parallel threads (from other programs) running at the same time. OpenMP abstracts away much of these concerns from programmers, which makes it easier at the cost of some fine control. We compensate this through adapting a program to the hardware.

Excessive Parallelism

Hardware caps the maximum number of threads; however, in the grammars each recursive call spawns a new thread. OS factors, specifically for the Linux kernels, eventually fail to scale in scheduling a high number of threads [9]. Moreover, when a parent thread spawns a child thread, it sleeps until all the child threads have finished. This process is expensive, when a large number of threads are involved. Memory access restrictions over shared and private variables can add to the complexity of the executing code. Complexity in this instance comes from the vagaries of scheduling what can be a high number of threads. We extend GAPP to overcome these limitations.

Extending GAPP for Recursion

Armed with the knowledge of excessive parallelism, we constrain the system so as to optimize the degree of parallelism. We combine parallel and serial implementations of the evolved programs, which, further improves the performance. This reduces the overhead caused due to excessive parallelism as the top level recursive calls distribute load across a number of threads, whereas the lower level calls appropriately carry out the work instead of merely invoking more threads. Evolution detects the exact level at which recursion switches from parallel to serial. The intermediate results are saved temporarily in an auxiliary variable and are shared amongst all the threads under execution. This ceases the creation of exponential number of threads thereby reduces the overhead caused due to excessive parallelism.

The GAPP grammars used for recursive benchmarks (Sect. 3.1) are modified as shown in Fig. 5, termed as GAPP (Scaled), hereafter. We alter the non-terminal <condition> to synthesize nested if-else condition blocks. The changes generate a program that reduces the execution time of the final programs, which evolve a two digit thread limiting constant, at which, the program starts to execute sequentially.

Fig. 5
figure 5

The enhanced GAPP grammars to synthesize parallel recursive Fibonacci program

Figure 6 shows an example of the GAPP (Scaled) generated Fibonacci program. It evolves a thread limiting constant for a given problem and the computational environment; this constant, arrests the further creation of threads and continues to execute serially. The intermediate result (temp in else if) is shared among the threads, thus, further optimizes the execution time, thereby, efficiently exploits the power of multi-cores.

Fig. 6
figure 6

GAPP (Scaled) evolved program that combines both parallel and serial execution

The constant (39) in the else if (in Fig. 6) condition is the thread limiting constant for 16 cores. Figure 7 shows the thread limiting constants (standard deviation) with respect to the number of cores for the six benchmarks. These constants adapt to the underlying hardware architectures. For example, the constant (39), which, at a large input (say, a 1000000 element array) may not be an optimal value, that can be a bigger constant. This is addressed with digit concatenation grammars [42, Chapter 5].

Fig. 7
figure 7

GAPP (Scaled) evolved thread limiting constants of the six recursive benchmarks

Figure 8 shows the speed-up of GAPP (Scaled) over all the six benchmarks for 2, 4, 8, and 16 cores. Like the other GAPP variants, the speed-up of GAPP (Scaled) improves with an increase in the number of cores. Especially, the speed-up of GAPP (Scaled) can be better seen, where the performance is much better than its counterparts.

Fig. 8
figure 8

The performance of GAPP (Scaled) programs for all the six recursive benchmarks

Table 7 presents the mean best generation (MBG) of GAPP (Unoptimized, Grammar, Time, Combined, Scaled) variants. The results show that GAPP (Grammar) generates a program faster than the two GAPP (Combined, Scaled) variants because of the grammatical bias. However, the last two GAPP (Combined, Scaled) variants use execution time in fitness evaluation. Thus, the evolution becomes hard, nevertheless, GAPP(Scaled) generates the efficient task parallel recursive programs.

Table 7 The mean best generation (MBG ± [standard deviation]) of GAPP (Grammar, Combined, Scaled)

Table 8 shows the non-parametric Friedman tests with Hommel’s post-hoc analysis on speed-up and MBG of GAPP variants. The best variant with the lowest rank is marked with an asterisk (*), and significantly different variants are in boldface.

Table 8 Friedman statistical tests with Hommel’s post-hoc analysis on speed-up and MBG of GAPP (Unoptimized, Grammar, Time, Combined, Scaled)

For speed-up, GAPP (Scaled) outperforms the remaining four GAPP variants. Note, these results are for 16 cores of a processor, and are similar for 8 cores, while they are insignificant with 4 cores and below. On average, for 16 cores, GAPP (Scaled) speeds up by a factor of 9.97, which improves over GAPP (Combined) and GAPP (Unoptimized) by 17.45% and 37.91% respectively. On an average, for 16 cores, GAPP (Scaled) shows a speed-up of 9.97, a significant improvement of 17.45% over GAPP (Combined). Similarly, a significant improvement of 37.91% over GAPP (Unoptimized).

For MBG, GAPP (Grammar) outperforms the other four GAPP variants. These results are for 16 cores, while they are similar for 8 and below. Although GAPP (Scaled) requires slightly more generations over other variants, GAPP (Scaled) is the best amongst all the GAPP variants in this paper, as it generates efficient parallel recursive programs.

However, a similar solution is to keep a table that records the result of a recursive call in its first evaluation, then, refer the table for the repeated recursive calls, similar to Koza [36]. However, that approach has often been criticized [38] for not being an exact recursion. We now analyze and extend GAPP for iterative sorting domain.

6.2 Iterative Sorting

In contrast to excessive parallelism in recursion, factors such as OpenMP work load scheduling plays a vital role in optimizing the performance of synthesized iterative sorting programs. Interestingly, OpenMP hides these details from the developer, which makes it easy to use, at the same time hard to realize their full potential. Load balancing by parallel threads is a serious concern on shared memory processors. OpenMP scheduling strategies (static, dynamic, guided) solve these performance issues effectively. However, optimally assigning the optional chunk size (chunk) explicitly is difficult, as the ideal value often requires the problem specific knowledge. The input chunk size changes with respect to the loop iterations, number of cores, and the threads under execution. On the other hand, smaller chunks of data leads to a well known parallel programming challenge of false sharing.

False Sharing

False sharing is a performance challenge that occurs when threads on different cores modify variables that reside on the same cache line [50], which invalidates the cache line and forces a memory update, thereby reduces the performance. Precisely, if one core tries to load the same cache line loaded by another core, that line is marked as “shared” access. If this core stores shared cache line, then that line is marked as “modified” and all the remaining cores will receive a cache line “invalid” message. Herein, if any core tries to access the cache line marked with modified, that line will be written back to the memory and marks it as “shared”. The other cores that try to access the same cache line will incur a cache miss. This frequent coordination among the cores, cache lines and memory that caused due to false sharing significantly degrades the performance of an application.

False sharing can be avoided by placing the variables far apart in the memory (using some compiler directives) so that they do not align in the same cache line. In the case of arrays, it can be avoided by aligning the array of elements on the cache line boundary. If this is impossible, we can set the array size to double the cache line, which is possible when dynamically allocating the array sizes. Our extensions of GAPP ensure that controlling the array sizes helps to deal with false sharing in improving the performance of the evolving programs.

Extending GAPP for Iterative Sorting

This section proposes to solve the false sharing that further extends GAPP to evolve more efficient parallel iterative sorting programs. We overcome the problem of ideal load balancing by evolving an appropriate chunk size that is independent of the problem and the number of cores that it executes. We adopt the digit concatenation grammars [42] for symbolic regression.

Figure 9 shows the modified GAPP grammar that automatically generates a sequence of digits. The evolved chunk size adapts to the number of cores, amount of load, and the number of threads. The proposed enhancements evolve more efficient programs.

Fig. 9
figure 9

The enhanced GAPP grammars to synthesize parallel recursive Fibonacci program

Figure 10 presents the successfully evolved parallel iterative Odd-Even sort program using GAPP (Scaled) grammars. Note, the program contains two constants (89, 87) as it operates in two phases (odd and even).

Fig. 10
figure 10

Evolved Odd-Even program that shows efficient performance

Table 9 shows the GAPP (Scaled) evolved constants (chunk size). These are averaged from the evolved best of run programs across 50 runs. The chunk results are reported for 8, and 16 cores. They showed significant performance optimization while, 2 and 4 are insignificant, hence, neglected. As a result, the evolved constants balance the load effectively. These chunk sizes created larger data arrays as opposed to the arrays with a default chunk size of 10. Creation of larger chunks of data helped in controlling false sharing, is evident at the higher number of cores, thus, the performance improves.

Table 9 GAPP (Scaled) evolved chunk size (mean ± [standard deviation]), averaged across 50 runs for all the four experimental problems on 8 and 16 cores respectively

Figure 11 shows the speed-up of GAPP (Scaled) evolved programs. The results indicate that the performance improves with an increase in the number of cores. It shows an average speed-up of 12.52 for 16 cores, a better improvement of 11.91% over GAPP (Combined), an improvement of 25.79% over GAPP (Unoptimized).

Fig. 11
figure 11

Performance of GAPP (Scaled) on four iterative sorting benchmarks

Table 10 represents the Wilcoxon Signed Rank Sum significance tests between GAPP (Scaled) and GAPP (Combined) at α = 0.05. It contains the p-value for the corresponding problem while indicates that the difference between the results of both the methods is significant; that is, p < 0.05.

Table 10 Significance tests (at α = 0.05) show that GAPP (Scaled) outperforms GAPP (Combined) for 8 and 16 cores

Vargha and Delaney [52] A-measure states how often that GAPP (Scaled) outperforms GAPP (Combined). A-measure lies in between 0 and 1: when it is above 0.5, GAPP (Scaled) is better than GAPP (Combined); when it is 0.5, then both are equal; when it is less than 0.5 GAPP (Combined) is better than GAPP (Scaled); if it is close to 0.5, the difference is small, otherwise the difference is large. For example, on Bubble sort with 16 cores, 35% of the time, GAPP (Scaled) performs better than GAPP (Combined). In other words, 65% of the time, GAPP (Combined) performs better than GAPP (Scaled). Overall, GAPP (Scaled) performs better than GAPP (Combined). Similarly for MBG, GAPP (Grammar) takes a less number of generations to evolve a parallel program. However, GAPP (Scaled) exhibits better performance with a few extra generations.

7 Conclusion

We presented GAPP to automatically generate efficient task parallel recursive and data parallel iterative sorting programs. GAPP offered a separation between the task and data parallelism in the design of the grammars along with the execution time in fitness evaluation. The modifications in the grammar favored quick generation of programs, while the execution time helped in optimizing their performance. We then analyzed the effect of OpenMP thread scheduling on performance of both the problem domains. We ceased the excessive parallelism, while restricting the degree of parallelism in the evolving programs. We limited this behavior with the evolution of programs that run both in serial (for lower level recursive calls) and parallel, thus, further optimized the performance. The most interesting contribution is the automatic load balancing that adapts to the experimental hardware environment, with which, the system has further improved the performance of the evolving sorting programs. The limiting constants for iterative sorting programs produce larger or smaller constants with the help of digit concatenation grammars. GAPP can further be extended to synthesize lock-free parallel programs, like in [24], applicable in gaming industry. Moreover, the synthesizability of GAPP can be further leveraged as in [14, 16, 18] that further improves the performance of the evolving parallel programs in less number of generations. Similarly, probabilistic approaches [25] for performance prediction help to further optimize the execution time.