Accelerated parallel genetic programming tree evaluation with OpenCL
Highlights
► We propose a parallel GP implementation in OpenCL for accelerated tree evaluation. ► On the GPU we could achieve 13 billion node evaluations per second. ► The GPU throughput is highly sensitive to the underlying parallelization strategy. ► The GPUs are remarkably faster than the multi-core CPUs (up to 24 × speedup).
Introduction
Proposed as a technique to automatically build arbitrarily complex computer programs, genetic programming (GP) [19] is beyond doubt a highly ambitious paradigm from the field of evolutionary computation. Strictly speaking, there are only two requirements that need to be satisfied in order to enable GP as a potential solver for a certain problem: that (i) the problem’s solution could be described as a computer program, and (ii) there exists a graduated evaluation metric that given two candidate solutions is capable of indicating–with reasonable accuracy–which one is better. The first condition says that candidate solutions must be represented on a GP system, hence being able to undergo sustained genetic operations, like crossover and mutation, which is the way programs evolve in GP. The latter allows for the implementation of the principle of natural selection, which is an essential component of every evolutionary computation algorithm.
It follows that, at least in theory, GP can be applicable not only to countless problems, but also to a large range of complexities. Although this fact emphasizes how versatile GP can be, there is a consequence: genetic programming is destined to be computationally demanding, whatever the availability of computational resources, because there will always be a more complex problem to be solved. This does not mean, however, that GP is an inefficient “processor-hungry” technique; being capable of tackling increasingly difficult problems as the computational power advances means that GP is able to take full advantage of an arbitrary amount of resources in order to handle a problem.
Most optimization algorithms can somehow take advantage of the availability of computational resources, but what distinguishes GP from most of them is that GP is able to efficiently explore a wider class of hardware architectures at their full extent. The main reason behind that is the high degree of parallelism endowed by GP, both fine- and coarse-grained [2].
The microprocessor industry players have agreed that keeping up with Moore’s Law [23] is becoming increasingly difficult as the manufacturing process involved in making microchips approaches the physical limits of the technology [14]. In other words, unless some breakthrough in the field of manufacturing and materials is accomplished, there is not much room for further improvement on the transistor’s density per area; moreover, even before the physical limits of the technology are effectively reached, the costs of producing denser and denser microprocessors may not be commercially viable.
An alternative approach to ensure continuous advance in the processor performance is to soften the focus on the transistor’s density per se and instead employ many independent “not-so-dense” microchips coupled in a unique processor; that is, the parallel multi-core design. However, this approach does not improve single-core performance, but rather the total performance, which is fully achieved when all processor cores operate, simultaneously, at full capacity.
The multi-core design has become ubiquitous in almost every computing platform and so has displaced the dominance of the serial single-core processors. Not merely that, but the demand for computational power has pressed forward all kinds of parallel architectures and has made them mainstream, the future trend.
According to [8], the variety of parallel processors can be categorized into two classes, namely the latency-and the throughput–oriented processors. The first class concerns those processors whose design is optimized towards the fast processing of sequential tasks, even for multi-core architectures. The second one refers to those processors in which the goal is to process the greatest amount of tasks per unit of time. A representative of the latency-oriented class is the ordinary CPU processor, be it single- or multi-core. As for the throughput-oriented ones, a noteworthy example is the Graphics Processing Unit (GPU) architecture.
One of the early attempts to make use of the GPUs to accelerate the execution of evolutionary algorithms was done by Yu et al. [33]. They implemented a parallel genetic algorithm on a Nvidia GeForce 6800GT GPU using the Cg language. Not only the fitness evaluation was implemented on the GPU, but also the genetic operators. Compared with the CPU, the authors have reported peak speedup of about 17 for the fitness evaluation. Harding and Banzhaf [11] implemented genetic programming on the GPU using the Accelerator toolkit. They reported excellent results on a Nvidia GeForce 7300 GO on different kinds of benchmarks. Langdon and Banzhaf [20] used the Rapidmind framework to implement a SIMD interpreter for linear genetic programming on the GPU. They achieved 895 million GPop/s on the Mackey–Glass chaotic time series using a Nvidia 8800 GTX.1 Robilliard et al. [27], [26] have implemented and evaluated different GP population-parallel approaches using the Nvidia’s Compute Unified Device Architecture (CUDA). Their peak performance on a Nvidia G80 GPU was an impressive 2.8 billion GPop/s. Also using the CUDA framework, Maitre et al. [21] obtained a peak speedup of 250 when comparing the performance of a dual GPU Nvidia GTX-295 (using half of its cores) with an Intel quad-core Q8200 CPU in sequential mode.
Except for a few exceptions, such as the use of the proprietary GPU.NET [12], it seems that most of the recent studies exploiting the GPU power have been built exclusively around the CUDA technology. While CUDA is a powerful and mature toolkit, it is a closed technology and only works on Nvidia GPUs, precluding the use of all the computational power available from other vendors and architectures.
Although the related works found in the literature have shown the remarkable power of the GPU in speeding up the genetic programming execution using different frameworks, so far none of them have implemented and evaluated a parallel genetic programming using the OpenCL specification, which holds the following properties: (i) open standard; (ii) portable across many parallel architectures and vendors; and (iii) allows low-level access to hardware. One of the consequences is that there is neither references detailing how to map a parallel GP into the OpenCL framework nor studies directly comparing the accelerated parallel genetic programming performance on multi-core CPUs and GPUs from different vendors.
This paper is organized as follows. Section 2 introduces OpenCL, a portable multi-vendor language for parallel programming on heterogeneous devices. Different strategies of parallel implementation of a genetic programming system using OpenCL for speeding up tree evaluation on both CPU and GPU devices are described in Section 3. In Section 4, computational experiments comparing raw performance in terms of GPop/s for different class of problems, parallel strategies, hardware architectures, and vendors are presented. Optimization techniques implemented by GPOCL are presented in Section 5. Finally, Section 6 points out conclusions and some directions for future work.
Section snippets
Open computing language—OpenCL
OpenCL is an open standard [31] for uniform and portable parallel programming across heterogeneous computing platforms. It aims to provide low and high level access to data- or task-parallel devices, either individually or simultaneously. Besides the conventional multi-core CPUs and GPUs from multiple vendors, OpenCL also supports some other parallel devices, such as Field-Programmable Gate Array (FPGA), digital signal processors (DSP), IBM’s Cell Broadband engine [15], and more.
Parallel genetic programming with OpenCL
This section discusses the main aspects of a genetic programming implementation that uses OpenCL to exploit the multi-core CPU and GPU processors. The actual program is called GPOCL, a freely available high-performance implementation written in C++ and OpenCL that implements a canonical GP system [19] using a prefix linear tree representation [3], [17].2
Computational experiments
The results from two batches of experiments are presented in this section, whose goal is to compare and discuss the performance of the strategies of parallel implementation, as discussed in Section 3, on different processor architectures and models/vendors. What distinguishes these two set of experiments is the workload pattern they present to the compute devices. While the first batch simulates the evaluation of primitives commonly found in symbolic regression problems, i.e., mainly
Optimization techniques
This section discusses the main optimization techniques employed in GPOCL, and also addresses the optimal choice of parameter settings. Most of the topics are specific to the GPUs, as this architecture offers a greater extent of optimization opportunities while being more sensitive to certain practices.
Conclusions
We have presented a detailed high-performance GP implementation in OpenCL for accelerated tree evaluation on the CPU and GPU architectures. Three GPU parallelization strategies were assessed in two domains: symbolic regression and data classification. Based on the presented study we can conclude that:
- •
The GPU is considerably faster and more power efficient than the CPU; even a mid-range GPU can beat a very high-end CPU.
- •
The population-parallel per compute unit strategy is clearly the most
Acknowledgments
The authors would like to thank the reviewers for their valuable suggestions, and the support provided by FAPERJ (grants E-26/102.825/2008 and E-26/102.025/2009), and CNPq (grant 311651/2006-2).
Douglas A. Augusto is a post-doctoral researcher at the Laboratório Nacional de Computação Científica (LNCC), Brazil. He received his M.Sc. and D.Sc. degrees from the Federal University of Rio de Janeiro, Brazil, in 2004 and 2009, respectively. His research interest includes parallel and distributed computing, general purpose GPU programming, and metaheuristics.
References (33)
- et al.
A parallel implementation of genetic programming that achieves super-linear performance
Inform. Sci.
(1998) - Advanced Micro Devices, AMD accelerated parallel processing programming guide-OpenCL, 12...
- et al.
Symbolic regression via genetic programming
- et al.
Green supercomputing comes of age
IT Prof.
(2008) - I. Corporation, Writing optimal OpenCL code with intel OpenCL SDK, 12...
- Advanced Micro Devices, Coming soon: the AMD fusion family of APUs, 2010. URL...
- et al.
An extension on “statistical comparisons of classifiers over multiple data sets” for all pairwise comparisons
J. Mach. Learn. Res.
(2009) - et al.
Understanding throughput-oriented architectures
Commun. ACM
(2010) - et al.
Heterogeneous Computing with OpenCL
(2011) - et al.
Introduction to Parallel Computing (2nd Edition)
(2003)
Fast genetic programming and artificial developmental systems on GPUs
Implementing cartesian genetic programming classifiers on graphics processing units using GPU.net
Data parallel algorithms
Commun. ACM
Introduction to the cell broadband engine architecture
IBM J. Res. Dev.
Massively parallel genetic programming
Cited by (37)
A massively parallel Grammatical Evolution technique with OpenCL
2017, Journal of Parallel and Distributed ComputingCitation Excerpt :That approach increases the usage of the computational resources for training datasets of any size, but it is not suitable for the GPGPU architecture as different individuals are evaluated at the same computational unit, causing instruction divergence [3]. Here we adopt the strategy proposed by [3], in which both approaches are combined: some individuals are evaluated simultaneously, and data samples are also computed concomitantly. Each individual is assessed in a computational unit and each processing element executes one or more data samples.
MPI + OpenCL implementation of a phase-field method incorporating CALPHAD description of Gibbs energies on heterogeneous computing platforms
2015, Computer Physics CommunicationsCitation Excerpt :Though phase-field computations on the GPU were performed on a large scale and have shown the remarkable power of the GPU in speeding up the execution of phase-field codes, they were largely based on CUDA implementations and hence were constrained to run only on NVIDIA® cards [18,19,10,11]. OpenCL specification holds the following advantages: (i) open standard; (ii) cross platform and vendor independent [20–22] and (iii) allows low-level access to hardware [23,24,20,25,26]. Till date, there is no parallel implementation nor an evaluation of a phase-field technique using the OpenCL specification.
An Efficient Federated Genetic Programming Framework for Symbolic Regression
2023, IEEE Transactions on Emerging Topics in Computational IntelligenceHigh-performance cartesian genetic programming on GPU for the inference of gene regulatory networks using scRNA-seq time-series data
2022, GECCO 2022 Companion - Proceedings of the 2022 Genetic and Evolutionary Computation ConferenceAn efficient fault-tolerant communication algorithm for population-based metaheuristics
2021, GECCO 2021 Companion - Proceedings of the 2021 Genetic and Evolutionary Computation Conference Companion
Douglas A. Augusto is a post-doctoral researcher at the Laboratório Nacional de Computação Científica (LNCC), Brazil. He received his M.Sc. and D.Sc. degrees from the Federal University of Rio de Janeiro, Brazil, in 2004 and 2009, respectively. His research interest includes parallel and distributed computing, general purpose GPU programming, and metaheuristics.
Helio J.C. Barbosa is a Senior Technologist at the Laboratório Nacional de Computação Científica, Brazil. He received a Civil Engineering degree (1974) from the Federal University of Juiz de Fora, where he is an Associate Professor in the Computer Science Department, and M.Sc. (1978) and D.Sc. (1986) degrees in Civil Engineering from the Federal University of Rio de Janeiro, Brazil. During 1988–1990 he was a visiting scholar at the Division of Applied Mechanics, Stanford University, USA. His is currently mainly interested in the design and application of nature-inspired metaheuristics in engineering and biology.