Accelerated parallel genetic programming tree evaluation with OpenCL

https://doi.org/10.1016/j.jpdc.2012.01.012Get rights and content

Abstract

Inspired by the process of natural selection, genetic programming (GP) aims at automatically building arbitrarily complex computer programs. Being classified as an “embarrassingly” parallel technique, GP can theoretically scale up to tackle very diverse problems by increasingly adding computational power to its arsenal. With today’s availability of many powerful parallel architectures, a challenge is to take advantage of all those heterogeneous compute devices in a portable and uniform way. This work proposes both (i) a transcription of existing GP parallelization strategies into the OpenCL programming platform; and (ii) a freely available implementation to evaluate its suitability for GP, by assessing the performance of parallel strategies on the CPU and GPU processors from different vendors. Benchmarks on the symbolic regression and data classification domains were performed. On the GPU we could achieve 13 billion node evaluations per second, delivering almost 10 times the throughput of a twelve-core CPU.

Highlights

► We propose a parallel GP implementation in OpenCL for accelerated tree evaluation. ► On the GPU we could achieve 13 billion node evaluations per second. ► The GPU throughput is highly sensitive to the underlying parallelization strategy. ► The GPUs are remarkably faster than the multi-core CPUs (up to 24 × speedup).

Introduction

Proposed as a technique to automatically build arbitrarily complex computer programs, genetic programming (GP) [19] is beyond doubt a highly ambitious paradigm from the field of evolutionary computation. Strictly speaking, there are only two requirements that need to be satisfied in order to enable GP as a potential solver for a certain problem: that (i) the problem’s solution could be described as a computer program, and (ii) there exists a graduated evaluation metric that given two candidate solutions is capable of indicating–with reasonable accuracy–which one is better. The first condition says that candidate solutions must be represented on a GP system, hence being able to undergo sustained genetic operations, like crossover and mutation, which is the way programs evolve in GP. The latter allows for the implementation of the principle of natural selection, which is an essential component of every evolutionary computation algorithm.

It follows that, at least in theory, GP can be applicable not only to countless problems, but also to a large range of complexities. Although this fact emphasizes how versatile GP can be, there is a consequence: genetic programming is destined to be computationally demanding, whatever the availability of computational resources, because there will always be a more complex problem to be solved. This does not mean, however, that GP is an inefficient “processor-hungry” technique; being capable of tackling increasingly difficult problems as the computational power advances means that GP is able to take full advantage of an arbitrary amount of resources in order to handle a problem.

Most optimization algorithms can somehow take advantage of the availability of computational resources, but what distinguishes GP from most of them is that GP is able to efficiently explore a wider class of hardware architectures at their full extent. The main reason behind that is the high degree of parallelism endowed by GP, both fine- and coarse-grained [2].

The microprocessor industry players have agreed that keeping up with Moore’s Law [23] is becoming increasingly difficult as the manufacturing process involved in making microchips approaches the physical limits of the technology [14]. In other words, unless some breakthrough in the field of manufacturing and materials is accomplished, there is not much room for further improvement on the transistor’s density per area; moreover, even before the physical limits of the technology are effectively reached, the costs of producing denser and denser microprocessors may not be commercially viable.

An alternative approach to ensure continuous advance in the processor performance is to soften the focus on the transistor’s density per se and instead employ many independent “not-so-dense” microchips coupled in a unique processor; that is, the parallel multi-core design. However, this approach does not improve single-core performance, but rather the total performance, which is fully achieved when all processor cores operate, simultaneously, at full capacity.

The multi-core design has become ubiquitous in almost every computing platform and so has displaced the dominance of the serial single-core processors. Not merely that, but the demand for computational power has pressed forward all kinds of parallel architectures and has made them mainstream, the future trend.

According to [8], the variety of parallel processors can be categorized into two classes, namely the latency-and the throughput–oriented processors. The first class concerns those processors whose design is optimized towards the fast processing of sequential tasks, even for multi-core architectures. The second one refers to those processors in which the goal is to process the greatest amount of tasks per unit of time. A representative of the latency-oriented class is the ordinary CPU processor, be it single- or multi-core. As for the throughput-oriented ones, a noteworthy example is the Graphics Processing Unit (GPU) architecture.

One of the early attempts to make use of the GPUs to accelerate the execution of evolutionary algorithms was done by Yu et al. [33]. They implemented a parallel genetic algorithm on a Nvidia GeForce 6800GT GPU using the Cg language. Not only the fitness evaluation was implemented on the GPU, but also the genetic operators. Compared with the CPU, the authors have reported peak speedup of about 17 for the fitness evaluation. Harding and Banzhaf [11] implemented genetic programming on the GPU using the Accelerator toolkit. They reported excellent results on a Nvidia GeForce 7300 GO on different kinds of benchmarks. Langdon and Banzhaf [20] used the Rapidmind framework to implement a SIMD interpreter for linear genetic programming on the GPU. They achieved 895 million GPop/s on the Mackey–Glass chaotic time series using a Nvidia 8800 GTX.1 Robilliard et al. [27], [26] have implemented and evaluated different GP population-parallel approaches using the Nvidia’s Compute Unified Device Architecture (CUDA). Their peak performance on a Nvidia G80 GPU was an impressive 2.8 billion GPop/s. Also using the CUDA framework, Maitre et al. [21] obtained a peak speedup of 250 when comparing the performance of a dual GPU Nvidia GTX-295 (using half of its cores) with an Intel quad-core Q8200 CPU in sequential mode.

Except for a few exceptions, such as the use of the proprietary GPU.NET [12], it seems that most of the recent studies exploiting the GPU power have been built exclusively around the CUDA technology. While CUDA is a powerful and mature toolkit, it is a closed technology and only works on Nvidia GPUs, precluding the use of all the computational power available from other vendors and architectures.

Although the related works found in the literature have shown the remarkable power of the GPU in speeding up the genetic programming execution using different frameworks, so far none of them have implemented and evaluated a parallel genetic programming using the OpenCL specification, which holds the following properties: (i) open standard; (ii) portable across many parallel architectures and vendors; and (iii) allows low-level access to hardware. One of the consequences is that there is neither references detailing how to map a parallel GP into the OpenCL framework nor studies directly comparing the accelerated parallel genetic programming performance on multi-core CPUs and GPUs from different vendors.

This paper is organized as follows. Section 2 introduces OpenCL, a portable multi-vendor language for parallel programming on heterogeneous devices. Different strategies of parallel implementation of a genetic programming system using OpenCL for speeding up tree evaluation on both CPU and GPU devices are described in Section 3. In Section 4, computational experiments comparing raw performance in terms of GPop/s for different class of problems, parallel strategies, hardware architectures, and vendors are presented. Optimization techniques implemented by GPOCL are presented in Section 5. Finally, Section 6 points out conclusions and some directions for future work.

Section snippets

Open computing language—OpenCL

OpenCL is an open standard [31] for uniform and portable parallel programming across heterogeneous computing platforms. It aims to provide low and high level access to data- or task-parallel devices, either individually or simultaneously. Besides the conventional multi-core CPUs and GPUs from multiple vendors, OpenCL also supports some other parallel devices, such as Field-Programmable Gate Array (FPGA), digital signal processors (DSP), IBM’s Cell Broadband engine [15], and more.

Parallel genetic programming with OpenCL

This section discusses the main aspects of a genetic programming implementation that uses OpenCL to exploit the multi-core CPU and GPU processors. The actual program is called GPOCL, a freely available high-performance implementation written in C++ and OpenCL that implements a canonical GP system [19] using a prefix linear tree representation [3], [17].2

Computational experiments

The results from two batches of experiments are presented in this section, whose goal is to compare and discuss the performance of the strategies of parallel implementation, as discussed in Section 3, on different processor architectures and models/vendors. What distinguishes these two set of experiments is the workload pattern they present to the compute devices. While the first batch simulates the evaluation of primitives commonly found in symbolic regression problems, i.e., mainly

Optimization techniques

This section discusses the main optimization techniques employed in GPOCL, and also addresses the optimal choice of parameter settings. Most of the topics are specific to the GPUs, as this architecture offers a greater extent of optimization opportunities while being more sensitive to certain practices.

Conclusions

We have presented a detailed high-performance GP implementation in OpenCL for accelerated tree evaluation on the CPU and GPU architectures. Three GPU parallelization strategies were assessed in two domains: symbolic regression and data classification. Based on the presented study we can conclude that:

  • The GPU is considerably faster and more power efficient than the CPU; even a mid-range GPU can beat a very high-end CPU.

  • The population-parallel per compute unit strategy is clearly the most

Acknowledgments

The authors would like to thank the reviewers for their valuable suggestions, and the support provided by FAPERJ (grants E-26/102.825/2008 and E-26/102.025/2009), and CNPq (grant 311651/2006-2).

Douglas A. Augusto is a post-doctoral researcher at the Laboratório Nacional de Computação Científica (LNCC), Brazil. He received his M.Sc. and D.Sc. degrees from the Federal University of Rio de Janeiro, Brazil, in 2004 and 2009, respectively. His research interest includes parallel and distributed computing, general purpose GPU programming, and metaheuristics.

References (33)

  • D. Andre et al.

    A parallel implementation of genetic programming that achieves super-linear performance

    Inform. Sci.

    (1998)
  • Advanced Micro Devices, AMD accelerated parallel processing programming guide-OpenCL, 12...
  • D.A. Augusto et al.

    Symbolic regression via genetic programming

  • W. Chun Feng et al.

    Green supercomputing comes of age

    IT Prof.

    (2008)
  • I. Corporation, Writing optimal OpenCL code with intel OpenCL SDK, 12...
  • Advanced Micro Devices, Coming soon: the AMD fusion family of APUs, 2010. URL...
  • S. García et al.

    An extension on “statistical comparisons of classifiers over multiple data sets” for all pairwise comparisons

    J. Mach. Learn. Res.

    (2009)
  • M. Garland et al.

    Understanding throughput-oriented architectures

    Commun. ACM

    (2010)
  • B. Gaster et al.

    Heterogeneous Computing with OpenCL

    (2011)
  • A. Grama et al.

    Introduction to Parallel Computing (2nd Edition)

    (2003)
  • S. Harding et al.

    Fast genetic programming and artificial developmental systems on GPUs

  • S. Harding et al.

    Implementing cartesian genetic programming classifiers on graphics processing units using GPU.net

  • W.D. Hillis et al.

    Data parallel algorithms

    Commun. ACM

    (1986)
  • R. Hiremane, From Moore’s Law to Intel Innovation - Prediction to Reality. Intel Magazine 1–9 (April...
  • C.R. Johns et al.

    Introduction to the cell broadband engine architecture

    IBM J. Res. Dev.

    (2007)
  • H. Juille et al.

    Massively parallel genetic programming

  • Cited by (37)

    • A massively parallel Grammatical Evolution technique with OpenCL

      2017, Journal of Parallel and Distributed Computing
      Citation Excerpt :

      That approach increases the usage of the computational resources for training datasets of any size, but it is not suitable for the GPGPU architecture as different individuals are evaluated at the same computational unit, causing instruction divergence [3]. Here we adopt the strategy proposed by [3], in which both approaches are combined: some individuals are evaluated simultaneously, and data samples are also computed concomitantly. Each individual is assessed in a computational unit and each processing element executes one or more data samples.

    • MPI + OpenCL implementation of a phase-field method incorporating CALPHAD description of Gibbs energies on heterogeneous computing platforms

      2015, Computer Physics Communications
      Citation Excerpt :

      Though phase-field computations on the GPU were performed on a large scale and have shown the remarkable power of the GPU in speeding up the execution of phase-field codes, they were largely based on CUDA implementations and hence were constrained to run only on NVIDIA® cards [18,19,10,11]. OpenCL specification holds the following advantages: (i) open standard; (ii) cross platform and vendor independent [20–22] and (iii) allows low-level access to hardware [23,24,20,25,26]. Till date, there is no parallel implementation nor an evaluation of a phase-field technique using the OpenCL specification.

    • An Efficient Federated Genetic Programming Framework for Symbolic Regression

      2023, IEEE Transactions on Emerging Topics in Computational Intelligence
    • High-performance cartesian genetic programming on GPU for the inference of gene regulatory networks using scRNA-seq time-series data

      2022, GECCO 2022 Companion - Proceedings of the 2022 Genetic and Evolutionary Computation Conference
    • An efficient fault-tolerant communication algorithm for population-based metaheuristics

      2021, GECCO 2021 Companion - Proceedings of the 2021 Genetic and Evolutionary Computation Conference Companion
    View all citing articles on Scopus

    Douglas A. Augusto is a post-doctoral researcher at the Laboratório Nacional de Computação Científica (LNCC), Brazil. He received his M.Sc. and D.Sc. degrees from the Federal University of Rio de Janeiro, Brazil, in 2004 and 2009, respectively. His research interest includes parallel and distributed computing, general purpose GPU programming, and metaheuristics.

    Helio J.C. Barbosa is a Senior Technologist at the Laboratório Nacional de Computação Científica, Brazil. He received a Civil Engineering degree (1974) from the Federal University of Juiz de Fora, where he is an Associate Professor in the Computer Science Department, and M.Sc. (1978) and D.Sc. (1986) degrees in Civil Engineering from the Federal University of Rio de Janeiro, Brazil. During 1988–1990 he was a visiting scholar at the Division of Applied Mechanics, Stanford University, USA. His is currently mainly interested in the design and application of nature-inspired metaheuristics in engineering and biology.

    View full text