Skip to main content

Using Genetic Programming for Data Science: Lessons Learned

  • Chapter
  • First Online:
Genetic Programming Theory and Practice XIII

Part of the book series: Genetic and Evolutionary Computation ((GEVO))

Abstract

In this chapter we present a case study to demonstrate how the current state-of-the-art Genetic Programming (GP) fairs as a tool for the emerging field of Data Science. Data Science refers to the practice of extracting knowledge from data, often Big Data, to glean insights useful for predicting business, political or societal outcomes. Data Science tools are important to the practice as they allow Data Scientists to be productive and accurate. GP has many features that make it amenable as a tool for Data Science, but GP is not widely considered as a Data Science method as of yet. Thus, we performed a real-world comparison of GP with a popular Data Science method to understand its strengths and weaknesses. GP proved to find equally strong solutions, leveraged the new Big Data infrastructure, and was able to provide several benefits like direct feature importance and solution confidence. GP lacked the ability to quickly build and test models, required much more intensive computing power, and, due to its lack of commercial maturity, created some challenges for productization as well as integration with data management and visualization capabilities. The lessons learned leads to several recommendations that provide a path for future research to focus on key areas to improve GP as a Data Science tool.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 109.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    http://protege.stanford.edu/

References

  • Arnaldo I, Veeramachaneni K, O’Reilly UM (2014) Flash: A GP-GPU ensemble learning system for handling large datasets. In: Nicolau M, et al. (eds.) 17th European conference on genetic programming. LNCS, vol. 8599. Springer, Granada, pp 13–24

    Google Scholar 

  • Castillo F, Kordon A, Sweeney J, Zirk W (2004) Using genetic programming in industrial statistical model building. In: O’Reilly UM, Yu T, Riolo RL, Worzel B (eds.) Genetic programming theory and practice II, Chap. 3 Springer, Ann Arbor, pp 31–48

    Google Scholar 

  • De Rainville FM, Fortin FA, Gardner MA, Parizeau M, Gagne C (2012) DEAP: a python framework for evolutionary algorithms. In: Wagner S, Affenzeller M (eds.) GECCO 2012 evolutionary computation software systems (EvoSoft). ACM, Philadelphia, PA, pp 85–92

    Google Scholar 

  • Dhar V (2013) Data science and prediction. Commun ACM 56(12):64–73

    Article  Google Scholar 

  • Dubcakova R (2011) Eureqa: software review. Genet. Program. Evolvable Mach. 12(2):173–178

    Article  Google Scholar 

  • Fazenda P, McDermott J, O’Reilly UM (2012) A library to run evolutionary algorithms in the cloud using MapReduce. In: Di Chio C, et al. (eds.) Applications of evolutionary computing, EvoApplications 2012, LNCS, vol. 7248. Springer, Malaga, pp 416–425

    Google Scholar 

  • Friedman J (2001) Greedy function approximation: a gradient boosting machine. Ann Stat 29(5):1189–1232

    Article  MathSciNet  MATH  Google Scholar 

  • Gustafson S, Sheth A (2014) Web of things. Computing Now 7(3). http://www.computer.org/web/computingnow/archive/march2014

  • Icke I, Bongard J (2013) Improving genetic programming based symbolic regression using deterministic machine learning. In: de la Fraga LG (ed.) 2013 IEEE conference on evolutionary computation, Cancun, vol. 1, pp 1763–1770

    Google Scholar 

  • Jones E, Oliphant E, Peterson P, et al. (2001) Scipy: open source scientific tools for python. http://wwwscipyorg

    Google Scholar 

  • Kordon AK, Smits GF (2001) Soft sensor development using genetic programming. In: Spector L, et al. (eds.) Proceedings of the genetic and evolutionary computation conference (GECCO-2001). Morgan Kaufmann, San Francisco, CA, pp 1346–1351

    Google Scholar 

  • Koza JR (1992) The genetic programming paradigm: Genetically breeding populations of computer programs to solve problems. In: Soucek B, the IRIS Group (eds.) Dynamic, genetic, and chaotic programming. Wiley, New York, pp 203–321

    Google Scholar 

  • Loukides M (2010) What is Data science? OReilly Radar Report. http://cdn.oreilly.com/radar/2010/06/What_is_Data_Science.pdf

    Google Scholar 

  • O’Neill M, Vanneschi L, Gustafson S, Banzhaf W (2010) Open issues in genetic programming. Genet Program Evolvable Mach 11(3/4):339–363 (tenth Anniversary Issue: Progress in Genetic Programming and Evolvable Machines)

    Google Scholar 

  • Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, Vanderplas J, Passos A, Cournapeau D, Brucher M, Perrot M, Duchesnay E (2011) Scikit-learn: machine learning in python. J Mach Learn Res 12:2825–2830

    MathSciNet  MATH  Google Scholar 

  • Schmidt M, Lipson H (2009) Distilling free-form natural laws from experimental data. Science 324(5923):81–85. doi:10.1126/science.1165893. http://ccsl.mae.cornell.edu/sites/default/files/Science09_Schmidt.pdf

    Google Scholar 

  • Smits GF, Vladislavleva E, Kotanchek ME (2010) Scalable symbolic regression by continuous evolution with very small populations. In: Riolo R, McConaghy T, Vladislavleva E (eds.) Genetic programming theory and practice VIII. Genetic and evolutionary computation, Chap. 9, vol. 8. Springer, Ann Arbor, pp 147–160

    Chapter  Google Scholar 

  • van der Walt S, Colbert SC, Varoquaux G (2011) The numpy array: a structure for efficient numerical computation. Comput Sci Eng 13:22–30

    Article  Google Scholar 

  • van Harmelen F, Hendler JA, Hitzler P, Janowicz K (2015) Semantics for big data. AI Magazine 36(1):3–4

    Google Scholar 

  • Veeramachaneni K, Arnaldo I, Derby O, O’Reilly UM (2015) FlexGP: Cloud-based ensemble learning with genetic programming for large regression problems. J Grid Comput 13(3):391–407

    Article  Google Scholar 

  • Wagner S, Kronberger G (2011) Algorithm and experiment design with heuristiclab: an open source optimization environment for research and education. In: Whitley D (ed.) GECCO 2011 tutorials. ACM, Dublin, pp 1411–1438

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Steven Gustafson .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer International Publishing Switzerland

About this chapter

Cite this chapter

Gustafson, S., Narasimhan, R., Palla, R., Yousuf, A. (2016). Using Genetic Programming for Data Science: Lessons Learned. In: Riolo, R., Worzel, W., Kotanchek, M., Kordon, A. (eds) Genetic Programming Theory and Practice XIII. Genetic and Evolutionary Computation. Springer, Cham. https://doi.org/10.1007/978-3-319-34223-8_7

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-34223-8_7

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-34221-4

  • Online ISBN: 978-3-319-34223-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics