Skip to main content

Algorithm vs Processing Manipulation to Scale Genetic Programming to Big Data Mining

  • Chapter
  • First Online:
Metaheuristics for Machine Learning

Part of the book series: Computational Intelligence Methods and Applications ((CIMA))

  • 406 Accesses

Abstract

In the era of petabyte, robust machine learning tools are needed to cope with the volume and high dimensionality of data to min. Evolutionary Algorithms (EA), such as Genetic Programming (GP), are powerful machine learning techniques with great potential to deal with big data challenges. To better exploit their capacities, additional manipulations can help the EA to alleviate the computation cost and then better look insight the large data sets. This chapter summarizes some solutions and trends to address difficulties when training EA/GP on big data sets and proposes a taxonomy to classify these solutions on three categories: Processing manipulation, algorithm manipulation and data manipulation. Two approaches are then presented and discussed. The first one, from the processing manipulation category, parallelizes GP over Spark. The second one, from the algorithm manipulation category, extends GP with active learning using dynamic and adaptive sampling. For each approach, some guidelines of implementation into the GP loop over an EA Python framework are given. A combination of the two approaches is also presented. The efficiency of these solutions is then discussed according to some published experimental studies.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 139.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 179.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 179.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    ZB: Zettabyte = 1021 bytes.

  2. 2.

    https://spark.apache.org.

  3. 3.

    https://github.com/hhmida/gp-spark.

  4. 4.

    https://hadoop.apache.org.

  5. 5.

    https://spark.apache.org.

  6. 6.

    https://github.com/deap/deap.

  7. 7.

    https://spark.apache.org/docs/latest/api/python/reference/pyspark.html.

References

  1. ACM (ed.): Genetic and Evolutionary Computation Conference, Berlin, Germany, July 15–19, 2017, Companion Material Proceedings. ACM (2017)

    Google Scholar 

  2. Adam-Bourdarios, C., Cowan, G., Germain, C., Guyon, I., Kegl, B., Rousseau, D.: Learning to discover: the higgs boson machine learning challenge (2014), http://higgsml.lal.in2p3.fr/documentation

  3. Alves, A.: Stacking machine learning classifiers to identify higgs bosons at the LHC. Journal of Instrumentation 12(05), T05005 (2017)

    Article  Google Scholar 

  4. Archive, U.K.: Kdd cup: http://kdd.ics.uci.edu/databases/kddcup99/ (1999), http://archive.ics.uci.edu/ml/machine-learning-databases/kddcup99-mld/kddcup99.html

  5. Atlas, L.E., Cohn, D., Ladner, R.: Training connectionist networks with queries and selective sampling. In: Touretzky, D. (ed.) Advances in Neural Information Processing Systems 2, pp. 566–573. Morgan-Kaufmann (1990)

    Google Scholar 

  6. Bacardit, J., Llorà, X.: Large-scale data mining using genetics-based machine learning. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 3(1), 37–61 (2013)

    Google Scholar 

  7. Ben Hamida, S., Hmida, H., Borgi, A., Rukoz, M.: Adaptive sampling for active learning with genetic programming. Cognitive Systems Research 65, 23–39 (2021). https://doi.org/10.1016/j.cogsys.2020.08.008, https://www.sciencedirect.com/science/article/pii/S1389041720300541

  8. Bhatnagar, R.: Unleashing machine learning onto big data: Issues, challenges and trends. In: Machine Learning Paradigms: Theory and Application, pp. 271–286. Springer (2019)

    Google Scholar 

  9. Cantu-Paz, E.: Efficient and accurate parallel genetic algorithms, vol. 1. Springer Science & Business Media (2000)

    Google Scholar 

  10. Chávez, F., Fernández, F., Benavides, C., Lanza, D., Villegas-Cortez, J., Trujillo, L., Olague, G., Román, G.: ECJ+HADOOP: an easy way to deploy massive runs of evolutionary algorithms. In: Applications of Evolutionary Computation, EvoApplications 2016, March 30 - April 1, Proceedings, Part II. Lecture Notes in Computer Science, vol. 9598, pp. 91–106. Springer (2016)

    Google Scholar 

  11. Cohn, D., Atlas, L.E., Ladner, R., Waibel, A.: Improving generalization with active learning. In: Machine Learning. pp. 201–221 (1994)

    Google Scholar 

  12. Curry, R., Lichodzijewski, P., Heywood, M.I.: Scaling genetic programming to large datasets using hierarchical dynamic subset selection. IEEE Transactions on Systems, Man, and Cybernetics: Part B - Cybernetics 37(4), 1065–1073 (2007), https://doi.org/10.1109/TSMCB.2007.896406

    Article  Google Scholar 

  13. Dean, J., Ghemawat, S.: Mapreduce: Simplified data processing on large clusters. In: Brewer, E.A., Chen, P. (eds.) 6th Symposium on Operating System Design and Implementation (OSDI 2004), San Francisco, California, USA, December 6–8, 2004. pp. 137–150. USENIX Association (2004)

    Google Scholar 

  14. Dushatskiy, A., Alderliesten, T., Bosman, P.A.: A novel surrogate-assisted evolutionary algorithm applied to partition-based ensemble learning. arXiv preprint arXiv:2104.08048 (2021)

    Google Scholar 

  15. Fortin, F.A., De Rainville, F.M., Gardner, M.A., Parizeau, M., Gagné, C.: DEAP: Evolutionary algorithms made easy. Journal of Machine Learning Research 13, 2171–2175 (Jul 2012)

    MathSciNet  Google Scholar 

  16. Freitas, A.A.: Data mining and knowledge discovery with evolutionary algorithms. Springer Science & Business Media (2018)

    Google Scholar 

  17. Gathercole, C., Ross, P.: Dynamic training subset selection for supervised learning in genetic programming. In: Parallel Problem Solving from Nature - PPSN III. Lecture Notes in Computer Science, vol. 866, pp. 312–321. Springer (1994)

    Google Scholar 

  18. Harding, S., Banzhaf, W.: Implementing cartesian genetic programming classifiers on graphics processing units using gpu. net. In: Proceedings of the 13th annual conference companion on Genetic and evolutionary computation. pp. 463–470 (2011)

    Google Scholar 

  19. Hmida, H., Ben Hamida, S., Borgi, A., Rukoz, M.: Hierarchical data topology based selection for large scale learning. In: Ubiquitous Intelligence & Computing, Advanced and Trusted Computing, Scalable Computing and Communications, Cloud and Big Data Computing, Internet of People, and Smart World Congress, 2016 Intl IEEE Conferences. pp. 1221–1226. IEEE (2016)

    Google Scholar 

  20. Hmida, H., Ben Hamida, S., Borgi, A., Rukoz, M.: Sampling methods in genetic programming learners from large datasets: A comparative study. In: Angelov, P., Manolopoulos, Y., Iliadis, L.S., Roy, A., Vellasco, M.M.B.R. (eds.) Advances in Big Data - Proceedings of the 2nd INNS Conference on Big Data, October 23–25, 2016, Thessaloniki, Greece. Advances in Intelligent Systems and Computing, vol. 529, pp. 50–60 (2016). https://doi.org/10.1007/978-3-319-47898-2_6

    Article  Google Scholar 

  21. Hmida, H., Ben Hamida, S., Borgi, A., Rukoz, M.: Scale genetic programming for large data sets: Case of higgs bosons classification. Procedia Computer Science 126, 302–311 (2018), the 22nd International Conference, KES-2018

    Google Scholar 

  22. Hmida, H., Ben Hamida, S., Borgi, A., Rukoz, M.: Genetic programming over spark for higgs boson classification. In: International Conference on Business Information Systems. pp. 300–312. Springer (2019)

    Google Scholar 

  23. Hmida, H., Ben Hamida, S.B., Borgi, A., Rukoz, M.: A new adaptive sampling approach for genetic programming. In: 2019 Third International Conference on Intelligent Computing in Data Sciences (ICDS). pp. 1–8 (2019). https://doi.org/10.1109/ICDS47004.2019.8942353

    Google Scholar 

  24. Hunt, R., Johnston, M., Browne, W.N., Zhang, M.: Sampling methods in genetic programming for classification with unbalanced data. In: Li, J. (ed.) Australasian Conference on Artificial Intelligence. Lecture Notes in Computer Science, vol. 6464, pp. 273–282. Springer (2010)

    Google Scholar 

  25. Iba, H.: Bagging, boosting, and bloating in genetic programming. In: Banzhaf, W., Daida, J., Eiben, A.E., Garzon, M.H., Honavar, V., Jakiela, M., Smith, R.E. (eds.) Proc. of the Genetic and Evolutionary Computation Conf. GECCO-99. pp. 1053–1060. Morgan Kaufmann, San Francisco, CA (1999)

    Google Scholar 

  26. Kienzler, R.: Mastering Apache Spark 2.x. Packt Publishing (2017)

    Google Scholar 

  27. Langdon, W.B.: Graphics processing units and genetic programming: an overview. Soft Computing 15(8), 1657–1669 (2011)

    Article  Google Scholar 

  28. Lasarczyk, C.W.G., Dittrich, P., Banzhaf, W.: Dynamic subset selection based on a fitness case topology. Evolutionary Computation 12(2), 223–242 (2004), https://doi.org/10.1162/106365604773955157

    Article  Google Scholar 

  29. L’Heureux, A., Grolinger, K., ElYamany, H.F., Capretz, M.A.M.: Machine learning with big data: Challenges and approaches. IEEE Access 5, 7776–7797 (2017). https://doi.org/10.1109/ACCESS.2017.2696365

    Article  Google Scholar 

  30. Liu, Y., Khoshgoftaar, T.M.: Reducing overfitting in genetic programming models for software quality classification. In: 8th IEEE International Symposium on High-Assurance Systems Engineering (HASE 2004), 25–26 March 2004, Tampa, FL, USA. pp. 56–65 (2004). https://doi.org/10.1109/HASE.2004.1281730

  31. Maitre, O.: Genetic programming on GPGPU cards using EASEA. In: Massively Parallel Evolutionary Computation on GPGPUs, pp. 227–248. Springer (2013)

    Google Scholar 

  32. Nordin, P., Banzhaf, W.: An on-line method to evolve behavior and to control a miniature robot in real time with genetic programming. Adaptive Behaviour 5(2), 107–140 (1997). https://doi.org/10.1177/105971239700500201

    Article  Google Scholar 

  33. Paduraru, C., Melemciuc, M., Stefanescu, A.: A distributed implementation using apache spark of a genetic algorithm applied to test data generation. In: ACM [1], pp. 1857–1863

    Google Scholar 

  34. Paris, G., Robilliard, D., Fonlupt, C.: Exploring overfitting in genetic programming. In: Artificial Evolution, 6th International Conference, Evolution Artificielle, EA 2003, Marseille, France, October 27–30, 2003. pp. 267–277 (2003)

    Google Scholar 

  35. Peralta, D., del Río, S., Ramírez-Gallego, S., Triguero, I., Benitez, J.M., Herrera, F.: Evolutionary Feature Selection for Big Data Classification: A MapReduce Approach. Mathematical Problems in Engineering 2015, 11 (2015)

    Article  MATH  Google Scholar 

  36. Qi, R., Wang, Z., Li, S.: A parallel genetic algorithm based on spark for pairwise test suite generation. J. Comput. Sci. Technol. 31(2), 417–427 (2016)

    Article  Google Scholar 

  37. Reinsel, D., Gantz, J., Rydning, J.: The digitization of the world from edge to core. Tech. Rep. US44413318, International Data Corporation (November 2018), https://www.seagate.com/files/www-content/our-story/trends/files/idc-seagate-dataage-whitepaper.pdf

  38. Robert Curry, M.H.: Towards efficient training on large datasets for genetic programming. Lecture Notes in Computer Science 866 (Advances in Artificial Intelligence), 161–174 (2004)

    Google Scholar 

  39. Shashidhara, B.M., Jain, S., Rao, V.D., Patil, N., Raghavendra, G.S.: Evaluation of machine learning frameworks on bank marketing and higgs datasets. In: 2nd International Conference on Advances in Computing and Communication Engineering. pp. 551–555 (2015)

    Google Scholar 

  40. Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauly, M., Franklin, M.J., Shenker, S., Stoica, I.: Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In: Proceedings of the 9th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2012, April 25–27. pp. 15–28. USENIX Association (2012)

    Google Scholar 

  41. Zhang, B.T., Joung, J.G.: Genetic programming with incremental data inheritance. In: Proceedings of the Genetic and Evolutionary Computation Conference. vol. 2, pp. 1217–1224. Morgan Kaufmann, Orlando, Florida, USA (13–17 July 1999), http://www.cs.bham.ac.uk/~wbl/biblio/gecco1999/GP-460.pdf

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to S. Ben Hamida .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Ben Hamida, S., Hmida, H. (2023). Algorithm vs Processing Manipulation to Scale Genetic Programming to Big Data Mining. In: Eddaly, M., Jarboui, B., Siarry, P. (eds) Metaheuristics for Machine Learning. Computational Intelligence Methods and Applications. Springer, Singapore. https://doi.org/10.1007/978-981-19-3888-7_7

Download citation

  • DOI: https://doi.org/10.1007/978-981-19-3888-7_7

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-19-3887-0

  • Online ISBN: 978-981-19-3888-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics