Algorithm vs Processing Manipulation to Scale Genetic Programming to Big Data Mining

Ben Hamida, S.; Hmida, H.

doi:10.1007/978-981-19-3888-7_7

S. Ben Hamida⁸ &
H. Hmida⁹

Part of the book series: Computational Intelligence Methods and Applications ((CIMA))

406 Accesses

Abstract

In the era of petabyte, robust machine learning tools are needed to cope with the volume and high dimensionality of data to min. Evolutionary Algorithms (EA), such as Genetic Programming (GP), are powerful machine learning techniques with great potential to deal with big data challenges. To better exploit their capacities, additional manipulations can help the EA to alleviate the computation cost and then better look insight the large data sets. This chapter summarizes some solutions and trends to address difficulties when training EA/GP on big data sets and proposes a taxonomy to classify these solutions on three categories: Processing manipulation, algorithm manipulation and data manipulation. Two approaches are then presented and discussed. The first one, from the processing manipulation category, parallelizes GP over Spark. The second one, from the algorithm manipulation category, extends GP with active learning using dynamic and adaptive sampling. For each approach, some guidelines of implementation into the GP loop over an EA Python framework are given. A combination of the two approaches is also presented. The efficiency of these solutions is then discussed according to some published experimental studies.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 139.00; Price excludes VAT (USA)

Softcover Book: USD 179.99; Price excludes VAT (USA)

Hardcover Book: USD 179.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Trends of Evolutionary Machine Learning to Address Big Data Mining

Scalable Distributed Genetic Algorithm Using Apache Spark (S-GA)

GPA-ES Algorithm Modification for Large Data

Notes

1.
ZB: Zettabyte = 10²¹ bytes.
2.
https://spark.apache.org.
3.
https://github.com/hhmida/gp-spark.
4.
https://hadoop.apache.org.
5.
https://spark.apache.org.
6.
https://github.com/deap/deap.
7.
https://spark.apache.org/docs/latest/api/python/reference/pyspark.html.

References

ACM (ed.): Genetic and Evolutionary Computation Conference, Berlin, Germany, July 15–19, 2017, Companion Material Proceedings. ACM (2017)
Google Scholar
Adam-Bourdarios, C., Cowan, G., Germain, C., Guyon, I., Kegl, B., Rousseau, D.: Learning to discover: the higgs boson machine learning challenge (2014), http://higgsml.lal.in2p3.fr/documentation
Alves, A.: Stacking machine learning classifiers to identify higgs bosons at the LHC. Journal of Instrumentation 12(05), T05005 (2017)
Article Google Scholar
Archive, U.K.: Kdd cup: http://kdd.ics.uci.edu/databases/kddcup99/ (1999), http://archive.ics.uci.edu/ml/machine-learning-databases/kddcup99-mld/kddcup99.html
Atlas, L.E., Cohn, D., Ladner, R.: Training connectionist networks with queries and selective sampling. In: Touretzky, D. (ed.) Advances in Neural Information Processing Systems 2, pp. 566–573. Morgan-Kaufmann (1990)
Google Scholar
Bacardit, J., Llorà, X.: Large-scale data mining using genetics-based machine learning. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 3(1), 37–61 (2013)
Google Scholar
Ben Hamida, S., Hmida, H., Borgi, A., Rukoz, M.: Adaptive sampling for active learning with genetic programming. Cognitive Systems Research 65, 23–39 (2021). https://doi.org/10.1016/j.cogsys.2020.08.008, https://www.sciencedirect.com/science/article/pii/S1389041720300541
Bhatnagar, R.: Unleashing machine learning onto big data: Issues, challenges and trends. In: Machine Learning Paradigms: Theory and Application, pp. 271–286. Springer (2019)
Google Scholar
Cantu-Paz, E.: Efficient and accurate parallel genetic algorithms, vol. 1. Springer Science & Business Media (2000)
Google Scholar
Chávez, F., Fernández, F., Benavides, C., Lanza, D., Villegas-Cortez, J., Trujillo, L., Olague, G., Román, G.: ECJ+HADOOP: an easy way to deploy massive runs of evolutionary algorithms. In: Applications of Evolutionary Computation, EvoApplications 2016, March 30 - April 1, Proceedings, Part II. Lecture Notes in Computer Science, vol. 9598, pp. 91–106. Springer (2016)
Google Scholar
Cohn, D., Atlas, L.E., Ladner, R., Waibel, A.: Improving generalization with active learning. In: Machine Learning. pp. 201–221 (1994)
Google Scholar
Curry, R., Lichodzijewski, P., Heywood, M.I.: Scaling genetic programming to large datasets using hierarchical dynamic subset selection. IEEE Transactions on Systems, Man, and Cybernetics: Part B - Cybernetics 37(4), 1065–1073 (2007), https://doi.org/10.1109/TSMCB.2007.896406
Article Google Scholar
Dean, J., Ghemawat, S.: Mapreduce: Simplified data processing on large clusters. In: Brewer, E.A., Chen, P. (eds.) 6th Symposium on Operating System Design and Implementation (OSDI 2004), San Francisco, California, USA, December 6–8, 2004. pp. 137–150. USENIX Association (2004)
Google Scholar
Dushatskiy, A., Alderliesten, T., Bosman, P.A.: A novel surrogate-assisted evolutionary algorithm applied to partition-based ensemble learning. arXiv preprint arXiv:2104.08048 (2021)
Google Scholar
Fortin, F.A., De Rainville, F.M., Gardner, M.A., Parizeau, M., Gagné, C.: DEAP: Evolutionary algorithms made easy. Journal of Machine Learning Research 13, 2171–2175 (Jul 2012)
MathSciNet Google Scholar
Freitas, A.A.: Data mining and knowledge discovery with evolutionary algorithms. Springer Science & Business Media (2018)
Google Scholar
Gathercole, C., Ross, P.: Dynamic training subset selection for supervised learning in genetic programming. In: Parallel Problem Solving from Nature - PPSN III. Lecture Notes in Computer Science, vol. 866, pp. 312–321. Springer (1994)
Google Scholar
Harding, S., Banzhaf, W.: Implementing cartesian genetic programming classifiers on graphics processing units using gpu. net. In: Proceedings of the 13th annual conference companion on Genetic and evolutionary computation. pp. 463–470 (2011)
Google Scholar
Hmida, H., Ben Hamida, S., Borgi, A., Rukoz, M.: Hierarchical data topology based selection for large scale learning. In: Ubiquitous Intelligence & Computing, Advanced and Trusted Computing, Scalable Computing and Communications, Cloud and Big Data Computing, Internet of People, and Smart World Congress, 2016 Intl IEEE Conferences. pp. 1221–1226. IEEE (2016)
Google Scholar
Hmida, H., Ben Hamida, S., Borgi, A., Rukoz, M.: Sampling methods in genetic programming learners from large datasets: A comparative study. In: Angelov, P., Manolopoulos, Y., Iliadis, L.S., Roy, A., Vellasco, M.M.B.R. (eds.) Advances in Big Data - Proceedings of the 2nd INNS Conference on Big Data, October 23–25, 2016, Thessaloniki, Greece. Advances in Intelligent Systems and Computing, vol. 529, pp. 50–60 (2016). https://doi.org/10.1007/978-3-319-47898-2_6
Article Google Scholar
Hmida, H., Ben Hamida, S., Borgi, A., Rukoz, M.: Scale genetic programming for large data sets: Case of higgs bosons classification. Procedia Computer Science 126, 302–311 (2018), the 22nd International Conference, KES-2018
Google Scholar
Hmida, H., Ben Hamida, S., Borgi, A., Rukoz, M.: Genetic programming over spark for higgs boson classification. In: International Conference on Business Information Systems. pp. 300–312. Springer (2019)
Google Scholar
Hmida, H., Ben Hamida, S.B., Borgi, A., Rukoz, M.: A new adaptive sampling approach for genetic programming. In: 2019 Third International Conference on Intelligent Computing in Data Sciences (ICDS). pp. 1–8 (2019). https://doi.org/10.1109/ICDS47004.2019.8942353
Google Scholar
Hunt, R., Johnston, M., Browne, W.N., Zhang, M.: Sampling methods in genetic programming for classification with unbalanced data. In: Li, J. (ed.) Australasian Conference on Artificial Intelligence. Lecture Notes in Computer Science, vol. 6464, pp. 273–282. Springer (2010)
Google Scholar
Iba, H.: Bagging, boosting, and bloating in genetic programming. In: Banzhaf, W., Daida, J., Eiben, A.E., Garzon, M.H., Honavar, V., Jakiela, M., Smith, R.E. (eds.) Proc. of the Genetic and Evolutionary Computation Conf. GECCO-99. pp. 1053–1060. Morgan Kaufmann, San Francisco, CA (1999)
Google Scholar
Kienzler, R.: Mastering Apache Spark 2.x. Packt Publishing (2017)
Google Scholar
Langdon, W.B.: Graphics processing units and genetic programming: an overview. Soft Computing 15(8), 1657–1669 (2011)
Article Google Scholar
Lasarczyk, C.W.G., Dittrich, P., Banzhaf, W.: Dynamic subset selection based on a fitness case topology. Evolutionary Computation 12(2), 223–242 (2004), https://doi.org/10.1162/106365604773955157
Article Google Scholar
L’Heureux, A., Grolinger, K., ElYamany, H.F., Capretz, M.A.M.: Machine learning with big data: Challenges and approaches. IEEE Access 5, 7776–7797 (2017). https://doi.org/10.1109/ACCESS.2017.2696365
Article Google Scholar
Liu, Y., Khoshgoftaar, T.M.: Reducing overfitting in genetic programming models for software quality classification. In: 8th IEEE International Symposium on High-Assurance Systems Engineering (HASE 2004), 25–26 March 2004, Tampa, FL, USA. pp. 56–65 (2004). https://doi.org/10.1109/HASE.2004.1281730
Maitre, O.: Genetic programming on GPGPU cards using EASEA. In: Massively Parallel Evolutionary Computation on GPGPUs, pp. 227–248. Springer (2013)
Google Scholar
Nordin, P., Banzhaf, W.: An on-line method to evolve behavior and to control a miniature robot in real time with genetic programming. Adaptive Behaviour 5(2), 107–140 (1997). https://doi.org/10.1177/105971239700500201
Article Google Scholar
Paduraru, C., Melemciuc, M., Stefanescu, A.: A distributed implementation using apache spark of a genetic algorithm applied to test data generation. In: ACM [1], pp. 1857–1863
Google Scholar
Paris, G., Robilliard, D., Fonlupt, C.: Exploring overfitting in genetic programming. In: Artificial Evolution, 6th International Conference, Evolution Artificielle, EA 2003, Marseille, France, October 27–30, 2003. pp. 267–277 (2003)
Google Scholar
Peralta, D., del Río, S., Ramírez-Gallego, S., Triguero, I., Benitez, J.M., Herrera, F.: Evolutionary Feature Selection for Big Data Classification: A MapReduce Approach. Mathematical Problems in Engineering 2015, 11 (2015)
Article MATH Google Scholar
Qi, R., Wang, Z., Li, S.: A parallel genetic algorithm based on spark for pairwise test suite generation. J. Comput. Sci. Technol. 31(2), 417–427 (2016)
Article Google Scholar
Reinsel, D., Gantz, J., Rydning, J.: The digitization of the world from edge to core. Tech. Rep. US44413318, International Data Corporation (November 2018), https://www.seagate.com/files/www-content/our-story/trends/files/idc-seagate-dataage-whitepaper.pdf
Robert Curry, M.H.: Towards efficient training on large datasets for genetic programming. Lecture Notes in Computer Science 866 (Advances in Artificial Intelligence), 161–174 (2004)
Google Scholar
Shashidhara, B.M., Jain, S., Rao, V.D., Patil, N., Raghavendra, G.S.: Evaluation of machine learning frameworks on bank marketing and higgs datasets. In: 2nd International Conference on Advances in Computing and Communication Engineering. pp. 551–555 (2015)
Google Scholar
Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauly, M., Franklin, M.J., Shenker, S., Stoica, I.: Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In: Proceedings of the 9th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2012, April 25–27. pp. 15–28. USENIX Association (2012)
Google Scholar
Zhang, B.T., Joung, J.G.: Genetic programming with incremental data inheritance. In: Proceedings of the Genetic and Evolutionary Computation Conference. vol. 2, pp. 1217–1224. Morgan Kaufmann, Orlando, Florida, USA (13–17 July 1999), http://www.cs.bham.ac.uk/~wbl/biblio/gecco1999/GP-460.pdf

Download references

Author information

Authors and Affiliations

LAMSADE, Paris Dauphine University, PSL Research University, Paris, France
S. Ben Hamida
Institut Supérieur des Études Technologiques de Bizerte, Menzel Abderrahmane, Tunisia
H. Hmida

Authors

S. Ben Hamida
View author publications
You can also search for this author in PubMed Google Scholar
H. Hmida
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to S. Ben Hamida .

Editor information

Editors and Affiliations

Qassim University, Buraydah, Saudi Arabia
Mansour Eddaly
Abu Dhabi Women Campus, Higher Colleges of Technology, Abu Dhabi, United Arab Emirates
Bassem Jarboui
Paris-Est Créteil University, Paris, France
Patrick Siarry

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Ben Hamida, S., Hmida, H. (2023). Algorithm vs Processing Manipulation to Scale Genetic Programming to Big Data Mining. In: Eddaly, M., Jarboui, B., Siarry, P. (eds) Metaheuristics for Machine Learning. Computational Intelligence Methods and Applications. Springer, Singapore. https://doi.org/10.1007/978-981-19-3888-7_7

Download citation

DOI: https://doi.org/10.1007/978-981-19-3888-7_7
Published: 13 August 2022
Publisher Name: Springer, Singapore
Print ISBN: 978-981-19-3887-0
Online ISBN: 978-981-19-3888-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Algorithm vs Processing Manipulation to Scale Genetic Programming to Big Data Mining

Abstract

Access this chapter

Similar content being viewed by others

Trends of Evolutionary Machine Learning to Address Big Data Mining

Scalable Distributed Genetic Algorithm Using Apache Spark (S-GA)

GPA-ES Algorithm Modification for Large Data

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Publish with us

Navigation

Algorithm vs Processing Manipulation to Scale Genetic Programming to Big Data Mining

Abstract

Access this chapter

Similar content being viewed by others

Trends of Evolutionary Machine Learning to Address Big Data Mining

Scalable Distributed Genetic Algorithm Using Apache Spark (S-GA)

GPA-ES Algorithm Modification for Large Data

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Share this chapter

Publish with us

Search

Navigation