Abstract
In the era of petabyte, robust machine learning tools are needed to cope with the volume and high dimensionality of data to min. Evolutionary Algorithms (EA), such as Genetic Programming (GP), are powerful machine learning techniques with great potential to deal with big data challenges. To better exploit their capacities, additional manipulations can help the EA to alleviate the computation cost and then better look insight the large data sets. This chapter summarizes some solutions and trends to address difficulties when training EA/GP on big data sets and proposes a taxonomy to classify these solutions on three categories: Processing manipulation, algorithm manipulation and data manipulation. Two approaches are then presented and discussed. The first one, from the processing manipulation category, parallelizes GP over Spark. The second one, from the algorithm manipulation category, extends GP with active learning using dynamic and adaptive sampling. For each approach, some guidelines of implementation into the GP loop over an EA Python framework are given. A combination of the two approaches is also presented. The efficiency of these solutions is then discussed according to some published experimental studies.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
References
ACM (ed.): Genetic and Evolutionary Computation Conference, Berlin, Germany, July 15–19, 2017, Companion Material Proceedings. ACM (2017)
Adam-Bourdarios, C., Cowan, G., Germain, C., Guyon, I., Kegl, B., Rousseau, D.: Learning to discover: the higgs boson machine learning challenge (2014), http://higgsml.lal.in2p3.fr/documentation
Alves, A.: Stacking machine learning classifiers to identify higgs bosons at the LHC. Journal of Instrumentation 12(05), T05005 (2017)
Archive, U.K.: Kdd cup: http://kdd.ics.uci.edu/databases/kddcup99/ (1999), http://archive.ics.uci.edu/ml/machine-learning-databases/kddcup99-mld/kddcup99.html
Atlas, L.E., Cohn, D., Ladner, R.: Training connectionist networks with queries and selective sampling. In: Touretzky, D. (ed.) Advances in Neural Information Processing Systems 2, pp. 566–573. Morgan-Kaufmann (1990)
Bacardit, J., Llorà, X.: Large-scale data mining using genetics-based machine learning. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 3(1), 37–61 (2013)
Ben Hamida, S., Hmida, H., Borgi, A., Rukoz, M.: Adaptive sampling for active learning with genetic programming. Cognitive Systems Research 65, 23–39 (2021). https://doi.org/10.1016/j.cogsys.2020.08.008, https://www.sciencedirect.com/science/article/pii/S1389041720300541
Bhatnagar, R.: Unleashing machine learning onto big data: Issues, challenges and trends. In: Machine Learning Paradigms: Theory and Application, pp. 271–286. Springer (2019)
Cantu-Paz, E.: Efficient and accurate parallel genetic algorithms, vol. 1. Springer Science & Business Media (2000)
Chávez, F., Fernández, F., Benavides, C., Lanza, D., Villegas-Cortez, J., Trujillo, L., Olague, G., Román, G.: ECJ+HADOOP: an easy way to deploy massive runs of evolutionary algorithms. In: Applications of Evolutionary Computation, EvoApplications 2016, March 30 - April 1, Proceedings, Part II. Lecture Notes in Computer Science, vol. 9598, pp. 91–106. Springer (2016)
Cohn, D., Atlas, L.E., Ladner, R., Waibel, A.: Improving generalization with active learning. In: Machine Learning. pp. 201–221 (1994)
Curry, R., Lichodzijewski, P., Heywood, M.I.: Scaling genetic programming to large datasets using hierarchical dynamic subset selection. IEEE Transactions on Systems, Man, and Cybernetics: Part B - Cybernetics 37(4), 1065–1073 (2007), https://doi.org/10.1109/TSMCB.2007.896406
Dean, J., Ghemawat, S.: Mapreduce: Simplified data processing on large clusters. In: Brewer, E.A., Chen, P. (eds.) 6th Symposium on Operating System Design and Implementation (OSDI 2004), San Francisco, California, USA, December 6–8, 2004. pp. 137–150. USENIX Association (2004)
Dushatskiy, A., Alderliesten, T., Bosman, P.A.: A novel surrogate-assisted evolutionary algorithm applied to partition-based ensemble learning. arXiv preprint arXiv:2104.08048 (2021)
Fortin, F.A., De Rainville, F.M., Gardner, M.A., Parizeau, M., Gagné, C.: DEAP: Evolutionary algorithms made easy. Journal of Machine Learning Research 13, 2171–2175 (Jul 2012)
Freitas, A.A.: Data mining and knowledge discovery with evolutionary algorithms. Springer Science & Business Media (2018)
Gathercole, C., Ross, P.: Dynamic training subset selection for supervised learning in genetic programming. In: Parallel Problem Solving from Nature - PPSN III. Lecture Notes in Computer Science, vol. 866, pp. 312–321. Springer (1994)
Harding, S., Banzhaf, W.: Implementing cartesian genetic programming classifiers on graphics processing units using gpu. net. In: Proceedings of the 13th annual conference companion on Genetic and evolutionary computation. pp. 463–470 (2011)
Hmida, H., Ben Hamida, S., Borgi, A., Rukoz, M.: Hierarchical data topology based selection for large scale learning. In: Ubiquitous Intelligence & Computing, Advanced and Trusted Computing, Scalable Computing and Communications, Cloud and Big Data Computing, Internet of People, and Smart World Congress, 2016 Intl IEEE Conferences. pp. 1221–1226. IEEE (2016)
Hmida, H., Ben Hamida, S., Borgi, A., Rukoz, M.: Sampling methods in genetic programming learners from large datasets: A comparative study. In: Angelov, P., Manolopoulos, Y., Iliadis, L.S., Roy, A., Vellasco, M.M.B.R. (eds.) Advances in Big Data - Proceedings of the 2nd INNS Conference on Big Data, October 23–25, 2016, Thessaloniki, Greece. Advances in Intelligent Systems and Computing, vol. 529, pp. 50–60 (2016). https://doi.org/10.1007/978-3-319-47898-2_6
Hmida, H., Ben Hamida, S., Borgi, A., Rukoz, M.: Scale genetic programming for large data sets: Case of higgs bosons classification. Procedia Computer Science 126, 302–311 (2018), the 22nd International Conference, KES-2018
Hmida, H., Ben Hamida, S., Borgi, A., Rukoz, M.: Genetic programming over spark for higgs boson classification. In: International Conference on Business Information Systems. pp. 300–312. Springer (2019)
Hmida, H., Ben Hamida, S.B., Borgi, A., Rukoz, M.: A new adaptive sampling approach for genetic programming. In: 2019 Third International Conference on Intelligent Computing in Data Sciences (ICDS). pp. 1–8 (2019). https://doi.org/10.1109/ICDS47004.2019.8942353
Hunt, R., Johnston, M., Browne, W.N., Zhang, M.: Sampling methods in genetic programming for classification with unbalanced data. In: Li, J. (ed.) Australasian Conference on Artificial Intelligence. Lecture Notes in Computer Science, vol. 6464, pp. 273–282. Springer (2010)
Iba, H.: Bagging, boosting, and bloating in genetic programming. In: Banzhaf, W., Daida, J., Eiben, A.E., Garzon, M.H., Honavar, V., Jakiela, M., Smith, R.E. (eds.) Proc. of the Genetic and Evolutionary Computation Conf. GECCO-99. pp. 1053–1060. Morgan Kaufmann, San Francisco, CA (1999)
Kienzler, R.: Mastering Apache Spark 2.x. Packt Publishing (2017)
Langdon, W.B.: Graphics processing units and genetic programming: an overview. Soft Computing 15(8), 1657–1669 (2011)
Lasarczyk, C.W.G., Dittrich, P., Banzhaf, W.: Dynamic subset selection based on a fitness case topology. Evolutionary Computation 12(2), 223–242 (2004), https://doi.org/10.1162/106365604773955157
L’Heureux, A., Grolinger, K., ElYamany, H.F., Capretz, M.A.M.: Machine learning with big data: Challenges and approaches. IEEE Access 5, 7776–7797 (2017). https://doi.org/10.1109/ACCESS.2017.2696365
Liu, Y., Khoshgoftaar, T.M.: Reducing overfitting in genetic programming models for software quality classification. In: 8th IEEE International Symposium on High-Assurance Systems Engineering (HASE 2004), 25–26 March 2004, Tampa, FL, USA. pp. 56–65 (2004). https://doi.org/10.1109/HASE.2004.1281730
Maitre, O.: Genetic programming on GPGPU cards using EASEA. In: Massively Parallel Evolutionary Computation on GPGPUs, pp. 227–248. Springer (2013)
Nordin, P., Banzhaf, W.: An on-line method to evolve behavior and to control a miniature robot in real time with genetic programming. Adaptive Behaviour 5(2), 107–140 (1997). https://doi.org/10.1177/105971239700500201
Paduraru, C., Melemciuc, M., Stefanescu, A.: A distributed implementation using apache spark of a genetic algorithm applied to test data generation. In: ACM [1], pp. 1857–1863
Paris, G., Robilliard, D., Fonlupt, C.: Exploring overfitting in genetic programming. In: Artificial Evolution, 6th International Conference, Evolution Artificielle, EA 2003, Marseille, France, October 27–30, 2003. pp. 267–277 (2003)
Peralta, D., del Río, S., Ramírez-Gallego, S., Triguero, I., Benitez, J.M., Herrera, F.: Evolutionary Feature Selection for Big Data Classification: A MapReduce Approach. Mathematical Problems in Engineering 2015, 11 (2015)
Qi, R., Wang, Z., Li, S.: A parallel genetic algorithm based on spark for pairwise test suite generation. J. Comput. Sci. Technol. 31(2), 417–427 (2016)
Reinsel, D., Gantz, J., Rydning, J.: The digitization of the world from edge to core. Tech. Rep. US44413318, International Data Corporation (November 2018), https://www.seagate.com/files/www-content/our-story/trends/files/idc-seagate-dataage-whitepaper.pdf
Robert Curry, M.H.: Towards efficient training on large datasets for genetic programming. Lecture Notes in Computer Science 866 (Advances in Artificial Intelligence), 161–174 (2004)
Shashidhara, B.M., Jain, S., Rao, V.D., Patil, N., Raghavendra, G.S.: Evaluation of machine learning frameworks on bank marketing and higgs datasets. In: 2nd International Conference on Advances in Computing and Communication Engineering. pp. 551–555 (2015)
Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauly, M., Franklin, M.J., Shenker, S., Stoica, I.: Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In: Proceedings of the 9th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2012, April 25–27. pp. 15–28. USENIX Association (2012)
Zhang, B.T., Joung, J.G.: Genetic programming with incremental data inheritance. In: Proceedings of the Genetic and Evolutionary Computation Conference. vol. 2, pp. 1217–1224. Morgan Kaufmann, Orlando, Florida, USA (13–17 July 1999), http://www.cs.bham.ac.uk/~wbl/biblio/gecco1999/GP-460.pdf
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this chapter
Cite this chapter
Ben Hamida, S., Hmida, H. (2023). Algorithm vs Processing Manipulation to Scale Genetic Programming to Big Data Mining. In: Eddaly, M., Jarboui, B., Siarry, P. (eds) Metaheuristics for Machine Learning. Computational Intelligence Methods and Applications. Springer, Singapore. https://doi.org/10.1007/978-981-19-3888-7_7
Download citation
DOI: https://doi.org/10.1007/978-981-19-3888-7_7
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-19-3887-0
Online ISBN: 978-981-19-3888-7
eBook Packages: Computer ScienceComputer Science (R0)