Genetic Programming over Spark for Higgs Boson Classification

Hmida, Hmida; Ben Hamida, Sana; Borgi, Amel; Rukoz, Marta

doi:10.1007/978-3-030-20485-3_23

Hmida Hmida^8,9,
Sana Ben Hamida⁹,
Amel Borgi⁸ &
…
Marta Rukoz⁹

Part of the book series: Lecture Notes in Business Information Processing ((LNBIP,volume 353))

Included in the following conference series:

International Conference on Business Information Systems

1535 Accesses
2 Citations

Abstract

With the growing number of available databases having a very large number of records, existing knowledge discovery tools need to be adapted to this shift and new tools need to be created. Genetic Programming (GP) has been proven as an efficient algorithm in particular for classification problems. Notwithstanding, GP is impaired with its computing cost that is more acute with large datasets. This paper, presents how an existing GP implementation (DEAP) can be adapted by distributing evaluations on a Spark cluster. Then, an additional sampling step is applied to fit tiny clusters. Experiments are accomplished on Higgs Boson classification with different settings. They show the benefits of using Spark as parallelization technology for GP.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 69.99; Price excludes VAT (USA)

Softcover Book: USD 89.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
‘A data lake is a collection of storage instances of various data assets additional to the originating data sources.’ (Source: Gartner).
2.
https://hadoop.apache.org.
3.
https://spark.apache.org.

References

Al-Madi, N., Ludwig, S.A.: Scaling genetic programming for data classification using mapreduce methodology. In: Fifth World Congress on Nature and Biologically Inspired Computing, NaBIC 2013, 12–14 August 2013, pp. 132–139. IEEE (2013)
Google Scholar
Baldi, P., Sadowski, P., Whiteson, D.: Searching for exotic particles in high-energy physics with deep learning. Nature Commun. 5 (2014)
Google Scholar
Baldi, P., Sadowski, P., Whiteson, D.: Enhanced higgs boson to \(\tau \)+ \(\tau \)- search with deep learning. Phys. Rev. Lett. 114(11), 111–801 (2015)
Article Google Scholar
Chávez, F., et al.: ECJ+HADOOP: an easy way to deploy massive runs of evolutionary algorithms. In: Squillero, G., Burelli, P. (eds.) EvoApplications 2016. LNCS, vol. 9598, pp. 91–106. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-31153-1_7
Chapter Google Scholar
Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. In: Brewer, E.A., Chen, P. (eds.) 6th Symposium on Operating System Design and Implementation (OSDI 2004), San Francisco, California, USA, 6–8 December 2004, pp. 137–150. USENIX Association (2004)
Google Scholar
Fortin, F.A., De Rainville, F.M., Gardner, M.A., Parizeau, M., Gagné, C.: DEAP: evolutionary algorithms made easy. J. Mach. Learn. Res. 13, 2171–2175 (2012)
MathSciNet MATH Google Scholar
Funika, W., Koperek, P.: Scaling evolutionary programming with the use of apache spark. Comput. Sci. (AGH) 17(1), 69–82 (2016)
Article Google Scholar
Gathercole, C., Ross, P.: Dynamic training subset selection for supervised learning in Genetic Programming. In: Davidor, Y., Schwefel, H.-P., Männer, R. (eds.) PPSN 1994. LNCS, vol. 866, pp. 312–321. Springer, Heidelberg (1994). https://doi.org/10.1007/3-540-58484-6_275
Chapter Google Scholar
Giráldez, R., Díaz-Díaz, N., Nepomuceno, I., Aguilar-Ruiz, J.S.: An approach to reduce the cost of evaluation in evolutionary learning. In: Cabestany, J., Prieto, A., Sandoval, F. (eds.) IWANN 2005. LNCS, vol. 3512, pp. 804–811. Springer, Heidelberg (2005). https://doi.org/10.1007/11494669_98
Chapter Google Scholar
Higgs Dataset: http://archive.ics.uci.edu/ml/datasets/HIGGS
Hmida, H., Hamida, S.B., Borgi, A., Rukoz, M.: Scale genetic programming for large data sets: case of higgs bosons classification. Procedia Comput. Sci. 126, 302–311 (2018). The 22nd International Conference, KES-201
Article Google Scholar
Karau, H., Warren, R.: High Performance Spark, 1st edn. O’Reilly, Sebastopol (2017)
Google Scholar
Kienzler, R.: Mastering Apache Spark 2.x. Packt Publishing, Birmingham (2017)
Google Scholar
Koza, J.R.: Genetic Programming: On the Programming of Computers by Means of Natural Selection. MIT Press, Cambridge (1992)
MATH Google Scholar
Paduraru, C., Melemciuc, M., Stefanescu, A.: A distributed implementation using apache spark of a genetic algorithm applied to test data generation. In: Companion Material Proceedings of Genetic and Evolutionary Computation Conference, 15–19 July 2017, pp. 1857–1863. ACM (2017)
Google Scholar
Peralta, D., del Río, S., Ramírez-Gallego, S., Triguero, I., Benitez, J.M., Herrera, F.: Evolutionary feature selection for big data classification: a MapReduce approach. Math. Probl. Eng. 2015, 11 (2015)
Article Google Scholar
Qi, R., Wang, Z., Li, S.: A parallel genetic algorithm based on spark for pairwise test suite generation. J. Comput. Sci. Technol. 31(2), 417–427 (2016)
Article Google Scholar
Shashidhara, B.M., Jain, S., Rao, V.D., Patil, N., Raghavendra, G.S.: Evaluation of machine learning frameworks on bank marketing and Higgs datasets. In: 2nd International Conference on Advances in Computing and Communication Engineering, pp. 551–555 (2015)
Google Scholar
Zaharia, M., et al.: Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: Proceedings of the 9th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2012, 25–27 April 2012, pp. 15–28. USENIX Association (2012)
Google Scholar

Download references

Author information

Authors and Affiliations

Faculté des Sciences de Tunis, LR11ES14 LIPAH, Université de Tunis El Manar, 2092, Tunis, Tunisia
Hmida Hmida & Amel Borgi
Université Paris Dauphine, PSL Research University, CNRS, UMR[7243], LAMSADE, 75016, Paris, France
Hmida Hmida, Sana Ben Hamida & Marta Rukoz

Authors

Hmida Hmida
View author publications
You can also search for this author in PubMed Google Scholar
Sana Ben Hamida
View author publications
You can also search for this author in PubMed Google Scholar
Amel Borgi
View author publications
You can also search for this author in PubMed Google Scholar
Marta Rukoz
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hmida Hmida .

Editor information

Editors and Affiliations

Poznań University of Economics and Business, Poznań, Poland
Witold Abramowicz
ETSI Informática, University of Seville, Seville, Spain
Rafael Corchuelo

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Hmida, H., Ben Hamida, S., Borgi, A., Rukoz, M. (2019). Genetic Programming over Spark for Higgs Boson Classification. In: Abramowicz, W., Corchuelo, R. (eds) Business Information Systems. BIS 2019. Lecture Notes in Business Information Processing, vol 353. Springer, Cham. https://doi.org/10.1007/978-3-030-20485-3_23

Download citation

DOI: https://doi.org/10.1007/978-3-030-20485-3_23
Published: 18 May 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-20484-6
Online ISBN: 978-3-030-20485-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics