Abstract
The amount of available data for data mining and knowledge discovery continue to grow very fast with the era of Big Data. Genetic Programming algorithms (GP), that are efficient machine learning techniques, are face up to a new challenge that is to deal with the mass of the provided data. Active Sampling, already used for Active Learning, might be a good solution to improve the Evolutionary Algorithms (EA) training from very big data sets. This paper present a review of sampling techniques already used with active GP learner and discuss their ability to improve the GP training from very big data sets. A method in each sampling strategy is implemented and applied on the KDD intrusion detection problem using very close parameters. Experimental results show that sampling methods outperforms results obtained with full dataset but some of them cannot be scaled to large datasets.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
See UCI Machine Learning Repository at http://archive.ics.uci.edu/ml/.
References
CGP: Cartesian gp website, http://www.cartesiangp.co.uk
Cohn, D., Atlas, L., Ladner, R.: Improving generalization with active learning. Mach. Learn. 15, 201–221 (1994)
Curry, R., Heywood, M.: Towards efficient training on large datasets for genetic programming. In: Tawfik, A.Y., Goodwin, S.D. (eds.) AI 2004. LNCS (LNAI), vol. 3060, pp. 161–174. Springer, Heidelberg (2004). doi:10.1007/978-3-540-24840-8_12
Curry, R., Lichodzijewski, P., Heywood, M.I.: Scaling genetic programming to large datasets using hierarchical dynamic subset selection. IEEE Trans. Syst. Man Cybern. Part B 37(4), 1065–1073 (2007)
Gathercole, C.: An Investigation of Supervised Learning in Genetic Programming. University of Edinburgh, Thesis (1998)
Gathercole, C., Ross, P.: Dynamic training subset selection for supervised learning in Genetic Programming. In: Davidor, Y., Schwefel, H.-P., Männer, R. (eds.) PPSN 1994. LNCS, vol. 866, pp. 312–321. Springer, Heidelberg (1994). doi:10.1007/3-540-58484-6_275
Hunt, R., Johnston, M., Browne, W., Zhang, M.: Sampling methods in genetic programming for classification with unbalanced data. In: Li, J. (ed.) AI 2010. LNCS (LNAI), vol. 6464, pp. 273–282. Springer, Heidelberg (2010). doi:10.1007/978-3-642-17432-2_28
Iba, H.: Bagging, boosting, and bloating in genetic programming. In: The 1st Annual Conference on Genetic and Evolutionary Computation, Proceedings of GECCO 1999, vol. 2, pp. 1053–1060. Morgan Kaufmann, San Francisco (1999)
Koza, J.R.: Genetic programming: on the programming of computers by means of natural selection. Stat. Comput. 4(2), 87–112 (1994)
Lasarczyk, C., Dittrich, P., Banzhaf, W.: Dynamic subset selection based on a fitness case topology. Evol. Comput. 12(2), 223–242 (2004)
Luke, S.: Ecj homepage. http://cs.gmu.edu/~eclab/projects/ecj/
Miller, J.F., Thomson, P.: Cartesian genetic programming. In: Poli, R., Banzhaf, W., Langdon, W.B., Miller, J., Nordin, P., Fogarty, T.C. (eds.) EuroGP 2000. LNCS, vol. 1802, pp. 121–132. Springer, Heidelberg (2000). doi:10.1007/978-3-540-46239-2_9
Nordin, P., Banzhaf, W.: An on-line method to evolve behavior and to control a miniature robot in real time with genetic programming. Adaptive Behav. 5(2), 107–140 (1997)
Teller, A., David, A.: Automatically choosing the number of fitness cases: the rational allocation of trials. In: Genetic Programming 1997: Proceedings of the Second Annual Conference, pp. 321–328. Morgan Kaufmann (1997)
UCI: Kdd cup (1999). http://kdd.ics.uci.edu/databases/kddcup99/
Zhang, B.-T., Cho, D.-Y.: Genetic programming with active data selection. In: McKay, B., Yao, X., Newton, C.S., Kim, J.-H., Furuhashi, T. (eds.) SEAL 1998. LNCS (LNAI), vol. 1585, pp. 146–153. Springer, Heidelberg (1999). doi:10.1007/3-540-48873-1_20
Zhang, B.T., Joung, J.G.: Genetic programming with incremental data inheritance. In: The Genetic and Evolutionary Computation Conference, Proceedings, vol. 2, pp. 1217–1224. Morgan Kaufmann, Orlando (1999)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this paper
Cite this paper
Hmida, H., Hamida, S.B., Borgi, A., Rukoz, M. (2017). Sampling Methods in Genetic Programming Learners from Large Datasets: A Comparative Study. In: Angelov, P., Manolopoulos, Y., Iliadis, L., Roy, A., Vellasco, M. (eds) Advances in Big Data. INNS 2016. Advances in Intelligent Systems and Computing, vol 529. Springer, Cham. https://doi.org/10.1007/978-3-319-47898-2_6
Download citation
DOI: https://doi.org/10.1007/978-3-319-47898-2_6
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-47897-5
Online ISBN: 978-3-319-47898-2
eBook Packages: EngineeringEngineering (R0)