GP Classification under Imbalanced Data sets: Active Sub-sampling and AUC Approximation

Doucette, John; Heywood, Malcolm I.

doi:10.1007/978-3-540-78671-9_23

GP Classification under Imbalanced Data sets: Active Sub-sampling and AUC Approximation

John Doucette¹ &
Malcolm I. Heywood¹

Conference paper

1164 Accesses
25 Citations

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 4971))

Abstract

The problem of evolving binary classification models under increasingly unbalanced data sets is approached by proposing a strategy consisting of two components: Sub-sampling and ‘robust’ fitness function design. In particular, recent work in the wider machine learning literature has recognized that maintaining the original distribution of exemplars during training is often not appropriate for designing classifiers that are robust to degenerate classifier behavior. To this end we propose a ‘Simple Active Learning Heuristic’ (SALH) in which a subset of exemplars is sampled with uniform probability under a class balance enforcing rule for fitness evaluation. In addition, an efficient estimator for the Area Under the Curve (AUC) performance metric is assumed in the form of a modified Wilcoxon-Mann-Whitney (WMW) statistic. Performance is evaluated in terms of six representative UCI data sets and benchmarked against: canonical GP, SALH based GP, SALH and the modified WMW statistic, and deterministic classifiers (Naive Bayes and C4.5). The resulting SALH-WMW model is demonstrated to be both efficient and effective at providing solutions maximizing performance assessed in terms of AUC.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Bradley, A.P.: The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recognition 30(7), 1145–1159 (1997)
Article Google Scholar
Brameier, M., Banzhaf, W.: A comparison of linear Genetic Programming and Neural Networks in medical data mining. IEEE Transactions on Evolutionary Computation 5(1), 17–26 (2001)
Article Google Scholar
Cartlidge, J., Bullock, S.: Learning lessons from the common cold: How reducing parasite virulence improves coevolutionary optimization. In: IEEE Congress on Evolutionary Computation, pp. 1420–1425 (2002)
Google Scholar
de Jong, E.D.: A monotonic archive for pareto-coevolution. Evolutionary Computation 15(1), 61–94 (2007)
Article Google Scholar
de Jong, E.D., Pollack, J.B.: Ideal evaluation from coevolution. Evolutionary Computation 12(2), 159–192 (2004)
Article Google Scholar
Eggermont, J., Eiben, A.E., van Hermert, J.I.: Adapting the fitness function in GP for data mining. In: Langdon, W.B., Fogarty, T.C., Nordin, P., Poli, R. (eds.) EuroGP 1999. LNCS, vol. 1598, pp. 193–202. Springer, Heidelberg (1999)
Chapter Google Scholar
Ficici, S.G., Pollock, J.B.: Pareto optimality in coevolutionary learning. In: European Conference on Artificial Life, pp. 316–325 (2001)
Google Scholar
Gathercole, C., Ross, P.: Dynamic training subset selection for supervised learning in genetic programming. In: Davidor, Y., Männer, R., Schwefel, H.-P. (eds.) PPSN 1994. LNCS, vol. 866, pp. 312–321. Springer, Heidelberg (1994)
Google Scholar
Hand, D.J.: Construction and Assessment of Classification Rules. John Wiley, Chichester (1997)
MATH Google Scholar
Hillis, W.D.: Co-evolving parasites improve simulated evolution as an optimization procedure. In: Artificial Life II. Santa Fe Institute Studies in the Sciences of Complexity, vol. X, pp. 313–324 (1990)
Google Scholar
Koza, J.R.: Genetic Programming: On the Programming of Computers by Means of Natural Selection. MIT Press, Cambridge (1992)
MATH Google Scholar
Langdon, W.B., Buxton, B.F.: Evolving Receiver Operating Characteristics for Data Fusion. In: Miller, J., Tomassini, M., Lanzi, P.L., Ryan, C., Tetamanzi, A.G.B., Langdon, W.B. (eds.) EuroGP 2001. LNCS, vol. 2038, pp. 87–96. Springer, Heidelberg (2001)
Chapter Google Scholar
Lemczyk, M., Heywood, M.I.: Training binary GP classifiers efficiently: a Pareto-coevolutionary approach. In: Ebner, M., O’Neill, M., Ekárt, A., Vanneschi, L., Esparcia-Alcázar, A.I. (eds.) EuroGP 2007. LNCS, vol. 4445, pp. 229–240. Springer, Heidelberg (2007)
Chapter Google Scholar
Lichodzijewski, P., Heywood, M.I.: Pareto-coevolutionary Genetic Programming for problem decomposition in multi-class classification. In: Proceedings of the Genetic and Evolutionary Computation Conference (GECCO), vol. 1, pp. 464–471 (2007)
Google Scholar
McIntyre, A.R., Heywood, M.I.: Toward co-evolutionary training of a multi-class classifier. In: Proceedings of the Congress on Evolutionary Computation (CEC), vol. 3, pp. 2130–2137 (2005)
Google Scholar
Newman, D.J., Hettich, S., Blake, C.L., Merz, C.J.: UCI repository of machine learning databases (1998), http://www.ics.uci.edu/~mlearn/mlrepository.html
Noble, J., Watson, R.A.: Pareto coevolution: Using performance against coevolved opponents in a game as dimensions for Pareto selection. In: Proceedings of the Genetic and Evolutionary Computation Conference (GECCO), pp. 493–500 (2001)
Google Scholar
Patterson, G., Zhang, M.: Fitness functions in Genetic Programming for classification with unbalanced data. In: Orgun, M.A., Thornton, J. (eds.) AI 2007. LNCS (LNAI), vol. 4830, pp. 464–471. Springer, Heidelberg (2007)
Chapter Google Scholar
Song, D., Heywood, M.I., Zincir-Heywood, A.N.: Training Genetic Programming on half a million patterns: An example from anomaly detection. IEEE Transactions on Evolutionary Computation 9(3), 225–239 (2005)
Article Google Scholar
Weiss, G.M., Provost, R.: Learning when training data are costly: The effect of class distribution on tree induction. Journal of Artificial Intelligence Research 19, 315–354 (2003)
MATH Google Scholar
Yan, L., Dodier, R., Mozer, M.C., Wolniewicz, R.: Optimizing classifier performance via the Wilcoxon-Mann-Whitney statistic. In: Proceedings of the International Conference on Machine Learning, pp. 848–855 (2003)
Google Scholar
Zoungker, D., Punch, B.: lilgp genetic programming system. version 1.1 (1998), http://garage.cse.msu.edu/

Download references

Author information

Authors and Affiliations

Faculty of Computer Science, Dalhousie University, 6050 University Av., Halifax, NS, B3H 1W5, Canada
John Doucette & Malcolm I. Heywood

Authors

John Doucette
View author publications
You can also search for this author in PubMed Google Scholar
Malcolm I. Heywood
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Michael O’Neill Leonardo Vanneschi Steven Gustafson Anna Isabel Esparcia Alcázar Ivanoe De Falco Antonio Della Cioppa Ernesto Tarantino

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Doucette, J., Heywood, M.I. (2008). GP Classification under Imbalanced Data sets: Active Sub-sampling and AUC Approximation. In: O’Neill, M., et al. Genetic Programming. EuroGP 2008. Lecture Notes in Computer Science, vol 4971. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-78671-9_23

Download citation

DOI: https://doi.org/10.1007/978-3-540-78671-9_23
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-78670-2
Online ISBN: 978-3-540-78671-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics