Skip to main content

GP Classification under Imbalanced Data sets: Active Sub-sampling and AUC Approximation

  • Conference paper

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 4971))

Abstract

The problem of evolving binary classification models under increasingly unbalanced data sets is approached by proposing a strategy consisting of two components: Sub-sampling and ‘robust’ fitness function design. In particular, recent work in the wider machine learning literature has recognized that maintaining the original distribution of exemplars during training is often not appropriate for designing classifiers that are robust to degenerate classifier behavior. To this end we propose a ‘Simple Active Learning Heuristic’ (SALH) in which a subset of exemplars is sampled with uniform probability under a class balance enforcing rule for fitness evaluation. In addition, an efficient estimator for the Area Under the Curve (AUC) performance metric is assumed in the form of a modified Wilcoxon-Mann-Whitney (WMW) statistic. Performance is evaluated in terms of six representative UCI data sets and benchmarked against: canonical GP, SALH based GP, SALH and the modified WMW statistic, and deterministic classifiers (Naive Bayes and C4.5). The resulting SALH-WMW model is demonstrated to be both efficient and effective at providing solutions maximizing performance assessed in terms of AUC.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Bradley, A.P.: The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recognition 30(7), 1145–1159 (1997)

    Article  Google Scholar 

  2. Brameier, M., Banzhaf, W.: A comparison of linear Genetic Programming and Neural Networks in medical data mining. IEEE Transactions on Evolutionary Computation 5(1), 17–26 (2001)

    Article  Google Scholar 

  3. Cartlidge, J., Bullock, S.: Learning lessons from the common cold: How reducing parasite virulence improves coevolutionary optimization. In: IEEE Congress on Evolutionary Computation, pp. 1420–1425 (2002)

    Google Scholar 

  4. de Jong, E.D.: A monotonic archive for pareto-coevolution. Evolutionary Computation 15(1), 61–94 (2007)

    Article  Google Scholar 

  5. de Jong, E.D., Pollack, J.B.: Ideal evaluation from coevolution. Evolutionary Computation 12(2), 159–192 (2004)

    Article  Google Scholar 

  6. Eggermont, J., Eiben, A.E., van Hermert, J.I.: Adapting the fitness function in GP for data mining. In: Langdon, W.B., Fogarty, T.C., Nordin, P., Poli, R. (eds.) EuroGP 1999. LNCS, vol. 1598, pp. 193–202. Springer, Heidelberg (1999)

    Chapter  Google Scholar 

  7. Ficici, S.G., Pollock, J.B.: Pareto optimality in coevolutionary learning. In: European Conference on Artificial Life, pp. 316–325 (2001)

    Google Scholar 

  8. Gathercole, C., Ross, P.: Dynamic training subset selection for supervised learning in genetic programming. In: Davidor, Y., Männer, R., Schwefel, H.-P. (eds.) PPSN 1994. LNCS, vol. 866, pp. 312–321. Springer, Heidelberg (1994)

    Google Scholar 

  9. Hand, D.J.: Construction and Assessment of Classification Rules. John Wiley, Chichester (1997)

    MATH  Google Scholar 

  10. Hillis, W.D.: Co-evolving parasites improve simulated evolution as an optimization procedure. In: Artificial Life II. Santa Fe Institute Studies in the Sciences of Complexity, vol. X, pp. 313–324 (1990)

    Google Scholar 

  11. Koza, J.R.: Genetic Programming: On the Programming of Computers by Means of Natural Selection. MIT Press, Cambridge (1992)

    MATH  Google Scholar 

  12. Langdon, W.B., Buxton, B.F.: Evolving Receiver Operating Characteristics for Data Fusion. In: Miller, J., Tomassini, M., Lanzi, P.L., Ryan, C., Tetamanzi, A.G.B., Langdon, W.B. (eds.) EuroGP 2001. LNCS, vol. 2038, pp. 87–96. Springer, Heidelberg (2001)

    Chapter  Google Scholar 

  13. Lemczyk, M., Heywood, M.I.: Training binary GP classifiers efficiently: a Pareto-coevolutionary approach. In: Ebner, M., O’Neill, M., Ekárt, A., Vanneschi, L., Esparcia-Alcázar, A.I. (eds.) EuroGP 2007. LNCS, vol. 4445, pp. 229–240. Springer, Heidelberg (2007)

    Chapter  Google Scholar 

  14. Lichodzijewski, P., Heywood, M.I.: Pareto-coevolutionary Genetic Programming for problem decomposition in multi-class classification. In: Proceedings of the Genetic and Evolutionary Computation Conference (GECCO), vol. 1, pp. 464–471 (2007)

    Google Scholar 

  15. McIntyre, A.R., Heywood, M.I.: Toward co-evolutionary training of a multi-class classifier. In: Proceedings of the Congress on Evolutionary Computation (CEC), vol. 3, pp. 2130–2137 (2005)

    Google Scholar 

  16. Newman, D.J., Hettich, S., Blake, C.L., Merz, C.J.: UCI repository of machine learning databases (1998), http://www.ics.uci.edu/~mlearn/mlrepository.html

  17. Noble, J., Watson, R.A.: Pareto coevolution: Using performance against coevolved opponents in a game as dimensions for Pareto selection. In: Proceedings of the Genetic and Evolutionary Computation Conference (GECCO), pp. 493–500 (2001)

    Google Scholar 

  18. Patterson, G., Zhang, M.: Fitness functions in Genetic Programming for classification with unbalanced data. In: Orgun, M.A., Thornton, J. (eds.) AI 2007. LNCS (LNAI), vol. 4830, pp. 464–471. Springer, Heidelberg (2007)

    Chapter  Google Scholar 

  19. Song, D., Heywood, M.I., Zincir-Heywood, A.N.: Training Genetic Programming on half a million patterns: An example from anomaly detection. IEEE Transactions on Evolutionary Computation 9(3), 225–239 (2005)

    Article  Google Scholar 

  20. Weiss, G.M., Provost, R.: Learning when training data are costly: The effect of class distribution on tree induction. Journal of Artificial Intelligence Research 19, 315–354 (2003)

    MATH  Google Scholar 

  21. Yan, L., Dodier, R., Mozer, M.C., Wolniewicz, R.: Optimizing classifier performance via the Wilcoxon-Mann-Whitney statistic. In: Proceedings of the International Conference on Machine Learning, pp. 848–855 (2003)

    Google Scholar 

  22. Zoungker, D., Punch, B.: lilgp genetic programming system. version 1.1 (1998), http://garage.cse.msu.edu/

Download references

Author information

Authors and Affiliations

Authors

Editor information

Michael O’Neill Leonardo Vanneschi Steven Gustafson Anna Isabel Esparcia Alcázar Ivanoe De Falco Antonio Della Cioppa Ernesto Tarantino

Rights and permissions

Reprints and permissions

Copyright information

© 2008 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Doucette, J., Heywood, M.I. (2008). GP Classification under Imbalanced Data sets: Active Sub-sampling and AUC Approximation. In: O’Neill, M., et al. Genetic Programming. EuroGP 2008. Lecture Notes in Computer Science, vol 4971. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-78671-9_23

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-78671-9_23

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-78670-2

  • Online ISBN: 978-3-540-78671-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics