Skip to main content

GM4OS: An Evolutionary Oversampling Approach for Imbalanced Binary Classification Tasks

  • Conference paper
  • First Online:
Applications of Evolutionary Computation (EvoApplications 2024)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14634))

  • 147 Accesses

Abstract

Imbalanced datasets pose a significant and longstanding challenge to machine learning algorithms, particularly in binary classification tasks. Over the past few years, various solutions have emerged, with a substantial focus on the automated generation of synthetic observations for the minority class, a technique known as oversampling. Among the various oversampling approaches, the Synthetic Minority Oversampling Technique (SMOTE) has recently garnered considerable attention as a highly promising method. SMOTE achieves this by generating new observations through the creation of points along the line segment connecting two existing minority class observations. Nevertheless, the performance of SMOTE frequently hinges upon the specific selection of these observation pairs for resampling. This research introduces the Genetic Methods for OverSampling (GM4OS), a novel oversampling technique that addresses this challenge. In GM4OS, individuals are represented as pairs of objects. The first object assumes the form of a GP-like function, operating on vectors, while the second object adopts a GA-like genome structure containing pairs of minority class observations. By co-evolving these two elements, GM4OS conducts a simultaneous search for the most suitable resampling pair and the most effective oversampling function. Experimental results, obtained on ten imbalanced binary classification problems, demonstrate that GM4OS consistently outperforms or yields results that are at least comparable to those achieved through linear regression and linear regression when combined with SMOTE.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Institutional subscriptions

References

  1. Batista, G.E.A.P.A., Prati, R.C., Monard, M.C.: A study of the behavior of several methods for balancing machine learning training data. SIGKDD Explor. 6, 20–29 (2004)

    Google Scholar 

  2. Chawla, N., Japkowicz, N., Kołcz, A.: Editorial: special issue on learning from imbalanced data sets. SIGKDD Explor. 6 , 1–6 (2004). https://doi.org/10.1145/1007730.1007733

  3. Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: Smote: synthetic minority over-sampling technique. J. Artif. Int. Res. 16(1), 321–357 (2002)

    Google Scholar 

  4. Mitchell, M.: An Introduction to Genetic Algorithms. MIT Press (1996)

    Google Scholar 

  5. Poli, R., Langdon, W.B., McPhee, N.F.: A Field Guide to Genetic Programming. Lulu Enterprises, UK Ltd (2008)

    Google Scholar 

  6. Ali, A., Shamsuddin, S.M., Ralescu, A.: Classification with class imbalance problem: a review 7, 176–204 (2015)

    Google Scholar 

  7. Huang, J., Ling, C.: Using AUC and accuracy in evaluating learning algorithms. IEEE Trans. Knowl. Data Eng. 17(3), 299–310 (2005). https://doi.org/10.1109/TKDE.2005.50

  8. Gosain, A., Sardana, S.: Handling class imbalance problem using oversampling techniques: a review. In: 2017 International Conference on Advances in Computing, Communications and Informatics (ICACCI), pp. 79–85 (2017). https://doi.org/10.1109/ICACCI.2017.8125820

  9. Han, H., Wang, W.-Y., Mao, B.-H.: Borderline-smote: a new over-sampling method in imbalanced data sets learning. In: Proceedings of the 2005 International Conference on Advances in Intelligent Computing - Volume Part I (ICIC 2005), Springer, Heidelberg (2005), pp. 878–887. https://doi.org/10.1007/11538059_91

  10. He, H., Bai, Y., Garcia, E.A., Li, S.: Adasyn: adaptive synthetic sampling approach for imbalanced learning. In: 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence), pp. 1322–1328 (2008). https://api.semanticscholar.org/CorpusID:1438164

  11. Frank, F., Bacao, F.: Advanced genetic programming vs. state-of-the-art automl in imbalanced binary classification. Emerg. Sci. J. 7(4), 1349–1363 (2023). https://doi.org/10.28991/ESJ-2023-07-04-021

  12. Pei, W., Xue, B., Shang, L., Zhang, M.: New fitness functions in genetic programming for classification with high-dimensional unbalanced data. In: 2019 IEEE Congress on Evolutionary Computation (CEC), pp. 2779–2786 (2019). https://doi.org/10.1109/CEC.2019.8789974

  13. Kumar, A.: A new fitness function in genetic programming for classification of imbalanced data. J. Exp. Theor. Artif. Intell. 1–13 (2022). https://doi.org/10.1080/0952813X.2022.2120087

  14. Karia, V., Zhang, W., Naeim, A., Ramezani, R., Gensample: a genetic algorithm for oversampling in imbalanced datasets. arXiv preprint arXiv:1910.10806 (2019)

  15. Azzali, I., Vanneschi, L., Silva, S., Bakurov, I., Giacobini, M.: A vectorial approach to genetic programming. In: Sekanina, L., Hu, T., Lourenço, N., Richter, H., García-Sánchez, P. (eds.) Genetic Programming: 22nd European Conference, EuroGP 2019, Held as Part of EvoStar 2019, Leipzig 24–26 April 2019, Proceedings, pp. 213–227. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-16670-0_14

  16. Azzali, I., Vanneschi, L., Bakurov, I., Silva, S., Ivaldi, M., Giacobini, M.: Towards the use of vector based GP to predict physiological time series. Appl. Soft Comput. 89, 106097 (2020). https://doi.org/10.1016/j.asoc.2020.106097

  17. Cox, D.R.: The regression analysis of binary sequences. J. Roy. Stat. Soc.: Ser. B (Methodol.) 20(2), 215–232 (1958)

    MathSciNet  Google Scholar 

  18. Pedregosa, F., et al.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)

    Google Scholar 

  19. Lemaître, G., Nogueira, F., Aridas, C. K.: Imbalanced-learn: a python toolbox to tackle the curse of imbalanced datasets in machine learning. J. Mach. Learn. Res. 18(17), 1–5 (2017)

    Google Scholar 

  20. Romano, J.D., et al.: Pmlb v1.0: an open source dataset collection for benchmarking machine learning methods. arXiv preprint arXiv:2012.00058v2 (2021)

  21. Ferrer, L.: Analysis and comparison of classification metrics. arXiv preprint arXiv:2209.05355 (2023)

  22. Bonferroni, C.: Teoria statistica delle classi e calcolo delle probabilità, Pubblicazioni del R. Istituto superiore di scienze economiche e commerciali di Firenze, Seeber (1936)

    Google Scholar 

  23. Mann, H.B., Whitney, D.R.: On a test of whether one of two random variables is stochastically larger than the other. Annal. Math. Statist. 18(1), 50–60 (1947). https://doi.org/10.1214/aoms/1177730491

  24. Fernandez, F., Vanneschi, L., Tomassini, M.: The effect of plagues in genetic programming: a study of variable-size populations. In: Ryan, C., Soule, T., Keijzer, M., Tsang, E., Poli, R., Costa, E. (eds.) Genetic Programming, pp. 317–326. Springer, Heidelberg (2003). https://doi.org/10.1007/3-540-36599-0_29

  25. Rochat, D., Tomassini, M., Vanneschi, L.: Dynamic size populations in distributed genetic programming. In: Keijzer, M., Tettamanzi, A., Collet, P., van Hemert, J., Tomassini, M. (eds.) Genetic Programming: 8th European Conference, EuroGP 2005, pp. 50–61. Springer, Heidelberg (2005). https://doi.org/10.1007/978-3-540-31989-4_5

  26. Farinati, D., Bakurov, I., Vanneschi, L.: A study of dynamic populations in geometric semantic genetic programming. Inf. Sci. 648, 119513 (2023). https://doi.org/10.1016/j.ins.2023.119513

  27. Breiman, L., Friedman, J.H., Olshen, R.A., Stone, C.J.: Classification and regression trees. Biometrics 40, 874 (1984). https://api.semanticscholar.org/CorpusID:29458883

Download references

Acknowledgments

This work was supported by national funds through FCT (Fundação para a Ciência e a Tecnologia), under the project - UIDB/04152/2020 - Centro de Investigação em Gestão de Informação (MagIC)/NOVA IMS.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Davide Farinati .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Farinati, D., Vanneschi, L. (2024). GM4OS: An Evolutionary Oversampling Approach for Imbalanced Binary Classification Tasks. In: Smith, S., Correia, J., Cintrano, C. (eds) Applications of Evolutionary Computation. EvoApplications 2024. Lecture Notes in Computer Science, vol 14634. Springer, Cham. https://doi.org/10.1007/978-3-031-56852-7_5

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-56852-7_5

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-56851-0

  • Online ISBN: 978-3-031-56852-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics