Skip to main content

Advertisement

Log in

Genetic programming for high-dimensional imbalanced classification with a new fitness function and program reuse mechanism

  • Methodologies and Application
  • Published:
Soft Computing Aims and scope Submit manuscript

Abstract

Genetic programming (GP) has been successfully applied to classification. However, GP may evolve biased classifiers when encountering the problem of class imbalance. These biased classifiers are often not reliable to be applied to some real-world applications. High dimensionality makes it more difficult for classifiers to effectively separate the majority class and the minority class. The use of GP to handle the joint effect of high dimensionality and class imbalance has not been heavily investigated. In this paper, we propose a GP approach to high-dimensional imbalanced classification, with the goals of increasing the classification performance as well as saving training time. To achieve this goal, a new fitness function is developed to solve the problem of class imbalance, and moreover, a strategy is proposed to reuse previous good GP individuals for improving efficiency. The proposed method is examined on ten high-dimensional imbalanced datasets. Experimental results show that, for high-dimensional imbalanced classification, the proposed method generally outperforms other GP methods and traditional classification algorithms using sampling methods to solve the problem of class imbalance.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

Notes

  1. http://www.gems-system.org; https://schlieplab.org/Static/Supplements/CompCancer/datasets.htm.

References

  • Aydogan EK, Ozmen M, Delice Y (2019) CBR-PSO: cost-based rough particle swarm optimization approach for high-dimensional imbalanced problems. Neural Comput Appl 31(10):6345–6363

    Article  Google Scholar 

  • Batista GE, Prati RC, Monard MC (2004) A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD Explor Newsl 31(10):6345–6363

    Google Scholar 

  • Bhowan U, Zhang M, Johnston M (2010) Genetic programming for classification with unbalanced data. In: European conference on genetic programming. Springer, p 1–13

  • Bhowan U, Johnston M, Zhang M (2011a) Ensemble learning and pruning in multi-objective genetic programming for classification with unbalanced data. In: Australasian joint conference on artificial intelligence. Springer, pp 192–202

  • Bhowan U, Johnston M, Zhang M (2011b) Evolving ensembles in multi-objective genetic programming for classification with unbalanced data. In: Proceedings of the 13th annual conference on genetic and evolutionary computation. ACM, pp 1331–1338

  • Bhowan U, Johnston M, Zhang M (2012) Developing new fitness functions in genetic programming for classification with unbalanced data. IEEE Trans Syst Man Cybern Part B (Cybern) 42(2):406–421

    Article  Google Scholar 

  • Bhowan U, Johnston M, Zhang M, Yao X (2013) Evolving diverse ensembles using genetic programming for classification with unbalanced data. IEEE Trans Evol Comput 17(3):368–386

    Article  Google Scholar 

  • Bhowan U, Johnston M, Zhang M, Yao X (2014) Reusing genetic programming for ensemble selection in classification of unbalanced data. IEEE Trans Evol Comput 18(6):893–908

    Article  Google Scholar 

  • Blagus R, Lusa L (2013) Improved shrunken centroid classifiers for high-dimensional class-imbalanced data. BMC Bioinform 14(1):64

    Article  Google Scholar 

  • Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) Smote: synthetic minority over-sampling technique. J Artif Intell Res 16:321–357

    Article  MATH  Google Scholar 

  • Chawla NV, Japkowicz N, Kotcz A (2004) Special issue on learning from imbalanced data sets. ACM Sigkdd Explor Newsl 6(1):1–6

    Article  Google Scholar 

  • Curry R, Lichodzijewski P, Heywood MI (2007) Scaling genetic programming to large datasets using hierarchical dynamic subset selection. IEEE Trans Syst Man Cybern Part B (Cybern) 37(4):1065–1073

    Article  Google Scholar 

  • Ertekin S, Huang J, Bottou L, Giles L (2007a) Learning on the border: active learning in imbalanced data classification. In: Proceedings of the sixteenth ACM conference on information and knowledge management. ACM, pp 127–136

  • Ertekin S, Huang J, Giles CL (2007b) Active learning for class imbalance problem. In: Proceedings of the 30th annual international ACM SIGIR conference on research and development in information retrieval, vol 7, pp 823–824

  • Espejo PG, Ventura S, Herrera F (2010) A survey on the application of genetic programming to classification. IEEE Trans Syst Man Cybern Part C (Appl Rev) 40(2):121–144

    Article  Google Scholar 

  • Fan W, Stolfo SJ, Zhang J, Chan PK (1999) Adacost: misclassification cost-sensitive boosting. In: Proceedings of the sixteenth international conference on machine learning, vol 99, pp 97–105

  • Fisher RA (1992) Statistical methods for research workers. In: Kotz S et al. (eds) Breakthroughs in statistics. Springer, pp 66–70

  • Fleury A, Vacher M, Noury N (2010) SVM-based multimodal classification of activities of daily living in health smart homes: sensors, algorithms, and first experimental results. IEEE Trans Inf Technol Biomed 14(2):274–283

    Article  Google Scholar 

  • Freund Y, Schapire RE (1997) A decision-theoretic generalization of on-line learning and an application to boosting. J Comput Syst Sci 55(1):119–139

    Article  MathSciNet  MATH  Google Scholar 

  • Galar M, Fernandez A, Barrenechea E, Bustince H, Herrera F (2012) A review on ensembles for the class imbalance problem: bagging-, boosting-, and hybrid-based approaches. IEEE Trans Syst Man Cybern Part C (Appl Rev) 42(4):463–484

    Article  Google Scholar 

  • Gathercole C, Ross P (1994) Dynamic training subset selection for supervised learning in genetic programming. In: International conference on parallel problem solving from nature. Springer, pp 312–321

  • Haixiang G, Yijing L, Shang J, Mingyun G, Yuanyue H, Bing G (2017) Learning from class-imbalanced data: review of methods and applications. Expert Syst Appl 73:220–239

    Article  Google Scholar 

  • Han H, Wang W-Y, Mao B-H (2005) Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning. In: International conference on intelligent computing. Springer, pp 878–887

  • He H, Bai Y, Garcia EA, Li S (2008) Adasyn: adaptive synthetic sampling approach for imbalanced learning. In: IEEE international joint conference on neural networks. IEEE, pp 1322–1328

  • Hong X, Chen S, Harris CJ (2007) A kernel-based two-class classifier for imbalanced data sets. IEEE Trans Neural Netw 18(1):28–41

    Article  Google Scholar 

  • Hsieh WW (2007) Nonlinear principal component analysis of noisy data. Neural Netw 20(4):434–443

    Article  MATH  Google Scholar 

  • Joshi MV, Kumar V, Agarwal RC (2001) Evaluating boosting algorithms to classify rare classes: comparison and improvements. In: Proceedings 2001 IEEE international conference on data mining. IEEE, pp 257–264

  • Joshi A, Dangra J, Rawat M (2016) A decision tree based classification technique for accurate heart disease classification and prediction. Int J Technol Res Manag 3:1–4

    Google Scholar 

  • Li J, Li X, Yao X (2005) Cost-sensitive classification with genetic programming. In: The 2005 IEEE congress on evolutionary computation, vol 3. IEEE, pp 2114–2121

  • Li P, Chan KL, Fang W (2006) Hybrid kernel machine ensemble for imbalanced data sets. In: 18th international conference on pattern recognition (ICPR’06), vol 1. IEEE, pp 1108–1111

  • Liu XY, Wu J, Zhou ZH (2009) Exploratory undersampling for class-imbalance learning. IEEE Trans Syst Man Cybern Part B (Cybern) 39(2):539–550

    Article  Google Scholar 

  • Liu J, Chen XX, Fang L, Li JX, Yang T, Zhan Q, Tong K, Fang Z (2018) Mortality prediction based on imbalanced high-dimensional ICU big data. Comput Ind 98:218–225

    Article  Google Scholar 

  • Luna JM, Pechenizkiy M, del Jesus MJ, Ventura S (2017) Mining context-aware association rules using grammar-based genetic programming. IEEE Trans Cybern 48:3030–3044

    Article  Google Scholar 

  • Patterson G, Zhang M (2007) Fitness functions in genetic programming for classification with unbalanced data. In: Australasian joint conference on artificial intelligence. Springer, pp 769–775

  • Pears R, Finlay J, Connor AM (2014) Synthetic minority over-sampling technique (SMOTE) for predicting software build outcomes. arXiv:1407.2330

  • Pei W, Xue B, Shang L, Zhang M (2018) Genetic programming based on granular computing for classification with high-dimensional data. In: Australasian joint conference on artificial intelligence. Springer, pp 643–655

  • Poli R, Langdon WB, McPhee NF (2008) A field guide to genetic programming. http://www.gp-field-guide.org.uk

  • Ramentol E, Caballero Y, Bello R, Herrera F (2012) SMOTE-RSB*: a hybrid preprocessing approach based on oversampling and undersampling for high imbalanced data-sets using SMOTE and rough sets theory. Knowl Inf Syst 33(2):245–265

    Article  Google Scholar 

  • Seiffert C, Khoshgoftaar TM, Van Hulse J, Folleco A (2014) An empirical study of the classification performance of learners on imbalanced and noisy software quality data. Inf Sci 259:571–595

    Article  Google Scholar 

  • Song D, Heywood MI, Zincir-Heywood AN (2003) A linear genetic programming approach to intrusion detection. In: Genetic and evolutionary computation conference. Springer, pp 2325–2336

  • Stefanowski J (2016) Dealing with data difficulty factors while learning from imbalanced data. In: Challenges in computational statistics and data mining. Springer, pp 333–363

  • Tan P-N, Steinbach M, Kumar V (2016) Introduction to data mining. Pearson Education India

  • Tashk ARB, Faez K (2007) Boosted bayesian kernel classifier method for face detection. In: Proceedings of the third international conference on natural computation. IEEE, pp 533–537

  • Tax DM, Duin RP (2004) Support vector data description. Mach Learn 54(1):45–66

    Article  MATH  Google Scholar 

  • Tran B, Xue B, Zhang M (2016) Genetic programming for feature construction and selection in classification on high-dimensional data. Memet Comput 8(1):3–15

    Article  Google Scholar 

  • Tran B, Xue B, Zhang M (2017) Using feature clustering for GP-based feature construction on high-dimensional data. In: European conference on genetic programming. Springer, pp 210–226

  • Wu G, Chang EY (2005) KBA: kernel boundary alignment considering imbalanced data distribution. IEEE Trans Knowl Data Eng 6:786–795

    Article  Google Scholar 

  • Yang P, Xu L, Zhou BB, Zhang Z, Zomaya AY (2009) A particle swarm based hybrid system for imbalanced medical data sampling. In: BMC genomics, vol 10. BioMed Central, p S34

  • Yang P, Yoo PD, Fernando J, Zhou BB, Zhang Z, Zomaya AY (2014) Sample subset optimization techniques for imbalanced and ensemble learning problems in bioinformatics applications. IEEE Trans Cybern 44(3):445–455

    Article  Google Scholar 

  • Yen SJ, Lee YS (2009) Cluster-based under-sampling approaches for imbalanced data distributions. Expert Syst Appl 36(3):5718–5727

    Article  Google Scholar 

  • Yin H, Gai K (2015) An empirical study on preprocessing high-dimensional class-imbalanced data for classification. In: IEEE 7th international symposium on cyberspace safety and security (CSS), IEEE 12th international conference on embedded software and systems (ICESS), IEEE 17th international conference on high performance computing and communications (HPCC). IEEE, pp 1314–1319

  • Yin L, Ge Y, Xiao K, Wang X, Quan X (2013) Feature selection for high-dimensional imbalanced data. Neurocomputing 105:3–11

    Article  Google Scholar 

  • Zhang S, Qin Z, Ling CX, Sheng S (2005) “Missing is useful”: missing values in cost-sensitive decision trees. IEEE Trans Knowl Data Eng 17(12):1689–1693

    Article  Google Scholar 

  • Zhou ZH, Liu XY (2006) Training cost-sensitive neural networks with methods addressing the class imbalance problem. IEEE Trans Knowl Data Eng 18(1):63–77

    Article  MathSciNet  Google Scholar 

  • Zhu Z, Ong YS, Dash M (2007) Markov blanket-embedded genetic algorithm for gene selection. Pattern Recognit 40(11):3236–3248

    Article  MATH  Google Scholar 

Download references

Acknowledgements

This work was supported in part by the Marsden Fund of New Zealand government under contracts VUW1509 (Mengjie Zhang) and VUW1615 (Bing Xue and Mengjie Zhang), the Science for Technological Innovation Challenge (SfTI) fund under grant E3603/2903 (Mengjie Zhang and Bing Xue), the University Research Fund at Victoria University of Wellington (Grant Number 216378/3764 and 223805/3986, Bing Xue and Mengjie Zhang), MBIE Data Science SSIF Fund under the contract RTVU1914 (Mengjie Zhang and Bing Xue), and National Natural Science Foundation of China (NSFC), under grant 61876169 (Bing Xue) and grant 61672276 (Lin Shang). Wenbin Pei was supported by China Scholarship Council/Victoria University Scholarship.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Wenbin Pei.

Ethics declarations

Conflict of interest

All authors declare that they have no conflict of interest.

Ethical approval

This article does not contain any studies with human participants or animals performed by any of the authors.

Additional information

Communicated by V. Loia.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Pei, W., Xue, B., Shang, L. et al. Genetic programming for high-dimensional imbalanced classification with a new fitness function and program reuse mechanism. Soft Comput 24, 18021–18038 (2020). https://doi.org/10.1007/s00500-020-05056-7

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00500-020-05056-7

Keywords

Navigation