Evolving ensembles using multi-objective genetic programming for imbalanced classification
Introduction
Classification of imbalanced data currently represents a great challenge in machine learning in the fields of medical diagnosis, fraud detection, and text categorization [1], [2], [3], [4], [5], [6]. Datasets are imbalanced when at least one class is rare (called a minority class), while other classes make up the rest (called a majority class).
Though classification approaches have shown great ability in machine learning tasks, they still struggle to tackle the problem of imbalanced training data, i.e., learning bias exists [7]. The re-weighting or sampling strategies are usually employed to solve imbalanced classification problems. However, there are still some deficiencies. For example, the weights need to be manually set, and the contribution of re-sampled data is weak. To our knowledge, genetic programming (GP) has the advantage of learning from imbalanced data. GP is an evolutionary computing technique based on the principles of evolution and natural selection which has been proved to be able to solve a series of real-world classification problems [8], [9]. They use the raw imbalanced training data in the learning process without the need to manually rebalance the class distributions. Some studies suggest that evolutionary-based methods outperform non-evolutionary models in imbalanced datasets analysis [10]. Compared with traditional sampling-based methods, it has two major advantages. First, it allows for the use of raw imbalanced data without prior sampling in the training process. Second, compared with single classifier, the combined knowledge of evolved classifiers can be used collaboratively in integration to achieve better generalization, and it provides an effective way to study complex real-world classification.
This motivated us to combine a multi-objective GP framework with efficient many-objective evolutionary algorithms (MaOEAs) to optimize multiple objectives such as classifier accuracy, classifier size, and the size of the solution tree, thereby evolving the GP classifier set, and further improving the overall algorithm performance.
In this paper, a multi-objective genetic programming (MGP) based algorithm is designed for high-performance imbalanced classification. It is combined with an efficient MaOEA and a weighted integrated decision strategy to evolve precise and diverse classifiers. Its performance is compared with standard (single prediction) GP and some excellent imbalanced classification methods. The main contributions are as follows:
(1) An efficient evolutionary strategy is integrated into MGP to optimize the false positive rate and the false negative rate while reducing the size of the solution tree. Fast nondominated sorting, environmental selection, and archiving mechanisms are adopted to implement a novel algorithm MGP+.
(2) Inspired by the ability of ensemble learning to improve the performance of classifiers, a comprehensive weighted ensemble decision is made based on the MGP+ framework, which is termed WMGP+.
(3) Experiments results validate that WMGP+, which uses MGP to evolve classifiers combined with a weighted ensemble decision strategy, has better performance than other competitive algorithms in dealing with imbalanced classification tasks on both binary-class and multi-class imbalanced datasets.
The rest of this paper is structured as follows. Existing imbalanced classification algorithms are reviewed in Section 2. Section 3 summarizes MGP+, which evolves the base classifiers. Section 4 presents details of WMGP+. Section 5 shows the experimental results and related comparisons. Section 6 concludes this paper and suggests future directions.
Section snippets
Overview of related work for class imbalance
Research on imbalanced classification algorithms can be roughly divided into two categories: data-level methods and algorithm-level methods. Data resampling is the most commonly used data-level method, including under-sampling and over-sampling algorithms. The former includes RU (random under-sampling), NCL (neighborhood cleaning rule) [11], and WU-SVM (weighted under-sampling support vector machine) [18]. The latter includes ADASYN (adaptive synthetic sampling approach) [12], SMOTE (small
MGP+
In imbalanced classification, it is more desired to minimize the false positive and the false negative , rather than to minimize classification error. To achieve this, we choose to use an MGP framework and optimize three objectives: (1) , (2) and (3) the number of leaf nodes of the resulting tree. The goal is to improve classification accuracy while reducing the complexity of the final model. In the MGP framework, three learning objectives conflict with each other, where a set of
Weighted ensemble decision
After a group of optimized base classifiers is generated by MGP+, we adopt a weighted ensemble decision strategy to obtain final classification results.
Experiments and results
In this section, experiments are performed to test the proposed WMGP+. All experiments are conducted on a computer with an Intel Xeon(R) Gold 6130 CPU @ 2.10 GHz, 16-GB memory, and Ubuntu 16.04 operating system. The programming environment of experiments is Python 3.6.13. We first introduce the benchmark datasets from the repository of University of California at Irvine (UCI), experimental setting and evaluation metrics used for imbalanced class learning. Then, experimental results of
Conclusion
This paper presents a novel algorithm called WMGP+ to solve imbalanced classification problems by using multi-objective Genetic Programming and weighted ensemble decision making. In the process of population evolution, both population diversity and convergence are improved. WMGP+ determines the weight of a classifier according to its performance and makes a comprehensive weight set decision to get high-quality classification results. Based on the experimental results of binary classes and
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgments
This work was supported in part by the National Natural Science Foundation of China (51775385, 61703279); in part by the Strategy Research Project of Artificial Intelligence Algorithms of Ministry of Education of China (000011); in part by the Shanghai Industrial Collaborative Science and Technology Innovation Project, China (2021-cyxt2-kj10); in part by the Shanghai Municipal Science and Technology Major Project, China (2021SHZDZX0100); in part by the Science and Technology Project of Suzhou,
References (59)
- et al.
Machine learning based mobile malware detection using highly imbalanced network traffic
Inform. Sci.
(2018) - et al.
A cluster-based oversampling algorithm combining SMOTE and k-means for imbalanced medical data
Inform. Sci.
(2021) - et al.
Multi-view ensemble learning based on distance-to-model and adaptive clustering for imbalanced credit risk assessment in P2P lending
Inform. Sci.
(2020) - et al.
Strategies for learning in class imbalance problems
Pattern Recognit.
(2003) - et al.
Affinity and class probability-based fuzzy support vector machine for imbalanced data sets
Neural Netw.
(2020) - et al.
Self-organizing map oversampling (SOMO) for imbalanced data set learning
Expert Syst. Appl.
(2017) - et al.
Improving imbalanced learning through a heuristic oversampling method based on k-means and SMOTE
Inform. Sci.
(2018) - et al.
An online fault detection model and strategies based on SVM-grid in clouds
IEEE/CAA J. Autom. Sin.
(2018) - et al.
Geometric structural ensemble learning for imbalanced problems
IEEE Trans. Cybern.
(2020) - et al.
Joint imbalanced classification and feature selection for hospital readmissions
Knowl. -Based Syst.
(2020)
Genetic programming-based discriminative feature learning for low-quality image classification
IEEE Trans. Cybern. Early Access
Genetic programming with image-related operators and a flexible program structure for feature learning in image classification
IEEE Trans. Evol. Comput.
Combined weighted multi-objective optimizer for instance reduction in two-class imbalanced data problem
Eng. Appl. Artif. Intel.
Fuzzy support vector machines
IEEE Trans. Neural Netw.
Entropy-based hybrid sampling ensemble learning for imbalanced data
Int. J. Intel. Syst.
SMOTE: Synthetic minority over-sampling technique
J. Artificial Intelligence Res.
Fuzzy support vector machine with relative density information for classifying imbalanced data
IEEE Trans. Fuzzy Syst.
A weighted hybrid ensemble method for classifying imbalanced data
Knowl-Based Syst.
A distance-based weighted under sampling scheme for support vector machines and its application to imbalanced classification
IEEE Trans. Neural Netw. Learn. Syst.
Borderline over-sampling for imbalanced data classification
Int. J. Knowl. Eng. Soft Data Paradig.
Sample and feature selecting based ensemble learning for imbalanced problems
Appl. Soft Comput.
Genetic programming for development of cost-sensitive classifiers for binary high-dimensional unbalanced classification
Appl. Soft Comput.
Unbalanced breast cancer data classification using novel fitness functions in genetic programming
Expert Syst. Appl.
A novel fitness function in genetic programming for medical data classification
J. Biomed. Inform.
Evolving diverse ensembles using genetic programming for classification with unbalanced data
IEEE Trans. Evol. Comput.
Reusing genetic programming for ensemble selection in classification of unbalanced data
IEEE Trans. Evol. Comput.
Cited by (22)
Video Deepfake classification using particle swarm optimization-based evolving ensemble models
2024, Knowledge-Based SystemsA self-driving solution for resource-constrained autonomous vehicles in parked areas
2024, High-Confidence ComputingA hierarchical estimation of multi-modal distribution programming for regression problems
2023, Knowledge-Based SystemsCitation Excerpt :Gaussian and polynomial kernels are traditionally used in kernel-based methods to approximate the target function [17–20]. Genetic programming (GP) [21] is one of the evolutionary computation techniques that is used for solving different problems [22–26], and the regression problem is one of the most common [27–34]. GP has the benefit of not requiring the regression models to be specified beforehand to anticipate the outcome.
Meta-lasso: new insight on infection prediction after minimally invasive surgery
2024, Medical and Biological Engineering and Computing