Genetic programming for development of cost-sensitive classifiers for binary high-dimensional unbalanced classification
Introduction
In unbalanced classification, the number of instances per class is disproportionate. Class imbalance is a common issue in many real-world applications, such as medical diagnosis, bioinformatics and fault diagnostics [1]. In these applications, the minority class is usually more important than the majority class. However, many classification algorithms are biased towards the majority class, thereby encountering a performance bias issue [2]. Moreover, because of overlapping areas and within-class imbalance, it is usually difficult for classifiers to correctly discriminate the boundary between the majority class and the minority class [3], [4], [5].
There are many classification tasks suffering from both high dimensionality and class imbalance. Undoubtedly, high dimensionality makes it more difficult to effectively solve the problem of class imbalance [6]. For many high-dimensional unbalanced datasets, both the majority class and the minority class usually do not have a sufficient number of instances. Because of the sparsity of data, it is often difficult to develop a classifier that effectively generalizes the data characteristics. To solve this issue, feature selection is a popular method, which selects a smallest subset of features that are necessary and sufficient to describe target labels [7], [8]. However, in high-dimensional classification, feature selection is a difficult task because of the large search space and feature interactions [9]. Moreover, class imbalance brings further difficulty in selecting informative features to improve the true positive rate (also called sensitivity) and the true negative rate (also called specificity). If feature selection approaches do not solve the problem of class imbalance, the selected features are often biased towards the majority class. As a consequence, classifiers using these selected features are more likely to be biased towards the majority class.
Genetic programming (GP) [10] can automatically evolve computer programs as solutions to a problem. GP has shown its effectiveness in classification. Usually, an individual (also called tree or program) is regarded as a classifier. The built-in feature selection ability of GP makes it useful to construct classifiers for classification with high-dimensional data. However, like other classification algorithms, classification performance of GP is negatively influenced by class imbalance issues [2]. This is because many GP methods use the overall classification accuracy as a fitness function, which equally treats all instances. Accordingly, the majority class contributes more to improving the overall classification accuracy than the minority class, causing the constructed classifiers biased towards the majority class.
Cost-sensitive learning [11] is one of the most important methods to address the issue of class imbalance, which requires to treat different errors differently. Cost-sensitive learning can be utilized with classification algorithms to make them sensitive to different types of misclassification. However, most existing cost-sensitive algorithms work with a cost matrix that is often required from domain experts. Unluckily, in many cases, experts may feel difficult to provide exact cost values due to the lack of domain or specialized knowledge related to actual situations. Besides, it is possible for different experts to have different opinions on evaluating the same mistake. Therefore, in many real-world applications, the misclassification cost values are unknown [12].
Without cost information from domain experts, an easy method is the use of the class imbalance ratio as the cost information [13]. However, this method is often criticized mainly because it is over-simplified without considering data characteristics [13]. Moreover, this method assumes a direct relationship between the class imbalance and cost sensitivity, which does not always hold. Many existing cost-sensitive algorithms usually use trial and error to determine misclassification costs, which may cause additional computations but may not lead to an optimal solution [12]. Therefore, it is essential to investigate how cost-sensitive classifiers can be automatically learned.
The goal of this study is to develop a cost-sensitive GP method, which does not require manually-designed cost matrices and can achieve good classification performance for high-dimensional unbalanced classification. In order to achieve this goal, there are three sub-goals:
Investigate the use of cost-sensitive learning with GP when the cost information is unknown,
Develop a GP method to construct cost-sensitive classifiers, where a cost matrix is automatically learned, and
Investigate whether the proposed method can achieve at least similar performance, compared with other methods.
This study focuses on binary classification because of the following reasons. First, many real-world applications related to unbalanced data are often binary classification. Second, binary classification is still very challenging if data is high-dimensional as well as unbalanced. Third, a multi-class classification task could be conducted by decomposing it into multiple binary classification tasks. Hence, we believe high-dimensional unbalanced binary classification is still worth investigating, but we are also open to address multi-class classification problems in the future.
Section snippets
Cost-sensitive learning for unbalanced classification
Classification algorithms may ignore the minority class if they assume the same misclassification cost (or loss) caused by different mistakes in unbalanced classification. However, in many domains, different mistakes often cause different losses. For example, in medical diagnosis, the possible mistakes:
A patient without cancer is misdiagnosed as a cancer patient;
A cancer patient is misdiagnosed as a patient without cancer.
The first misdiagnosis is definitely troublesome because it wastes
A cost matrix
A cost matrix is used to indicate costs of correct and incorrect predictions. In binary classification, a class-dependent cost matrix is shown in Table 1 [11].
The minority class and the majority class are seen as the positive set () and the negative set (), respectively. In Table 1, is the cost of a false negative, and is the cost of a false positive. is greater than or equal to , to show that the misclassification cost of the minority class is greater than or equal
The proposed method
In this section, we introduce the proposed method, named Cost-Sensitive Genetic Programming (CS-GP).
Datasets
In the experiments, we used ten gene expression datasets1 to examine and investigate the effectiveness of CS-GP. Gene expression datasets often have thousands of features, and many of these datasets encounter the problem of class imbalance. The details of the ten datasets are shown in Table 3. Class imbalance ratio () [12] is the number of instances in the majority class divided
Results and discussions
Table 6 reports AUC results of CS-GP and baseline GP methods on the test sets. The Wilcoxon statistical significance test [35] is conducted to compare CS-GP with a baseline GP method, with a significance level of 0.05. In Table 6, “”, “” and “” are used to show that CS-GP is significantly better than, similar to, and significantly worse than a baseline method, respectively.
Further analysis on the evolved cost values
Fig. 4 shows the evolved cost values (for the minority class) on the ten datasets in the 30 independent runs. The vertical axis shows 30 cost values, each of which is evolved by the best individual (its right sub-tree) from the final generation in a run. The horizontal axis is used to show which run (from 1 to 30) a cost value is evolved.
On Armstrong-2002-v1 (), most of the evolved cost values are in the interval of (1, 10), which is similar to that of being evolved for Leukemia () and
Conclusion and future work
This paper designs a new GP method (i.e. CS-GP) to construct classifiers and learn cost values automatically and simultaneously when the needed cost information is unknown. In CS-GP, a new tree representation, terminal and function sets have been developed. In the evolved tree, the cost value represented by its right sub-tree is used by the classifier represented by its left sub-tree in the evaluation to make this classifier sensitive to different classification mistakes. Therefore, CS-GP is
CRediT authorship contribution statement
Wenbin Pei: Investigation, Conceptualization, Methodology, Software, Writing - original draft. Bing Xue: Supervision, Methodology, Writing - review & editing. Lin Shang: Supervision, Writing - review & editing. Mengjie Zhang: Supervision, Methodology, Writing - review & editing.
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgments
This work was supported in part by the Marsden Fund of New Zealand government under contracts VUW1509 and VUW1615, the Science for Technological Innovation Challenge (SfTI) fund under grant E3603/2903, the University Research Fund at Victoria University of Wellington (grant number 223805/3986), MBIE Data Science SSIF Fund under the contract RTVU1914, and National Natural Science Foundation of China (NSFC), under grant 61876169, 61672276 and 51975294. Wenbin Pei was supported by China
References (35)
- et al.
Addressing class imbalance in deep learning for small lesion detection on medical images
Comput. Biol. Med.
(2020) - et al.
Cost-sensitive support vector machines
Neurocomputing
(2019) - et al.
A hierarchical genetic fuzzy system based on genetic programming for addressing classification with highly imbalanced and borderline data-sets
Knowl.-Based Syst.
(2013) - et al.
Developing new fitness functions in genetic programming for classification with unbalanced data
IEEE Trans. Syst. Man Cybern. B
(2012) - et al.
A review on ensembles for the class imbalance problem: bagging-, boosting-, and hybrid-based approaches
IEEE Trans. Syst. Man Cybern. Part C (Appl. Rev.)
(2012) - et al.
Cost-sensitive learning of deep feature representations from imbalanced data
IEEE Trans. Neural Netw. Learn. Syst.
(2017) Dealing with data difficulty factors while learning from imbalanced data
- et al.
Learning from imbalanced data
IEEE Trans. Knowl. Data Eng.
(2008) - et al.
An introduction to variable and feature selection
J. Mach. Learn. Res.
(2003) - et al.
Genetic programming for feature construction and selection in classification on high-dimensional data
Memet. Comput.
(2016)
A survey on evolutionary computation approaches to feature selection
IEEE Trans. Evol. Comput.
A field guide to genetic programming
The foundations of cost-sensitive learning
A cost-sensitive deep belief network for imbalanced classification
IEEE Trans. Neural Netw. Learn. Syst.
Cost-sensitive learning
Cost-sensitive learning
Cost-sensitive learning
Cited by (18)
Evolving ensembles using multi-objective genetic programming for imbalanced classification
2022, Knowledge-Based SystemsCitation Excerpt :Such as FSVM (fuzzy support vector machine) [13], FSVM-WD (FSVM based on the within-class relative density) [16], and ACFSVM (affinity and class probability-based fuzzy support vector machine technique) [20]. In GP, which is the focus of this paper, this usually involves an adaptive fitness function to improve classifiers with good accuracy in both classes, e.g., CS-GP (cost sensitive-GP) [23] is a GP-based method for binary high-dimensional imbalanced classification. D-score GP (DGP) and F2 score GP (F2GP) [24,26] use a new fitness function and GP to process imbalanced medical data classification.
Health prediction for king salmon via evolutionary machine learning with genetic programming
2024, Journal of the Royal Society of New ZealandDetecting Overlapping Areas in Unbalanced High-Dimensional Data Using Neighborhood Rough Set and Genetic Programming
2023, IEEE Transactions on Evolutionary Computation