Genetic programming for development of cost-sensitive classifiers for binary high-dimensional unbalanced classification

https://doi.org/10.1016/j.asoc.2020.106989Get rights and content

Highlights

  • This paper investigates the use of cost-sensitive learning with genetic programming (GP) when the cost matrix is unknown.

  • This paper proposes a new GP method to automatically develop cost-sensitive classifiers.

  • The proposed method is independent of manually designed cost matrices.

  • We show that cost-sensitive learning can help GP to solve its performance bias issue.

Abstract

Genetic programming (GP) has the built-in ability for feature selection when developing classifiers for classification with high-dimensional data. However, due to the problem of class imbalance, the developed classifiers by GP are prone to be biased towards the majority class. Cost-sensitive learning has shown to be effective in addressing the problem of class imbalance. In cost-sensitive learning, cost matrices are often manually designed and then considered by classification algorithms to treat different mistakes differently. However, in many real-world applications, cost matrices are unknown because of the limited domain knowledge in complex situations. Therefore, in this paper, we propose a novel GP method to develop cost-sensitive classifiers, where a cost matrix is automatically learned, instead of requiring it from domain experts. The proposed method is examined and compared with existing methods on ten high-dimensional unbalanced datasets. Experimental results show that the proposed method outperforms the compared GP methods in most cases.

Introduction

In unbalanced classification, the number of instances per class is disproportionate. Class imbalance is a common issue in many real-world applications, such as medical diagnosis, bioinformatics and fault diagnostics [1]. In these applications, the minority class is usually more important than the majority class. However, many classification algorithms are biased towards the majority class, thereby encountering a performance bias issue [2]. Moreover, because of overlapping areas and within-class imbalance, it is usually difficult for classifiers to correctly discriminate the boundary between the majority class and the minority class [3], [4], [5].

There are many classification tasks suffering from both high dimensionality and class imbalance. Undoubtedly, high dimensionality makes it more difficult to effectively solve the problem of class imbalance [6]. For many high-dimensional unbalanced datasets, both the majority class and the minority class usually do not have a sufficient number of instances. Because of the sparsity of data, it is often difficult to develop a classifier that effectively generalizes the data characteristics. To solve this issue, feature selection is a popular method, which selects a smallest subset of features that are necessary and sufficient to describe target labels [7], [8]. However, in high-dimensional classification, feature selection is a difficult task because of the large search space and feature interactions [9]. Moreover, class imbalance brings further difficulty in selecting informative features to improve the true positive rate (also called sensitivity) and the true negative rate (also called specificity). If feature selection approaches do not solve the problem of class imbalance, the selected features are often biased towards the majority class. As a consequence, classifiers using these selected features are more likely to be biased towards the majority class.

Genetic programming (GP) [10] can automatically evolve computer programs as solutions to a problem. GP has shown its effectiveness in classification. Usually, an individual (also called tree or program) is regarded as a classifier. The built-in feature selection ability of GP makes it useful to construct classifiers for classification with high-dimensional data. However, like other classification algorithms, classification performance of GP is negatively influenced by class imbalance issues [2]. This is because many GP methods use the overall classification accuracy as a fitness function, which equally treats all instances. Accordingly, the majority class contributes more to improving the overall classification accuracy than the minority class, causing the constructed classifiers biased towards the majority class.

Cost-sensitive learning [11] is one of the most important methods to address the issue of class imbalance, which requires to treat different errors differently. Cost-sensitive learning can be utilized with classification algorithms to make them sensitive to different types of misclassification. However, most existing cost-sensitive algorithms work with a cost matrix that is often required from domain experts. Unluckily, in many cases, experts may feel difficult to provide exact cost values due to the lack of domain or specialized knowledge related to actual situations. Besides, it is possible for different experts to have different opinions on evaluating the same mistake. Therefore, in many real-world applications, the misclassification cost values are unknown [12].

Without cost information from domain experts, an easy method is the use of the class imbalance ratio as the cost information [13]. However, this method is often criticized mainly because it is over-simplified without considering data characteristics [13]. Moreover, this method assumes a direct relationship between the class imbalance and cost sensitivity, which does not always hold. Many existing cost-sensitive algorithms usually use trial and error to determine misclassification costs, which may cause additional computations but may not lead to an optimal solution [12]. Therefore, it is essential to investigate how cost-sensitive classifiers can be automatically learned.

The goal of this study is to develop a cost-sensitive GP method, which does not require manually-designed cost matrices and can achieve good classification performance for high-dimensional unbalanced classification. In order to achieve this goal, there are three sub-goals:

  • (1)

    Investigate the use of cost-sensitive learning with GP when the cost information is unknown,

  • (2)

    Develop a GP method to construct cost-sensitive classifiers, where a cost matrix is automatically learned, and

  • (3)

    Investigate whether the proposed method can achieve at least similar performance, compared with other methods.

This study focuses on binary classification because of the following reasons. First, many real-world applications related to unbalanced data are often binary classification. Second, binary classification is still very challenging if data is high-dimensional as well as unbalanced. Third, a multi-class classification task could be conducted by decomposing it into multiple binary classification tasks. Hence, we believe high-dimensional unbalanced binary classification is still worth investigating, but we are also open to address multi-class classification problems in the future.

Section snippets

Cost-sensitive learning for unbalanced classification

Classification algorithms may ignore the minority class if they assume the same misclassification cost (or loss) caused by different mistakes in unbalanced classification. However, in many domains, different mistakes often cause different losses. For example, in medical diagnosis, the possible mistakes:

  • A patient without cancer is misdiagnosed as a cancer patient;

  • A cancer patient is misdiagnosed as a patient without cancer.

The first misdiagnosis is definitely troublesome because it wastes

A cost matrix

A cost matrix is used to indicate costs of correct and incorrect predictions. In binary classification, a class-dependent cost matrix is shown in Table 1 [11].

The minority class and the majority class are seen as the positive set (Class0) and the negative set (Class1), respectively. In Table 1, C10 is the cost of a false negative, and C01 is the cost of a false positive. C10 is greater than or equal to C01, to show that the misclassification cost of the minority class is greater than or equal

The proposed method

In this section, we introduce the proposed method, named Cost-Sensitive Genetic Programming (CS-GP).

Datasets

In the experiments, we used ten gene expression datasets1 to examine and investigate the effectiveness of CS-GP. Gene expression datasets often have thousands of features, and many of these datasets encounter the problem of class imbalance. The details of the ten datasets are shown in Table 3. Class imbalance ratio (IR) [12] is the number of instances in the majority class divided

Results and discussions

Table 6 reports AUC results of CS-GP and baseline GP methods on the test sets. The Wilcoxon statistical significance test [35] is conducted to compare CS-GP with a baseline GP method, with a significance level of 0.05. In Table 6, “+”, “=” and “” are used to show that CS-GP is significantly better than, similar to, and significantly worse than a baseline method, respectively.

Further analysis on the evolved cost values

Fig. 4 shows the evolved cost values (for the minority class) on the ten datasets in the 30 independent runs. The vertical axis shows 30 cost values, each of which is evolved by the best individual (its right sub-tree) from the final generation in a run. The horizontal axis is used to show which run (from 1 to 30) a cost value is evolved.

On Armstrong-2002-v1 (IR=2), most of the evolved cost values are in the interval of (1, 10), which is similar to that of being evolved for Leukemia (IR=2) and

Conclusion and future work

This paper designs a new GP method (i.e. CS-GP) to construct classifiers and learn cost values automatically and simultaneously when the needed cost information is unknown. In CS-GP, a new tree representation, terminal and function sets have been developed. In the evolved tree, the cost value represented by its right sub-tree is used by the classifier represented by its left sub-tree in the evaluation to make this classifier sensitive to different classification mistakes. Therefore, CS-GP is

CRediT authorship contribution statement

Wenbin Pei: Investigation, Conceptualization, Methodology, Software, Writing - original draft. Bing Xue: Supervision, Methodology, Writing - review & editing. Lin Shang: Supervision, Writing - review & editing. Mengjie Zhang: Supervision, Methodology, Writing - review & editing.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

This work was supported in part by the Marsden Fund of New Zealand government under contracts VUW1509 and VUW1615, the Science for Technological Innovation Challenge (SfTI) fund under grant E3603/2903, the University Research Fund at Victoria University of Wellington (grant number 223805/3986), MBIE Data Science SSIF Fund under the contract RTVU1914, and National Natural Science Foundation of China (NSFC), under grant 61876169, 61672276 and 51975294. Wenbin Pei was supported by China

References (35)

  • XueB. et al.

    A survey on evolutionary computation approaches to feature selection

    IEEE Trans. Evol. Comput.

    (2015)
  • PoliR. et al.

    A field guide to genetic programming

    (2008)
  • ElkanC.

    The foundations of cost-sensitive learning

  • ZhangC. et al.

    A cost-sensitive deep belief network for imbalanced classification

    IEEE Trans. Neural Netw. Learn. Syst.

    (2018)
  • FernándezA. et al.

    Cost-sensitive learning

  • LingC.X. et al.

    Cost-sensitive learning

  • ZhouZ.H.

    Cost-sensitive learning

  • Cited by (18)

    View all citing articles on Scopus
    View full text