Using Bayesian networks for selecting classifiers in GP ensembles

doi:10.1016/j.ins.2013.09.049

Information Sciences

Volume 258, 10 February 2014, Pages 200-216

https://doi.org/10.1016/j.ins.2013.09.049 Get rights and content

Highlights

•
We use a Bayesian network for combining the responses provided by decision tree ensembles.
•
The ensembles are learned by a boosted Genetic programming.
•
Bayesian networks are learned by a specifically devised evolutionary algorithm.
•
The proposed approach strongly reduces the number of needed classifiers.

Abstract

Ensemble techniques have been widely used to improve classification performance also in the case of GP-based systems. These techniques should improve classification accuracy by using voting strategies to combine the responses of different classifiers. However, even reducing the number of classifiers composing the ensemble, by selecting only those appropriately “diverse” according to a given measure, gives no guarantee of obtaining significant improvements in both classification accuracy and generalization capacity. This paper presents a novel approach for combining GP-based ensembles by means of a Bayesian Network. The proposed system is able to learn and combine decision tree ensembles effectively by using two different strategies: in the first, decision tree ensembles are learned by means of a boosted GP algorithm; in the second, the responses of the ensemble are combined using a Bayesian network, which also implements a selection strategy to reduce the number of classifiers. Experiments on several data sets show that the approach obtains comparable or better accuracy with respect to other methods proposed in the literature, considerably reducing the number of classifiers used. In addition, a comparison with similar approaches, confirmed the goodness of our method and its superiority with respect to other selection techniques based on diversity.

Introduction

Ensemble techniques have been taken into account [2], [8], [25] in the last few years in order to further improve the performance of classification algorithms. They try to combine the responses provided by several experts effectively, in order to improve the overall classification accuracy [28]. Such techniques rely on: (i) a diversification heuristic, used to extract sufficiently diverse classifiers; (ii) a voting mechanism, to combine the responses provided by the learned classifiers. If the classifiers are sufficiently diverse, i.e. they make uncorrelated errors, then the majority vote rule tends to the Bayesian error as the number of classifiers increases [28].

Ensemble techniques have also been used to enhance the performance of classification systems in which decision trees are learned by means of Genetic Programming (GP). Examples of GP–based approaches using ensemble techniques can be found in [6], [14], [18], [24], [31]. Moreover, in [6], [14], [24], ensembles of decision trees are evolved, and the diversity among the ensemble members is obtained by using bagging or boosting techniques. According to these approaches, an ensemble can be obtained by evolving each decision tree with reference to a different subset of the original data. Instead, in the bagging approach [5], different subsets (called bags), the same size as the original training set, are obtained by applying a random sampling with replacement. Then the ensemble is built by training each classifier on a different bag. Finally, the responses provided by these classifiers are combined by means of the majority vote rule: an unknown sample is assigned the class label that has the highest occurrence among those provided by the whole set of classifiers. While, in the boosting approach [16], the classifier ensemble is trained by means of a stepwise procedure. At each step, a new classifier is trained by choosing the training set samples from the original training set, according to a suitable probability distribution. This distribution is adaptively changed at each step in such a way that samples misclassified in the previous steps have a better chance of being chosen. Eventually, classifier responses are combined by the weighted majority vote rule, where the weight associated with a classifier takes into account its overall accuracy on the training data.

In [13] a novel GP-based classification system, named Cellular GP for data Classification (CGPC), is presented. In the CGPC approach, individuals interact according to a cellular automata inspired model, whose goal is to enable a fine-grained parallel implementation of GP. In this model, each individual has a spatial location on a low-dimensional grid and interacts only with other individuals within a small neighborhood. In [14], an extension of CGPC, called BoostCGPC, is presented. It is based on two different ensemble techniques: the Breiman’s bagging algorithm [5] and the AdaBoost.M2 boosting algorithm [16]. Despite the significant performance improvements classifier ensembles can provide, their major drawback is that, usually, it is necessary to combine a large number of classifiers in order to obtain a marked error reduction. This implies large memory requirements and slow classification speeds. In fact, these aspects can be critical in some applications [32], [36], but this problem can be solved by selecting a fraction of the classifiers from the original ensemble. This reduction, often called “ensemble pruning” in the literature, has other potential benefits. In particular, an appropriate subset of complementary classifiers can perform better than the whole ensemble [32], [43], [44], [33], [4]. When the cardinality L of the whole ensemble is high, the problem of finding the optimal subset of classifiers becomes computationally intractable because of the resulting exponential growth of the search space, made of all the 2^L possible subsets. It is worth mentioning that several heuristic algorithms can be found in the literature for finding near optimal solutions. Examples of such heuristics are: Genetic algorithms (GAs) [43], [44] and semidefinite programming [42].

In order to be successful, any ensemble learning strategy should ensure that the classifiers making up the ensemble are suitably diverse, so as to avoid correlated errors [28]. In fact, as the ensemble size increases, it could happen that a correct classification provided by some classifiers is overturned by the convergence of other classifiers on the same wrong decision. If the ensemble classifiers are not sufficiently diverse, this event is much more likely and can reduce the obtainable performance, regardless of any combination strategy. Classifier diversity for bagging and boosting are experimentally investigated in [26], [27]. The results have shown that these techniques do not ensure obtaining sufficiently diverse classifiers. As regards boosting, in [26] it is observed that whereas highly diverse classifiers are obtained at the first steps, as the boosting process proceeds, classifier diversity strongly decreases.

More recently, AdaBoost has also been applied for generating more training subsets, used to learn an ensemble of Extreme Learning Machine (ELM) classifiers [40]. This approach tries to overcome some of the drawbacks in traditional gradient-based learning algorithm, and can also alleviate instability and over-fitting problems of ELM. The diversity issue has been also considered in the unsupervised learning of ensembles of clusters [41].

In a previous work [10], the classifier combination problem was reformulated as a pattern recognition one, in which the pattern is represented by the set of class labels provided by the classifiers when classifying a sample. Following this approach, a Bayesian network (BN) [35] was learned in order to estimate the conditional probability of each class, given the set of labels provided by the classifiers for each sample of a training set. Here, we have used Bayesian Networks because they provide a natural and compact way to encode joint probability distributions through graphical models, and allow probabilistic relationships among variables to be derived by using effective learning algorithms for both the graphical structure and its parameters. In particular, the joint probability among variables is modeled through the structure of a Direct Acyclic Graph (DAG), whose nodes are the variables while the arcs are their statistical dependencies. In this way, the conditional probability of each class, given the set of responses provided by the classifiers, can be directly derived by the DAG structure applying the Bayes Rule. Thus, the combining rule is automatically provided by the learned BN. Moreover, this approach makes it possible to identify redundant classifiers, i.e. classifiers whose outputs do not influence the output of the combiner: the behavior of these classifiers is very similar to that of other classifiers in the ensemble. For this reason, they may be discarded without affecting the overall performance of the combiner, thus overcoming the main drawback of the combining methods discussed above. In [11] the learning of the BN is performed by means of an evolutionary algorithm using a direct encoding scheme of the BN structure (DAG). This encoding scheme is based on a specifically devised data structure, called Multilist, which allows an easy and effective implementation of the genetic operators. Indeed, the rationale behind the choice of an evolutionary approach is that of trying to solve one of the main drawbacks of standard learning algorithms, such as the k2 one, which adopt a greedy search strategy and thus are prone to be trapped in local optima. On the contrary, evolutionary algorithms allow us to effectively explore complex high dimensional search space.

This paper presents a new classification system that merges the two aforementioned approaches. The goal is to build a high performance classification system, able to deal with large data sets but selecting only a reduced number of classifiers. For this purpose, we built a two-module system that combines the BoostCGPC algorithm [14], which produces a high performing ensemble of decision tree classifiers, with the BN-based approach to classifier combination [11]. In particular, the BN module evaluates classifiers diversity by estimating the statistical dependencies of the responses they provide. This ability is used to select the minimum number of classifiers, among those provided by the BoostCGPC module, required to effectively classify the data at hand. Moreover, the responses provided by the selected classifiers are effectively combined by means of a rule inferred by the BN module. In this way the proposed system exploits the advantages provided by both techniques and allows us to greatly reduce the number of classifiers in the ensemble. Finally, as regards the evolutionary learning of the BN, the mutation operator has been reformulated in order to ensure that any permutation on the elements of the main list, leaves unchanged the connection topology of the multi-list. This property is very important to manage the trade-off between the exploration and the exploitation ability of the evolutionary algorithm.

In order to assess the effectiveness of the proposed system, several experiments were performed. More specifically, seven data sets, of different sizes, number of attributes and classes were considered. Two kinds of comparison were performed: in the former the results of our approach were compared with those achieved by using different combination strategies; in the latter, our results were compared with those obtained by combining subsets of classifiers selected according to different selection strategies. Moreover, a diversity analysis of the selected classifiers was carried out taking into account two diversity measures. A genotypic one, which compares the structures of the related decision trees, and a phenotypic one that considers the responses provided by the classifiers to be combined.

The remainder of the paper is organized as follows: Section 2 reviews some previous works done on ensemble pruning. Section 3 details the system architecture; in Section 4 several diversity measures, included those used in the experimental findings, are described; Section 5 illustrates the experimental results; finally Section 6 deals with discussion and some concluding remarks.

Section snippets

Related works

As mentioned in the introduction, the larger the number of classifiers included in the pool, the greater the memory requirements and the computational time needed for combining their results. For this reason the use of a large number of experts can be critical in some applications and it explains why in the last few years, many researchers focused their attention on ensemble pruning techniques. In [4], the initial ensemble is pruned by a sequential backward selection procedure that attempts to

The architecture of the system

The proposed system aims to classify large datasets exploiting the advantages of both an ensemble based GP classifier, and a Bayesian Network-based combining technique. In practice, the system comprises two main modules: the first one (denoted as ensemble module) is composed of an ensemble of decision tree classifiers (experts) built by means of the BoostCGPC algorithm. Whereas the second one (denoted as combining module) uses a Bayesian Network-(BN)-based combiner to effectively combine the

Diversity measures and correlation

Performing a diversity analysis and exploring the correlation with the accuracy of an ensemble would constitute a good step forward for trying to understand the performance of our system. In addition, other motivations also require this kind of analysis. In fact, an important property of an ensemble-based classifiers is the ability to achieve good values of accuracy by using only a very limited number of classifiers. However, the approach described in this paper needs a considerable overhead to

Experimental analysis

The proposed approach was tested on seven real data sets: Adult, Census, Covtype, Phoneme, PhotoObject, Satimage, and Segment. The size and class distribution of these data sets are described in Table 1. They present different characteristics as regards the number and type (continuous and discrete) of attributes, the number of classes (two classes and multiple classes problems) and the number of samples. In particular, Adult and Census contain census data collected by the U.S. Census Bureau.

Conclusions

We presented a novel approach for improving the performance of derivation tree ensembles, learned by means of a boosted GP algorithm. Our basic idea is to combine the classifiers in the ensemble by using a BN to estimate the conditional probability of each class given the responses of these classifiers. This approach allows modeling explicitly the dependencies among the experts, trying to solve the problems derived by the error correlation that can occur even in boosting techniques. The

References (44)

R. Banfield et al.
Ensembles diversity measures and their application to thinning
Information Fusion
(2005)
G. Giacinto et al.
An approach to the automatic design of multiple classifier systems
Pattern Recognition Letters
(2001)
L. Kuncheva et al.
An experimental study on diversity for bagging and boosting with linear classifiers
Information Fusion
(2002)
Z. Zhou et al.
Ensembling neural networks: many could be better than all
Artificial intelligence
(2002)
The Sloan Digital Sky Survey (SDSS)....
Battista Biggio et al.
Multiple classifier systems for robust classifier design in adversarial environments
International Journal of Machine Learning and Cybernetics
(2010)
R. Agrawal, M. Mehta, J. Shafer, SPRINT: a scalable parallel classifier for data mining, in: Proc. of 22th...
L. Breiman
Bagging predictors
Machine Learning
(1996)
E. Cantú-Paz et al.
Inducing oblique decision trees with evolutionary algorithms
IEEE Transaction on Evolutionary Computation
(2003)
C. Chow et al.
Approximating discrete probability distributions with dependence trees
IEEE Transactions on Information Theory
(1968)

I. Christou et al.

A classifier ensemble approach to the tv-viewer profile adaptation problem

International Journal of Machine Learning and Cybernetics

(2012)

G. Cooper, E. Herskovits, A Bayesian method for constructing Bayesian belief networks from databases, in: Proceedings...

C. De Stefano et al.

Classifier combination by Bayesian networks for handwriting recognition

International Journal of Pattern Recognition and Artificial Intelligence

(2009)

C. De Stefano, F. Fontanella, C. Marrocco, A. Scotto di Freca, A Hybrid Evolutionary Algorithm for Bayesian Networks...

A. Ekárt et al.

Maintaining the diversity of genetic programs

Lecture Notes in Computer Science, EuroGP 2002

(2002)

G. Folino et al.

A cellular genetic programming approach to classification

G. Folino et al.

GP ensembles for large-scale data classification

IEEE Transaction on Evolutionary Computation

(2006)

G. Folino et al.

Training distributed GP ensemble with a selective algorithm based on clustering and pruning for pattern classification

IEEE Transactions on Evolutionary Computation

(2008)

Y. Freund, R. Shapire, Experiments with a new boosting algorithm, in: Proceedings of the 13th Int. Conference on...

N. Friedman et al.

Bayesian network classifiers

Machine Learning

(1997)

C. Gagné, M. Sebag, M. Schoenauer, M. Tomassini, Ensemble learning for free with evolutionary algorithms?, in:...

M. Hall et al.

The WEKA data mining software: an update

SIGKDD Explorations

(2009)

Cited by (22)

A novel binary classification approach based on geometric semantic genetic programming
2022, Swarm and Evolutionary Computation
Geometric semantic genetic programming (GSGP) is a recent variant of genetic programming. GSGP allows the landscape of any supervised regression problem to be transformed into a unimodal error surface, thus it has been applied only to this kind of problem. In a previous paper, we presented a novel variant of GSGP for binary classification problems that, taking inspiration from perceptron neural networks, uses a logistic-based activation function to constrain the output value of a GSGP tree in the interval [0,1]. This simple approach allowed us to use the standard RMSE function to evaluate the train classification error on binary classification problems and, consequently, to preserve the intrinsic properties of the geometric semantic operators. The results encouraged us to investigate this approach further. To this aim, in this paper, we present the results from 18 test problems, which we compared with those achieved by eleven well-known and widely classification schemes. We also studied how the parameter settings affect the classification performance and the use of the $F$ -score function to deal with imbalanced data. The results confirmed the effectiveness of the proposed approach.
A multi-context CNN ensemble for small lesion detection
2020, Artificial Intelligence in Medicine
Citation Excerpt :
As a consequence, the number of background patches erroneously detected as lesions may be high and limit the benefits that the CADe system can provide, even when deep learning techniques are applied [19,20]. A simple yet effective way, commonly used in Machine Learning, for boosting the performance of poor detection models is the so called “expert combination”: multiple detectors are trained by using different weight settings and/or different partitions of the same data and strategically combined to solve a particular detection problem [21–25]. The rationale is that differently trained networks can learn different representations of the training data and, in this way, can agree on correct predictions and make their errors in different parts of the input space.
In this paper, we propose a novel method for the detection of small lesions in digital medical images. Our approach is based on a multi-context ensemble of convolutional neural networks (CNNs), aiming at learning different levels of image spatial context and improving detection performance. The main innovation behind the proposed method is the use of multiple-depth CNNs, individually trained on image patches of different dimensions and then combined together. In this way, the final ensemble is able to find and locate abnormalities on the images by exploiting both the local features and the surrounding context of a lesion. Experiments were focused on two well-known medical detection problems that have been recently faced with CNNs: microcalcification detection on full-field digital mammograms and microaneurysm detection on ocular fundus images. To this end, we used two publicly available datasets, INbreast and E-ophtha. Statistically significantly better detection performance were obtained by the proposed ensemble with respect to other approaches in the literature, demonstrating its effectiveness in the detection of small abnormalities.
A novel Error-Correcting Output Codes algorithm based on genetic programming
2019, Swarm and Evolutionary Computation
Citation Excerpt :
Up to now, it has been successfully applied to different optimization problems in diverse fields, such as interpreting reinforcement learning policies [44] and different types of knowledge discovery [45,46] by treating individuals as symbolic expressions. And some GP based learning algorithms treated individuals as learners, so as to applied to rainfall prediction [47], building ensemble learning systems [48] and outlier elimination [49]. In addition, individuals can represent a set of actions, so that some GP based feature selection methods were proposed by manipulating feature subsets with different operations [50].
Error-Correcting Output Codes (ECOC) is widely used in the field of multiclass classification. As an optimal codematrix is key to the performance of an ECOC algorithm, this paper proposes a genetic programming (GP) based ECOC algorithm (GP-ECOC). In the design of individual of our GP, each terminal node represents a class, and nonterminal nodes combine the classes in their child nodes. In this way, an individual is a class combination tree, and represents an ECOC codematrix. A legality checking process is embedded in our algorithm to check each codematrix, so as to ensure each codematrix satisfying ECOC constraints. Those violating the constraints will be corrected by a proposed Guided Mutation operator. Before fitness evaluation, a local optimization algorithm is proposed to append new columns for tough classes, so as to improve the generalization ability of each individual and accelerate the evolutionary speed. In this way, our GP can evolve optimal codematrices through the evolutionary process. Experiments show that compared with other ensemble algorithms, our algorithm can achieve stable and high performances with relatively small ensemble scales on various UCI data sets. To the best of our knowledge, it is the first time that GP has been applied to implement the ECOC encoding algorithm. Our Python code is available at https://github.com/samuellees/gpecoc.
An experimental protocol to support cognitive impairment diagnosis by using handwriting analysis
2018, Procedia Computer Science
Nowadays diseases involving cognitive impairments affect millions of people worldwide, with Alzheimer’s and Parkinson’s diseases being the most common ones. Because of the worldwide average lifespan increment, it is expected that their incidence will increase in the next few decades. Among the daily activities, handwriting is one of the first affected by cognitive impairments. For this reasons, researchers have also been investigating the analysis of handwriting alterations as diagnostic signs for this kind of diseases. In this paper we present an experimental protocol that we developed for the analysis of the handwriting dynamics of patients affected by cognitive impairments. The aim of this protocol is to build a large database that would allow to effectively train different classifier systems. We also detail the most common and effective features previously used in the literature to represent handwriting dynamics of the subjects affected by cognitive impairments.
Evolving meta-ensemble of classifiers for handling incomplete and unbalanced datasets in the cyber security domain
2016, Applied Soft Computing Journal
Citation Excerpt :
The rest of the paper is structured as follows: in Section 2 presents some related works; in Section 3, a real scenario in the field of cyber security is illustrated; Section 4 is devoted to some background information concerning the problem of missing data and incomplete datasets and the ensemble of classifiers; in Section 5, the framework and its software architecture is illustrated; Section 6 shows a number of experiments conducted to verify the effectiveness of the approach and to compare it with other similar approaches; finally, Section 7 concludes the work. Evolutionary algorithms have been used mainly to evolve and select the base classifiers composing the ensemble [5,6] or adopting some time-expensive algorithms to combine the ensemble [7]; however, a limited number of papers concerns the evolution of the combining function of the ensemble by using GP. In the following, we analyze two groups of approaches.
Cyber security classification algorithms usually operate with datasets presenting many missing features and strongly unbalanced classes. In order to cope with these issues, we designed a distributed genetic programming (GP) framework, named CAGE-MetaCombiner, which adopts a meta-ensemble model to operate efficiently with missing data. Each ensemble evolves a function for combining the classifiers, which does not need of any extra phase of training on the original data. Therefore, in the case of changes in the data, the function can be recomputed in an incremental way, with a moderate computational effort; this aspect together with the advantages of running on parallel/distributed architectures makes the algorithm suitable to operate with the real time constraints typical of a cyber security problem. In addition, an important cyber security problem that concerns the classification of the users or the employers of an e-payment system is illustrated, in order to show the relevance of the case in which entire sources of data or groups of features are missing. Finally, the capacity of approach in handling groups of missing features and unbalanced datasets is validated on many artificial datasets and on two real datasets and it is compared with some similar approaches.
A multi-level approach using genetic algorithms in an ensemble of Least Squares Support Vector Machines
2016, Knowledge-Based Systems
Citation Excerpt :
So, the GA is used to train a set of networks with optimal parameters and the first M individuals are chosen to compose the ensemble. The work proposed by Stefano et al. [11] presents an approach for combining GP-based ensembles by means of a Bayesian network. A boosted GP algorithm is employed to evolve ensembles of decision trees and then their decisions values are combined using a Bayesian network, which is also learned by means of an evolutionary algorithm.
Despite the ensemble systems have been shown to be an efficient method to increase the accuracy and stability of learning algorithms in recent decades, its construction has a question to be elucidated: diversity. The disagreement among the models that compose the ensemble can be generated when they are built under different circumstances, such as training dataset, parameter setting and selection of learning algorithms. The ensemble may be viewed as a structure with three levels: input space, the base components and the combining block of the components responses. We propose a multi-level approach using genetic algorithms to build the ensemble of Least Squares Support Vector Machines (LS-SVM), performing a feature selection in the input space, the parameterization and the choice of which models will compose the ensemble at the component level and finding a weight vector which best represents the importance of each classifier in the final response of the ensemble. The combination of feature selection and parameterization should help create even more diversity. In order to evaluate the performance of the proposed approach, we use some benchmarks to compare with other classification algorithms, including some change in the fitness function of our approach.

View all citing articles on Scopus

View full text

Using Bayesian networks for selecting classifiers in GP ensembles

Highlights

Abstract

Introduction

Section snippets

Related works

The architecture of the system

Diversity measures and correlation

Experimental analysis

Conclusions

Information Fusion

Pattern Recognition Letters

Information Fusion

Artificial intelligence

Multiple classifier systems for robust classifier design in adversarial environments

International Journal of Machine Learning and Cybernetics

Bagging predictors

Machine Learning

Inducing oblique decision trees with evolutionary algorithms

IEEE Transaction on Evolutionary Computation

Approximating discrete probability distributions with dependence trees

IEEE Transactions on Information Theory

A classifier ensemble approach to the tv-viewer profile adaptation problem

International Journal of Machine Learning and Cybernetics

Classifier combination by Bayesian networks for handwriting recognition

International Journal of Pattern Recognition and Artificial Intelligence

Maintaining the diversity of genetic programs

Lecture Notes in Computer Science, EuroGP 2002

A cellular genetic programming approach to classification

GP ensembles for large-scale data classification

IEEE Transaction on Evolutionary Computation

Training distributed GP ensemble with a selective algorithm based on clustering and pruning for pattern classification

IEEE Transactions on Evolutionary Computation

Bayesian network classifiers

Machine Learning

The WEKA data mining software: an update

SIGKDD Explorations