Elsevier

Information Sciences

Volume 258, 10 February 2014, Pages 200-216
Information Sciences

Using Bayesian networks for selecting classifiers in GP ensembles

https://doi.org/10.1016/j.ins.2013.09.049Get rights and content

Highlights

Abstract

Ensemble techniques have been widely used to improve classification performance also in the case of GP-based systems. These techniques should improve classification accuracy by using voting strategies to combine the responses of different classifiers. However, even reducing the number of classifiers composing the ensemble, by selecting only those appropriately “diverse” according to a given measure, gives no guarantee of obtaining significant improvements in both classification accuracy and generalization capacity. This paper presents a novel approach for combining GP-based ensembles by means of a Bayesian Network. The proposed system is able to learn and combine decision tree ensembles effectively by using two different strategies: in the first, decision tree ensembles are learned by means of a boosted GP algorithm; in the second, the responses of the ensemble are combined using a Bayesian network, which also implements a selection strategy to reduce the number of classifiers. Experiments on several data sets show that the approach obtains comparable or better accuracy with respect to other methods proposed in the literature, considerably reducing the number of classifiers used. In addition, a comparison with similar approaches, confirmed the goodness of our method and its superiority with respect to other selection techniques based on diversity.

Introduction

Ensemble techniques have been taken into account [2], [8], [25] in the last few years in order to further improve the performance of classification algorithms. They try to combine the responses provided by several experts effectively, in order to improve the overall classification accuracy [28]. Such techniques rely on: (i) a diversification heuristic, used to extract sufficiently diverse classifiers; (ii) a voting mechanism, to combine the responses provided by the learned classifiers. If the classifiers are sufficiently diverse, i.e. they make uncorrelated errors, then the majority vote rule tends to the Bayesian error as the number of classifiers increases [28].

Ensemble techniques have also been used to enhance the performance of classification systems in which decision trees are learned by means of Genetic Programming (GP). Examples of GP–based approaches using ensemble techniques can be found in [6], [14], [18], [24], [31]. Moreover, in [6], [14], [24], ensembles of decision trees are evolved, and the diversity among the ensemble members is obtained by using bagging or boosting techniques. According to these approaches, an ensemble can be obtained by evolving each decision tree with reference to a different subset of the original data. Instead, in the bagging approach [5], different subsets (called bags), the same size as the original training set, are obtained by applying a random sampling with replacement. Then the ensemble is built by training each classifier on a different bag. Finally, the responses provided by these classifiers are combined by means of the majority vote rule: an unknown sample is assigned the class label that has the highest occurrence among those provided by the whole set of classifiers. While, in the boosting approach [16], the classifier ensemble is trained by means of a stepwise procedure. At each step, a new classifier is trained by choosing the training set samples from the original training set, according to a suitable probability distribution. This distribution is adaptively changed at each step in such a way that samples misclassified in the previous steps have a better chance of being chosen. Eventually, classifier responses are combined by the weighted majority vote rule, where the weight associated with a classifier takes into account its overall accuracy on the training data.

In [13] a novel GP-based classification system, named Cellular GP for data Classification (CGPC), is presented. In the CGPC approach, individuals interact according to a cellular automata inspired model, whose goal is to enable a fine-grained parallel implementation of GP. In this model, each individual has a spatial location on a low-dimensional grid and interacts only with other individuals within a small neighborhood. In [14], an extension of CGPC, called BoostCGPC, is presented. It is based on two different ensemble techniques: the Breiman’s bagging algorithm [5] and the AdaBoost.M2 boosting algorithm [16]. Despite the significant performance improvements classifier ensembles can provide, their major drawback is that, usually, it is necessary to combine a large number of classifiers in order to obtain a marked error reduction. This implies large memory requirements and slow classification speeds. In fact, these aspects can be critical in some applications [32], [36], but this problem can be solved by selecting a fraction of the classifiers from the original ensemble. This reduction, often called “ensemble pruning” in the literature, has other potential benefits. In particular, an appropriate subset of complementary classifiers can perform better than the whole ensemble [32], [43], [44], [33], [4]. When the cardinality L of the whole ensemble is high, the problem of finding the optimal subset of classifiers becomes computationally intractable because of the resulting exponential growth of the search space, made of all the 2L possible subsets. It is worth mentioning that several heuristic algorithms can be found in the literature for finding near optimal solutions. Examples of such heuristics are: Genetic algorithms (GAs) [43], [44] and semidefinite programming [42].

In order to be successful, any ensemble learning strategy should ensure that the classifiers making up the ensemble are suitably diverse, so as to avoid correlated errors [28]. In fact, as the ensemble size increases, it could happen that a correct classification provided by some classifiers is overturned by the convergence of other classifiers on the same wrong decision. If the ensemble classifiers are not sufficiently diverse, this event is much more likely and can reduce the obtainable performance, regardless of any combination strategy. Classifier diversity for bagging and boosting are experimentally investigated in [26], [27]. The results have shown that these techniques do not ensure obtaining sufficiently diverse classifiers. As regards boosting, in [26] it is observed that whereas highly diverse classifiers are obtained at the first steps, as the boosting process proceeds, classifier diversity strongly decreases.

More recently, AdaBoost has also been applied for generating more training subsets, used to learn an ensemble of Extreme Learning Machine (ELM) classifiers [40]. This approach tries to overcome some of the drawbacks in traditional gradient-based learning algorithm, and can also alleviate instability and over-fitting problems of ELM. The diversity issue has been also considered in the unsupervised learning of ensembles of clusters [41].

In a previous work [10], the classifier combination problem was reformulated as a pattern recognition one, in which the pattern is represented by the set of class labels provided by the classifiers when classifying a sample. Following this approach, a Bayesian network (BN) [35] was learned in order to estimate the conditional probability of each class, given the set of labels provided by the classifiers for each sample of a training set. Here, we have used Bayesian Networks because they provide a natural and compact way to encode joint probability distributions through graphical models, and allow probabilistic relationships among variables to be derived by using effective learning algorithms for both the graphical structure and its parameters. In particular, the joint probability among variables is modeled through the structure of a Direct Acyclic Graph (DAG), whose nodes are the variables while the arcs are their statistical dependencies. In this way, the conditional probability of each class, given the set of responses provided by the classifiers, can be directly derived by the DAG structure applying the Bayes Rule. Thus, the combining rule is automatically provided by the learned BN. Moreover, this approach makes it possible to identify redundant classifiers, i.e. classifiers whose outputs do not influence the output of the combiner: the behavior of these classifiers is very similar to that of other classifiers in the ensemble. For this reason, they may be discarded without affecting the overall performance of the combiner, thus overcoming the main drawback of the combining methods discussed above. In [11] the learning of the BN is performed by means of an evolutionary algorithm using a direct encoding scheme of the BN structure (DAG). This encoding scheme is based on a specifically devised data structure, called Multilist, which allows an easy and effective implementation of the genetic operators. Indeed, the rationale behind the choice of an evolutionary approach is that of trying to solve one of the main drawbacks of standard learning algorithms, such as the k2 one, which adopt a greedy search strategy and thus are prone to be trapped in local optima. On the contrary, evolutionary algorithms allow us to effectively explore complex high dimensional search space.

This paper presents a new classification system that merges the two aforementioned approaches. The goal is to build a high performance classification system, able to deal with large data sets but selecting only a reduced number of classifiers. For this purpose, we built a two-module system that combines the BoostCGPC algorithm [14], which produces a high performing ensemble of decision tree classifiers, with the BN-based approach to classifier combination [11]. In particular, the BN module evaluates classifiers diversity by estimating the statistical dependencies of the responses they provide. This ability is used to select the minimum number of classifiers, among those provided by the BoostCGPC module, required to effectively classify the data at hand. Moreover, the responses provided by the selected classifiers are effectively combined by means of a rule inferred by the BN module. In this way the proposed system exploits the advantages provided by both techniques and allows us to greatly reduce the number of classifiers in the ensemble. Finally, as regards the evolutionary learning of the BN, the mutation operator has been reformulated in order to ensure that any permutation on the elements of the main list, leaves unchanged the connection topology of the multi-list. This property is very important to manage the trade-off between the exploration and the exploitation ability of the evolutionary algorithm.

In order to assess the effectiveness of the proposed system, several experiments were performed. More specifically, seven data sets, of different sizes, number of attributes and classes were considered. Two kinds of comparison were performed: in the former the results of our approach were compared with those achieved by using different combination strategies; in the latter, our results were compared with those obtained by combining subsets of classifiers selected according to different selection strategies. Moreover, a diversity analysis of the selected classifiers was carried out taking into account two diversity measures. A genotypic one, which compares the structures of the related decision trees, and a phenotypic one that considers the responses provided by the classifiers to be combined.

The remainder of the paper is organized as follows: Section 2 reviews some previous works done on ensemble pruning. Section 3 details the system architecture; in Section 4 several diversity measures, included those used in the experimental findings, are described; Section 5 illustrates the experimental results; finally Section 6 deals with discussion and some concluding remarks.

Section snippets

Related works

As mentioned in the introduction, the larger the number of classifiers included in the pool, the greater the memory requirements and the computational time needed for combining their results. For this reason the use of a large number of experts can be critical in some applications and it explains why in the last few years, many researchers focused their attention on ensemble pruning techniques. In [4], the initial ensemble is pruned by a sequential backward selection procedure that attempts to

The architecture of the system

The proposed system aims to classify large datasets exploiting the advantages of both an ensemble based GP classifier, and a Bayesian Network-based combining technique. In practice, the system comprises two main modules: the first one (denoted as ensemble module) is composed of an ensemble of decision tree classifiers (experts) built by means of the BoostCGPC algorithm. Whereas the second one (denoted as combining module) uses a Bayesian Network-(BN)-based combiner to effectively combine the

Diversity measures and correlation

Performing a diversity analysis and exploring the correlation with the accuracy of an ensemble would constitute a good step forward for trying to understand the performance of our system. In addition, other motivations also require this kind of analysis. In fact, an important property of an ensemble-based classifiers is the ability to achieve good values of accuracy by using only a very limited number of classifiers. However, the approach described in this paper needs a considerable overhead to

Experimental analysis

The proposed approach was tested on seven real data sets: Adult, Census, Covtype, Phoneme, PhotoObject, Satimage, and Segment. The size and class distribution of these data sets are described in Table 1. They present different characteristics as regards the number and type (continuous and discrete) of attributes, the number of classes (two classes and multiple classes problems) and the number of samples. In particular, Adult and Census contain census data collected by the U.S. Census Bureau.

Conclusions

We presented a novel approach for improving the performance of derivation tree ensembles, learned by means of a boosted GP algorithm. Our basic idea is to combine the classifiers in the ensemble by using a BN to estimate the conditional probability of each class given the responses of these classifiers. This approach allows modeling explicitly the dependencies among the experts, trying to solve the problems derived by the error correlation that can occur even in boosting techniques. The

References (44)

  • I. Christou et al.

    A classifier ensemble approach to the tv-viewer profile adaptation problem

    International Journal of Machine Learning and Cybernetics

    (2012)
  • G. Cooper, E. Herskovits, A Bayesian method for constructing Bayesian belief networks from databases, in: Proceedings...
  • C. De Stefano et al.

    Classifier combination by Bayesian networks for handwriting recognition

    International Journal of Pattern Recognition and Artificial Intelligence

    (2009)
  • C. De Stefano, F. Fontanella, C. Marrocco, A. Scotto di Freca, A Hybrid Evolutionary Algorithm for Bayesian Networks...
  • A. Ekárt et al.

    Maintaining the diversity of genetic programs

    Lecture Notes in Computer Science, EuroGP 2002

    (2002)
  • G. Folino et al.

    A cellular genetic programming approach to classification

  • G. Folino et al.

    GP ensembles for large-scale data classification

    IEEE Transaction on Evolutionary Computation

    (2006)
  • G. Folino et al.

    Training distributed GP ensemble with a selective algorithm based on clustering and pruning for pattern classification

    IEEE Transactions on Evolutionary Computation

    (2008)
  • Y. Freund, R. Shapire, Experiments with a new boosting algorithm, in: Proceedings of the 13th Int. Conference on...
  • N. Friedman et al.

    Bayesian network classifiers

    Machine Learning

    (1997)
  • C. Gagné, M. Sebag, M. Schoenauer, M. Tomassini, Ensemble learning for free with evolutionary algorithms?, in:...
  • M. Hall et al.

    The WEKA data mining software: an update

    SIGKDD Explorations

    (2009)
  • Cited by (22)

    • A multi-context CNN ensemble for small lesion detection

      2020, Artificial Intelligence in Medicine
      Citation Excerpt :

      As a consequence, the number of background patches erroneously detected as lesions may be high and limit the benefits that the CADe system can provide, even when deep learning techniques are applied [19,20]. A simple yet effective way, commonly used in Machine Learning, for boosting the performance of poor detection models is the so called “expert combination”: multiple detectors are trained by using different weight settings and/or different partitions of the same data and strategically combined to solve a particular detection problem [21–25]. The rationale is that differently trained networks can learn different representations of the training data and, in this way, can agree on correct predictions and make their errors in different parts of the input space.

    • A novel Error-Correcting Output Codes algorithm based on genetic programming

      2019, Swarm and Evolutionary Computation
      Citation Excerpt :

      Up to now, it has been successfully applied to different optimization problems in diverse fields, such as interpreting reinforcement learning policies [44] and different types of knowledge discovery [45,46] by treating individuals as symbolic expressions. And some GP based learning algorithms treated individuals as learners, so as to applied to rainfall prediction [47], building ensemble learning systems [48] and outlier elimination [49]. In addition, individuals can represent a set of actions, so that some GP based feature selection methods were proposed by manipulating feature subsets with different operations [50].

    • Evolving meta-ensemble of classifiers for handling incomplete and unbalanced datasets in the cyber security domain

      2016, Applied Soft Computing Journal
      Citation Excerpt :

      The rest of the paper is structured as follows: in Section 2 presents some related works; in Section 3, a real scenario in the field of cyber security is illustrated; Section 4 is devoted to some background information concerning the problem of missing data and incomplete datasets and the ensemble of classifiers; in Section 5, the framework and its software architecture is illustrated; Section 6 shows a number of experiments conducted to verify the effectiveness of the approach and to compare it with other similar approaches; finally, Section 7 concludes the work. Evolutionary algorithms have been used mainly to evolve and select the base classifiers composing the ensemble [5,6] or adopting some time-expensive algorithms to combine the ensemble [7]; however, a limited number of papers concerns the evolution of the combining function of the ensemble by using GP. In the following, we analyze two groups of approaches.

    • A multi-level approach using genetic algorithms in an ensemble of Least Squares Support Vector Machines

      2016, Knowledge-Based Systems
      Citation Excerpt :

      So, the GA is used to train a set of networks with optimal parameters and the first M individuals are chosen to compose the ensemble. The work proposed by Stefano et al. [11] presents an approach for combining GP-based ensembles by means of a Bayesian network. A boosted GP algorithm is employed to evolve ensembles of decision trees and then their decisions values are combined using a Bayesian network, which is also learned by means of an evolutionary algorithm.

    View all citing articles on Scopus
    View full text