Elsevier

Applied Soft Computing

Volume 11, Issue 1, January 2011, Pages 1087-1097
Applied Soft Computing

A generic optimising feature extraction method using multiobjective genetic programming

https://doi.org/10.1016/j.asoc.2010.02.008Get rights and content

Abstract

In this paper, we present a generic, optimising feature extraction method using multiobjective genetic programming. We re-examine the feature extraction problem and show that effective feature extraction can significantly enhance the performance of pattern recognition systems with simple classifiers. A framework is presented to evolve optimised feature extractors that transform an input pattern space into a decision space in which maximal class separability is obtained. We have applied this method to real world datasets from the UCI Machine Learning and StatLog databases to verify our approach and compare our proposed method with other reported results. We conclude that our algorithm is able to produce classifiers of superior (or equivalent) performance to the conventional classifiers examined, suggesting removal of the need to exhaustively evaluate a large family of conventional classifiers on any new problem.

Introduction

Despite its prominence in the field of pattern recognition up to the 1970s, the area of feature extraction – also termed feature construction – together with the related area of feature selection, has been largely overtaken by work on classifier design, principally neural networks. Indeed many elegant theoretical results have been obtained in the classification domain in the intervening years. Nonetheless, feature extraction retains a key position in the field since the performance of a pattern classifier is well-known to be enhanced by proper preprocessing of the raw measurement data – this topic is the main focus of the present work.

Fig. 1 shows a prototypical pattern recognition system in which a vector of raw measurements is mapped into a decision space. Often the feature selection and/or extraction stages are either omitted or are implicit in the recognition paradigm – a multi-layer perceptron (MLP) is a good example of a classification paradigm where a distinct feature extraction stage is not readily identifiable. Addison et al. [1] and Park et al. [2] have reviewed existing feature extraction and selection techniques while Guyon and Elisseeff [3] have discussed feature extraction in terms of filter and wrapper methods. In this paper we focus on feature extraction.

The principal difficulty with designing the feature extraction stage for a classifier is that it usually requires deep domain-specific knowledge. (Indeed much of the work in image processing on detecting image cues such as edges and corners is actually feature extraction.) Even for feature extractors designed by domain experts, the issue of optimality is rarely addressed. Ideally, we would require some measure of class separability in the transformed decision space to be maximised but with handcrafted methods this is hard to guarantee.

In general terms, finding the optimal (possibly nonlinear) transformation, xy from input vector x to the decision space vector y where y=f(x), is a challenging task. In the sense that the feature extraction preprocessing stage is a transformation or mapping from input space to decision space, for a given classification problem we seek the mapping which maximises the separability of the classes in decision space. Thus feature extraction can be regarded as the search for an optimal sequence of operations subject to some criterion.

Genetic programming (GP) is an evolutionary problem solving method which has been extensively used to evolve programs or sequences of operations [8]. Typically, a prospective solution in GP is represented as a parse tree which can be interpreted as a sequence of operations and thus evaluated. Fig. 2 shows example GP trees together with the crossover operation typically used in the search process; the the output of the tree on the left (Parent 1), for example, evaluates to the expression:y=log(X3X4)where X3,4 are input features from the pattern being processed.

During evolutionary search two parents are selected biassed in their fitness and these may undergo crossover to produce two new offspring. A crossover point is selected in each parent and the the two subtrees – shown in the dashed boxes in Fig. 2 – are exchanged. The two offspring may each be modified by a mutation operator in which a subtree in an offspring tree is selected and replaced by a new, randomly-generated subtree. See Section 2.2 for details of the selection, crossover and mutation operations used in the present work. The cycle of selection/crossover/mutation is repeated either for a fixed number of iterations or until some some pre-specified error target is attained. (Genetic programming has been comprehensively reviewed in a recent book by Poli et al. [4]).

GP has been used before to optimise feature extraction and selection. Ebner [5], [6] has evolved image processing operators using GP. Bot [7] has used GP to evolve decision space features, adding these one-at-a-time to a k NN classifier if the newly evolved feature improved the classification performance by more than a certain amount. Bot’s approach is a greedy algorithm and therefore almost certainly sub-optimal. In addition, Koza [8] has produced character detectors using genetic programming while Tackett [9] evolved a symbolic expression for image classification based on image features.

Harvey et al. [10] evolved pipelined image processing operations to transform multi-spectral input synthetic aperture radar (SAR) image planes into a new set of image planes and a conventional supervised classifier was used to label the transformed features. Training data were used to derive a Fisher linear discriminant and GP was applied to find a threshold to reduce the output from the discriminant finding phase to a binary image. However, the discriminability is constrained in the discriminant finding phase and the GP only used as a one-dimensional search tool to find a threshold.

Sherrah et al. [11] proposed the Evolutionary PreProcessor (EPrep) system which used GP to evolve a good feature mapping by minimising misclassification error. Three typical classifiers: generalised linear machine (GLIM), k-nearest neighbour (k NN) and maximum likelihood classifiers were selected randomly and trained in conjunction with the search for the optimal feature extractors. The misclassification errors on the validation set from those classifiers were used as a fitness value for the individuals in the evolutionary population. The same procedure was used in the co-evolution of feature extraction/classifiers in [12]. This approach, however, makes the feature extraction procedure dependent on the classifier in an opaque way such that there is a potential risk that the evolved preprocessing can be excellent but the classifier can be poor giving a poor overall performance, or vice versa.

Kotani et al. [13] used GP to determine the polynomial combination of raw features to be fed into a k NN classifier and reported an improvement in classification performance. Krawiec [14] constructed a fixed-length decision vector using GP proposing an extended method to protect ‘useful’ blocks during the evolution. This protection method, however, contributes to the overfitting which is evident from his experiments. Indeed, Krawiec’s results show that for some datasets, the application of his feature extraction method actually produces worse classification performance than using the raw input data alone. Estébanez et al. [15] have followed a similar approach to Krawiec in projecting to a vector decision space of pre-determined dimensionality. Recently, Guo et al. [16] have evolved features in a condition monitoring task although it is not clear whether the elements in the vector of decision variables were evolved at the same time or hand-selected after evolution. Smith and Bull [17] have used GP together with a GA to perform feature construction and feature selection.

Broadly, the previous work on GP feature extraction can be categorised as evolving either: A discrete feature extraction stage which then feeds into a traditional classifier, or evolving a combined feature extraction/classification method which directly outputs a class label. Of the two possible routes, we argue that there is little merit in investing computational effort in evolving classifiers since this area is well understood and has solid theoretical underpinnings. We argue that the available computational effort should be expended on producing good feature extraction; in addition, we question the speed of convergence when exploring a search space which contains not only the set of feature extractors but also the set of all classifiers. Consequently, we adopt the approach here of evolving optimal feature extraction algorithms and performing the classification task using a standard, simple and fast-to-train classifier since the classifier has to be included inside the evolutionary loop to evaluate an individual’s fitness in terms of a separability measure in the decision space. We draw a distinction in the present work between evolving a feature extraction stage and evolving a classifier since the outcome of our evolutionary optimisation is a mapping into a real-valued (1D) decision space, not a map-ping into the space of object labels which is what would result from evolving a classifier. Clearly, our overall ‘system’ does constitute a classification system and our feature extraction stages are conditioned on our choice of classifier, in the present case, a single threshold. As a future extension to the present framework, we envisage mapping the input patterns into a multidimensional decision space (see [11], for example) in which case the choice of classifier is explicitly much more open.

It is noteworthy that all the previous work on evolving feature extractors/classifiers by GP have used a single objective, typically minimising the classification error over a training set which is disadvantageous from a number of standpoints. In particular, unless specific measures are taken to prevent it, the trees in a GP optimisation tend to grow without limit with no corresponding improvement in fitness, a phenomenon which is termed tree bloat. This is analogous to overfitting in neural networks and can lead to poor generalisation of the trained classifier as well as excessive computational demands. Various heuristic and indirect techniques have been used to suppress bloat but Ekárt and Németh [18] have shown that using a multiobjective fitness function [19] within GP, where one of the objectives is to minimise tree size, prevents bloat by exerting selective pressure in favour of smaller trees; also see [20]. We have thus used a multiobjective framework with Pareto optimality [19] in the present work.

Rather than a single solution, the converged output of a multiobjective optimisation is a set of equivalent solutions whose members are superior to all the other feasible solutions; the members of this so-called Pareto set are said to dominate the other possible solutions [19]. Within this set, none can be considered ‘better’ than any other from the point of view of the simultaneous optimisation of multiple objectives and it is left to a Decision Maker (DM) to select one of the optima according to some utility function which expresses their preferences. See [19] for a detailed review of multiobjective evolutionary methods; Jin and Sendhoff [21] have recently presented a review of Pareto-based multiobjective machine learning with particular emphasis on neural networks.

Our overall objective in the present work has been to identify the (near-)optimal series of mathematical transformations of pattern data that produces the best class separation in the transformed decision space. Further, our aim has been to produce a generic, domain-independent method such that the transformed patterns (or extracted features) can then be accurately classified with a simple and fast classifier. We make no assumptions about the statistical distributions of the original feature data.

For convenience and without loss of generality, we focus here on two-class problems. In common with other approaches to multi-class problems, extension to three or more classes is somewhat more involved and will be the subject of a future publication.

The rest of this paper is organised as follows: We present our generic framework to evolve optimal feature extractors in Section 2 and demonstrate its utility in Section 3 by applying it to eight datasets from the UCI Machine Learning [22] and StatLog [23] datasets. We make comparison with nine popular classifiers as well as previous evolutionary results reported by other researchers. We offer conclusions and suggestions for future work in Section 4.

Section snippets

Methodology

Over the years much effort has been expended in the pattern recognition community on finding a ‘best’ classifier (e.g. [23], [24]), the conclusion of which is that there is no single classifier which is best for every problem. In the present work, we focus on the feature preprocessing stage in classification systems. We propose a generic framework to evolve an optimal feature extraction stage for a given problem, independent of the dimensionality of the input pattern space and with optimised

Results

In this section we address our guiding issue of producing a generic methodology by examining performance across a wide range of two-class classification problems from the UCI Machine Learning [22] and StatLog [23] databases. Since GP is able to synthesise a feature extraction stage which is (near-)optimal with respect to the learning task at hand, we conjecture that the classification performance of our method should, at worst, be identical to the very best conventional classifier on any given

Conclusions and future work

In this paper we have demonstrated the use of multiobjective genetic programming (MOGP) to evolve an ‘optimal’ feature extractor which transforms input patterns into a decision space such that class separability is maximised. In the present work we have projected the input pattern to a one-dimensional decision space since this transformation naturally arises from a genetic programming tree although potentially, superior classification performance could be obtained by projecting into a

Acknowledgements

The financial support of a Universities UK Overseas Research Student Award Scheme (ORSAS) scholarship and the Henry Lester Trust is gratefully acknowledged.

References (45)

  • A. Jaszkiewicz

    Genetic local search for multi-objective combinatorial optimization

    European Journal of Operational Research

    (2002)
  • D. Addison et al.

    A comparison of feature extraction and selection techniques

  • C.H. Park, H. Park, P. Pardalos, A comparative study of linear and nonlinear feature extraction methods, Technical...
  • I. Guyon, A. Elisseeff, An introduction to feature extraction, in: I. Guyon, S. Gunn, M. Nikravesh, L. Zadeh (Eds.),...
  • R. Poli, W.B. Langdon, N.F. McPhee, A Field Guide to Genetic Programming, Published via http://lulu.com and freely...
  • M. Ebner

    On the evolution of interest operators using genetic programming

  • M. Ebner et al.

    Evolving a task specific image operator

  • M.J.C. Bot

    Feature extraction for the k-nearest neighbour classifier with genetic programming

  • J.R. Koza

    Genetic Programming II: Automatic Discovery of Reusable Programs

    (1994)
  • W.A. Tackett

    Genetic programming for feature discovery and image discrimination

  • N.R. Harvey, J. Theiler, S.P. Brumby, S. Perkins, J.J. Szymanski, J.J. Bloch, R.B. Porter, G. Mark, A. Young, C,...
  • J.R. Sherrah et al.

    The evolutionary preprocessor: automatic feature extraction for supervised classification using genetic programming

  • C. Harris, An investigation into the application of genetic programming techniques to signal analysis and feature...
  • M. Kotani et al.

    Feature extraction using evolutionary computation

  • K. Krawiec

    Genetic programming-based construction of features for machine learning and knowledge discovery tasks

    Genetic Programming and Evolvable Machines

    (2002)
  • C. Estébanez et al.

    A method based on genetic programming for improving the quality of datasets in classification problems

    International Journal of Computer Science and Applications

    (2007)
  • H. Guo et al.

    Feature generation using genetic programming with application to fault classification

    IEEE Transactions on Systems, Man and Cybernetics, Part B

    (2005)
  • M.G. Smith et al.

    Genetic programming with a genetic algorithm for feature construction and selection

    Genetic Programming and Evolvable Machines

    (2005)
  • A. Ekárt et al.

    Selection based on the Pareto nondomination criterion for controlling code growth in genetic programming

    Genetic Programming and Evolvable Machines

    (2001)
  • C.A.C. Coello

    An updated survey of GA-based multiobjective optimization techniques

    ACM Computing Surveys

    (2000)
  • S. Bleuler et al.

    Multiobjective genetic programming: reducing bloat using SPEA2

    in Congress on Evolutionary Computation

    (2001)
  • Y. Jin et al.

    Pareto-based multiobjective machine learning: an overview and case studies

    IEEE Transactions on Systems, Man and Cybernetics, Part C

    (2008)
  • Cited by (29)

    • Robust path-following control design of heavy vehicles based on multiobjective evolutionary optimization

      2022, Expert Systems with Applications
      Citation Excerpt :

      Furthermore, robust control design can be commonly assigned as a multiobjective optimization problem (MOP). A MOP is defined as a problem involving two or more conflicting objectives to be optimized simultaneously, and engineering problems related to several objectives often appear in many real-world design applications (Zhou et al., 2011), such as data mining (Alatas, Akin, & Karci, 2008; Zhang & Rockett, 2011), bioinformatics (Koduru et al., 2008; Shin, Lee, Kim, & Zhang, 2005), artificial neural networks (Delgado, Cuellar, & Pegalajar, 2008; Qasem & Shamsuddin, 2011), manufacturing (Weinert, Zabel, Kersting, Michelitsch, & Wagner, 2009), system identification (Aguirre et al., 2017; Barbosa, Aguirre, Martinez, & Braga, 2011), and pattern recognition (Guedes, Ferreira, & Barbosa, 2016). MOPs present a set of trade-off solutions, called Pareto optimal solutions, where an objective function cannot be optimized without decreasing performance in at least one other objective function (He, Tian, Jin, Zhang, & Pan, 2017).

    • Multi Hive Artificial Bee Colony Programming for high dimensional symbolic regression with feature selection

      2019, Applied Soft Computing Journal
      Citation Excerpt :

      Embedded methods [29] want to reduce the computation time required to reclassify different subgroups obtained using wrapper methods. A plethora of different automatic programming-based feature selection methods have been presented in the literature [30–35]. Harvey et al. have proposed an automatic GP-based feature design, called Autofead, which can easily incorporate specific terminals and functions [30].

    • A genetic programming method for feature mapping to improve prediction of HIV-1 protease cleavage site

      2018, Applied Soft Computing Journal
      Citation Excerpt :

      The GP can be employed for learning and extracting of high discriminative features [10,26], or learning better metric (or transfer function) to elevate separability of classifier [27]. Zhang and Rockett [28] employed GP for feature learning and metric learning in a multi-objective optimization technique. They significantly enhanced the performance of extracted patterns and classifier separability.

    • MBCGP-FE: A modified balanced cartesian genetic programming feature extractor

      2017, Knowledge-Based Systems
      Citation Excerpt :

      Since MFE3/GADR is based on MFE3/GA, it eliminates the single feature even if it is strongly relevant. Some works like [25,26] apply GP to construct new features. GCI [27] is a wrapper GP-based method proposed by Bensusan and Kuscu.

    View all citing articles on Scopus
    View full text