A generic optimising feature extraction method using multiobjective genetic programming

doi:10.1016/j.asoc.2010.02.008

Applied Soft Computing

Volume 11, Issue 1, January 2011, Pages 1087-1097

https://doi.org/10.1016/j.asoc.2010.02.008 Get rights and content

Abstract

In this paper, we present a generic, optimising feature extraction method using multiobjective genetic programming. We re-examine the feature extraction problem and show that effective feature extraction can significantly enhance the performance of pattern recognition systems with simple classifiers. A framework is presented to evolve optimised feature extractors that transform an input pattern space into a decision space in which maximal class separability is obtained. We have applied this method to real world datasets from the UCI Machine Learning and StatLog databases to verify our approach and compare our proposed method with other reported results. We conclude that our algorithm is able to produce classifiers of superior (or equivalent) performance to the conventional classifiers examined, suggesting removal of the need to exhaustively evaluate a large family of conventional classifiers on any new problem.

Introduction

Despite its prominence in the field of pattern recognition up to the 1970s, the area of feature extraction – also termed feature construction – together with the related area of feature selection, has been largely overtaken by work on classifier design, principally neural networks. Indeed many elegant theoretical results have been obtained in the classification domain in the intervening years. Nonetheless, feature extraction retains a key position in the field since the performance of a pattern classifier is well-known to be enhanced by proper preprocessing of the raw measurement data – this topic is the main focus of the present work.

Fig. 1 shows a prototypical pattern recognition system in which a vector of raw measurements is mapped into a decision space. Often the feature selection and/or extraction stages are either omitted or are implicit in the recognition paradigm – a multi-layer perceptron (MLP) is a good example of a classification paradigm where a distinct feature extraction stage is not readily identifiable. Addison et al. [1] and Park et al. [2] have reviewed existing feature extraction and selection techniques while Guyon and Elisseeff [3] have discussed feature extraction in terms of filter and wrapper methods. In this paper we focus on feature extraction.

The principal difficulty with designing the feature extraction stage for a classifier is that it usually requires deep domain-specific knowledge. (Indeed much of the work in image processing on detecting image cues such as edges and corners is actually feature extraction.) Even for feature extractors designed by domain experts, the issue of optimality is rarely addressed. Ideally, we would require some measure of class separability in the transformed decision space to be maximised but with handcrafted methods this is hard to guarantee.

In general terms, finding the optimal (possibly nonlinear) transformation, $x \to y$ from input vector $x$ to the decision space vector $y$ where $y = f (x)$ , is a challenging task. In the sense that the feature extraction preprocessing stage is a transformation or mapping from input space to decision space, for a given classification problem we seek the mapping which maximises the separability of the classes in decision space. Thus feature extraction can be regarded as the search for an optimal sequence of operations subject to some criterion.

Genetic programming (GP) is an evolutionary problem solving method which has been extensively used to evolve programs or sequences of operations [8]. Typically, a prospective solution in GP is represented as a parse tree which can be interpreted as a sequence of operations and thus evaluated. Fig. 2 shows example GP trees together with the crossover operation typically used in the search process; the the output of the tree on the left (Parent 1), for example, evaluates to the expression: $y = - \log (X_{3} - X_{4})$ where $X_{3, 4}$ are input features from the pattern being processed.

During evolutionary search two parents are selected biassed in their fitness and these may undergo crossover to produce two new offspring. A crossover point is selected in each parent and the the two subtrees – shown in the dashed boxes in Fig. 2 – are exchanged. The two offspring may each be modified by a mutation operator in which a subtree in an offspring tree is selected and replaced by a new, randomly-generated subtree. See Section 2.2 for details of the selection, crossover and mutation operations used in the present work. The cycle of selection/crossover/mutation is repeated either for a fixed number of iterations or until some some pre-specified error target is attained. (Genetic programming has been comprehensively reviewed in a recent book by Poli et al. [4]).

GP has been used before to optimise feature extraction and selection. Ebner [5], [6] has evolved image processing operators using GP. Bot [7] has used GP to evolve decision space features, adding these one-at-a-time to a $k$ NN classifier if the newly evolved feature improved the classification performance by more than a certain amount. Bot’s approach is a greedy algorithm and therefore almost certainly sub-optimal. In addition, Koza [8] has produced character detectors using genetic programming while Tackett [9] evolved a symbolic expression for image classification based on image features.

Harvey et al. [10] evolved pipelined image processing operations to transform multi-spectral input synthetic aperture radar (SAR) image planes into a new set of image planes and a conventional supervised classifier was used to label the transformed features. Training data were used to derive a Fisher linear discriminant and GP was applied to find a threshold to reduce the output from the discriminant finding phase to a binary image. However, the discriminability is constrained in the discriminant finding phase and the GP only used as a one-dimensional search tool to find a threshold.

Sherrah et al. [11] proposed the Evolutionary PreProcessor (EPrep) system which used GP to evolve a good feature mapping by minimising misclassification error. Three typical classifiers: generalised linear machine (GLIM), $k$ -nearest neighbour (k NN) and maximum likelihood classifiers were selected randomly and trained in conjunction with the search for the optimal feature extractors. The misclassification errors on the validation set from those classifiers were used as a fitness value for the individuals in the evolutionary population. The same procedure was used in the co-evolution of feature extraction/classifiers in [12]. This approach, however, makes the feature extraction procedure dependent on the classifier in an opaque way such that there is a potential risk that the evolved preprocessing can be excellent but the classifier can be poor giving a poor overall performance, or vice versa.

Kotani et al. [13] used GP to determine the polynomial combination of raw features to be fed into a $k$ NN classifier and reported an improvement in classification performance. Krawiec [14] constructed a fixed-length decision vector using GP proposing an extended method to protect ‘useful’ blocks during the evolution. This protection method, however, contributes to the overfitting which is evident from his experiments. Indeed, Krawiec’s results show that for some datasets, the application of his feature extraction method actually produces worse classification performance than using the raw input data alone. Estébanez et al. [15] have followed a similar approach to Krawiec in projecting to a vector decision space of pre-determined dimensionality. Recently, Guo et al. [16] have evolved features in a condition monitoring task although it is not clear whether the elements in the vector of decision variables were evolved at the same time or hand-selected after evolution. Smith and Bull [17] have used GP together with a GA to perform feature construction and feature selection.

Broadly, the previous work on GP feature extraction can be categorised as evolving either: A discrete feature extraction stage which then feeds into a traditional classifier, or evolving a combined feature extraction/classification method which directly outputs a class label. Of the two possible routes, we argue that there is little merit in investing computational effort in evolving classifiers since this area is well understood and has solid theoretical underpinnings. We argue that the available computational effort should be expended on producing good feature extraction; in addition, we question the speed of convergence when exploring a search space which contains not only the set of feature extractors but also the set of all classifiers. Consequently, we adopt the approach here of evolving optimal feature extraction algorithms and performing the classification task using a standard, simple and fast-to-train classifier since the classifier has to be included inside the evolutionary loop to evaluate an individual’s fitness in terms of a separability measure in the decision space. We draw a distinction in the present work between evolving a feature extraction stage and evolving a classifier since the outcome of our evolutionary optimisation is a mapping into a real-valued (1D) decision space, not a map-ping into the space of object labels which is what would result from evolving a classifier. Clearly, our overall ‘system’ does constitute a classification system and our feature extraction stages are conditioned on our choice of classifier, in the present case, a single threshold. As a future extension to the present framework, we envisage mapping the input patterns into a multidimensional decision space (see [11], for example) in which case the choice of classifier is explicitly much more open.

It is noteworthy that all the previous work on evolving feature extractors/classifiers by GP have used a single objective, typically minimising the classification error over a training set which is disadvantageous from a number of standpoints. In particular, unless specific measures are taken to prevent it, the trees in a GP optimisation tend to grow without limit with no corresponding improvement in fitness, a phenomenon which is termed tree bloat. This is analogous to overfitting in neural networks and can lead to poor generalisation of the trained classifier as well as excessive computational demands. Various heuristic and indirect techniques have been used to suppress bloat but Ekárt and Németh [18] have shown that using a multiobjective fitness function [19] within GP, where one of the objectives is to minimise tree size, prevents bloat by exerting selective pressure in favour of smaller trees; also see [20]. We have thus used a multiobjective framework with Pareto optimality [19] in the present work.

Rather than a single solution, the converged output of a multiobjective optimisation is a set of equivalent solutions whose members are superior to all the other feasible solutions; the members of this so-called Pareto set are said to dominate the other possible solutions [19]. Within this set, none can be considered ‘better’ than any other from the point of view of the simultaneous optimisation of multiple objectives and it is left to a Decision Maker (DM) to select one of the optima according to some utility function which expresses their preferences. See [19] for a detailed review of multiobjective evolutionary methods; Jin and Sendhoff [21] have recently presented a review of Pareto-based multiobjective machine learning with particular emphasis on neural networks.

Our overall objective in the present work has been to identify the (near-)optimal series of mathematical transformations of pattern data that produces the best class separation in the transformed decision space. Further, our aim has been to produce a generic, domain-independent method such that the transformed patterns (or extracted features) can then be accurately classified with a simple and fast classifier. We make no assumptions about the statistical distributions of the original feature data.

For convenience and without loss of generality, we focus here on two-class problems. In common with other approaches to multi-class problems, extension to three or more classes is somewhat more involved and will be the subject of a future publication.

The rest of this paper is organised as follows: We present our generic framework to evolve optimal feature extractors in Section 2 and demonstrate its utility in Section 3 by applying it to eight datasets from the UCI Machine Learning [22] and StatLog [23] datasets. We make comparison with nine popular classifiers as well as previous evolutionary results reported by other researchers. We offer conclusions and suggestions for future work in Section 4.

Section snippets

Methodology

Over the years much effort has been expended in the pattern recognition community on finding a ‘best’ classifier (e.g. [23], [24]), the conclusion of which is that there is no single classifier which is best for every problem. In the present work, we focus on the feature preprocessing stage in classification systems. We propose a generic framework to evolve an optimal feature extraction stage for a given problem, independent of the dimensionality of the input pattern space and with optimised

Results

In this section we address our guiding issue of producing a generic methodology by examining performance across a wide range of two-class classification problems from the UCI Machine Learning [22] and StatLog [23] databases. Since GP is able to synthesise a feature extraction stage which is (near-)optimal with respect to the learning task at hand, we conjecture that the classification performance of our method should, at worst, be identical to the very best conventional classifier on any given

Conclusions and future work

In this paper we have demonstrated the use of multiobjective genetic programming (MOGP) to evolve an ‘optimal’ feature extractor which transforms input patterns into a decision space such that class separability is maximised. In the present work we have projected the input pattern to a one-dimensional decision space since this transformation naturally arises from a genetic programming tree although potentially, superior classification performance could be obtained by projecting into a

Acknowledgements

The financial support of a Universities UK Overseas Research Student Award Scheme (ORSAS) scholarship and the Henry Lester Trust is gratefully acknowledged.

References (45)

A. Jaszkiewicz
Genetic local search for multi-objective combinatorial optimization
European Journal of Operational Research
(2002)
D. Addison et al.
A comparison of feature extraction and selection techniques
C.H. Park, H. Park, P. Pardalos, A comparative study of linear and nonlinear feature extraction methods, Technical...
I. Guyon, A. Elisseeff, An introduction to feature extraction, in: I. Guyon, S. Gunn, M. Nikravesh, L. Zadeh (Eds.),...
R. Poli, W.B. Langdon, N.F. McPhee, A Field Guide to Genetic Programming, Published via http://lulu.com and freely...
M. Ebner
On the evolution of interest operators using genetic programming
M. Ebner et al.
Evolving a task specific image operator
M.J.C. Bot
Feature extraction for the $k$ -nearest neighbour classifier with genetic programming
J.R. Koza
Genetic Programming II: Automatic Discovery of Reusable Programs
(1994)
W.A. Tackett
Genetic programming for feature discovery and image discrimination

N.R. Harvey, J. Theiler, S.P. Brumby, S. Perkins, J.J. Szymanski, J.J. Bloch, R.B. Porter, G. Mark, A. Young, C,...

J.R. Sherrah et al.

The evolutionary preprocessor: automatic feature extraction for supervised classification using genetic programming

C. Harris, An investigation into the application of genetic programming techniques to signal analysis and feature...

M. Kotani et al.

Feature extraction using evolutionary computation

K. Krawiec

Genetic programming-based construction of features for machine learning and knowledge discovery tasks

Genetic Programming and Evolvable Machines

(2002)

C. Estébanez et al.

A method based on genetic programming for improving the quality of datasets in classification problems

International Journal of Computer Science and Applications

(2007)

H. Guo et al.

Feature generation using genetic programming with application to fault classification

IEEE Transactions on Systems, Man and Cybernetics, Part B

(2005)

M.G. Smith et al.

Genetic programming with a genetic algorithm for feature construction and selection

Genetic Programming and Evolvable Machines

(2005)

A. Ekárt et al.

Selection based on the Pareto nondomination criterion for controlling code growth in genetic programming

Genetic Programming and Evolvable Machines

(2001)

C.A.C. Coello

An updated survey of GA-based multiobjective optimization techniques

ACM Computing Surveys

(2000)

S. Bleuler et al.

Multiobjective genetic programming: reducing bloat using SPEA2

in Congress on Evolutionary Computation

(2001)

Y. Jin et al.

Pareto-based multiobjective machine learning: an overview and case studies

IEEE Transactions on Systems, Man and Cybernetics, Part C

(2008)

Cited by (29)

Automatic design of machine learning via evolutionary computation: A survey
2023, Applied Soft Computing
Machine learning (ML), as the most promising paradigm to discover deep knowledge from data, has been widely applied to practical applications, such as recommender systems, virtual reality, and semantic segmentation. However, building a high-quality ML system for given tasks requires expert knowledge and high computation cost. This poses a significant challenge to the further development of ML in large-scale practical applications. The automatic design of ML has become an increasingly popular research trend. At the same time, evolutionary computation (EC), as an excellent heuristic search technique, has been widely employed in ML optimization, so-called evolutionary machine learning (EML). In this paper, we offer a comprehensive review of the literature (more than 500 references) for EML methods. We first introduce the concepts related to ML and EC. After that, we propose a taxonomy criterion based on the ML and EC perspectives. The important research problems of EML, e.g., ML algorithms, solution representations, search paradigms, acceleration strategies and applications, are reviewed systematically. Lastly, we analyze EML limitations and discuss potential trends that are promising to address in the future.
Robust path-following control design of heavy vehicles based on multiobjective evolutionary optimization
2022, Expert Systems with Applications
Citation Excerpt :
Furthermore, robust control design can be commonly assigned as a multiobjective optimization problem (MOP). A MOP is defined as a problem involving two or more conflicting objectives to be optimized simultaneously, and engineering problems related to several objectives often appear in many real-world design applications (Zhou et al., 2011), such as data mining (Alatas, Akin, & Karci, 2008; Zhang & Rockett, 2011), bioinformatics (Koduru et al., 2008; Shin, Lee, Kim, & Zhang, 2005), artificial neural networks (Delgado, Cuellar, & Pegalajar, 2008; Qasem & Shamsuddin, 2011), manufacturing (Weinert, Zabel, Kersting, Michelitsch, & Wagner, 2009), system identification (Aguirre et al., 2017; Barbosa, Aguirre, Martinez, & Braga, 2011), and pattern recognition (Guedes, Ferreira, & Barbosa, 2016). MOPs present a set of trade-off solutions, called Pareto optimal solutions, where an objective function cannot be optimized without decreasing performance in at least one other objective function (He, Tian, Jin, Zhang, & Pan, 2017).
The ability to deal with systems parametric uncertainties is an essential issue for heavy self-driving vehicles in unconfined environments. In this sense, robust controllers prove to be efficient for autonomous navigation. However, uncertainty matrices for this class of systems are usually defined by algebraic methods which demand prior knowledge of the system dynamics. In this case, the control system designer depends on the quality of the uncertain model to obtain an optimal control performance. This work proposes a robust recursive controller designed via multiobjective optimization to overcome these shortcomings. Furthermore, a local search approach for multiobjective optimization problems is presented. The proposed method applies to any multiobjective evolutionary algorithm already established in the literature. The results presented show that this combination of model-based controller and machine learning improves the effectiveness of the system in terms of robustness, stability and smoothness.
Discovering generalized design knowledge using a multi-objective evolutionary algorithm with generalization operators
2020, Expert Systems with Applications
The early-phase design of complex systems is a challenging task, as a decision maker has to take into account the intricate relationships among different design variables. A popular way to help decision makers easily identify important design features is to use data mining. However, many of the existing algorithms output design features that are too complex (e.g., conjunction of many literals with unrelated predicates), making it difficult for a user to understand, remember, and apply these features to find better designs. In this paper, we introduce a new data mining method that extracts compact design features through knowledge generalization. The proposed method performs a search over the space of features using a multi-objective evolutionary algorithm that contains a set of generalization operators in addition to conventional evolutionary operators. Both variables and feature types are generalized by using an ontology defining a set of domain-specific concepts and relationships. Generalization leads to more compact and insightful features, as generalized knowledge encompasses wider concepts. A comparative experiment is conducted on a real-world system architecting problem to demonstrate the gain in compactness of the extracted features without significant reductions in predictive power.
Multi Hive Artificial Bee Colony Programming for high dimensional symbolic regression with feature selection
2019, Applied Soft Computing Journal
Citation Excerpt :
Embedded methods [29] want to reduce the computation time required to reclassify different subgroups obtained using wrapper methods. A plethora of different automatic programming-based feature selection methods have been presented in the literature [30–35]. Harvey et al. have proposed an automatic GP-based feature design, called Autofead, which can easily incorporate specific terminals and functions [30].
Feature selection is a process that provides model extraction by specifying necessary or related features and improves generalization. The Artificial Bee Colony (ABC) algorithm is one of the most popular optimization algorithms inspired on swarm intelligence developed by simulating the search behavior of honey bees. Artificial Bee Colony Programming (ABCP) is a recently proposed high level automatic programming technique for a Symbolic Regression (SR) problem based on the ABC algorithm. In this paper, a new feature selection method based on ABCP is proposed, Multi Hive ABCP (MHABCP) for high-dimensional SR problems. The learning ability and generalization performance of the proposed MHABCP is investigated using synthetic and real high-dimensional SR datasets and is compared with basic ABCP and GP automatic programming methods. Experimental results show that MHABCP has better performance choosing relevant features in high dimensional SR problems and generalization than other methods.
A genetic programming method for feature mapping to improve prediction of HIV-1 protease cleavage site
2018, Applied Soft Computing Journal
Citation Excerpt :
The GP can be employed for learning and extracting of high discriminative features [10,26], or learning better metric (or transfer function) to elevate separability of classifier [27]. Zhang and Rockett [28] employed GP for feature learning and metric learning in a multi-objective optimization technique. They significantly enhanced the performance of extracted patterns and classifier separability.
The human immunodeficiency virus (HIV) is the cause of acquired immunodeficiency syndrome (AIDS), which has profound implications in terms of both economic burden and loss of life. Modeling and examination of the HIV protease cleavage of amino acid sequences can contribute to control of this disease and production of more effective drugs. The present paper introduces a new method for encoding and characterization of amino acid sequences and a new model for the prediction of amino acid sequence cleavage by HIV protease. The proposed encoding scheme utilizes a combination of amino acids’ spatial and structural features in conjunction with 20 amino acid sequences to make sure that their physicochemical and sequencing features are all taken into account. The proposed HIV-1 amino acid cleavage prediction model is developed with the combination of genetic programming and support vector machine. The results of evaluations performed on various datasets demonstrate the superior performance of the proposed encoding and better accuracy of the proposed HIV-1 cleavage prediction model as compared to the state-of-the-art methods.
MBCGP-FE: A modified balanced cartesian genetic programming feature extractor
2017, Knowledge-Based Systems
Citation Excerpt :
Since MFE3/GADR is based on MFE3/GA, it eliminates the single feature even if it is strongly relevant. Some works like [25,26] apply GP to construct new features. GCI [27] is a wrapper GP-based method proposed by Bensusan and Kuscu.
Many data sets are represented by low-level or primitive features. This makes it difficult to discover relevant information via learning algorithm. Changing the way primitive data is represented can be advantageous. This can be performed using data preprocessing algorithms. A successful preprocessing algorithm should be capable of revealing the relationships among features to improve learners. These hidden relations among features can make the relevancy of the aspects of the data opaque to the learner. Automatic feature extraction is a solution to overcome this problem. This article introduces a Modified Balanced Cartesian Genetic Programming Feature Extractor (MBCGP-FE) for transforming the feature space to a smaller one composed of highly informative features through modifying the representation and operators of Balanced Cartesian Genetic Programming (BCGP). The new feature space is composed from original relevant and new constructed features which are created by discovering and compacting hidden relations among features. The size of the new feature space is determined during the optimization process. Experimental results on real data sets show that the MBCGP-FE improves the performance of learners and it is effective in reducing the dimension of data sets through the construction of new informative features. In addition, obtained results indicate the effectiveness of our proposed method in comparison with other feature extraction methods.

View all citing articles on Scopus

View full text

A generic optimising feature extraction method using multiobjective genetic programming

Abstract

Introduction

Section snippets

Methodology

Results

Conclusions and future work

Acknowledgements

European Journal of Operational Research

A comparison of feature extraction and selection techniques

On the evolution of interest operators using genetic programming

Evolving a task specific image operator

Feature extraction for the k-nearest neighbour classifier with genetic programming

Genetic Programming II: Automatic Discovery of Reusable Programs

Genetic programming for feature discovery and image discrimination

The evolutionary preprocessor: automatic feature extraction for supervised classification using genetic programming

Feature extraction using evolutionary computation

Genetic programming-based construction of features for machine learning and knowledge discovery tasks

Genetic Programming and Evolvable Machines

A method based on genetic programming for improving the quality of datasets in classification problems

International Journal of Computer Science and Applications

Feature generation using genetic programming with application to fault classification

IEEE Transactions on Systems, Man and Cybernetics, Part B

Genetic programming with a genetic algorithm for feature construction and selection

Genetic Programming and Evolvable Machines

Selection based on the Pareto nondomination criterion for controlling code growth in genetic programming

Genetic Programming and Evolvable Machines

An updated survey of GA-based multiobjective optimization techniques

ACM Computing Surveys

Multiobjective genetic programming: reducing bloat using SPEA2

in Congress on Evolutionary Computation

Pareto-based multiobjective machine learning: an overview and case studies

IEEE Transactions on Systems, Man and Cybernetics, Part C

Feature extraction for the $k$ -nearest neighbour classifier with genetic programming