Elsevier

Applied Soft Computing

Volume 12, Issue 1, January 2012, Pages 416-422
Applied Soft Computing

Two layered Genetic Programming for mixed-attribute data classification

https://doi.org/10.1016/j.asoc.2011.08.029Get rights and content

Abstract

The important problem of data classification spans numerous real life applications. The classification problem has been tackled by using Genetic Programming in many successful ways. Most approaches focus on classification of only one type of data. However, most of the real-world data contain a mixture of categorical and continuous attributes. In this paper, we present an approach to classify mixed attribute data using Two Layered Genetic Programming (L2GP). The presented approach does not transform data into any other type and combines the properties of arithmetic expressions (using numerical data) and logical expressions (using categorical data). The outer layer contains logical functions and some nodes. These nodes contain the inner layer and are either logical or arithmetic expressions. Logical expressions give their Boolean output to the outer tree. The arithmetic expressions give a real value as their output. Positive real value is considered true and a negative value is considered false. These outputs of inner layers are used to evaluate the outer layer which determines the classification decision. The proposed classification technique has been applied on various heterogeneous data classification problems and found successful.

Introduction

Data classification is of high interest due to its applicability in several critical domains like disease diagnosis, feature recognition, fraud detection and decision making. The real-life-data is very unpredictable, which makes the classification a challenging task. This increases the need of automated classification systems with no or minimum human interference. Some properties of a good classification system are:

  • Robustness: The classifier should be able to output good results over a variety of problems.

  • Applicability: It should be readily applicable to data without any preprocessing.

  • Accuracy: The resultant classifier should be reliable and exhibit good generalizing abilities.

  • Efficient modeling: The structure should be flexible to adapt the data properties. It should be independent of data distribution.

  • Comprehensibility: The classifier should be comprehensible to help in future decision making.

  • Portability: The classifier should be portable to other tools for future use and efficacy.

Fortunately, one of the recent evolutionary algorithms, Genetic Programming (GP), possesses the above mentioned abilities. This important feature of GP has been recognized since its inception and GP has been widely used for the classification tasks. While being successful in various application domains [1], [2], [3], [4], GP suffers from a few limitations like code bloat, long training time and lack of convergence, etc. Researchers have been trying to overcome these limitations to make the most out of this powerful classification tool.

Broadly, GP has been applied for classification in two different ways. One evolves classifiers as logical rules applicable to categorical data (continuous attributes need discretization). Another method is the evolution of arithmetic classifier expressions which is applicable to numerical attributes (categorical attributes are encoded into numeric values). The data transformations in either case can result in loss of information or biasness, in addition to added computational effort.

We have proposed a novel GP based classification system applicable to mixed attribute data without preprocessing of data. This is a two layered approach where the outer layer is a logical expression tree with some leaf nodes. These leaf nodes form the inner layer trees. The inner layer expressions can be of two types, logical expressions for categorical attributes and arithmetic expressions for continuous attributes of the data. The logical expressions give a Boolean value as output, which can be used by the outer layer, readily. On the other hand, the real output by arithmetic expressions is considered true for positive values and false for negative values. These outputs of inner layers are used to evaluate the outer layer which determines the classification decision. This novel GP based representation has been tested on various binary datasets from UCI repository [5]. The method has been found compatible with various other GP based classification algorithms.

Section snippets

Related work

GP was introduced by Koza [6] in 1992 for automatic evolution of computer programs. Its ability to evolve classifiers has been realized since its beginning. Decision trees are one of the simpler classifiers and GP has been successfully used for decision tree evolution since 1991 [7]. Several advancements are being made [8] to date. Other classifier evolution approaches include evolution of neural networks [9], [10], [11], autonomous systems [12], rule induction algorithms [13], fuzzy rule based

Methodology

The first step in any GP system is defining the solution representation, by selecting a function and a terminal set. The proposed solution (classifier) is a two-layered tree. The outer layer is a logical tree with function and terminal nodes (antecedents). The function set contains ‘and’, ‘or’ and ‘not’ operators. The terminal nodes of the outer layer tree are the inner layer trees. One instance of the outer layer logical tree (Fig. 1) is:OuterLogicalTree=[(Innertree1)OR(Inner

Results

We have used ten-fold-cross-validation method to obtain the classification results. The data is divided into ten equal parts, nine parts are used for training and one part is used for testing phase, this process is repeated to keep each of the ten parts as testing data once. The ten-fold-cross-validation process is repeated 5 times. For each new fold, we have performed two independent runs with different random seeds. This means that, for each dataset, there are total 100 runs. All the reported

Conclusions and future work

In this paper, we have proposed a novel GP based classification technique that operates upon data with mixed type of attributes. The technique does not require any transformation or preprocessing of the data. We have tested the system on several benchmark datasets and compared the performance with various GP based classification methods. The results have revealed that the presented technique offers compatible performance owing to its flexible two layered representation. The future works include

Acknowledgement

The authors would like to thank Higher Education Commission, Pakistan for the financial support and providing the opportunity to perform this research.

References (49)

  • A. Asuncion et al.

    Machine Learning Repository

    (2007)
  • J.R. Koza

    Genetic Programming: On the Programming of computers by Means of Natural Selection

    (1992)
  • J.R. Koza
    (1991)
  • Q. Li

    Dynamic split-point selection method for decision tree evolved by gene expression programming

    IEEE Congress on Evolutionary Computation

    (2009)
  • D. Rivero et al.
    (2008)
  • M. Oltean et al.
    (2009)
  • G.A. Pappa et al.

    Evolving rule induction algorithms with multiobjective grammer based genetic programming

    Knowledge and Information Systems

    (2008)
  • J. Eggermont

    Evolving fuzzy decision trees for data classification

    Proceedings of the 14th Belgium Netherlands Artificial Intelligence Conference

    (2002)
  • R. Konig et al.

    Genetic programming – a tool for flexible rule extraction

    IEEE Congress on Evolutionary Computation

    (2007)
  • A.P. Engelbrecht et al.

    A building block approach to genetic programming for rule discovery, in data mining: a heuristic approach

  • E. Carreno et al.

    Evolution of classification rules for comprehensible knowledge discovery

    IEEE Congress on Evolutionary Computation

    (2007)
  • A.A. Freitas

    A genetic programming framework for two data mining tasks: classification and generalized rule induction

    Genetic Programming

    (1997)
  • C.S. Kuo et al.
    (2007)
  • J. Eggermont et al.

    A comparison of genetic programming variants for data classification

    Proceedings of the Eleventh Belgium Netherlands Conference on Artificial Intelligence

    (1999)
  • Cited by (0)

    View full text