A constrained-syntax genetic programming system for discovering classification rules: application to medical data sets

https://doi.org/10.1016/j.artmed.2003.06.001Get rights and content

Abstract

This paper proposes a new constrained-syntax genetic programming (GP) algorithm for discovering classification rules in medical data sets. The proposed GP contains several syntactic constraints to be enforced by the system using a disjunctive normal form representation, so that individuals represent valid rule sets that are easy to interpret. The GP is compared with C4.5, a well-known decision-tree-building algorithm, and with another GP that uses Boolean inputs (BGP), in five medical data sets: chest pain, Ljubljana breast cancer, dermatology, Wisconsin breast cancer, and pediatric adrenocortical tumor. For this last data set a new preprocessing step was devised for survival prediction. Computational experiments show that, overall, the GP algorithm obtained good results with respect to predictive accuracy and rule comprehensibility, by comparison with C4.5 and BGP.

Introduction

Classification is an important problem extensively studied in several research areas, such as statistical pattern recognition, machine learning and data mining [6], [14], [20], [28]. The basic idea is to predict the class of an instance (a record of a given data set), based on the values of predictor attributes of that instance.

Medical diagnosis can be considered a classification problem: a record is a given patient’s case, predictor attributes are all patient’s data (including symptoms, signals, clinical history and results of laboratory tests), and the class is the diagnosis (disease or clinical condition that the physician has to discover, based on the patient’s data). Finally, in this analogy, the medical knowledge and past experience of the physician takes the role of the classifier. Notwithstanding, many times, physicians cannot easily state the final diagnosis of a patient using simple classification rules. This is usually due to the underlying complexity of the information necessary to achieve such a diagnosis. This paper proposes a new genetic programming (GP) system for discovering simple classification rules.

GP is a search algorithm based on the principle of natural selection [1], [17]. In essence, it evolves a population of individuals, where each individual represents a candidate solution for a given problem. In each “generation” (iteration) individuals are selected for reproduction. This selection is probabilistically biased towards the current best individuals. New offsprings are produced by operators that modify those selected individuals. This process of selection and creation of new individuals is repeated for a number of generations, so that the quality of the individuals (candidate solutions) is expected to improve with time. At the end of this evolution process, the best individual ever produced by GP is presented to the user as the final solution.

The GP algorithm proposed in this paper discovers classification rules in the following format: IF (a-certain-combination-of-attribute–values-is-satisfied) THEN (predict-a-certain-disease). Each individual represents a set of these IF–THEN rules. This rule format has the advantage of being intuitively comprehensible for the user. Hence, he/she can combine the knowledge contained in the discovered rules with his/her own knowledge, in order to make intelligent decisions about the target classification problem—for instance, medical diagnosis.

The use of GP for discovering comprehensible IF–THEN classification rules is much less explored in the literature when compared with other traditional rule induction and decision-tree-induction methods [21], [30]. We believe such a use of GP is a promising research area, since GP has the advantage of performing a global search in the space of candidate rules. In the context of classification rule discovery, in general this makes it cope better with attribute interaction than conventional, greedy rule induction and decision-tree-building algorithms [5], [8], [10], [11], [23].

The GP algorithm proposed in this paper is a constrained-syntax one in the sense that it contains several syntactic constraints to be enforced by the system, so that individuals represent rule sets that are valid and easy to interpret, due to the use of a disjunctive normal form representation.

The remainder of this paper is organized as follows. Section 2 presents an overview of the area of GP, in order to make this paper self-contained. Section 3 describes the proposed constrained-syntax GP for discovering classification rules. Section 4 reports the results of computational experiments comparing the GP with C4.5 and a “Booleanized” GP [2]. Finally, Section 5 presents the conclusions and suggested research directions.

Section snippets

An overview of genetic programming

In GP, the basic idea is the evolution of a population of “programs”; i.e., candidate solutions to the specific problem at hand. A program (an individual of the population) is usually represented as a tree, where the internal nodes are functions (operators) and the leaf nodes are terminal symbols. More complex representations, such as graphs or multiple trees, were proposed in the past [13], [27], but they are far less popular than the tree representation, which was popularized by Koza’s first

A constrained-syntax GP for discovering classification rules

This section describes a constrained-syntax GP system developed for discovering IF–THEN classification rules. Hence, the system addresses the classification problem, as discussed in Section 1. The design of the system involved the following aspects: individual representation, genetic operators, fitness function, and classification of new instances. These subjects are discussed in the next subsections.

Data sets and running parameters

Experiments were done with five data sets—namely chest pain, Ljubljana breast cancer, dermatology, Wisconsin breast cancer, and pediatric adrenocortical tumor. Table 2 summarizes the information about the first four data sets: the number of examples (records), attributes and classes. Ljubljana breast cancer, Wisconsin breast cancer and dermatology are public domain data sets, available from the Machine Learning Repository at the University of California at Irvine (//www.ics.uci.edu/~mlearn/mlrepository.html

Conclusions

We have proposed a constrained-syntax GP for discovering classification rules. This GP uses a function set consisting of both logical operators (AND, OR) and relational operators (“=”, “≠”, “≤”, “>”). Rule sets are represented by individuals where these operators are arranged in a hierarchical, tree-like structure. We have specified several syntactic constraints to be enforced by the GP, so that individuals represent rule sets that are valid and easy to interpret, due to the use of the

References (31)

  • Banzhaf W, Nordin P, Keller RE, Francone FD. Genetic programming—an introduction: on the automatic evolution of...
  • C.C. Bojarczuk et al.

    Genetic programming for knowledge discovery in chest pain diagnosis

    IEEE Eng. Med. Biol. Mag.

    (2000)
  • Bojarczuk CC, Lopes HS, Freitas AA. An innovative application of a constrained-syntax genetic programming system to the...
  • Clack C, Yu T. PolyGP: a polymorphic genetic programming system in Haskell. In: Proceedings of the 3rd Annual...
  • V. Dhar et al.

    Discovering interesting patterns for investment decision making with GLOWER—a genetic learner overlaid with entropy reduction

    Data Mining Know. Discovery J.

    (2000)
  • Fayyad UM, Piatetsky-Shapiro G, Smyth P. From data mining to knowledge discovery: an overview. In: Fayyad UM, et al.,...
  • Freeman JJ. A linear representation for GP using context free grammars. In: Proceedings of the 3rd Annual Conference on...
  • A.A. Freitas

    Understanding the crucial role of attribute interaction in data mining

    Artif. Intell. Rev.

    (2001)
  • Freitas AA. A survey of evolutionary algorithms for data mining and knowledge discovery. In: Ghosh A, Tsutsui S,...
  • Freitas AA. Data mining and knowledge discovery with evolutionary algorithms. Berlin: Springer-Verlag;...
  • Freitas AA. Evolutionary algorithms. In: Zytkow J, Klosgen W, editors. Handbook of data mining and knowledge discovery....
  • D.P. Greene et al.

    Competition-based induction of decision models from examples

    Machine Learn.

    (1993)
  • Gruau F. Cellullar encoding of genetic neural networks. Technical Report 92-21. Ecole Normale Superieure de Lyon...
  • Hand DJ. Construction and assessment of classification rules. Chicester: Wiley;...
  • Janikow CZ, Deweese S. Processing constraints in genetic programming with CGP2.1. In: Proceedings of the 3rd Annual...
  • Cited by (102)

    • Designing genetic programming classifiers with feature selection and feature construction

      2020, Applied Soft Computing Journal
      Citation Excerpt :

      Because GP can automatically select original features as terminal nodes, some researches [23–25] do not restrict the number of distinct features within terminal nodes and only consider the classification performance of GP classification rules. To discover classification rules that are easy to interpret, some literatures [26–29] not only consider the classification performance but also constraint the function complexity of the classification rules. GP is good at dealing with binary classification problems [36].

    • Innovative classification, regression model for predicting various diseases

      2020, Data Analytics in Biomedical Engineering and Healthcare
    • Model-driven regularization approach to straight line program genetic programming

      2016, Expert Systems with Applications
      Citation Excerpt :

      Supervised classification by evolving selection rules is another avenue in which GP obtains a remarkable success as shown, for example, in Carreño, Leguizamón, and Wagner (2007), Cano, Herrera, and Lozano (2007), Chien, Yang, and Lin (2003), Freitas (1997), Hennessy, Madden, Conroy, and Ryder (2005) and Kuo, Hong, and Chen (2007). Singular applications to medicine and biology problems (Aslam, Zhu, & Nandi, 2013; Bojarczuk, Lopes, & Freitas, 2000; Bojarczuk, Lopes, Freitas, & Michalkiewicz, 2004; Castelli, Vanneschi, & Silva, 2014), feature extraction methods (Krawiec, 2002; Smith & Bull, 2005), database clustering and rule extraction (Wedashwara, Mabu, Obayashi, & Kuremoto, 2015), generation of hybrid multi-level predictors for function approximation and regression analysis (Tsakonas & Gabrys, 2012) are other examples in which GP is applied. Specific applications to inductive learning problems solved via GP can be found in relatively old papers (Okley, 1994; Poli & Cagnoni, 1997).

    View all citing articles on Scopus
    1

    Tel.: +55-41-310-4697; fax: +55-41-310-4629.

    2

    Tel.: +44-1227-82-7220; fax: +44-1227-76-2811.

    3

    Tel.: +55-41-366-0588; fax: +55-41-267-4074.

    View full text