Using enhanced genetic programming techniques for evolving classifiers in the context of medical diagnosis

Winkler, Stephan M.; Affenzeller, Michael; Wagner, Stefan

doi:10.1007/s10710-008-9076-8

Using enhanced genetic programming techniques for evolving classifiers in the context of medical diagnosis

Original Paper
Published: 30 December 2008

Volume 10, pages 111–140, (2009)
Cite this article

Download PDF

Genetic Programming and Evolvable Machines Aims and scope Submit manuscript

Using enhanced genetic programming techniques for evolving classifiers in the context of medical diagnosis

Download PDF

Stephan M. Winkler¹,
Michael Affenzeller² &
Stefan Wagner²

327 Accesses
22 Citations
Explore all metrics

Abstract

There are several data based methods in the field of artificial intelligence which are nowadays frequently used for analyzing classification problems in the context of medical applications. As we show in this paper, the application of enhanced evolutionary computation techniques to classification problems has the potential to evolve classifiers of even higher quality than those trained by standard machine learning methods. On the basis of five medical benchmark classification problems taken from the UCI repository as well as the Melanoma data set (prepared by members of the Department of Dermatology of the Medical University Vienna) we document that the enhanced genetic programming approach presented here is able to produce comparable or even better results than linear modeling methods, artificial neural networks, kNN classification, support vector machines and also various genetic programming approaches.

A Genetic Programming Approach Applied to Feature Selection from Medical Data

Improving Genetic Programming for Classification with Lazy Evaluation and Dynamic Weighting

Genetic Programming Representations for Multi-dimensional Feature Learning in Biomedical Classification

1 Introduction

Classification is understood as the act of placing an object into a set of categories, based on the object’s properties. Objects are classified according to an (in most cases hierarchical) classification scheme also called taxonomy. Amongst many other possible applications, examples of taxonomic classification are biological classification (as for example the act of categorizing and grouping organisms into species) and medical classification.

A statistical classification algorithm is supposed to take feature representations of objects and map them to a special, predefined classification label. Such classification algorithms are designed to learn (i.e., to approximate the behavior of) a function which maps a vector of object features into one of several classes; this is done by analyzing a set of input-output examples (“training samples”) of the function. The basic steps can be summarized as follows: After training (i.e., after learning a classifier or even a set of classifiers) using a set of training samples, one should be able to classify a new, unknown object into one of the predefined classes by analyzing its measured features using the classifier(s) returned by the training algorithm. Since statistical classification algorithms are supposed to “learn" such functions, we are dealing with a specific subarea of machine learning and, more generally, artificial intelligence. There are several approaches that are nowadays used for solving data mining and, more specifically, classification problems. The most common ones are (as for example described in [32]) decision tree learning, instance-based learning, inductive logic programming (such as using Prolog, e.g.) and reinforcement learning.

The task of a classification algorithm in the context of patient classification is to find a formal description that models underlying classification rules; this means that the diagnosis can be predicted by considering a set of other measured features (for example blood parameter values).

The application of data based machine learning techniques for learning classifiers is one of the major topics not only in computer science in general, but also especially in medical data mining. When it comes to automatically classifying patients into those which are possibly suffering from a certain disease and those which are healthy, one often has to face the problem that special tests (such as blood analysis, e.g.) produce high costs of time and money. This is why one has to seek for hidden relationships between the symptoms of a certain disease and other features which are much easier to measure.

In this paper we present a classification algorithm based on genetic programming (GP) [25] using a structure identification framework [51] described in Sect. 2.1, in combination with an enhanced, hybrid selection scheme [3] (as explained in Sect. 2.2.2). This GP approach has been implemented as a part of HeuristicLab [47], a framework for prototyping and analyzing optimization techniques for which both generic concepts of evolutionary algorithms and many functions to evaluate and analyze them are available.

The results obtained using this enhanced GP based identification approach are compared to those achieved by methods that are nowadays standard techniques in machine learning, including (but not being limited to) linear regression modeling (Lin), (artificial) neural networks (NN), k-nearest-neighbor (kNN) classification, support vector machines, and rule learning. We also compare these results to those achieved using other variants of GP based machine learning.

The comparison of the methods mentioned is done on the basis of five medical benchmark classification problems: The Wisconsin, the Heart, the Diabetes and the Thyroid data sets taken from the UCI machine learning repository ^{Footnote 1} as well as the Melanoma data set prepared by members of the Department of Dermatology of the Medical University Vienna. In Sect. 3 we document the results obtained using the enhanced GP method presented here and compare them to those returned by classification algorithms mentioned previously.

2 GP-based classification in medical diagnostics

2.1 The GP based structure identification approach

2.1.1 General remarks

Preliminary work for the approach presented in this paper was done for the project “Specification, Design, and Implementation of a Genetic Programming Approach for Identifying Nonlinear Models of Mechatronic Systems" as a part of a strategic project at the Johannes Kepler University Linz, Austria, focusing on the identification of mechatronic systems. It has been shown (see for instance [5] or [51]) that methods of GP are suitable for determining an appropriate mathematical representation of a physical, mechatronical system; especially NO_x and soot emissions of Diesel engines have been investigated in this context yielding satisfying models (especially for NO_x). The methods implemented for this project have been used for developing a GP-based statistical classification algorithm. This algorithm works on a set of training examples with known properties [X ₁ ... X _n]; one of these properties (X _t) has to represent the membership information with respect to the underlying taxonomy. In the context of medical applications, this target property gives the information about the respective patient’s diagnosis (as “diseased” versus “not diseased” or “malign” versus “benign”, e.g.). On the basis of the training examples, the algorithm tries to evolve (or, as one could also say, to “learn”) a solution, that is a formula, that represents a function which maps a vector of object features into one of the given classes. In other words, each instance of the classification problem to be solved is interpreted as an instance of an optimization problem; a solution is found by a heuristic optimization algorithm.

The goal of the implemented GP classification process is to produce functions from a database containing the measured results of the experiments to be analyzed. Thus, the GP algorithm works with solution candidates that are tree structure representations of symbolic expressions.

As we will see later in Sect. 3, the computational effort needed in GP based learning of classifiers might be rather high, especially when using large populations or enhanced selection strategies (as described later in Sect. 2.2.2). Still, the following advantages of evolutionary learning of classifiers outmatch these drawbacks:

The produced models are made up of well-known function and terminal primitives, so that the resulting models can be understood and analyzed by people who do not necessarily have to be experts in machine learning or even informatics.
In fact, the functional basis (as described in the next subsection) can also be used in other data based modeling tasks as for example described in [55], [56] and [51].

2.1.2 Solution candidate representation using hybrid tree structures

The selection of the library functions is an important part of any GP modeling process because this library should be able to represent a wide range of systems; Table 1 gives an overview of the function set as well as the terminal nodes used for the classification experiments documented in this paper. As the reader can see in Table 1, mathematical functions and terminal nodes are used as well as Boolean operators for building complex arithmetic expressions. There are in fact no structural restrictions for the use of Boolean blocks in formulae; of course, [Then/Else] and Boolean expressions have to be connected to [IF] nodes, but there are no other restrictions regarding the use of Boolean blocks within mathematical expressions. Thus, the concept of decision trees is included in this approach together with the standard structure identification concept that tries to evolve nonlinear mathematical expressions. An example showing the structure tree representation of a combined formula including arithmetic as well as logical functions is displayed in Fig. 1.

Table 1 Set of function and terminal definitions for enhanced GP based classification

Full size table

Even if our GP approach is not grammar driven ^{Footnote 2}, the evolutionary process is still supposed to produce syntactically correct structures, i.e. that for example that the first child (argument) of an [IF] node should always be a Boolean or a logical expression (node). This is why syntactic constraints are also defined:

The first input to a conditional node has to be a Boolean or a logical expression.
The inputs to logical functions have to be logical or Boolean expressions.
Nodes representing logical or Boolean functions are only allowed to be children of nodes that represent conditional or logical functions.

These constraints are saved as meta-information about the functional basis and are considered by the crossover and mutation operators; these operators have been implemented in such a way that only syntactically correct structure trees are produced.

2.1.3 Evaluation of classification models

There are several possible functions that can serve as fitness functions for a GP process. For example, the ratio of misclassifications (using optimal thresholds) or the area under the corresponding receiver operating characteristic (ROC) curves ([61], [10]) could be used. Another function frequently used for quantifying the quality of models is the R ² function that takes into account the sum of squared errors as well as the sum of squared target values; an alternative, the so-called adjusted R ² function, is also utilized in many applications.

In [8] and [9], for example, another classification specific fitness function for GP is described: Roughly speaking, four types of results can be observed for the prediction, namely true positive, false positive, true negative and false negative classifications; the fitness function combines two indicators commonly used in the medical domain, namely the sensitivity and the specificity whose product is returned as the fitness of a classifier. The goal of a GP process therefore is to maximize both sensitivity and specificity at the same time.

Still, we have decided to use a variant of the squared errors function for estimating the quality of a classification model. The most important aspect that has to be considered when using this function is the following one: the errors of predicted values that are lower than the lowest class value or greater than the greatest class value do here not have a totally quadratic, but only linear contribution to the fitness value. To be a bit more precise: N given samples with original classifications o _i are divided into n classes with class labels c ₁, ...,c _n (where c ₁ is the lowest and c _n the greatest class value), which means that each original classification (target) value o _i of sample i equals one of the n class labels c ₁, ...,c _n:

$$ \forall_{i \in [1,N]}: o_i \in \{c_1, \ldots, c_n\} $$

(1)

The fitness value F of a classification model producing the estimated classification values e _i is then evaluated as follows:

$$ \begin{array}{c} \forall_{i \in [1,N]}:\\ (e_i < c_1) \Rightarrow f_i=(o_i-c_1)^2 + \mid c_1-e_i \mid,\\ (c_1 \leq e_i \leq c_n) \Rightarrow f_i=(e_i-o_i)^2,\\ (e_i > c_n) \Rightarrow f_i=(o_i-c_n)^2 + \mid c_n-e_i \mid \end{array} $$

(2)

$$ F = \frac{1}{N}\sum_{i=1}^{N}{f_i} $$

(3)

The reason for this is that values that are greater than the greatest class value or below the lowest value are anyway classified as belonging to the class having the greatest or the lowest class number, respectively; using a standard implementation of the squared error function would punish a formula producing such values more than necessary.

2.1.4 Finding appropriate class thresholds: dynamic range selection

Of course, a mathematical expression alone does not yet define a classification model; thresholds are used for dividing the output into multiple ranges, each corresponding to exactly one class. These regions are defined before starting the training algorithm in static range selection (SRS, see for example [27] for explanations), which brings along the difficulty of determining the appropriate range boundaries a priori. Within the GP based classification framework discussed in this paper we have therefore used dynamic range selection (DRS) which attempts to overcome this problem by evolving the range thresholds along with the classification models: Thresholds are chosen in such a way that the sum of class-wise ratios of misclassifications for all given classes is minimized (on the training data, of course).

In detail, let us consider the following: Given N (training) samples with original classifications o _i divided into n classes c ₁, ...,c _n (with c ₁ being the lowest and c _n the greatest class value), models produced by GP can be in general used for calculating estimated values e _i for all N samples. Assuming thresholds T = t ₁,...,t _n-1 (with c _j < t _j < c _j+1 for j∈[1; n−1]), each sample k is classified as ec _k:

$$ e_k < t_1\rightarrow ec_k(T)=c_1 $$

(4)

$$ t_j < e_k < t_{j+1}\rightarrow ec_k(T) = c_{j+1} $$

(5)

$$ e_k > t_{n-1} \rightarrow ec_k(T)=c_n $$

(6)

Thus, assuming a set of thresholds T _m, for each class c _k we get the ratio of correctly classified samples cr _k as

$$ total_k(T_m)=|\{a : (\forall_{x \in a}: o_x = ec_k(T_m)) \}| $$

(7)

$$ correct_k(T_m)=|\{b : (\forall_{x \in b}: o_x = ec_k(T_m) \wedge o_x = c_k) \}| $$

(8)

$$ cr_k(T_m)=\frac{correct_k(T_m)}{total_k(T_m)}. $$

(9)

The sum of ratios of correctly classified samples is—depending on the set of thresholds T _m—calculated as

$$ cr(T_m) = \sum_{i=1}^{n}cr_i(T_m). $$

(10)

So, finally we can define the set of thresholds applied as that set T _opt so that each other set of thresholds leads to lower sums of classification accuracies:

$$ T_d \neq T_{opt} \rightarrow cr_(T_d)<cr(T_{opt}) $$

(11)

The optimal set of thresholds T _opt is calculated in the following way: For each possible pair of class values (c _i,c _i+1) we calculate a number of nt equidistant thresholds t _tmp(i,d) (for d∈[1, ...,nt]) between c _i and c _i+1; thus, $c_i<t_{tmp(i,d_1)}<t_{tmp(i,d_2)}< c_{i+1}$ and $(t_{tmp(i,d_1+1)}-t_{tmp(i,d_1)})=(t_{tmp(i,d_2+1)}-t_{tmp(i,d_2)})$ for 1 ≤ d ₁ < d ₂ ≤ nt. All possible combinations of thresholds are calculated and evaluated on training data; those thresholds T _opt, that are optimal for the training samples with respect to cr(T _opt), are fixed and also applied on the test samples.

Please note that this sum of class-wise classification accuracies is not equal to the total ratio of correctly classified samples which is used later for estimating the quality of classifiers produced by various algorithms in Sect. 3; the total classification accuracy for a set of thresholds acc(T _m) (assuming original and estimated values o and e) is defined as

$$ z(T_m)=\|\{a | (\forall_{x \in a}: o_x = ec_x(T_m)) \}\| $$

(12)

$$ acc(T_m)=\frac{z}{N}. $$

(13)

2.2 Advanced algorithmic concepts applied within GP-based structure identification

Our first attempt for adjustable selection pressure handling was the so-called Segregative Genetic Algorithm (SEGA) [1] which introduces birth surplus in the sense of a (μ,λ) evolution strategy [7] into the general concept of a GA and uses this enhanced flexibility primarily for adaptive selection pressure steering in the migration phases of the parallel GA in order to improve achievable global solution quality. The SASEGASA^{Footnote 3} [3] is an extension of SEGA; its most important feature is its ability to self-adaptively adjust selection pressure in order to achieve progress in solution quality without losing essential genetic information which would lead to unwanted premature convergence. The SASEGASA is generic in that sense that all algorithmic extensions are problem-independent so that they do not depend on a certain problem representation and the corresponding operators.

Therefore, we can easily combine the enhanced algorithmic concepts of the SASEGASA with genetic programming (GP). However, we have observed two major differences when combining SASEGASA and genetic programming compared to the experience in the application of SASEGASA in other domains like combinatorial optimization or real-valued optimization [3]:

The potential in terms of achievable solution quality in comparison with the standard algorithms seems to be considerably higher in the field of GP than in standard applications of GAs.
By far not all algorithmic extensions of SASEGASA are relevant in GP. Only some algorithmic aspects of the rather complex SASEGASA concept are really relevant in the GP domain which makes the handling and especially parameter adjustment easier and more robust.

Therefore, the discussion in this article will focus on the algorithmic parts of SASEGASA that are really relevant for GP. In doing so, this section is structured as follows: The first subsection describes the general idea of SASEGASA in a quite compact way, whereas the second subsection focuses on that parts of SASEGASA in further detail which are really relevant for the present contribution and discusses the reasons for that.

For a more detailed description of all involved algorithmic aspects the interested reader is referred to the book [2].

2.2.1 A short description of SASEGASA

In principle, the SASEGASA introduces two enhancements to the basic concept of Genetic Algorithms. Firstly, it brings in a variable and self-adaptive selection pressure in order to control the diversity of the evolving population in a goal-oriented way with respect to the objective function. The second concept introduces a separation of the population to increase the broadness of the search process and joins the subpopulation after their evolution in order to end up with a population including all genetic information which is needed for locating a global optimum.

At the beginning of the evolutionary process the whole population is divided into a certain number of subpopulations. These subpopulations evolve independently from each other until the fitness increase stagnates in all subpopulations because of too similar individuals within the subpopulations, i.e. local premature convergence. Thanks to offspring selection this can be triggered exactly as soon as an upper limit of selection pressure is exceeded (cf. Sect. 2.2.2); subsequently, a reunification from n to (n−1) subpopulations is performed by joining an appropriate number of adjacent subpopulation members.

Thus, at the beginning a certain number of villages (subpopulations) of the evolutionary process are slowly growing together forming bigger towns, eventually ending up with one big city containing the whole population at the end of evolution. This ensures essential building blocks can evolve independently in different regions of the search space at the beginning and during the evolutionary process.

2.2.2 Relevant algorithmic parts of SASEGASA for genetic programming based machine learning

In [3] it has been shown that the aspect of segregation and reunification is highly relevant in order to systematically improve the achievable global solution quality of combinatorial optimization problems as for example the traveling salesman problem (TSP). Still, we have not used this parallel approach for our GP-based modeling studies due to the following reasons: On the one hand, this would lead to a high increase of runtime consumption; on the other hand, anyway, we do not expect any significant increase of solution quality using this concept for GP-based modeling as results summarized in [2] indicate that this parallel approach does not remarkably affect the solution quality of optimization problems others than combinatorial problems.

In contrast to this, the possibility of self-adaptive selection pressure steering within SASEGASA caused by offspring selection (OS) has proven extremely powerful especially for GP problem representations; this has already been shown in [52], [53], [55] and [54], and will also be treated in the discussion of experimental results of the present article (Sect. 3).

An essential question about the general performance of GAs or GP is, whether or not good parents are able to produce children of comparable or even better fitness (the building block hypothesis implicitly relies on this). In natural evolution, this is almost always true. For artificial evolution and exceptionally for GP this property is not so easy to guarantee; OS is designed to assure exactly that property.

In order to produce a child for the ongoing evolutionary process, offspring selection does not only consider the fitness of the parents; additionally, the fitness value of the evenly produced offspring is compared to the fitness values of its own parents: offspring is accepted as a candidate for the further evolutionary process if and only if the reproduction operator was able to produce an offspring that could outperform the fitness of its own parents. This strategy guarantees that evolution is presumed mainly with crossover results that were able to mix the properties of their parents in an advantageous way.

As in the case of conventional GAs or GP, offspring are generated by parent selection, crossover, and mutation. In a second (offspring) selection step, the number of offspring to be generated is calculated using a predefined parameter defining the quotient of the next generation’s members that have to outperform their own parents; this parameter is called success ratio (SuccRatio). As long as this ratio is not fulfilled, children are created and only successful offspring will definitely become members of the next generation; this is schematically displayed in Fig. 2. As soon as the postulated ratio is reached, the rest of the next generation members are randomly chosen from the children that did not reach the success criterion. Within this selection model selection pressure is defined as the ratio of generated candidates to the population size. An upper limit for selection pressure gives a quite intuitive termination criterion: As soon as it is no more possible to find a sufficient number of offspring that outperform their parents, the algorithm terminates in the simple version which is used here.

Research results obtained within the last two years have lead to the conclusion that a simplified version of offspring selection used in combination with enhanced parent selection [48] shows the best and most robust results in the context of GP-applications. Thus, SuccRatio should be set to 1.0, i.e. every offspring for the next generation is forced to pass the success criterion. Furthermore, it is beneficial in GP applications to define that a child is better than its parents if and only if it is better than the better of the two parents. In the context of combinatorial optimization problems, where some intermediate value of the parents fitness values is used as a threshold value for the success criterion, such settings tend to support premature convergence. Still, in the field of genetic programming applications these parameter settings lead to high quality results quite robustly.

However, there is one aspect concerning parent selection that has to be considered in this application domain: It is—applying the parameter settings of offspring selection mentioned above—most effective to use different selection methods for the selection of the two parents which are chosen for crossover. In the present context this gender specific selection aspect [48] is implemented most effectively by selecting one parent conventionally by roulette-wheel selection and the other parent randomly.

All together, this especial variant of adapted sexual selection combined with a simplified version of offspring selection aims to cross one above-average parent with a randomly selected parent (which brings in diversity) as long as a whole new population could be filled up with children that were better than their better parent. An upper limit for selection pressure acts as termination criterion in that sense that the algorithm stops, if too many trials (|POP| ·maxSelPress) were already taken and still no new population consisting of successful offspring could be generated. Or in other words this indicates that with the actual gene pool not enough children can be generated that outperform their parents which is a reasonable termination criterion for an evolutionary algorithm. This special version of SASEGASA (offspring selection) is schematically shown in Fig. 3.

3 Empirical studies

3.1 Classification methods used

For comparing GP based classification with other machine learning methods, the following techniques for training classifiers have been examined: genetic programming (enhanced approach as described in Sect. 2), linear modeling, neural networks, the k-nearest-neighbor method, and support vector machines.

3.1.1 GP-based training of classifiers

We have used the following parameter settings for our GP test series:

Single population approach; population size: 500–1,000
Mutation rate: 10%
Maximum formula tree height: 8
Parents selection: gender specific, random & roulette
Offspring selection: strict offspring selection (success ratio as well as comparison factor set to 1.0)
Termination criteria:
- Max. number of generations: 1000; not reached, all executions were terminated via the
- Maximum selection pressure: 100 (as described in Sect. 2.2.2, details can be found in [3] and [4])
Function set: all functions as described in Sect. 2.1

In addition to splitting the given data into training and test data, EGP based training is implemented in such a way that a part of the given training data is not used for training models and serves as validation set; in the end, when it comes to returning classifiers, the algorithm returns those models that perform best on validation data. This approach has been chosen because it is assumed to help to cope with over-fitting; it is also applied in other GP based machine learning algorithms as for example described in [6]. In fact, this was also done in our standard GP tests for the Melanoma data set.

3.1.2 Linear modeling

Given a data collection including m input features storing the information about N samples, a linear model is defined by the vector of coefficients θ_{1 ...} m . For calculating the vector of modeled values e using the given input values matrix u _{1 ...} m, these input values are multiplied with the corresponding coefficients and added:

$$ e=u_{1 \ldots m} * \theta $$

(14)

The coefficients vector can be computed by simply applying matrix division. For conducting the test series documented here we have used the matrix division function provided by MATLAB:

$$ {\tt theta = InputValues}\backslash {\tt TargetValues;} $$

If a constant additive factor is to be included into the model (i.e., the coefficients vector), this command has to be extended:

$$ \begin{array}{c} {\tt r=size(InputValues,1);}\\ {\tt theta=[InputValues ones(r,1)]} \backslash {\tt TargetValues;} \end{array} $$

Theoretical background of this approach can be found in [29].

3.1.3 Neural networks

For training neural network (NN) models, three-layer feed-forward neural networks with one output neuron were created using the backpropagation as well as the Levenberg-Marquardt training method. Theoretical background and details can be found in [35] (Chap. 11, “Neural Networks”), [31], [26] or [20].

The following two approaches have been applied for training neural networks:

On the one hand we have trained networks with 5 neurons in the hidden layer (referred to as “NN1” in the test series documentation in Sect. 3.3) as well as networks with 10 hidden neurons (referred to as “NN2” in the test series documentation); the number of iterations of the training process was set to 100 (in the first variant, “NN1”) and 300 (in the second variant, “NN2”). In the context of analyzing the benchmark problems used here, higher numbers of nodes or iterations are likely to lead to overfitting (i.e., a better fit on the training data, but worse test results).

The NN training framework used to collect the results reported in this paper is the ${\tt NNSYSID20}$ package, a neural network toolbox implementing the Levenberg-Marquardt training method for MATLAB; it has been implemented by Magnus Nørgaard at the Technical University of Denmark [37].
On the other hand, the multilayer perceptron training algorithm available in WEKA [57] has also been used for training classifiers. In this case the number of hidden nodes was set to (a + c)/2, where a is the number of attributes (features) and c the number of classes. The number of iterations was not pre-defined, but 10% of the data were designed to be used as validation data; in order to combat the danger of overfitting, the training algorithm was terminated as soon as the error on validation data is constantly getting worse in 20 iterations consecutively. This training method, which applies backpropagation learning, is in the following referred to as the “NN3” method.

3.1.4 kNN classification

Unlike other data based modeling methods based on linear models, neural networks or GP, k-nearest-neighbor classification works without creating any explicit models. During the training phase, the data are simply collected; when it comes to classifying a new, unknown sample x _new, the sample-wise distance between x _new and all other training samples x _train is calculated and the classification is done on the basis of those k training samples (x _NN) showing the smallest distances from x _new.

The distance between two samples is calculated as follows: First, all features are normalized by subtracting the respective mean values and dividing the remaining samples by the respective variables’ standard deviation. Given a data matrix x including m features storing the information about N samples, the normalized values x _norm are calculated as

$$ \forall_{i \in [1,m]} \forall_{j \in [1,N]}: x_{norm}(i,j) = \frac{x(i,j)-\frac{1}{N}\sum_{k=1}^{N}{x(i,k)}}{\sigma(x(i,1 \ldots N))} $$

(15)

where the standard deviation σ of a given variable x storing N values is calculated as

$$ \sigma(x) = \sqrt{\frac{1}{N-1}\sum_{i=1}^{N}{(x_i-\bar{x})^2}} $$

(16)

with $\bar{x}$ denoting the mean value of x.

Then, on the basis of the normalized data, the distance between two samples a and b, d(a,b), is calculated as the mean squared variable-wise distance:

$$ d(a,b) = \frac{1}{n}\sum_{i=1}^{n}{(a_{norm}(i)-b_{norm}(i))^2} $$

(17)

where n again is the number of features stored for each sample.

In the context of classification, the numbers of instances (of the k nearest neighbors) are counted for each given class and the algorithm automatically predicts that class that is represented by the highest number of instances (included in x _NN). In the test series documented in this paper we have applied weighting to kNN classification: The distance between x _new and x _NN is relevant for the classification statement, the weight of “nearer" samples is higher than that of samples that are “further" away from x _new.

There is a lot of literature that can be found for kNN classification; very good explanations and compact overviews of kNN classification (including several possible variants and applications) are for example given in [17] and [41].

3.1.5 Support vector machines

Support vector machines (SVMs) are a widely used approach in machine learning based on statistical learning theory [46]; an example of the application of SVMs in the medical domain have been reported in [34], e.g.

The most important aspect of SVMs is that it is possible to give bounds on the generalization error of the models produced, and to select the respectively best model from a set of models following the principle of structural risk minimization [46]. SVM are designed to calculate hyperplanes that separate the data from each other and maximize the margin between sets of data points. While the basic training algorithm is only able to construct linear separators, so-called kernel functions can be used to calculate scalar products in higher-dimensional spaces; if the kernel functions used are non-linear, then the separating boundaries will be non-linear, too.

In this work we have used the SVM implementation described in [38] and [24]; we have used the implementation of this algorithm which is available for the WEKA machine learning framework [57]; we have used polynomial kernels as well as Gaussian radial basis function kernels with the γ parameter (defining the inverse variance) set to 0.01 and the complexity parameter c set to 10,000.

3.2 Data sets

We shall later on report on the quality of classifiers that have been produced for the following five data collections:

The Wisconsin data set is a part of the UCI Machine Learning Repository¹; In short, it represents medical measurements which were recorded while investigating patients potentially suffering from breast cancer. The number of features recorded is 9 (all being continuous numeric ones); the file version we have used contains 683 recorded examples (by now, 699 examples are already available since the data base is updated regularly).
The Heart Disease data collection contains data recorded by European medical scientists. In the context of this classification problem the main goal is to model the presence of potential heart diseases in patients. The target values of the given samples (again representing patients) are integer valued from 0 (no presence) to 4; experiments published in the machine learning community have concentrated on simply attempting to distinguish presence (target classification values 1,2,3,4; 139 samples in total) from absence (class value 0; 164 samples), thus turning this classification situation into a binary one.

The original data contains 76 attributes, but, as is stated on the respective UCI webpage, all published experiments refer to using a subset of 14 of them. Furthermore, only the “Cleveland" data set is the only one that has been used by machine learning researchers; this is the reason why we have used this data partition containing 14 variables and 303 samples, treating it as a binary classification task as described above.
The Pima Indians Diabetes data set (in this paper referred to as the “Pima” data set) is a data collection compiled by scientists at the Johns Hopkins University in Baltimore, MD, USA. The samples in this data set (in total 768) represent patients, in particular adult females of Pima Indian heritage. Based on 8 input variables the goal here is to classify samples as diseased or not; 500 samples originally belong to class “0” (representing patients who were tested negative) and 268 to class “1” (representing positive original classifications).
The Thyroid data set represents medical measurements which were recorded while investigating patients potentially suffering from hypo- or hyperthyroidism; this data set has also been taken from the UCI Machine Learning Repository¹. In short, the task is to determine whether a patient is hypothyroid or not. Three classes are formed: Euthyroid (the state of having normal thyroid gland function), hyperthyroid (overactive thyroid) and hypothyroid (underactive thyroid); a good classifier has to be significantly better than 92% simply because 92% of the patients are not hyper- or hypothyroid. In total, the data set contains 7,200 samples.

The samples of the Thyroid data set are not equally distributed to the three given classes; in fact, 166 samples belong to class “1” (“subnormal functioning”), 368 samples are classified as “2” (“hyperfunction”), and the remaining 6,666 samples belong to class “3” (“euthyroid”). Twenty-one attributes (15 binary and 6 continuous ones) are stored in this data set.
The Melanoma data set represents medical measurements which were recorded while investigating patients potentially suffering from skin cancer. It contains 1,311 examples for which 30 features have been recorded; each of the 1,311 samples represents a pigmented skin lesion which has to be classified as a melanoma or a nonhazardous nevus. This data set has been provided to us by Prof. Michael Binder from the Departement of Dermatology at the Medical University Vienna, Austria.

A comparison of machine learning methods for the diagnosis of pigmented skin lesions (i.e., detecting skin cancer based on the analysis of visual data) can be found in [15]; in this paper the authors describe the quality of classifiers produced for a comparable data collection using kNN classification, ANNs, decision trees, and SVM. The difference is that in the data collection used in [15] all lesions were separated into three classes (common nevi, dysplastic nevi, or melanoma); here we use data representing lesions that have been classified as benign or malign, i.e. we are facing a binary classification problem.

As an alternative to the UCI repository, the Proben1 benchmark set [39] is a collection of real world problems originally established for neural networks. This benchmark collection contains datasets that have been taken from the UCI repository; still, the data has been modified insofar as it has been encoded for direct neural network use as described in [39]. For each available data set several different file versions (storing different data partitions) have been formed and frequently used for estimating the quality of machine learning algorithms; we will in the following mention also results that have been obtained using these data sets.

3.3 Results

Since the Wisconsin, Heart, Pima and the Thyroid data sets are publicly available, the results produced by GP are compared to those that have been published previously for various machine learning methods; the Melanoma is not openly available, therefore we have used all machine learning approaches mentioned for training classifiers for this data set.

All five data sets were investigated via 10-fold cross-validation (CV). This means that each original data set was divided into 10 disjoint sets of (approximately) equal size; thus, 10 different pairs of training (90% of the data) and test data sets (10% of the data) can be formed and used for testing the classification algorithm. For each data collection, each of the resulting 10 pairs of training and test data partitions has been used in 5 independent GP test runs; for the Melanoma data set, all machine learning algorithms mentioned previously have also been applied to all pairs of training and test data, the stochastic algorithms again applied 5 times independently. The results presented in this section have already been partially published in [53].

We will in the following also report on results obtained using other GP variants in medical data analysis, as for example documented in [19], [11], [30], [14], [6], [9], [18], and [27]. The main results of these test series can be summarized in the following way: Genetic programming is indeed able to produce classifiers for medical benchmark data sets, and its biggest advantages are its general applicability and that it produced models that are understandable and interpretable. We will show here how good the results produced by GP with strict offspring selection are compared to those of other GP implementations as well as other frequently used machine learning algorithms.

3.3.1 Results for the Wisconsin data set

Table 2 summarizes the results for the 10-fold cross validation produced by GP with offspring selection as described in Sect. 3.1.1. These figures boil down to the fact that extended GP has in this case been able to produce classifiers that on average correctly classify 97.91% of training samples and 97.53% of test samples.

Table 2 Summary of training and test results for the Wisconsin data set: correct classification rates (average values and standard deviation values) for 10-fold CV partitions, produced by GP with offspring selection

Full size table

In order to compare the quality of these results to those reported in the literature, Table 3 summarizes test accuracies that have been obtained using 10-fold cross validation. For each method listed we give the references to the respective articles in which these results have been reported ^{Footnote 4}. As we see here, the results summarized in Table 2 outperform those of the other algorithms reported in the literature listed here.

Table 3 Comparison of machine learning methods: average test accuracy of classifiers for the Wisconsin data set

Full size table

In the case of the Wisconsin data set, the results retrieved using this enhanced GP (EGP) classification algorithm were compared to those obtained using other GP variants. In [9], for example, results for steady state GP with 500 individuals are documented; for the Wisconsin breast cancer data set this GP implementation produced classifiers with 93.5% average classification accuracy (standard deviation: 0.79). In [27] recent results for several classification benchmark problems are documented; the Wisconsin data set was there analyzed using standard GP as well as three other GP based classification variants (POPE-GP, DecMO-GP and DecMOP-GP), and the respective results are also listed in Table 3.

In this context we have to admit that the effort of GP to produce these classifiers is higher than the runtime or memory consumed by most other machine learning algorithms; in our GP tests using the Wisconsin data set and populations with 500 individuals the average number of generations executed was 51.6 and the average number of solutions evaluated ∼1,296,742.

In [11] and [14], for example, results for the Cancer data set of the Proben1 collection are documented; this data set has been created based on the Wisconsin data set and is available as three data partitions. According to Brameier and Banzhaf [11], linear GP was able to produce classifiers that correctly classify 94.28–97.82% of the given test samples (with varying accuracies for the three given problem instances), and De Falco et al. report that GP was used for producing classifiers with 93.92–97.54% classification accuracy. ANNs were also trained; these tests resulted in classifiers that correctly classify 95.23–98.62% of the test data.

3.3.2 Results for the Heart data set

Table 4 summarizes the results for the 10-fold cross validation produced by GP with offspring selection; extended GP with offspring selection (and populations containing 1,000 solutions) has in this case been able to produce classifiers that on average correctly classify 91.50% of training samples and 81.22% of test samples. As already mentioned previously, we have trained binary classifiers, so that the goal is is distinguish between presence and absence of heart diseases.

Table 4 Summary of training and test results for the Heart data set: correct classification rates (average values and standard deviation values) for 10-fold CV partitions, produced by GP with offspring selection

Full size table

In order to compare the quality of these results to those reported in the literature, Table 5 summarizes test accuracies; in most cases, these results have been obtained using 10-fold or 5-fold cross validation. For each method listed we give the references to the respective articles in which these results have been reported for the binary classification. Negative correlation learning as documented in [13] was able to produce classifiers that correctly classify up to 81.98% of the given test samples, enhanced GP as reported on in [18] yields up to 81.3% correct classifications on test samples. Using extended GP with strict offspring selection and populations containing 1,000 solutions we have been able to produce classifiers that on average correctly classify 81.22% of test samples; using populations containing only 100 solutions the correct classification rate of GP with OS on test data is 80.74%.

Table 5 Comparison of machine learning methods: average test accuracy of classifiers for the Heart data set

Full size table

3.3.3 Results for the Pima data set

Table 6 summarizes the results for the 10-fold cross validation produced by GP with offspring selection as described in Sect. 3.1.1. These figures boil down to the fact that extended GP (with string offspring selection and populations containing 1,000 individuals) has in this case been able to produce classifiers that on average correctly classify 78.25% of training samples and 75.53% of test samples.

Table 6 Summary of training and test results for the Pima data set: correct classification rates (average values and standard deviation values) for 10-fold CV partitions, produced by GP with offspring selection

Full size table

In order to compare the quality of these results to those reported in the literature, Table 7 summarizes test accuracies that have been obtained using cross validation; for each method listed we give the references to the respective articles in which these results have been reported. As we see here, the results summarized in Table 6 have to be considered rather satisfying as the results produced by other algorithms (including GP approaches as those reported in [30] and [18], e.g.) show comparable or even (slightly) weaker results.

Table 7 Comparison of machine learning methods: average test accuracy of classifiers for the Wisconsin data set

Full size table

Please note that that other GP results have been produced using smaller populations and consuming less computational effort: The GP approach discussed in [18] was applied with populations of 100 solutions, producing 200 offspring over not more than 99 generations, and the results documented in [30] have been produced by GP with 500 individuals evolving over 50 generations. These approaches obviously consume less runtime than our approach; using populations containing 1,000 individuals over 59.2 generations on average and selection pressures eventually reaching 100, ∼1,420,935 solutions are here evaluated on average. Using smaller populations of only 100 solutions, GP with strict OS produces classifiers that correctly classify 74.87% (±4.95) of the given test samples.

In [11] and [14] also results for the Diabetes data set of the Proben1 collection are documented; this data set has been created based on the Pima data set of the UCI repository. According to Brameier and Banzhaf [11], linear GP was able to produce classifiers that correctly classify 72.15–76.91% of the given test samples, and De Falco et al. report that GP was used for producing classifiers with 69.64–75.16% classification accuracy. Artificial neural networks were also trained resulting in classifiers that correctly classify 73.58–77.41% of the test data.

3.3.4 Results for the Thyroid data set

The results achieved for the Thyroid data set are to be reported here. Table 8 summarizes the results for the 10-fold cross validation produced by GP with offspring selection, again as described in Sect. 3.1.1. For each class we characterize the classification accuracy on training and test data: For each partition we state average as well as standard deviation values. These figures boil down to the fact that extended GP has in this case been able to produce classifiers that on average correctly classify 99,10% of training samples and 98,76% of test samples, the total standard deviation values being 0.73 and 0.92, respectively.

Table 8 Summary of training and test results for the Thyroid data set: correct classification rates (average values and standard deviation values) for 10-fold CV partitions, produced by GP with offspring selection

Full size table

In order to compare the quality of these results to those reported in the literature, Table 9 summarizes a selection of test accuracies that have been obtained using 10-fold cross validation; again, for each method listed we give the references to the respective articles in which these results have been reported. Obviously the results summarized in Table 8 have to be considered quite fine, but not perfect as they are outperformed by results reported in [49] and [16].

Table 9 Comparison of machine learning methods: average test accuracy of classifiers for the Thyroid data set

Full size table

GP has also been repeatedly applied for solving the Thyroid problem, some of the results published are the following ones:

In [28] (Table 8), results produced by a pareto-coevolutionary GP classifier system for the Thyroid problem are reported, and here in Table 9 these results are stated as the “PGPC” results; in fact, these results are not the mean accuracy values but rather the median value, which is why these results are not totally comparable to other results stated here. For obtaining these results two populations consisting of 10 so-called learners and 50 so-called testers were trained over 50,000 epochs; details regarding this GP approach and its evaluation effort can be found in [28].
Loveard and Ciesielski reported that classifiers for the Thyroid problem could be produced using GP with test accuracies ranging from 94.9% to 98.2% (depending on the range selection strategy used); details are given in [30]. Populations containing 1,000 individuals were trained over 50 generations, so that the number of evaluated solutions totals to 50,000.
In [11] also results for the Thyroid data set of the Proben1 collection are documented. According to [11], linear GP was able to produce classifiers that correctly classify 98.56–99.11% of the given test samples; ANNs were also trained resulting in classifiers that correctly classify 97.62–98.09% of the test data.
According to Banzhaf and Lasarczyk [6], GP-evolved programs consisting of register machine instructions turned out to eventually misclassify on average 2.29% of the given test samples, and that optimal classifiers are able to correctly classify 98.64% of the test data. Populations of size 100 have been used here, and in each generation 500 offspring solutions were created and tested; the number of generations was set to 500, so that in each test run 250,000 have been evaluated.
Furthermore, Gathercole and Ross [19] report classification errors between 1.6% and 0.73% as best result using tree-based GP (evaluating 1.6–3.2*10⁸ formulas in each test run), and that an classification error of 1.52% for neural networks is reported in [42]. In fact, Gathercole and Ross reformulated the Thyroid problem to classifying cases as “class 3” or “not class 3”. Furthermore, as documented in [19], it was relatively straight forward for their GP implementation (DSS-GP) to produce function tree expressions which could distinguish between classes “1” and “2” completely correctly on both the training and test sets. “To be fair, in splitting up the problem into two phases (class 3 or not, then class 1 or 2) the GP has been presented with an easier problem [...]. This could be taken in different ways: Splitting up the problem is mildly cheating, or demonstrating the flexibility of the GP approach.” (Taken from [19], p. 7.)

GP with strict offspring selection was here applied with populations of 1,000 individuals; on average, the number of generations executed in our GP tests for the Thyroid test studies was 73.9, and on average ∼2,463,635 models were evaluated in each GP test run.

One of the most common and also simplest ways how to illustrate classification results is simply plotting the target values and the estimated values into one chart; Fig. 4 shows a graphical representation of one of the results obtained for the Thyroid data set, cross-validation set 9: The blue spots represent original classification values for each sample, green spots estimated values for the training samples and red ones estimated values for the test samples; the orange lines display optimal thresholds (with respect to training samples) that are also applied to test samples.

Figure 5 shows one of the models produced by GP with offspring selection for 10-fold CV partition 3 of the Thyroid data set. This model was trained on a scaled version of the data set: All variables were linearly scaled to the interval [0;100]. Thus, the threshold values are therefore also values between 0 and 100, in this case being 43.135 and 89.025. The interested reader is here also referred to the website ^{Footnote 5} of our upcoming book on modern concepts for GAs and GP, which shall appear at CRC Press by the beginning of 2009; on the materials section of this webpage we report on models for 10-fold CV analysis of the Thyroid data set.

3.3.5 Results for the Melanoma data set

For the Melanoma data set, finally, no results are available in the literature, therefore we have tested all machine learning algorithms mentioned previously for getting an objective evaluation of our GP methods.

First, in Table 10 we summarize original versus estimated classifications obtained by applying the classifiers produced by GP with offspring selection; in total, 97.17% of the training and 95.42% of the test samples are classified correctly (with standard deviations 0.87 and 2.13, respectively). These GP tests using the Melanoma data set were done with populations containing 1,000 individuals; the average number of generations executed was 54.4 and the average number of solutions evaluated ∼2,372,629.

Table 10 Confusion matrices for average classification results produced by GP with OS for the Melanoma data set

Full size table

Classification results obtained using several machine learning algorithms are collected in Table 11; we give training accuracies (wherever available) as well as test accuracies. Support vector machine based training was done with radial as well as with polynomial kernel functions, furthermore we used γ values 0.001 and 0.01. In standard GP (SGP) test we used tournament parents selection (k = 3), 8% mutation, single point crossover and the same structural limitations as in GP with OS; in order to get a fair comparison, the population size was set to 1,000 and the number of generations to 1,000 yielding 1,000,000 evaluations per test run.

Table 11 Comparison of machine learning methods: Average test accuracy of classifiers for the Melanoma data set

Full size table

As we can see in Table 11, GP with strict OS performs approximately as well as the support vector machines and neural nets applying those settings that are optimal in this test case: GP with OS was able to classify 95.42% of the test cases correctly, SVMs correctly classified 94.89–95.47% and neural nets (with validation set based stopping) 95.27% of the test cases evaluated. Standard GP as well as kNN, linear regression and standard ANNs clearly perform worse.

Even though we see that the average accuracy recorded for models produced by GP with OS is rather high, the relatively high standard deviation of this method’s performance (2.13, compared to 0.41 recorded for optimal SVMs) has to be seen as a negative aspect of these results.

4 Conclusion

In this paper we have presented an enhanced genetic programming method that was successfully used for investigating machine learning problems in the context of medical classification. The approach works with hybrid formula structures combining logical expressions (as used for example in decision trees) and classical mathematical functions; the enhanced selection scheme originally successfully applied for solving combinatorial optimization problems using genetic algorithms was also applied yielding high quality results.

We have investigated GP in the context of learning classifiers for five medical data collections, namely the Wisconsin, the Heart, the Diabetes and the Thyroid data sets taken from the UCI machine learning repository and the Melanoma data set, a collection represents medical measurements which were recorded while investigating patients potentially suffering from skin cancer.

The results presented in this article make the authors believe that an application in a real-world framework in the context of medical data analysis using the techniques presented here is recommended. As documented in the test results summary, GP based classification using gender specific parents selection as well as strict offspring selection is able to produce results that are—in terms of classification accuracy—at least comparable to the classifiers produced by classical machine learning algorithms frequently used for solving classification problems, namely linear regression, neural networks, neighborhood based classification or support vector machines as well as other GP implementations that have been used on the data sets investigated in our test studies.

Notes

http://www.ics.uci.edu/~mlearn/.
In contrast to the GP procedure described here, grammar driven GP ([58, 59]) has also been frequently used for solving classification tasks. Grammar based GP is an extension of GP that uses a grammar which defines the structure of the evolved solutions. An example for the application of grammar based GP for data mining in the context of medical knowledge retrieval is given in [36].
The abbreviation SASEGASA stands for Self Adaptive Segregative Genetic Algorithm with aspects of Simulated Annealing.
A even more detailed listing of test results for this data set can be found in [23].
http://gagp2008.heuristiclab.com/.

References

M. Affenzeller, Segregative genetic algorithms (SEGA): a hybrid superstructure upwards compatible to genetic algorithms for retarding premature convergence. IJCSS 2(1), 18–32 (2001)
Google Scholar
M. Affenzeller, Population Genetics and Evolutionary Computation: Theoretical and Practical Aspects. Schriften der Johannes Kepler Universität Linz. Universitätsverlag Rudolf Trauner (2005)
M. Affenzeller, S. Wagner, SASEGASA: a new generic parallel evolutionary algorithm for achieving highest quality results. J. Heuristics - Special Issue on New Advances on Parallel Meta-Heuristics for Complex Problems 10, 239–263 (2004)
Google Scholar
M. Affenzeller, S. Wagner, Offspring selection: a new self-adaptive selection scheme for genetic algorithms, in Adaptive and Natural Computing Algorithms, ed. by B. Ribeiro, R.F. Albrecht, A. Dobnikar, D.W. Pearson, N.C. Steele (Springer Computer Science, Springer, 2005), pp. 218–221
Chapter Google Scholar
D. Alberer, L. del Re, S. Winkler, P. Langthaler, Virtual sensor design of particulate and nitric oxide emissions in a di diesel engine. in Proceedings of the 7th International Conference on Engines for Automobile ICE 2005, 2005-24-063, Capri, Italy, 2005
W. Banzhaf, C. Lasarczyk, Genetic programming of an algorithmic chemistry. in Genetic Programming Theory and Practice II. ed. by U. O’Reilly, T. Yu, R. Riolo, B. Worzel (University of Michigan, Ann Arbor, 2004), pp. 175–190
H. Beyer, The Theory of Evolution Strategies. Springer, New York (2001)
C. Bojarczuk, H. Lopes, A. Freitas, Discovering comprehensible classification rules using genetic programming: a case study in a medical domain. in Proceedings of the Genetic and Evolutionary Computation Conference, vol. 2, Orlando, Florida, USA, 1999, pp. 953–958
C. Bojarczuk, H. Lopes, A. Freitas, E. Michalkiewicz, A constrained-syntax genetic programming system for discovering classification rules: application to medical data sets. Artif. Intell. Med. 30(1), 27–48 (2004)
Article Google Scholar
A. Bradley, The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recognit. 30, 1145–1159 (1997)
Article Google Scholar
M. Brameier, W. Banzhaf, A comparison of linear genetic programming and neural networks inmedical data mining. Evolutionary Computation, IEEE Transactions on 5(1), 17–26 (2001)
Article Google Scholar
C. Brodley, P. Utgoff, Multivariate decision trees. Mach. Learn. 19(1), 45–77 (1995)
MATH Google Scholar
G. Brown, Diversity in neural network ensembles. Ph.D. thesis, School of Computer Science, University of Birmingham, 2003
I. De Falco, A. Della Cioppa, E. Tarantino, Discovering interesting classification rules with genetic programming. Appl. Soft Comput. J. 1(4), 257–269 (2002)
Article Google Scholar
S. Dreiseitl, L. Ohno-Machado, H. Kittler, S. Vinterbo, H. Billhardt, M. Binder, A comparison of machine learning methods for the diagnosis of pigmented skin lesions. J. Biomed. Inform. 34, 28–36 (2001)
Article Google Scholar
W. Duch, R. Adamczak, K. Grabczewski, A new methodology of extraction, optimization and application of crisp and fuzzy logical rules. IEEE Trans. Neural Netw. 12, 277–306 (2001)
Article Google Scholar
R.O. Duda, P.E. Hart, D.G. Stork, Pattern Classification, 2nd edn. (Wiley Interscience, 2000)
J. Eggermont, J.N. Kok, W.A. Kosters, Genetic programming for data classification: partitioning the search space. in Proceedings of the 2004 Symposium on applied computing ACM SAC’04, Nicosia, Cyprus, ACM, 2004, pp. 1001–1005
C. Gathercole, P. Ross, Dynamic training subset selection for supervised learning in genetic programming. in Parallel Problem Solving from Nature III, LNCS, vol. 866 ed. by Y. Davidor, H.P. Schwefel, R. Männer, (Springer-Verlag, 1994), pp. 312–321
P. Gill, W. Murray, M. Wright, Practical Optimization. (Academic Press, 1982)
H. Hamilton, N. Shan, N. Cercone, Riac: a rule induction algorithm based on approximate classification. Tech. Rep. CS 96-06, (Regina University, 1996)
P.L. Hammer, A. Kogan, B. Simeone, S. Szedmak, Pareto-optimal patterns in logical analysis of data. Discrete Appl. Math. 144, 102 (2004)
MathSciNet Google Scholar
I. Jonyer, L.B. Holder, D.J. Cook, Attribute-value selection based on minimum description length. in Proceedings of the International Conference on Artificial Intelligence, Las Vegas, Nevada, USA, 2004, pp. 1154–1159
S. Keerthi, S. Shevade, C. Bhattacharyya, K. Murthy, Improvements to platt’s SMO algorithm for SVM classifier design. Neural Comput. 13(3), 637–649 (2001)
Article MATH Google Scholar
J.R. Koza, Genetic Programming: On the Programming of Computers by Means of Natural Selection. The MIT Press (1992)
K. Levenberg, A method for the solution of certain non-linear problems in least squares. Q. Appl. Math. 2, 164–168 (1944)
MathSciNet MATH Google Scholar
D.P.X. Li, V. Ciesielski, Multi-objective techniques in genetic programming for evolving classifiers. in Proceedings of the 2005 Congress on Evolutionary Computation (CEC ’05), Munich, Germany, 2005, pp. 183–190
P. Lichodzijewski, M. Heywood, Pareto-coevolutionary genetic programming for problem decomposition in multi-class classification. in Proceedings of the Genetic and Evolutionary Computation Conference GECCO’07, London, England, 2006, pp. 464–471
L. Ljung, System Identification – Theory For the User, 2nd edn. (PTR Prentice Hall, Upper Saddle River, NJ, 1999)
T. Loveard, V. Ciesielski, Representing classification problems in genetic programming. in Proceedings of the Congress on Evolutionary Computation, vol. 2 (IEEE Press, COEX, World Trade Center, 159 Samseong-dong, Gangnam-gu, Seoul, Korea, 2001), pp. 1070–1077
D.W. Marquardt, An algorithm for least-squares estimation of nonlinear parameters. SIAM J. Appl. Math. 11, 431–441 (1963)
Article MathSciNet MATH Google Scholar
T.M. Mitchell, Machine Learning. (McGraw-Hill, New York, 2000)
Google Scholar
B. Moghaddam, G. Shakhnarovich, Boosted dyadic kernel discriminants. in Advances in Neural Information Processing Systems (NIPS), vol. 15 ed. by S. Becker, S. Thrun, K. Obermayer (2002)
K. Morik, M. Imhoff, P. Brockhausen, T. Joachims, U. Gather, Knowledge discovery and knowledge validation in intensive care. Artif. Intell. Med. 19, 225–249 (2000)
Article Google Scholar
O. Nelles, Nonlinear System Identification. (Springer Verlag, Berlin Heidelberg, New York, 2001)
MATH Google Scholar
P. Ngan, M. Wong, K. Leung, J. Cheng, Using grammar based genetic programming for data mining of medical knowledge. (Genetic Programming, 1998), pp. 254–259
M. Nørgaard, Neural network based system identification toolbox. Tech. Rep. 00-E-891, Technical University of Denmark (2000)
J. Platt, Fast training of support vector machines using sequential minimal optimization. in Advances in Kernel Methods-Support Vector Learning, ed. by B. Schoelkopf, C. Burges, A. Smola, (MIT Press, 1999). pp. 185–208
L. Prechelt, Proben1 - a set of neural network benchmark problems and benchmarking rules. Tech. rep., Fakultät für Informatik, Universität Karlsruhe (1994)
M.L. Raymer, L.A. Kuhn, W.F. Punch, Knowledge discovery in biological datasets using a hybrid bayes classifier/evolutionary algorithm. in BIBE ’01: Proceedings of the 2nd IEEE International Symposium on Bioinformatics and Bioengineering. (IEEE Computer Society, Washington, DC, USA, 2001), pp. 236–244
S.J. Russell, P. Norvig, Artificial Intelligence: A Modern Approach, 2nd edn. (Prentice Hall, 2003)
W. Schiffmann, M. Joost, R. Werner, Optimization of the backpropagation algorithm for training multilayer perceptrons. Tech. Rep. 15, University of Koblenz, Institute of Physics (1992)
W. Schiffmann, M. Joost, R. Werner, Comparison of optimized backpropagation algorithms. in Proceedings of the European Symposium on Artificial Neural Networks ESANN ’93, Brussels, Belgium, 1993, pp. 97–104
I. Taha, J. Ghosh, Evaluation and ordering of rules extracted from feedforward networks. in Proceedings of the IEEE International Conference on Neural Networks, Houston, Texas, USA, 1997, pp. 221–226
M.K. Titsias, A.C. Likas, Shared kernel models for class conditional density estimation. IEEE-NN 12, 987–997 (2001)
Google Scholar
V. Vapnik, Statistical Learning Theory. (Wiley, New York, 1998)
MATH Google Scholar
S. Wagner, M. Affenzeller, Heuristiclab: A generic and extensible optimization environment. in Adaptive and Natural Computing Algorithms, ed. by B. Ribeiro, R.F. Albrecht, A. Dobnikar, D.W. Pearson, N.C. Steele, (Springer Computer Science, Springer, 2005a), pp. 538–541
S. Wagner, M. Affenzeller, Sexual GA: gender-specific selection for genetic algorithms. in Proceedings of the 9th World Multi-Conference on Systemics, Cybernetics and Informatics (WMSCI) 2005, vol. 4, Orlando, Florida, USA, ed. by N. Callaos, W. Lesso, E. Hansen. (International Institute of Informatics and Systemics, 2005b), pp. 76–81
S. Weiss, I. Kapouleas, An empirical comparison of pattern recognition, neural nets, and machine learning classification methods. in Readings in Machine Learning. ed. by Shavlik J.W., Dietterich T.G. (Kaufmann, San Mateo, CA), pp. 177–183 (1990)
Google Scholar
J. Wen-Hua, D. Madigan, S.L. Scott, On bayesian learning of sparse classifiers. Tech. Rep. 2003-08, Avaya Labs Research (2003)
S. Winkler, Evolutionary system identification—modern concepts and practical applications. Ph.D. thesis, Institute for Formal Models and Verification, Johannes Kepler University Linz, 2008
S. Winkler, M. Affenzeller, S. Wagner, Automatic data based patient classification using genetic programming. in Cybernetics and Systems 2006, vol. 1, ed. by R. Trappl, R. Brachman, R. Brooks, H. Kitano, D. Lenat, O. Stock, W. Wahlster, M. Wooldridge. (Austrian Society for Cybernetic Studies, 2006a), pp. 251–256
S. Winkler, M. Affenzeller, S. Wagner, Using enhanced genetic programming techniques for evolving classifiers in the context of medical diagnosis—an empirical study. in Proceedings of the GECCO 2006 Workshop on Medical Applications of Genetic and Evolutionary Computation (MedGEC 2006), Seattle, Washington, USA. Association for Computing Machinery (ACM), 2006b
S. Winkler, M. Affenzeller, S. Wagner, Advanced genetic programming based machine learning. J. Math. Model. Algorithms 6(3), 455–480 (2007a)
Article MathSciNet MATH Google Scholar
S. Winkler, M. Affenzeller, S. Wagner, Selection pressure driven sliding window genetic programming. Lecture Notes in Computer Science 4739: Computer Aided Systems Theory - EuroCAST 2007, pp. 789–795 (2007b)
S. Winkler, M. Affenzeller, S. Wagner, Offspring selection and its effects on genetic propagation in genetic programming based system identification. in Cybernetics and Systems 2008, vol. 2, ed. by R. Trappl. (Austrian Society for Cybernetic Studies, 2008), pp. 549–554
I. Witten, E. Frank, Data Mining: Practical Machine Learning Tools and Techniques. (Morgan Kaufmann, San Francisco, 2005)
MATH Google Scholar
M.L. Wong, K.S. Leung, Inducing logic programs with genetic algorithms: the genetic logicprogramming system genetic logic programming and applications. IEEE Expert 10(5), 68–76 (1995)
Article Google Scholar
M.L. Wong, K.S. Leung, Evolutionary program induction directed by logic grammars. Evol. Comput. 5(2), 143–180 (1997)
Article MathSciNet Google Scholar
Z.H. Zhou, Y. Jiang, Nec4.5: neural ensemble based c4.5. IEEE Trans. Knowl. Data Eng. 16(6), 770–773 (2004)
Article MathSciNet Google Scholar
M.H. Zweig, Receiver-operating characteristic (ROC) plots: a fundamental evaluation tool in clinical medicine. Clin. Chem. 39, 561–577 (1993)
Google Scholar

Download references

Author information

Authors and Affiliations

Research Center Hagenberg, Upper Austria University of Applied Sciences, Softwarepark 11, 4232, Hagenberg, Austria
Stephan M. Winkler
Department of Software Engineering, Upper Austria University of Applied Sciences, Softwarepark 11, 4232, Hagenberg, Austria
Michael Affenzeller & Stefan Wagner

Authors

Stephan M. Winkler
View author publications
You can also search for this author in PubMed Google Scholar
Michael Affenzeller
View author publications
You can also search for this author in PubMed Google Scholar
Stefan Wagner
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Stephan M. Winkler.

Additional information

The work described in this paper was done within the Translational Research Project L284-N04 “GP-Based Techniques for the Design of Virtual Sensors” sponsored by the Austrian Science Fund (FWF).

Rights and permissions

Reprints and permissions

About this article

Cite this article

Winkler, S.M., Affenzeller, M. & Wagner, S. Using enhanced genetic programming techniques for evolving classifiers in the context of medical diagnosis. Genet Program Evolvable Mach 10, 111–140 (2009). https://doi.org/10.1007/s10710-008-9076-8

Download citation

Received: 27 August 2007
Revised: 03 December 2008
Accepted: 04 December 2008
Published: 30 December 2008
Issue Date: June 2009
DOI: https://doi.org/10.1007/s10710-008-9076-8

Keywords

JEL Classification

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Using enhanced genetic programming techniques for evolving classifiers in the context of medical diagnosis

Abstract

Similar content being viewed by others

A Genetic Programming Approach Applied to Feature Selection from Medical Data

Improving Genetic Programming for Classification with Lazy Evaluation and Dynamic Weighting

Genetic Programming Representations for Multi-dimensional Feature Learning in Biomedical Classification

1 Introduction

2 GP-based classification in medical diagnostics

2.1 The GP based structure identification approach

2.1.1 General remarks

2.1.2 Solution candidate representation using hybrid tree structures

2.1.3 Evaluation of classification models

2.1.4 Finding appropriate class thresholds: dynamic range selection

2.2 Advanced algorithmic concepts applied within GP-based structure identification

2.2.1 A short description of SASEGASA

2.2.2 Relevant algorithmic parts of SASEGASA for genetic programming based machine learning

3 Empirical studies

3.1 Classification methods used

3.1.1 GP-based training of classifiers

3.1.2 Linear modeling

3.1.3 Neural networks

3.1.4 kNN classification

3.1.5 Support vector machines

3.2 Data sets

3.3 Results

3.3.1 Results for the Wisconsin data set

3.3.2 Results for the Heart data set

3.3.3 Results for the Pima data set

3.3.4 Results for the Thyroid data set

3.3.5 Results for the Melanoma data set

4 Conclusion

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

JEL Classification

Search

Navigation