1 Introduction

Classification is understood as the act of placing an object into a set of categories, based on the object’s properties. Objects are classified according to an (in most cases hierarchical) classification scheme also called taxonomy. Amongst many other possible applications, examples of taxonomic classification are biological classification (as for example the act of categorizing and grouping organisms into species) and medical classification.

A statistical classification algorithm is supposed to take feature representations of objects and map them to a special, predefined classification label. Such classification algorithms are designed to learn (i.e., to approximate the behavior of) a function which maps a vector of object features into one of several classes; this is done by analyzing a set of input-output examples (“training samples”) of the function. The basic steps can be summarized as follows: After training (i.e., after learning a classifier or even a set of classifiers) using a set of training samples, one should be able to classify a new, unknown object into one of the predefined classes by analyzing its measured features using the classifier(s) returned by the training algorithm. Since statistical classification algorithms are supposed to “learn" such functions, we are dealing with a specific subarea of machine learning and, more generally, artificial intelligence. There are several approaches that are nowadays used for solving data mining and, more specifically, classification problems. The most common ones are (as for example described in [32]) decision tree learning, instance-based learning, inductive logic programming (such as using Prolog, e.g.) and reinforcement learning.

The task of a classification algorithm in the context of patient classification is to find a formal description that models underlying classification rules; this means that the diagnosis can be predicted by considering a set of other measured features (for example blood parameter values).

The application of data based machine learning techniques for learning classifiers is one of the major topics not only in computer science in general, but also especially in medical data mining. When it comes to automatically classifying patients into those which are possibly suffering from a certain disease and those which are healthy, one often has to face the problem that special tests (such as blood analysis, e.g.) produce high costs of time and money. This is why one has to seek for hidden relationships between the symptoms of a certain disease and other features which are much easier to measure.

In this paper we present a classification algorithm based on genetic programming (GP) [25] using a structure identification framework [51] described in Sect. 2.1, in combination with an enhanced, hybrid selection scheme [3] (as explained in Sect. 2.2.2). This GP approach has been implemented as a part of HeuristicLab [47], a framework for prototyping and analyzing optimization techniques for which both generic concepts of evolutionary algorithms and many functions to evaluate and analyze them are available.

The results obtained using this enhanced GP based identification approach are compared to those achieved by methods that are nowadays standard techniques in machine learning, including (but not being limited to) linear regression modeling (Lin), (artificial) neural networks (NN), k-nearest-neighbor (kNN) classification, support vector machines, and rule learning. We also compare these results to those achieved using other variants of GP based machine learning.

The comparison of the methods mentioned is done on the basis of five medical benchmark classification problems: The Wisconsin, the Heart, the Diabetes and the Thyroid data sets taken from the UCI machine learning repository Footnote 1 as well as the Melanoma data set prepared by members of the Department of Dermatology of the Medical University Vienna. In Sect. 3 we document the results obtained using the enhanced GP method presented here and compare them to those returned by classification algorithms mentioned previously.

2 GP-based classification in medical diagnostics

2.1 The GP based structure identification approach

2.1.1 General remarks

Preliminary work for the approach presented in this paper was done for the project “Specification, Design, and Implementation of a Genetic Programming Approach for Identifying Nonlinear Models of Mechatronic Systems" as a part of a strategic project at the Johannes Kepler University Linz, Austria, focusing on the identification of mechatronic systems. It has been shown (see for instance [5] or [51]) that methods of GP are suitable for determining an appropriate mathematical representation of a physical, mechatronical system; especially NO x and soot emissions of Diesel engines have been investigated in this context yielding satisfying models (especially for NO x ). The methods implemented for this project have been used for developing a GP-based statistical classification algorithm. This algorithm works on a set of training examples with known properties [X 1 ... X n ]; one of these properties (X t ) has to represent the membership information with respect to the underlying taxonomy. In the context of medical applications, this target property gives the information about the respective patient’s diagnosis (as “diseased” versus “not diseased” or “malign” versus “benign”, e.g.). On the basis of the training examples, the algorithm tries to evolve (or, as one could also say, to “learn”) a solution, that is a formula, that represents a function which maps a vector of object features into one of the given classes. In other words, each instance of the classification problem to be solved is interpreted as an instance of an optimization problem; a solution is found by a heuristic optimization algorithm.

The goal of the implemented GP classification process is to produce functions from a database containing the measured results of the experiments to be analyzed. Thus, the GP algorithm works with solution candidates that are tree structure representations of symbolic expressions.

As we will see later in Sect. 3, the computational effort needed in GP based learning of classifiers might be rather high, especially when using large populations or enhanced selection strategies (as described later in Sect. 2.2.2). Still, the following advantages of evolutionary learning of classifiers outmatch these drawbacks:

  • The produced models are made up of well-known function and terminal primitives, so that the resulting models can be understood and analyzed by people who do not necessarily have to be experts in machine learning or even informatics.

  • In fact, the functional basis (as described in the next subsection) can also be used in other data based modeling tasks as for example described in [55], [56] and [51].

2.1.2 Solution candidate representation using hybrid tree structures

The selection of the library functions is an important part of any GP modeling process because this library should be able to represent a wide range of systems; Table 1 gives an overview of the function set as well as the terminal nodes used for the classification experiments documented in this paper. As the reader can see in Table 1, mathematical functions and terminal nodes are used as well as Boolean operators for building complex arithmetic expressions. There are in fact no structural restrictions for the use of Boolean blocks in formulae; of course, [Then/Else] and Boolean expressions have to be connected to [IF] nodes, but there are no other restrictions regarding the use of Boolean blocks within mathematical expressions. Thus, the concept of decision trees is included in this approach together with the standard structure identification concept that tries to evolve nonlinear mathematical expressions. An example showing the structure tree representation of a combined formula including arithmetic as well as logical functions is displayed in Fig. 1.

Table 1 Set of function and terminal definitions for enhanced GP based classification
Fig. 1
figure 1

An exemplary hybrid structure tree

Even if our GP approach is not grammar driven Footnote 2, the evolutionary process is still supposed to produce syntactically correct structures, i.e. that for example that the first child (argument) of an [IF] node should always be a Boolean or a logical expression (node). This is why syntactic constraints are also defined:

  • The first input to a conditional node has to be a Boolean or a logical expression.

  • The inputs to logical functions have to be logical or Boolean expressions.

  • Nodes representing logical or Boolean functions are only allowed to be children of nodes that represent conditional or logical functions.

These constraints are saved as meta-information about the functional basis and are considered by the crossover and mutation operators; these operators have been implemented in such a way that only syntactically correct structure trees are produced.

2.1.3 Evaluation of classification models

There are several possible functions that can serve as fitness functions for a GP process. For example, the ratio of misclassifications (using optimal thresholds) or the area under the corresponding receiver operating characteristic (ROC) curves ([61], [10]) could be used. Another function frequently used for quantifying the quality of models is the R 2 function that takes into account the sum of squared errors as well as the sum of squared target values; an alternative, the so-called adjusted R 2 function, is also utilized in many applications.

In [8] and [9], for example, another classification specific fitness function for GP is described: Roughly speaking, four types of results can be observed for the prediction, namely true positive, false positive, true negative and false negative classifications; the fitness function combines two indicators commonly used in the medical domain, namely the sensitivity and the specificity whose product is returned as the fitness of a classifier. The goal of a GP process therefore is to maximize both sensitivity and specificity at the same time.

Still, we have decided to use a variant of the squared errors function for estimating the quality of a classification model. The most important aspect that has to be considered when using this function is the following one: the errors of predicted values that are lower than the lowest class value or greater than the greatest class value do here not have a totally quadratic, but only linear contribution to the fitness value. To be a bit more precise: N given samples with original classifications o i are divided into n classes with class labels c 1, ...,c n (where c 1 is the lowest and c n the greatest class value), which means that each original classification (target) value o i of sample i equals one of the n class labels c 1, ...,c n :

$$ \forall_{i \in [1,N]}: o_i \in \{c_1, \ldots, c_n\} $$
(1)

The fitness value F of a classification model producing the estimated classification values e i is then evaluated as follows:

$$ \begin{array}{c} \forall_{i \in [1,N]}:\\ (e_i < c_1) \Rightarrow f_i=(o_i-c_1)^2 + \mid c_1-e_i \mid,\\ (c_1 \leq e_i \leq c_n) \Rightarrow f_i=(e_i-o_i)^2,\\ (e_i > c_n) \Rightarrow f_i=(o_i-c_n)^2 + \mid c_n-e_i \mid \end{array} $$
(2)
$$ F = \frac{1}{N}\sum_{i=1}^{N}{f_i} $$
(3)

The reason for this is that values that are greater than the greatest class value or below the lowest value are anyway classified as belonging to the class having the greatest or the lowest class number, respectively; using a standard implementation of the squared error function would punish a formula producing such values more than necessary.

2.1.4 Finding appropriate class thresholds: dynamic range selection

Of course, a mathematical expression alone does not yet define a classification model; thresholds are used for dividing the output into multiple ranges, each corresponding to exactly one class. These regions are defined before starting the training algorithm in static range selection (SRS, see for example [27] for explanations), which brings along the difficulty of determining the appropriate range boundaries a priori. Within the GP based classification framework discussed in this paper we have therefore used dynamic range selection (DRS) which attempts to overcome this problem by evolving the range thresholds along with the classification models: Thresholds are chosen in such a way that the sum of class-wise ratios of misclassifications for all given classes is minimized (on the training data, of course).

In detail, let us consider the following: Given N (training) samples with original classifications o i divided into n classes c 1, ...,c n (with c 1 being the lowest and c n the greatest class value), models produced by GP can be in general used for calculating estimated values e i for all N samples. Assuming thresholds T = t 1,...,t n-1 (with c j  < t j  < c j+1 for j∈[1; n−1]), each sample k is classified as ec k :

$$ e_k < t_1\rightarrow ec_k(T)=c_1 $$
(4)
$$ t_j < e_k < t_{j+1}\rightarrow ec_k(T) = c_{j+1} $$
(5)
$$ e_k > t_{n-1} \rightarrow ec_k(T)=c_n $$
(6)

Thus, assuming a set of thresholds T m , for each class c k we get the ratio of correctly classified samples cr k as

$$ total_k(T_m)=|\{a : (\forall_{x \in a}: o_x = ec_k(T_m)) \}| $$
(7)
$$ correct_k(T_m)=|\{b : (\forall_{x \in b}: o_x = ec_k(T_m) \wedge o_x = c_k) \}| $$
(8)
$$ cr_k(T_m)=\frac{correct_k(T_m)}{total_k(T_m)}. $$
(9)

The sum of ratios of correctly classified samples is—depending on the set of thresholds T m —calculated as

$$ cr(T_m) = \sum_{i=1}^{n}cr_i(T_m). $$
(10)

So, finally we can define the set of thresholds applied as that set T opt so that each other set of thresholds leads to lower sums of classification accuracies:

$$ T_d \neq T_{opt} \rightarrow cr_(T_d)<cr(T_{opt}) $$
(11)

The optimal set of thresholds T opt is calculated in the following way: For each possible pair of class values (c i ,c i+1) we calculate a number of nt equidistant thresholds t tmp(i,d) (for d∈[1, ...,nt]) between c i and c i+1; thus, \(c_i<t_{tmp(i,d_1)}<t_{tmp(i,d_2)}< c_{i+1}\) and \((t_{tmp(i,d_1+1)}-t_{tmp(i,d_1)})=(t_{tmp(i,d_2+1)}-t_{tmp(i,d_2)})\) for 1 ≤ d 1 < d 2 ≤ nt. All possible combinations of thresholds are calculated and evaluated on training data; those thresholds T opt , that are optimal for the training samples with respect to cr(T opt ), are fixed and also applied on the test samples.

Please note that this sum of class-wise classification accuracies is not equal to the total ratio of correctly classified samples which is used later for estimating the quality of classifiers produced by various algorithms in Sect. 3; the total classification accuracy for a set of thresholds acc(T m ) (assuming original and estimated values o and e) is defined as

$$ z(T_m)=\|\{a | (\forall_{x \in a}: o_x = ec_x(T_m)) \}\| $$
(12)
$$ acc(T_m)=\frac{z}{N}. $$
(13)

2.2 Advanced algorithmic concepts applied within GP-based structure identification

Our first attempt for adjustable selection pressure handling was the so-called Segregative Genetic Algorithm (SEGA) [1] which introduces birth surplus in the sense of a (μ,λ) evolution strategy [7] into the general concept of a GA and uses this enhanced flexibility primarily for adaptive selection pressure steering in the migration phases of the parallel GA in order to improve achievable global solution quality. The SASEGASAFootnote 3 [3] is an extension of SEGA; its most important feature is its ability to self-adaptively adjust selection pressure in order to achieve progress in solution quality without losing essential genetic information which would lead to unwanted premature convergence. The SASEGASA is generic in that sense that all algorithmic extensions are problem-independent so that they do not depend on a certain problem representation and the corresponding operators.

Therefore, we can easily combine the enhanced algorithmic concepts of the SASEGASA with genetic programming (GP). However, we have observed two major differences when combining SASEGASA and genetic programming compared to the experience in the application of SASEGASA in other domains like combinatorial optimization or real-valued optimization [3]:

  • The potential in terms of achievable solution quality in comparison with the standard algorithms seems to be considerably higher in the field of GP than in standard applications of GAs.

  • By far not all algorithmic extensions of SASEGASA are relevant in GP. Only some algorithmic aspects of the rather complex SASEGASA concept are really relevant in the GP domain which makes the handling and especially parameter adjustment easier and more robust.

Therefore, the discussion in this article will focus on the algorithmic parts of SASEGASA that are really relevant for GP. In doing so, this section is structured as follows: The first subsection describes the general idea of SASEGASA in a quite compact way, whereas the second subsection focuses on that parts of SASEGASA in further detail which are really relevant for the present contribution and discusses the reasons for that.

For a more detailed description of all involved algorithmic aspects the interested reader is referred to the book [2].

2.2.1 A short description of SASEGASA

In principle, the SASEGASA introduces two enhancements to the basic concept of Genetic Algorithms. Firstly, it brings in a variable and self-adaptive selection pressure in order to control the diversity of the evolving population in a goal-oriented way with respect to the objective function. The second concept introduces a separation of the population to increase the broadness of the search process and joins the subpopulation after their evolution in order to end up with a population including all genetic information which is needed for locating a global optimum.

At the beginning of the evolutionary process the whole population is divided into a certain number of subpopulations. These subpopulations evolve independently from each other until the fitness increase stagnates in all subpopulations because of too similar individuals within the subpopulations, i.e. local premature convergence. Thanks to offspring selection this can be triggered exactly as soon as an upper limit of selection pressure is exceeded (cf. Sect. 2.2.2); subsequently, a reunification from n to (n−1) subpopulations is performed by joining an appropriate number of adjacent subpopulation members.

Thus, at the beginning a certain number of villages (subpopulations) of the evolutionary process are slowly growing together forming bigger towns, eventually ending up with one big city containing the whole population at the end of evolution. This ensures essential building blocks can evolve independently in different regions of the search space at the beginning and during the evolutionary process.

2.2.2 Relevant algorithmic parts of SASEGASA for genetic programming based machine learning

In [3] it has been shown that the aspect of segregation and reunification is highly relevant in order to systematically improve the achievable global solution quality of combinatorial optimization problems as for example the traveling salesman problem (TSP). Still, we have not used this parallel approach for our GP-based modeling studies due to the following reasons: On the one hand, this would lead to a high increase of runtime consumption; on the other hand, anyway, we do not expect any significant increase of solution quality using this concept for GP-based modeling as results summarized in [2] indicate that this parallel approach does not remarkably affect the solution quality of optimization problems others than combinatorial problems.

In contrast to this, the possibility of self-adaptive selection pressure steering within SASEGASA caused by offspring selection (OS) has proven extremely powerful especially for GP problem representations; this has already been shown in [52], [53], [55] and [54], and will also be treated in the discussion of experimental results of the present article (Sect. 3).

An essential question about the general performance of GAs or GP is, whether or not good parents are able to produce children of comparable or even better fitness (the building block hypothesis implicitly relies on this). In natural evolution, this is almost always true. For artificial evolution and exceptionally for GP this property is not so easy to guarantee; OS is designed to assure exactly that property.

In order to produce a child for the ongoing evolutionary process, offspring selection does not only consider the fitness of the parents; additionally, the fitness value of the evenly produced offspring is compared to the fitness values of its own parents: offspring is accepted as a candidate for the further evolutionary process if and only if the reproduction operator was able to produce an offspring that could outperform the fitness of its own parents. This strategy guarantees that evolution is presumed mainly with crossover results that were able to mix the properties of their parents in an advantageous way.

As in the case of conventional GAs or GP, offspring are generated by parent selection, crossover, and mutation. In a second (offspring) selection step, the number of offspring to be generated is calculated using a predefined parameter defining the quotient of the next generation’s members that have to outperform their own parents; this parameter is called success ratio (SuccRatio). As long as this ratio is not fulfilled, children are created and only successful offspring will definitely become members of the next generation; this is schematically displayed in Fig. 2. As soon as the postulated ratio is reached, the rest of the next generation members are randomly chosen from the children that did not reach the success criterion. Within this selection model selection pressure is defined as the ratio of generated candidates to the population size. An upper limit for selection pressure gives a quite intuitive termination criterion: As soon as it is no more possible to find a sufficient number of offspring that outperform their parents, the algorithm terminates in the simple version which is used here.

Fig. 2
figure 2

Flowchart for embedding offspring selection into a Genetic Algorithm

Research results obtained within the last two years have lead to the conclusion that a simplified version of offspring selection used in combination with enhanced parent selection [48] shows the best and most robust results in the context of GP-applications. Thus, SuccRatio should be set to 1.0, i.e. every offspring for the next generation is forced to pass the success criterion. Furthermore, it is beneficial in GP applications to define that a child is better than its parents if and only if it is better than the better of the two parents. In the context of combinatorial optimization problems, where some intermediate value of the parents fitness values is used as a threshold value for the success criterion, such settings tend to support premature convergence. Still, in the field of genetic programming applications these parameter settings lead to high quality results quite robustly.

However, there is one aspect concerning parent selection that has to be considered in this application domain: It is—applying the parameter settings of offspring selection mentioned above—most effective to use different selection methods for the selection of the two parents which are chosen for crossover. In the present context this gender specific selection aspect [48] is implemented most effectively by selecting one parent conventionally by roulette-wheel selection and the other parent randomly.

All together, this especial variant of adapted sexual selection combined with a simplified version of offspring selection aims to cross one above-average parent with a randomly selected parent (which brings in diversity) as long as a whole new population could be filled up with children that were better than their better parent. An upper limit for selection pressure acts as termination criterion in that sense that the algorithm stops, if too many trials (|POP| ·maxSelPress) were already taken and still no new population consisting of successful offspring could be generated. Or in other words this indicates that with the actual gene pool not enough children can be generated that outperform their parents which is a reasonable termination criterion for an evolutionary algorithm. This special version of SASEGASA (offspring selection) is schematically shown in Fig. 3.

Fig. 3
figure 3

Flowchart for embedding a simplified version of offspring selection into the GP based machine learning process

3 Empirical studies

3.1 Classification methods used

For comparing GP based classification with other machine learning methods, the following techniques for training classifiers have been examined: genetic programming (enhanced approach as described in Sect. 2), linear modeling, neural networks, the k-nearest-neighbor method, and support vector machines.

3.1.1 GP-based training of classifiers

We have used the following parameter settings for our GP test series:

  • Single population approach; population size: 500–1,000

  • Mutation rate: 10%

  • Maximum formula tree height: 8

  • Parents selection: gender specific, random & roulette

  • Offspring selection: strict offspring selection (success ratio as well as comparison factor set to 1.0)

  • Termination criteria:

    • Max. number of generations: 1000; not reached, all executions were terminated via the

    • Maximum selection pressure: 100 (as described in Sect. 2.2.2, details can be found in [3] and [4])

  • Function set: all functions as described in Sect. 2.1

In addition to splitting the given data into training and test data, EGP based training is implemented in such a way that a part of the given training data is not used for training models and serves as validation set; in the end, when it comes to returning classifiers, the algorithm returns those models that perform best on validation data. This approach has been chosen because it is assumed to help to cope with over-fitting; it is also applied in other GP based machine learning algorithms as for example described in [6]. In fact, this was also done in our standard GP tests for the Melanoma data set.

3.1.2 Linear modeling

Given a data collection including m input features storing the information about N samples, a linear model is defined by the vector of coefficients θ1 ... m . For calculating the vector of modeled values e using the given input values matrix u 1 ... m, these input values are multiplied with the corresponding coefficients and added:

$$ e=u_{1 \ldots m} * \theta $$
(14)

The coefficients vector can be computed by simply applying matrix division. For conducting the test series documented here we have used the matrix division function provided by MATLAB:

$$ {\tt theta = InputValues}\backslash {\tt TargetValues;} $$

If a constant additive factor is to be included into the model (i.e., the coefficients vector), this command has to be extended:

$$ \begin{array}{c} {\tt r=size(InputValues,1);}\\ {\tt theta=[InputValues ones(r,1)]} \backslash {\tt TargetValues;} \end{array} $$

Theoretical background of this approach can be found in [29].

3.1.3 Neural networks

For training neural network (NN) models, three-layer feed-forward neural networks with one output neuron were created using the backpropagation as well as the Levenberg-Marquardt training method. Theoretical background and details can be found in [35] (Chap. 11, “Neural Networks”), [31], [26] or [20].

The following two approaches have been applied for training neural networks:

  • On the one hand we have trained networks with 5 neurons in the hidden layer (referred to as “NN1” in the test series documentation in Sect. 3.3) as well as networks with 10 hidden neurons (referred to as “NN2” in the test series documentation); the number of iterations of the training process was set to 100 (in the first variant, “NN1”) and 300 (in the second variant, “NN2”). In the context of analyzing the benchmark problems used here, higher numbers of nodes or iterations are likely to lead to overfitting (i.e., a better fit on the training data, but worse test results).

    The NN training framework used to collect the results reported in this paper is the \({\tt NNSYSID20}\) package, a neural network toolbox implementing the Levenberg-Marquardt training method for MATLAB; it has been implemented by Magnus Nørgaard at the Technical University of Denmark [37].

  • On the other hand, the multilayer perceptron training algorithm available in WEKA [57] has also been used for training classifiers. In this case the number of hidden nodes was set to (a + c)/2, where a is the number of attributes (features) and c the number of classes. The number of iterations was not pre-defined, but 10% of the data were designed to be used as validation data; in order to combat the danger of overfitting, the training algorithm was terminated as soon as the error on validation data is constantly getting worse in 20 iterations consecutively. This training method, which applies backpropagation learning, is in the following referred to as the “NN3” method.

3.1.4 kNN classification

Unlike other data based modeling methods based on linear models, neural networks or GP, k-nearest-neighbor classification works without creating any explicit models. During the training phase, the data are simply collected; when it comes to classifying a new, unknown sample x new , the sample-wise distance between x new and all other training samples x train is calculated and the classification is done on the basis of those k training samples (x NN ) showing the smallest distances from x new .

The distance between two samples is calculated as follows: First, all features are normalized by subtracting the respective mean values and dividing the remaining samples by the respective variables’ standard deviation. Given a data matrix x including m features storing the information about N samples, the normalized values x norm are calculated as

$$ \forall_{i \in [1,m]} \forall_{j \in [1,N]}: x_{norm}(i,j) = \frac{x(i,j)-\frac{1}{N}\sum_{k=1}^{N}{x(i,k)}}{\sigma(x(i,1 \ldots N))} $$
(15)

where the standard deviation σ of a given variable x storing N values is calculated as

$$ \sigma(x) = \sqrt{\frac{1}{N-1}\sum_{i=1}^{N}{(x_i-\bar{x})^2}} $$
(16)

with \(\bar{x}\) denoting the mean value of x.

Then, on the basis of the normalized data, the distance between two samples a and b, d(a,b), is calculated as the mean squared variable-wise distance:

$$ d(a,b) = \frac{1}{n}\sum_{i=1}^{n}{(a_{norm}(i)-b_{norm}(i))^2} $$
(17)

where n again is the number of features stored for each sample.

In the context of classification, the numbers of instances (of the k nearest neighbors) are counted for each given class and the algorithm automatically predicts that class that is represented by the highest number of instances (included in x NN ). In the test series documented in this paper we have applied weighting to kNN classification: The distance between x new and x NN is relevant for the classification statement, the weight of “nearer" samples is higher than that of samples that are “further" away from x new .

There is a lot of literature that can be found for kNN classification; very good explanations and compact overviews of kNN classification (including several possible variants and applications) are for example given in [17] and [41].

3.1.5 Support vector machines

Support vector machines (SVMs) are a widely used approach in machine learning based on statistical learning theory [46]; an example of the application of SVMs in the medical domain have been reported in [34], e.g.

The most important aspect of SVMs is that it is possible to give bounds on the generalization error of the models produced, and to select the respectively best model from a set of models following the principle of structural risk minimization [46]. SVM are designed to calculate hyperplanes that separate the data from each other and maximize the margin between sets of data points. While the basic training algorithm is only able to construct linear separators, so-called kernel functions can be used to calculate scalar products in higher-dimensional spaces; if the kernel functions used are non-linear, then the separating boundaries will be non-linear, too.

In this work we have used the SVM implementation described in [38] and [24]; we have used the implementation of this algorithm which is available for the WEKA machine learning framework [57]; we have used polynomial kernels as well as Gaussian radial basis function kernels with the γ parameter (defining the inverse variance) set to 0.01 and the complexity parameter c set to 10,000.

3.2 Data sets

We shall later on report on the quality of classifiers that have been produced for the following five data collections:

  • The Wisconsin data set is a part of the UCI Machine Learning Repository1; In short, it represents medical measurements which were recorded while investigating patients potentially suffering from breast cancer. The number of features recorded is 9 (all being continuous numeric ones); the file version we have used contains 683 recorded examples (by now, 699 examples are already available since the data base is updated regularly).

  • The Heart Disease data collection contains data recorded by European medical scientists. In the context of this classification problem the main goal is to model the presence of potential heart diseases in patients. The target values of the given samples (again representing patients) are integer valued from 0 (no presence) to 4; experiments published in the machine learning community have concentrated on simply attempting to distinguish presence (target classification values 1,2,3,4; 139 samples in total) from absence (class value 0; 164 samples), thus turning this classification situation into a binary one.

    The original data contains 76 attributes, but, as is stated on the respective UCI webpage, all published experiments refer to using a subset of 14 of them. Furthermore, only the “Cleveland" data set is the only one that has been used by machine learning researchers; this is the reason why we have used this data partition containing 14 variables and 303 samples, treating it as a binary classification task as described above.

  • The Pima Indians Diabetes data set (in this paper referred to as the “Pima” data set) is a data collection compiled by scientists at the Johns Hopkins University in Baltimore, MD, USA. The samples in this data set (in total 768) represent patients, in particular adult females of Pima Indian heritage. Based on 8 input variables the goal here is to classify samples as diseased or not; 500 samples originally belong to class “0” (representing patients who were tested negative) and 268 to class “1” (representing positive original classifications).

  • The Thyroid data set represents medical measurements which were recorded while investigating patients potentially suffering from hypo- or hyperthyroidism; this data set has also been taken from the UCI Machine Learning Repository1. In short, the task is to determine whether a patient is hypothyroid or not. Three classes are formed: Euthyroid (the state of having normal thyroid gland function), hyperthyroid (overactive thyroid) and hypothyroid (underactive thyroid); a good classifier has to be significantly better than 92% simply because 92% of the patients are not hyper- or hypothyroid. In total, the data set contains 7,200 samples.

    The samples of the Thyroid data set are not equally distributed to the three given classes; in fact, 166 samples belong to class “1” (“subnormal functioning”), 368 samples are classified as “2” (“hyperfunction”), and the remaining 6,666 samples belong to class “3” (“euthyroid”). Twenty-one attributes (15 binary and 6 continuous ones) are stored in this data set.

  • The Melanoma data set represents medical measurements which were recorded while investigating patients potentially suffering from skin cancer. It contains 1,311 examples for which 30 features have been recorded; each of the 1,311 samples represents a pigmented skin lesion which has to be classified as a melanoma or a nonhazardous nevus. This data set has been provided to us by Prof. Michael Binder from the Departement of Dermatology at the Medical University Vienna, Austria.

    A comparison of machine learning methods for the diagnosis of pigmented skin lesions (i.e., detecting skin cancer based on the analysis of visual data) can be found in [15]; in this paper the authors describe the quality of classifiers produced for a comparable data collection using kNN classification, ANNs, decision trees, and SVM. The difference is that in the data collection used in [15] all lesions were separated into three classes (common nevi, dysplastic nevi, or melanoma); here we use data representing lesions that have been classified as benign or malign, i.e. we are facing a binary classification problem.

As an alternative to the UCI repository, the Proben1 benchmark set [39] is a collection of real world problems originally established for neural networks. This benchmark collection contains datasets that have been taken from the UCI repository; still, the data has been modified insofar as it has been encoded for direct neural network use as described in [39]. For each available data set several different file versions (storing different data partitions) have been formed and frequently used for estimating the quality of machine learning algorithms; we will in the following mention also results that have been obtained using these data sets.

3.3 Results

Since the Wisconsin, Heart, Pima and the Thyroid data sets are publicly available, the results produced by GP are compared to those that have been published previously for various machine learning methods; the Melanoma is not openly available, therefore we have used all machine learning approaches mentioned for training classifiers for this data set.

All five data sets were investigated via 10-fold cross-validation (CV). This means that each original data set was divided into 10 disjoint sets of (approximately) equal size; thus, 10 different pairs of training (90% of the data) and test data sets (10% of the data) can be formed and used for testing the classification algorithm. For each data collection, each of the resulting 10 pairs of training and test data partitions has been used in 5 independent GP test runs; for the Melanoma data set, all machine learning algorithms mentioned previously have also been applied to all pairs of training and test data, the stochastic algorithms again applied 5 times independently. The results presented in this section have already been partially published in [53].

We will in the following also report on results obtained using other GP variants in medical data analysis, as for example documented in [19], [11], [30], [14], [6], [9], [18], and [27]. The main results of these test series can be summarized in the following way: Genetic programming is indeed able to produce classifiers for medical benchmark data sets, and its biggest advantages are its general applicability and that it produced models that are understandable and interpretable. We will show here how good the results produced by GP with strict offspring selection are compared to those of other GP implementations as well as other frequently used machine learning algorithms.

3.3.1 Results for the Wisconsin data set

Table 2 summarizes the results for the 10-fold cross validation produced by GP with offspring selection as described in Sect. 3.1.1. These figures boil down to the fact that extended GP has in this case been able to produce classifiers that on average correctly classify 97.91% of training samples and 97.53% of test samples.

Table 2 Summary of training and test results for the Wisconsin data set: correct classification rates (average values and standard deviation values) for 10-fold CV partitions, produced by GP with offspring selection

In order to compare the quality of these results to those reported in the literature, Table 3 summarizes test accuracies that have been obtained using 10-fold cross validation. For each method listed we give the references to the respective articles in which these results have been reported Footnote 4. As we see here, the results summarized in Table 2 outperform those of the other algorithms reported in the literature listed here.

Table 3 Comparison of machine learning methods: average test accuracy of classifiers for the Wisconsin data set

In the case of the Wisconsin data set, the results retrieved using this enhanced GP (EGP) classification algorithm were compared to those obtained using other GP variants. In [9], for example, results for steady state GP with 500 individuals are documented; for the Wisconsin breast cancer data set this GP implementation produced classifiers with 93.5% average classification accuracy (standard deviation: 0.79). In [27] recent results for several classification benchmark problems are documented; the Wisconsin data set was there analyzed using standard GP as well as three other GP based classification variants (POPE-GP, DecMO-GP and DecMOP-GP), and the respective results are also listed in Table 3.

In this context we have to admit that the effort of GP to produce these classifiers is higher than the runtime or memory consumed by most other machine learning algorithms; in our GP tests using the Wisconsin data set and populations with 500 individuals the average number of generations executed was 51.6 and the average number of solutions evaluated ∼1,296,742.

In [11] and [14], for example, results for the Cancer data set of the Proben1 collection are documented; this data set has been created based on the Wisconsin data set and is available as three data partitions. According to Brameier and Banzhaf [11], linear GP was able to produce classifiers that correctly classify 94.28–97.82% of the given test samples (with varying accuracies for the three given problem instances), and De Falco et al. report that GP was used for producing classifiers with 93.92–97.54% classification accuracy. ANNs were also trained; these tests resulted in classifiers that correctly classify 95.23–98.62% of the test data.

3.3.2 Results for the Heart data set

Table 4 summarizes the results for the 10-fold cross validation produced by GP with offspring selection; extended GP with offspring selection (and populations containing 1,000 solutions) has in this case been able to produce classifiers that on average correctly classify 91.50% of training samples and 81.22% of test samples. As already mentioned previously, we have trained binary classifiers, so that the goal is is distinguish between presence and absence of heart diseases.

Table 4 Summary of training and test results for the Heart data set: correct classification rates (average values and standard deviation values) for 10-fold CV partitions, produced by GP with offspring selection

In order to compare the quality of these results to those reported in the literature, Table 5 summarizes test accuracies; in most cases, these results have been obtained using 10-fold or 5-fold cross validation. For each method listed we give the references to the respective articles in which these results have been reported for the binary classification. Negative correlation learning as documented in [13] was able to produce classifiers that correctly classify up to 81.98% of the given test samples, enhanced GP as reported on in [18] yields up to 81.3% correct classifications on test samples. Using extended GP with strict offspring selection and populations containing 1,000 solutions we have been able to produce classifiers that on average correctly classify 81.22% of test samples; using populations containing only 100 solutions the correct classification rate of GP with OS on test data is 80.74%.

Table 5 Comparison of machine learning methods: average test accuracy of classifiers for the Heart data set

3.3.3 Results for the Pima data set

Table 6 summarizes the results for the 10-fold cross validation produced by GP with offspring selection as described in Sect. 3.1.1. These figures boil down to the fact that extended GP (with string offspring selection and populations containing 1,000 individuals) has in this case been able to produce classifiers that on average correctly classify 78.25% of training samples and 75.53% of test samples.

Table 6 Summary of training and test results for the Pima data set: correct classification rates (average values and standard deviation values) for 10-fold CV partitions, produced by GP with offspring selection

In order to compare the quality of these results to those reported in the literature, Table 7 summarizes test accuracies that have been obtained using cross validation; for each method listed we give the references to the respective articles in which these results have been reported. As we see here, the results summarized in Table 6 have to be considered rather satisfying as the results produced by other algorithms (including GP approaches as those reported in [30] and [18], e.g.) show comparable or even (slightly) weaker results.

Table 7 Comparison of machine learning methods: average test accuracy of classifiers for the Wisconsin data set

Please note that that other GP results have been produced using smaller populations and consuming less computational effort: The GP approach discussed in [18] was applied with populations of 100 solutions, producing 200 offspring over not more than 99 generations, and the results documented in [30] have been produced by GP with 500 individuals evolving over 50 generations. These approaches obviously consume less runtime than our approach; using populations containing 1,000 individuals over 59.2 generations on average and selection pressures eventually reaching 100, ∼1,420,935 solutions are here evaluated on average. Using smaller populations of only 100 solutions, GP with strict OS produces classifiers that correctly classify 74.87% (±4.95) of the given test samples.

In [11] and [14] also results for the Diabetes data set of the Proben1 collection are documented; this data set has been created based on the Pima data set of the UCI repository. According to Brameier and Banzhaf [11], linear GP was able to produce classifiers that correctly classify 72.15–76.91% of the given test samples, and De Falco et al. report that GP was used for producing classifiers with 69.64–75.16% classification accuracy. Artificial neural networks were also trained resulting in classifiers that correctly classify 73.58–77.41% of the test data.

3.3.4 Results for the Thyroid data set

The results achieved for the Thyroid data set are to be reported here. Table 8 summarizes the results for the 10-fold cross validation produced by GP with offspring selection, again as described in Sect. 3.1.1. For each class we characterize the classification accuracy on training and test data: For each partition we state average as well as standard deviation values. These figures boil down to the fact that extended GP has in this case been able to produce classifiers that on average correctly classify 99,10% of training samples and 98,76% of test samples, the total standard deviation values being 0.73 and 0.92, respectively.

Table 8 Summary of training and test results for the Thyroid data set: correct classification rates (average values and standard deviation values) for 10-fold CV partitions, produced by GP with offspring selection

In order to compare the quality of these results to those reported in the literature, Table 9 summarizes a selection of test accuracies that have been obtained using 10-fold cross validation; again, for each method listed we give the references to the respective articles in which these results have been reported. Obviously the results summarized in Table 8 have to be considered quite fine, but not perfect as they are outperformed by results reported in [49] and [16].

Table 9 Comparison of machine learning methods: average test accuracy of classifiers for the Thyroid data set

GP has also been repeatedly applied for solving the Thyroid problem, some of the results published are the following ones:

  • In [28] (Table 8), results produced by a pareto-coevolutionary GP classifier system for the Thyroid problem are reported, and here in Table 9 these results are stated as the “PGPC” results; in fact, these results are not the mean accuracy values but rather the median value, which is why these results are not totally comparable to other results stated here. For obtaining these results two populations consisting of 10 so-called learners and 50 so-called testers were trained over 50,000 epochs; details regarding this GP approach and its evaluation effort can be found in [28].

  • Loveard and Ciesielski reported that classifiers for the Thyroid problem could be produced using GP with test accuracies ranging from 94.9% to 98.2% (depending on the range selection strategy used); details are given in [30]. Populations containing 1,000 individuals were trained over 50 generations, so that the number of evaluated solutions totals to 50,000.

  • In [11] also results for the Thyroid data set of the Proben1 collection are documented. According to [11], linear GP was able to produce classifiers that correctly classify 98.56–99.11% of the given test samples; ANNs were also trained resulting in classifiers that correctly classify 97.62–98.09% of the test data.

  • According to Banzhaf and Lasarczyk [6], GP-evolved programs consisting of register machine instructions turned out to eventually misclassify on average 2.29% of the given test samples, and that optimal classifiers are able to correctly classify 98.64% of the test data. Populations of size 100 have been used here, and in each generation 500 offspring solutions were created and tested; the number of generations was set to 500, so that in each test run 250,000 have been evaluated.

  • Furthermore, Gathercole and Ross [19] report classification errors between 1.6% and 0.73% as best result using tree-based GP (evaluating 1.6–3.2*108 formulas in each test run), and that an classification error of 1.52% for neural networks is reported in [42]. In fact, Gathercole and Ross reformulated the Thyroid problem to classifying cases as “class 3” or “not class 3”. Furthermore, as documented in [19], it was relatively straight forward for their GP implementation (DSS-GP) to produce function tree expressions which could distinguish between classes “1” and “2” completely correctly on both the training and test sets. “To be fair, in splitting up the problem into two phases (class 3 or not, then class 1 or 2) the GP has been presented with an easier problem [...]. This could be taken in different ways: Splitting up the problem is mildly cheating, or demonstrating the flexibility of the GP approach.” (Taken from [19], p. 7.)

GP with strict offspring selection was here applied with populations of 1,000 individuals; on average, the number of generations executed in our GP tests for the Thyroid test studies was 73.9, and on average ∼2,463,635 models were evaluated in each GP test run.

One of the most common and also simplest ways how to illustrate classification results is simply plotting the target values and the estimated values into one chart; Fig. 4 shows a graphical representation of one of the results obtained for the Thyroid data set, cross-validation set 9: The blue spots represent original classification values for each sample, green spots estimated values for the training samples and red ones estimated values for the test samples; the orange lines display optimal thresholds (with respect to training samples) that are also applied to test samples.

Fig. 4
figure 4

Graphical representation of one of the results obtained for the Thyroid data set, CV-partition 9: Comparison of original and estimated class values

Figure 5 shows one of the models produced by GP with offspring selection for 10-fold CV partition 3 of the Thyroid data set. This model was trained on a scaled version of the data set: All variables were linearly scaled to the interval [0;100]. Thus, the threshold values are therefore also values between 0 and 100, in this case being 43.135 and 89.025. The interested reader is here also referred to the website Footnote 5 of our upcoming book on modern concepts for GAs and GP, which shall appear at CRC Press by the beginning of 2009; on the materials section of this webpage we report on models for 10-fold CV analysis of the Thyroid data set.

Fig. 5
figure 5

Graphical representation of one of the models produced by SASEGASA-GP for 10-fold cross validation partition 3 of the Thyroid data set

3.3.5 Results for the Melanoma data set

For the Melanoma data set, finally, no results are available in the literature, therefore we have tested all machine learning algorithms mentioned previously for getting an objective evaluation of our GP methods.

First, in Table 10 we summarize original versus estimated classifications obtained by applying the classifiers produced by GP with offspring selection; in total, 97.17% of the training and 95.42% of the test samples are classified correctly (with standard deviations 0.87 and 2.13, respectively). These GP tests using the Melanoma data set were done with populations containing 1,000 individuals; the average number of generations executed was 54.4 and the average number of solutions evaluated ∼2,372,629.

Table 10 Confusion matrices for average classification results produced by GP with OS for the Melanoma data set

Classification results obtained using several machine learning algorithms are collected in Table 11; we give training accuracies (wherever available) as well as test accuracies. Support vector machine based training was done with radial as well as with polynomial kernel functions, furthermore we used γ values 0.001 and 0.01. In standard GP (SGP) test we used tournament parents selection (k = 3), 8% mutation, single point crossover and the same structural limitations as in GP with OS; in order to get a fair comparison, the population size was set to 1,000 and the number of generations to 1,000 yielding 1,000,000 evaluations per test run.

Table 11 Comparison of machine learning methods: Average test accuracy of classifiers for the Melanoma data set

As we can see in Table 11, GP with strict OS performs approximately as well as the support vector machines and neural nets applying those settings that are optimal in this test case: GP with OS was able to classify 95.42% of the test cases correctly, SVMs correctly classified 94.89–95.47% and neural nets (with validation set based stopping) 95.27% of the test cases evaluated. Standard GP as well as kNN, linear regression and standard ANNs clearly perform worse.

Even though we see that the average accuracy recorded for models produced by GP with OS is rather high, the relatively high standard deviation of this method’s performance (2.13, compared to 0.41 recorded for optimal SVMs) has to be seen as a negative aspect of these results.

4 Conclusion

In this paper we have presented an enhanced genetic programming method that was successfully used for investigating machine learning problems in the context of medical classification. The approach works with hybrid formula structures combining logical expressions (as used for example in decision trees) and classical mathematical functions; the enhanced selection scheme originally successfully applied for solving combinatorial optimization problems using genetic algorithms was also applied yielding high quality results.

We have investigated GP in the context of learning classifiers for five medical data collections, namely the Wisconsin, the Heart, the Diabetes and the Thyroid data sets taken from the UCI machine learning repository and the Melanoma data set, a collection represents medical measurements which were recorded while investigating patients potentially suffering from skin cancer.

The results presented in this article make the authors believe that an application in a real-world framework in the context of medical data analysis using the techniques presented here is recommended. As documented in the test results summary, GP based classification using gender specific parents selection as well as strict offspring selection is able to produce results that are—in terms of classification accuracy—at least comparable to the classifiers produced by classical machine learning algorithms frequently used for solving classification problems, namely linear regression, neural networks, neighborhood based classification or support vector machines as well as other GP implementations that have been used on the data sets investigated in our test studies.