Elsevier

Biosystems

Volume 72, Issues 1–2, November 2003, Pages 187-196
Biosystems

Model selection methodology in supervised learning with evolutionary computation

https://doi.org/10.1016/S0303-2647(03)00143-6Get rights and content

Abstract

The expressive power, powerful search capability, and the explicit nature of the resulting models make evolutionary methods very attractive for supervised learning applications in bioinformatics. However, their characteristics also make them highly susceptible to overtraining or to discovering chance relationships in the data. Identification of appropriate criteria for terminating evolution and for selecting an appropriately validated model is vital. Some approaches that are commonly applied to other modelling methods are not necessarily applicable in a straightforward manner to evolutionary methods. An approach to model selection is presented that is not unduly computationally intensive. To illustrate the issues and the technique two bioinformatic datasets are used, one relating to metabolite determination and the other to disease prediction from gene expression data.

Introduction

Many successful approaches to discover new knowledge about the structure and function of biological systems rely on supervised learning. This involves formation of a empirical models using data whose meaning is known. The models are then used to interpret new data of similar origin. Evolutionary computation (EC), supervised and unsupervised, is particularly successful in many areas of bioinformatics. These include sequence alignment, structure prediction, drug design, gene expression analysis, proteomics, metabolomics and more (Fogel and Corne, 2003).

Supervised methods aim to form a model that relates Y data, the known value or values, with the X variables that represent the corresponding measurements or observed attributes. A successful model, once validated, can be used to predict unknown Y values from new sets of X measurements. The nature of the model can also reveal important information about the data and the problem domain. Noise is always present in datasets and originates from instrument variability, sample variability, human errors, etc. In addition, such datasets often contain relatively few ‘data objects’, or individual samples of X and Y variables. The number of X variables per object is often much larger than the number of objects. The probability of modelling chance relationships between X and Y data is therefore high.

There are two main reasons for attempting to form models that capture the characteristics embodied in a dataset:

  • to extract knowledge from the data by discovering explicit relationships between X and Y, or …

  • to form a model that can be used to interpret new X data by predicting the corresponding Y.

Even a single run of an evolutionary algorithm (EA) produces numerous candidate models. To produce a model that is applicable in a meaningful way to similar data, wherein the noise is of course different, and in which the discovery of chance relationships needs to be minimised as far as possible, the model needs to be properly trained and validated. It is also desirable to obtain an estimate of the likely performance of the model when used to make predictions from ‘new’ data. This is vital in drawing scientific conclusions from any results that are subsequently derived from use of that model or from any knowledge that is derived from the structure of the model itself.

The EAs used as illustrations in this paper are Genetic Programming (GP) Cramer, 1985, Koza, 1992 and Genetic Algorithms (GA) (Holland, 1975), but the data modelling principles discussed apply also to supervised learning using other EA variants, such as Evolutionary Strategies (ES) (Schwefel, 1981), or Evolutionary Programming (EP) (Fogel, 1966), and indeed to non-evolutionary methods also.

The subsequent sections of this paper begin by considering important issues in model selection and generalization in relation to various modelling methods including those based on EAs. A simple but reasonably effective heuristic is then introduced that favours the selection of general models and is applicable to EA-based methods. Finally, successful application of the approach is illustrated by applying it to a well-known dataset for classification of disease from gene expression data.

Section snippets

Model selection and generalization

Supervised methods commonly used for interpreting analytical data in biological applications include partial least squares regression (PLS) (Martens and Naes, 1989), artificial neural networks (ANN) (e.g. Bishop, 1995), GP Cramer, 1985, Koza, 1992, numerous variants of GAs (Holland, 1975) and other EAs. All of these methods are capable of producing a number of different models on the same dataset. PLS is a deterministic method and we only have to decide how many latent variables, or factors, to

Model selection criteria

In many examples in the literature, the model that minimises the test set error is often taken as the point at which the training is optimal. However, it is quite common for the training set error to be significantly lower than the test set error at this point. Fig. 1 is such an example, in this case taken from a neural network being trained on the noisy spectral regression data described above. The divergence of the curves demonstrates that as training progresses beyond about 50 epochs the

Application to gene expression data

The aim of this study was to illustrate the effectiveness of the heuristic introduced above (Section 3) on a well-known gene expression dataset (Golub et al., 1999). This dataset was also one of the datasets used in the CAMDA Challenge in 2000 (CAMDA, 2000). In the work presented here, modelling used a GP-based discriminant classifier running on an ordinary desktop PC and model selection was based on the use of three data partitions.

The dataset is presented as three files (along with other

Conclusions

This paper has aimed to clarify some of the issues concerning model selection and generality when forming predictive models using evolutionary methods. The generality of a model—its estimated performance on unseen data—is a far more important attribute than its performance on data that was used in forming it or selecting it. A simple heuristic has been presented that can be used with EA-based methods and whose application has been illustrated on a well-known gene-expression dataset. The

References (30)

  • M.D. Ritchie et al.

    Multifactor-dimensionality reduction reveals high-order interactions among estrogen-metabolism genes in sporadic breast cancer

    Am. J. Hum. Genet.

    (2001)
  • Bishop, C., 1995. Neural Networks in Pattern Recognition. Oxford University Press, Oxford,...
  • Breiman, L., 1994. Bagging predictors. Technical Report 421, Department of Statistics, University of California,...
  • CAMDA’00, 2000. CAMDA’00: Contest Datasets....
  • Cavaretta, M.J., Chellapilla, K., 1999. Data mining using genetic programming: the implications of parsimony on...
  • Cramer, N.L., 1985. A representation for the adaptive generation of simple sequential programs. In: Grefenstette, J.J....
  • Efron, B., Tibshirani, R., 1993. An Introduction to the Bootstrap. Chapman & Hall,...
  • Eiben, A., Jelasity, M., 2002. A critical note on experimental research methodology in EC. In: IEEE Congress on...
  • Fogel, L., 1966. Artificial Intelligence Through Simulated Evolution. Wiley, New...
  • Fogel, G., Corne, D. (Eds.), 2003. Evolutionary Computation in Bioinformatics. Morgan Kauffmann, San Francisco,...
  • Freitas, A., 2002. Data Mining and Knowledge Discovery with Evolutionary Algorithms. Springer-Verlag,...
  • Freund, Y., Schapire, R., 1996. Experiments with a new boosting algorithm. In: Machine Learning: Proceedings of 13th...
  • Golub, T., 1999. Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression....
  • Golub, T.R., Slonim, D.K., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov, J.P., Coller, H., Loh, M.L., Downing, J.R.,...
  • Hand, D., Mannila, H., Smyth, P., 2001. Data Mining. MIT Press, Cambridge,...
  • Cited by (34)

    • Obtaining transparent models of chaotic systems with multi-objective simulated annealing algorithms

      2008, Information Sciences
      Citation Excerpt :

      Our aim is to discover a consistent subset of state variables and the equations that relate them, and also the numerical values of the coefficients in these equations. Some of the most recent approaches to obtain this information are based on evolutionary techniques, combined with a tree-based representation of the model [4,6,14,19,46,56]. In Fig. 3, there is a simplified example of such a representation, that will be explained in depth in Section 3.2.

    • The methodologies of systems biology

      2007, Systems Biology: Philosophical Foundations
    • On the use of multi-objective evolutionary algorithms for the induction of fuzzy classification rule systems

      2005, BioSystems
      Citation Excerpt :

      It is, therefore, important to investigate other complexity measures. Many researchers also believe that the complexity of a classifier is not the only factor that is responsible for overfitting (Domingos, 1999; Jaynes, 2003; Rowland, 2003). In fact, it has been shown that complex classifiers sometimes outperform simpler classifiers on unseen data (Keijzer and Babovic, 2000).

    • Endocrine-disrupting compounds and metabolomic reprogramming in breast cancer

      2023, Journal of Biochemical and Molecular Toxicology
    View all citing articles on Scopus
    View full text