Model selection methodology in supervised learning with evolutionary computation
Introduction
Many successful approaches to discover new knowledge about the structure and function of biological systems rely on supervised learning. This involves formation of a empirical models using data whose meaning is known. The models are then used to interpret new data of similar origin. Evolutionary computation (EC), supervised and unsupervised, is particularly successful in many areas of bioinformatics. These include sequence alignment, structure prediction, drug design, gene expression analysis, proteomics, metabolomics and more (Fogel and Corne, 2003).
Supervised methods aim to form a model that relates Y data, the known value or values, with the X variables that represent the corresponding measurements or observed attributes. A successful model, once validated, can be used to predict unknown Y values from new sets of X measurements. The nature of the model can also reveal important information about the data and the problem domain. Noise is always present in datasets and originates from instrument variability, sample variability, human errors, etc. In addition, such datasets often contain relatively few ‘data objects’, or individual samples of X and Y variables. The number of X variables per object is often much larger than the number of objects. The probability of modelling chance relationships between X and Y data is therefore high.
There are two main reasons for attempting to form models that capture the characteristics embodied in a dataset:
- •
to extract knowledge from the data by discovering explicit relationships between X and Y, or …
- •
to form a model that can be used to interpret new X data by predicting the corresponding Y.
Even a single run of an evolutionary algorithm (EA) produces numerous candidate models. To produce a model that is applicable in a meaningful way to similar data, wherein the noise is of course different, and in which the discovery of chance relationships needs to be minimised as far as possible, the model needs to be properly trained and validated. It is also desirable to obtain an estimate of the likely performance of the model when used to make predictions from ‘new’ data. This is vital in drawing scientific conclusions from any results that are subsequently derived from use of that model or from any knowledge that is derived from the structure of the model itself.
The EAs used as illustrations in this paper are Genetic Programming (GP) Cramer, 1985, Koza, 1992 and Genetic Algorithms (GA) (Holland, 1975), but the data modelling principles discussed apply also to supervised learning using other EA variants, such as Evolutionary Strategies (ES) (Schwefel, 1981), or Evolutionary Programming (EP) (Fogel, 1966), and indeed to non-evolutionary methods also.
The subsequent sections of this paper begin by considering important issues in model selection and generalization in relation to various modelling methods including those based on EAs. A simple but reasonably effective heuristic is then introduced that favours the selection of general models and is applicable to EA-based methods. Finally, successful application of the approach is illustrated by applying it to a well-known dataset for classification of disease from gene expression data.
Section snippets
Model selection and generalization
Supervised methods commonly used for interpreting analytical data in biological applications include partial least squares regression (PLS) (Martens and Naes, 1989), artificial neural networks (ANN) (e.g. Bishop, 1995), GP Cramer, 1985, Koza, 1992, numerous variants of GAs (Holland, 1975) and other EAs. All of these methods are capable of producing a number of different models on the same dataset. PLS is a deterministic method and we only have to decide how many latent variables, or factors, to
Model selection criteria
In many examples in the literature, the model that minimises the test set error is often taken as the point at which the training is optimal. However, it is quite common for the training set error to be significantly lower than the test set error at this point. Fig. 1 is such an example, in this case taken from a neural network being trained on the noisy spectral regression data described above. The divergence of the curves demonstrates that as training progresses beyond about 50 epochs the
Application to gene expression data
The aim of this study was to illustrate the effectiveness of the heuristic introduced above (Section 3) on a well-known gene expression dataset (Golub et al., 1999). This dataset was also one of the datasets used in the CAMDA Challenge in 2000 (CAMDA, 2000). In the work presented here, modelling used a GP-based discriminant classifier running on an ordinary desktop PC and model selection was based on the use of three data partitions.
The dataset is presented as three files (along with other
Conclusions
This paper has aimed to clarify some of the issues concerning model selection and generality when forming predictive models using evolutionary methods. The generality of a model—its estimated performance on unseen data—is a far more important attribute than its performance on data that was used in forming it or selecting it. A simple heuristic has been presented that can be used with EA-based methods and whose application has been illustrated on a well-known gene-expression dataset. The
References (30)
- et al.
Multifactor-dimensionality reduction reveals high-order interactions among estrogen-metabolism genes in sporadic breast cancer
Am. J. Hum. Genet.
(2001) - Bishop, C., 1995. Neural Networks in Pattern Recognition. Oxford University Press, Oxford,...
- Breiman, L., 1994. Bagging predictors. Technical Report 421, Department of Statistics, University of California,...
- CAMDA’00, 2000. CAMDA’00: Contest Datasets....
- Cavaretta, M.J., Chellapilla, K., 1999. Data mining using genetic programming: the implications of parsimony on...
- Cramer, N.L., 1985. A representation for the adaptive generation of simple sequential programs. In: Grefenstette, J.J....
- Efron, B., Tibshirani, R., 1993. An Introduction to the Bootstrap. Chapman & Hall,...
- Eiben, A., Jelasity, M., 2002. A critical note on experimental research methodology in EC. In: IEEE Congress on...
- Fogel, L., 1966. Artificial Intelligence Through Simulated Evolution. Wiley, New...
- Fogel, G., Corne, D. (Eds.), 2003. Evolutionary Computation in Bioinformatics. Morgan Kauffmann, San Francisco,...
Cited by (34)
Obtaining transparent models of chaotic systems with multi-objective simulated annealing algorithms
2008, Information SciencesCitation Excerpt :Our aim is to discover a consistent subset of state variables and the equations that relate them, and also the numerical values of the coefficients in these equations. Some of the most recent approaches to obtain this information are based on evolutionary techniques, combined with a tree-based representation of the model [4,6,14,19,46,56]. In Fig. 3, there is a simplified example of such a representation, that will be explained in depth in Section 3.2.
The methodologies of systems biology
2007, Systems Biology: Philosophical FoundationsThe methodologies of systems biology
2007, Systems BiologyNovel feature selection method for genetic programming using metabolomic <sup>1</sup>H NMR data
2006, Chemometrics and Intelligent Laboratory SystemsOn the use of multi-objective evolutionary algorithms for the induction of fuzzy classification rule systems
2005, BioSystemsCitation Excerpt :It is, therefore, important to investigate other complexity measures. Many researchers also believe that the complexity of a classifier is not the only factor that is responsible for overfitting (Domingos, 1999; Jaynes, 2003; Rowland, 2003). In fact, it has been shown that complex classifiers sometimes outperform simpler classifiers on unseen data (Keijzer and Babovic, 2000).
Endocrine-disrupting compounds and metabolomic reprogramming in breast cancer
2023, Journal of Biochemical and Molecular Toxicology