Investigation of the importance of the genotype–phenotype mapping in information retrieval

https://doi.org/10.1016/S0167-739X(02)00108-5Get rights and content

Abstract

An investigation of the role of the genotype–phenotype mapping (G-Pm) is presented for an evolutionary optimization task. A simple genetic algorithm (SGA) plus a mapping creates a new mapping genetic algorithm (MGA) that is used to optimize a Boolean decision tree for an information retrieval task, with the tree being created via a relatively complex mapping. Its performance is contrasted with that of a genetic programming algorithm, British Telecom Genetic Programming (BTGP) which operates directly on phenotypic trees. The mapping is observed to play an important role in the time evolution of the system allowing the MGA to achieve better results than the BTGP. We conclude that an appropriate G-Pm can improve the evolvability of evolutionary algorithms.

Introduction

Evolutionary methods have been the focus of much attention in computer science, principally because of their potential for performing partially directed search in very large combinatorial spaces. Evolutionary Algorithms (EAs) have the potential to balance exploration of the search space with exploitation of useful features of that search space. However the correct balance is difficult to achieve and places limits on what can be predicted about the algorithm’s behavior. In addition, EAs are often implemented in system-specific ways, making it very difficult to predict and evaluate performance on different implementations.

This paper focuses upon the comparison between algorithms for information filtering. This is one of the tasks at which EAs have been found particularly effective. Such algorithms deal with the situation where a relevant subset of documents or records must be isolated from a larger pool. This paper considers two such algorithms which were developed for the task of information filtering in a telecommunications context. The British Telecom Genetic Programming (BTGP) is a genetic programming system where the programs produced execute Boolean searches through keywords [2]. The MGA (mapping genetic algorithm) is a genetic algorithm which also uses a Boolean tree representation, through a relatively complicated mapping between genotype and phenotype.

There has recently been a growing interest in the nature of the genotype–phenotype mapping (hereafter G-Pm) in evolutionary systems, and in particular the effect that it might have on evolutionary dynamics. This mapping generally introduces a non-linear function between the representation of a solution (chromosome) and the physical realization of the solution upon which a value of fitness is calculated. For instance, investigations of the mapping between sequences (genotype) and secondary structures (phenotype) for ribonucleic acid (RNA) have shown that the mapping possesses a number of properties which strongly influence the search dynamics for a particular structure [4], in particular, the mapping is redundant, with many different genomes producing the same phenotype. There also exist extended neutral networks in genotype space which connect (by point mutations) genomes mapping to the same phenotype. This allows for continued exploration of the genotypic space during the evolutionary process, in between periods of adaptive change, and helps prevent the system from becoming trapped in local minima. The importance of neutral mutation in biological systems has also been studied by Kimura [10].

The distinction between genotype and phenotype in evolutionary computation systems has been made explicit for quite a while [12] and yet there has been relatively little attention paid to the potential impact of the G-Pm on the evolutionary dynamics of such systems. Genetic algorithms [5], [6] typically use a very direct (trivial) mapping of parameters onto the genotype string, while genetic programming [11] in its simplest form works directly on the phenotype. Historically, in fields such as artificial intelligence and computer science the issue of representation has often proved crucial to the successful solution of problems [17]. The G-Pm is essentially the representation of the phenotype by the genotype, and is likely to prove similarly important in evolutionary computation systems. Previous work which has looked at some aspects of the G-Pm in computational systems includes the Paterson and Livesey system [13] and a system which maps genotypes to phenotypic programs using grammars developed by Keller and Banzhaf [8]. More recent work [14], [15] has explored a number of redundant G-Pms with a view to ascertaining whether these fundamental properties of living systems can be encouraged in our artificial systems.

The authors of this paper investigate the evolutionary dynamics and relative performance of a genetic programming system (BTGP) [2] versus a genetic algorithm (MGA) which employs a relatively complex G-Pm. Both systems are applied to a non-trivial information retrieval task and employ the same Boolean decision tree phenotypic representation. The ability of the algorithms to generate a good final solution is of particular interest, rather than seeking fast convergence alone. Previous work using genetic programming for data mining [16] and genetic algorithms for query optimization [9] already exists.

Section snippets

Experimental design

In order to compare the BTGP and the MGA an application task must be specified. The task here is to evolve a Boolean decision tree capable of discriminating between two document classes, those sought in a retrieval task and those which are of no interest. The data used is generated in a preprocessing step from internet documents which have been labeled by a user as either of interest (positive) or of no interest (negative). Preprocessing consists of extraction of a set of keywords across all

Experiments

A number of experiments were carried out to compare the performance of the BTGP and the MGA applied to the information retrieval task described in Section 2. Two separate data sets were used. Data set 1 consists of 84 cases which are divided into 66 training cases (30 positive, 36 negative) and 18 test cases (8 positive, 10 negative). Each case consists of a positive or negative label together with one value for each keyword specifying its presence (1) or absence (0). These cases correspond to

Discussion

The experimental results have highlighted the need to consider carefully how the phenotype is represented by the genotype. This representation is embodied in the details of the G-Pm. The genetic operators must be considered together with this mapping to gain insight into evolutionary dynamics.

The redundant nature of the MGA mapping effectively dampens the effects of mutations on the phenotype while for the BTGP the majority of mutations do affect the phenotype fitness directly. Other

Conclusions

The role of a G-Pm has been investigated in the context of two different EAs applied to an optimization problem. The algorithms are required to create Boolean decision trees for use in information retrieval which are able to distinguish between two classes of example documents. One algorithm (the MGA) employs a non-trivial G-Pm, whereas the other algorithm (the BTGP) operates directly on the phenotypic representation.

The MGA introduces a high degree of redundancy into the genotypic

José-Luis Fernández-Villacañas Martı́n graduated in Physics at the Complutense University in Madrid and received his PhD in Astrophysics in 1989. From then on, he was a Member of the Theoretical Physics Department in Oxford University until he moved to British Telecom Research Labs in 1992. At BT he was a Senior Researcher in Artificial Life and Evolutionary Computation. He left the labs to join the European Commission in 1999 as a Project Officer in Future Emerging Technologies. Since October

References (17)

  • L. Altenberg, Fitness distance correlation analysis: an instructive counterexample, in: Proceedings of the Seventh...
  • J.L. Fernández-Villacañas Martı́n, BTGP and information retrieval, in: Proceedings of the Second International...
  • J.L. Fernández-Villacañas Martı́n, P. Marrow, M. Shackleton, On measuring the attributes of evolutionary algorithms: an...
  • W. Fontana, P. Schuster, Shaping space: the possible and the attainable in RNA genotype–phenotype mapping, Working...
  • D.E. Goldberg, Genetic Algorithms in Search, Optimisation and Machine Learning, Addison-Wesley, Reading, MA,...
  • J.H. Holland, Adaptation in Natural and Artificial Systems, The University of Michigan Press, Ann Arbor,...
  • M. Huynen, P. Stadler, W. Fontana, Smoothness within ruggedness: the role of neutrality in adaption, in: Proceedings of...
  • R.E. Keller, W. Banzhaf, Genetic programming using mutation, reproduction and genotype–phenotype mapping from linear...
There are more references available in the full text version of this article.

Cited by (12)

  • Local search: A guide for the information retrieval practitioner

    2009, Information Processing and Management
    Citation Excerpt :

    The size of a collection to be used for experiments is a very important issue. Many practitioners in the area still use Cranfield test collections (Cordon et al., 2006; Lopez-Pujalte et al., 2003a; Vrajitoru, 1998), and other use even small datasets e.g. a set of 359 abstracts by Cordon et al. (2002), and a set of 200 ‘cases’ by Fernandez-Villacanas Martin and Shackleton (2003). Whilst it is difficult to assert what the best size is for using test collections in experimentation, these datasets are almost certainly too small as the collection statistics (such as IDF, TF) will be very different from larger datasets which are of more interest to users.

  • Improving metaheuristics convergence properties in inductive query by example using two strategies for reducing the search space

    2007, Computers and Operations Research
    Citation Excerpt :

    In addition to this first work in IQBE, GAs have been profusely applied to improve query definition, for example in [8–11]. There are also approaches based on GP, for example [12,13], and on SA [14], where a hybrid-simulated annealing-GP has been presented. In spite of the huge work carried out in improving metaheuristic algorithms for IQBE, there are still several unresolved problems related to algorithms performance (specifically in GA and SA), such as poor results obtained in large queries design or convergence issues, mainly the time needed for convergence.

View all citing articles on Scopus

José-Luis Fernández-Villacañas Martı́n graduated in Physics at the Complutense University in Madrid and received his PhD in Astrophysics in 1989. From then on, he was a Member of the Theoretical Physics Department in Oxford University until he moved to British Telecom Research Labs in 1992. At BT he was a Senior Researcher in Artificial Life and Evolutionary Computation. He left the labs to join the European Commission in 1999 as a Project Officer in Future Emerging Technologies. Since October 2000 he is Visiting Professor at the Charles III University in Madrid at the Department of Communications Technology.

Mark Shackleton graduated from Sheffield University (UK) in 1986 with a degree in Computer Science. He first worked for Singer Link-Miles, manufacturers of commercial fight simulators, developing real-time 3D computer graphics algorithms and systems. He joined the Image Processing and Computer Vision research group at BT (Britain Telecommunications) in 1989. In this group he designed and implemented a number of systems in areas such as automatic face recognition, model-based coding, and content retrieval from images and video sequences. During this period he spent time seconded to MIT Media Laboratory working closely alongside researchers there. In 1996 he moved across to the Future Technologies Group, now part of the Intelligent Systems Laboratory, BTexact’s at Adastral Park. He now leads this group whose remit is to develop novel solutions to problems using nature-inspired algorithms and approaches. Mark is Project Manager of BTexact’s Pervasive Computing Research Programme which is seeking to address the issues of complexity inherent in the next generation of large-scale, complex, dynamic networks of computational devices.

View full text