Abstract
This paper summarizes the use of a genetic programming (GP) system to develop classification rules for gene expression data that hold promise for the development of new molecular diagnostics. This work focuses on discovering simple, accurate rules that diagnose diseases based on changes of gene expression profiles within a diseased cell. GP is shown to be a useful technique for discovering classification rules in a supervised learning mode where the biological genotype is paired with a biological phenotype such as a disease state. In the process of developing these rules, it is necessary to devise new techniques for establishing fitness and interpreting the results of evolutionary runs because of the large number of independent variables and the comparatively small number of samples. These techniques are described and issues of overfitting caused by small sample sizes and the behavior of the GP system when variables are missing from the samples are discussed.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
Reference
Bojarczuk, C. C, Lopes, H. S., and Freitas, A. A. (2001). Data mining with constrained-syntax genetic programming: applications to medical data sets. Intelligent Data Analysis in Medicine and Pharmacology (IDAMAP-2001)
Brown, M. P. S., Grundy, W. N., Lin, D., Cristianini, N., Sugnet, C, Furey, T. S., Manuel Ares, J. & Haussler, D. (1999). Support vector machine classification of microarray gene expression data. University of Santa Cruz Technical Report. UCSC-CRL-99–09 http://www.cse.ucsc.edu/research/compbio/genex/genex.ps/research/compbio/genex/genex.ps.
Brown, M. P. S., Grundy, W. N., Lin, D., Cristianini, N., Sugnet, C, Furey, T. S., Manuel Ares, J. & Haussler, D. (1999). Supplemental data for “Knowledge-based analysis of microarray gene expression data by using support vector machines”, available at http://www.cse.ucsc.edu/research/compbio/genex/.
Brown, M. P. S., Grundy, W. N., Lin, D., Cristianini, N., Sugnet, C., Furey, T. S., Manuel Ares, J. & Haussler, D. (2000). Knowledge-based analysis of microarray gene expression data by using support vector machines. Proc. Natl. Acad. Sci. (USA) 97: 262–267
Eisen, M., Spellman, P., Brown, P., and Botstein, D. (1998). Supplemental data for “Cluster analysis and display of genome-wide expression patterns”, Proc. Nat. Acad. Sci. (USA) 95: 14863–14868, available at http://rana.stanford.edu/clustering/clustering.
Eisen, M., Spellman, P., Brown, P., and Botstein, D. (1998). Cluster analysis and display of genome-wide expression patterns, Proc. Nat. Acad. Sci. (USA) 95: 14863–14868.
Gerhold, D., et al. (1999). DNA chips: Promising Toys have become Powerful Tools. Trends Biochem Sci. ; 24(5): 168–73
Khan, J. et al. (2001). Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks. Nat Med 7: 673–679
Khan, J. et al. (2001). Supplementary information for Javed Khan, et. al, Nature Medicine; 7(6):673–679, http://www.nhgri.nih.gov/DIR/Microarray/Supplement/.
Linden, D. and Altshuler, E. (1999). Evolving Wire Antennas using Genetic Algorithm. Proceedings of the First NASA/DoD Workshop on Evolvable Hardware, 225–232, IEEE Computer Society, Los Alamitos, CA.
Luke, S. and Panait, L. (2002). Is the Perfect the Enemy of the Good? In Proceedings of the Genetic and Evolutionary Computation Conference, pp. 820–828, Morgan Kauffman, San Francisco, CA.
McKay. B. et al. (1995). Using a tree structured genetic algorithm to perform symbolic regression. In First International Conference on Genetic Algorithms in Engineering Systems: Innovations and Applications, A. M. S. Zalzala (Ed. ); GALESIA, volume 414, pages 487–492, Sheffield UK, 12–14, September. IEEE.
McPhee, N. F. and Hopper, N. J. (1999). Analysis of Genetic Diversity through Population History. In Proceedings of the Genetic and Evolutionary Computation Conference, pp. 1112–1120, Morgan Kauffman, San Francisco, CA.
MYGD. Munich Information center for Protein Sequences (MIPS) yeast genome database, http://www.mips.biochem.mpg.de/proj/yeast/proj/yeast.
Raidl, G. R. (1998). A Hybrid GP Approach for Numerically Robust Symbolic Regression. In Genetic Programming 1998: Proceedings of the Third Annual Conference, J. R. Koza, et al (Eds. ), pp. 323–28. University of Wisconsin, Madison. San Francisco: Morgan Kaufmann Publishers.
Rao, C. R. (1964). The Use and Interpretation of Principal Component Analysis in Applied Research, Sankya, Series A: 26: 329–358
Tan, K. C, Tay, A., Lee, T. H., and Heng, C. M. (2002). Mining multiple comprehensible classification rules using genetic programming. In Proceedings of the 2002 Congress on Evolutionary Computation CEC, 1302–1307.
Teller, A. and Veloso, M. (1995). PADO: Learning Tree Structured Algorithms for Orchestration into an Object Recognition System. Technical Report CMU-CS-95–101, Carnegie Mellon University, Dept. of Computer Science.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2003 Springer Science+Business Media New York
About this chapter
Cite this chapter
Driscoll, J.A., Worzel, B., MacLean, D. (2003). Classification of Gene Expression Data with Genetic Programming. In: Riolo, R., Worzel, B. (eds) Genetic Programming Theory and Practice. Genetic Programming Series, vol 6. Springer, Boston, MA. https://doi.org/10.1007/978-1-4419-8983-3_3
Download citation
DOI: https://doi.org/10.1007/978-1-4419-8983-3_3
Publisher Name: Springer, Boston, MA
Print ISBN: 978-1-4613-4747-7
Online ISBN: 978-1-4419-8983-3
eBook Packages: Springer Book Archive