Abstract
Biological sequence analysis presents interesting challenges for machine learning. With an important problem — the recognition of functional target sites for microRNA molecules — as an example, we show how multiple genetic programming classifiers improve accuracy and stability. Moving from single classifiers to bagging and boosting with crossvalidation and parameter optimization requires more computing power. A special-purpose search processor for fitness evaluation renders boosted genetic programming practical for our purposes.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Altschul, S. F., Gish, W., Miller, W., Myers, E. W., and Lipman, D. J. (1990). Basic local alignment search tool. Journal of molecular biology, 215(3):403–410.
Bartlett, P., Freund, Y., Lee, W. S., and Schapire, R. E. (1998). Boosting the margin: a new explanation for the effectiveness of voting methods. Annals of Statistics, 26(5):1651–1686.
Benson, D. A., Karsch-Mizrachi, I., Lipman, D. J., Ostell, J., and Wheeler, D. L. (2005). GenBank. Nucleic Acids Research, 33(DB):D34–D38.
Brenner, S., Jacob, F., and Meselson, M. (1961). An unstable intermediate carrying information from genes to ribosomes for protein synthesis. Nature, 190:576–581.
Burges, C. J. C. (1998). A tutorial on support vector machines for pattern recognition. Knowledge Discovery and Data Mining, 2(2): 121–167.
Crick, F. H. C. (1958). The biological replication of macromolecules. Symposia of the Society for Experimental Biology, 12:138–163.
Eiben, Agoston Endre, Hinterding, Robert, and Michalewicz, Zbigniew (1999). Parameter control in evolutionary algorithms. IEEE Transations on Evolutionary Computation, 3(2): 124–141.
Feldt, Robert and Nordin, Peter (2000). Using factorial experiments to evaluate the effect of genetic programming parameters. In Poli, Riccardo, Banzhaf, Wolfgang, Langdon, William B., Miller, Julian F., Nordin, Peter, and Fogarty, Terence C, editors, Genetic Programming, Proceedings of EuroGP’2000, volume 1802 of LNCS, pages 271–282, Edinburgh. Springer-Verlag.
Freund, Y. and Schapire, R. E. (1997). A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences, 55(1):119–139.
Griffiths-Jones, S. (2004). The microRNA registry. Nucleic Acids Research, 32(90001):D109–111.
Halaas, A., Svingen, B., Nedland, M., Sætrom, P., Snøve Jr., O., and Birkeland, O. R. (2004). A recursive MISD architecture for pattern matching. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 12(7):727–734.
Hansen, L. K. and Salamon, P. (1990). Neural network ensembles. IEEE Transactions on Pattern Analysis and Machine Intelligence, 12(10):993–1001.
Knuth, D. E. (2002). All questions answered. Notices of the AMS, 49(3):318–324.
Koza, John R. (1992). Genetic Programming: On the Programming of Computers by Means of Natural Selection. MIT Press, Cambridge, MA, USA.
Lewin, B. (2000). Genes VII. Oxford University Press, Oxford, UK.
Martinez, J. and Tuschl, T. (2004). RISC is a 5′ phosphomonoester-producing RNA endonuclease. Genes & development, 18(9):975–980.
Meir, R. and Rätsch, G. (2003). An introduction to boosting and leveraging. In Mendelson, S. and Smola, A., editors, Advanced Lectures on Machine Learning, volume 2600, pages 118–183. Springer-Verlag.
Montana, David J. (1995). Strongly typed genetic programming. Evolutionary Computation, 3(2): 199–230.
Petersen, C. P., Bordeleau, M.-E., Pelletier, J., and Sharp, P. A. (2006). Short RNAs repress translation after initiation in mammalian cells. Molecular cell, 21(4):533–542.
Prechelt, L. (1998). Automatic early stopping using cross validation: quantifying the criteria. Neural Networks, 11(4):761–767.
Rätsch, G., Onoda, T., and Müller, K.-R. (2001). Soft margins for AdaBoost. Machine Learning, 42(3):287–320.
Saetrom, O., Snøve Jr., O., and Sætrom, P. (2005a). Weighted sequence motifs as an improved seeding step in microRNA target prediction algorithms. RNA, 11(7):995–1003.
Sætrom, P. (2004). Predicting the efficacy of short oligonucleotides in antisense and RNAi experiments with boosted genetic programming. Bioinformatics, 20(17):3055–3063.
Sætrom, P., Sneve, R., Kristiansen, K. I., Snøve Jr., O., Grünfeld, T., Rognes, T., and Seeberg, E. (2005b). Predicting non-coding RNA genes in Escherichia coli with boosted genetic programming. Nucleic Acids Research, 33(10):3263–3270.
Salzberg, S. (1997). On comparing classifiers: Pitfalls to avoid and a recommended approach. Data Mining and Knowledge Discovery, 1(3):317–328.
Sethupathy, P., Corda, B., and Hatzigeorgiou, A. G. (2006). TarBase: a comprehensive database of experimentally supported anima 1 microRNA targets. RNA, 12(2): 192–197.
Smith, T. F. and Waterman, M. S. (1981). Identification of common molecular subsequences. Journal of molecular biology, 147(l):403–410.
Vapnik, V. N. (1998). Statistical Learning Theory. Wiley-Interscience, New York, NY, USA.
Wightman, B., Ha, I., and Ruvkun, G. (1993). Posttranscriptional regulation of the heterochronic gene lin-14 by lin-4 mediates temporal pattern formation in C. elegans. Cell, 75(5):855–862.
Yekta, S., Shih, I., and Bartel, D. P. (2004). MicroRNA-directed cleavage of HOXB8 mRNA. Science, 304(5670):594–596.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2007 Springer Science+Business Media, LLC
About this chapter
Cite this chapter
Saetrom, P., Birkeland, O.R., Snøve, O. (2007). Boosting Improves Stability and Accuracy of Genetic Programming in Biological Sequence Classification. In: Riolo, R., Soule, T., Worzel, B. (eds) Genetic Programming Theory and Practice IV. Genetic and Evolutionary Computation. Springer, Boston, MA. https://doi.org/10.1007/978-0-387-49650-4_5
Download citation
DOI: https://doi.org/10.1007/978-0-387-49650-4_5
Publisher Name: Springer, Boston, MA
Print ISBN: 978-0-387-33375-5
Online ISBN: 978-0-387-49650-4
eBook Packages: Computer ScienceComputer Science (R0)