Abstract
One of the central challenges of machine learning is the selection of methods for feature selection, feature engineering, and classification or regression algorithms for building an analytics pipeline. This is true for both novices and experts. Automated machine learning (AutoML) has emerged as a useful approach to generate machine learning pipelines without the need for manual construction and evaluation. We review here some challenges of building pipelines and present several of the first and most widely used AutoML methods and open-source software. We present in detail the Tree-based Pipeline Optimization Tool (TPOT) that represents pipelines as expression trees and uses genetic programming (GP) for discovery and optimization. We present some of the extensions of TPOT and its application to real-world big data. We end with some thoughts about the future of AutoML and evolutionary machine learning.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Chicco, D., Oneto, L., Tavazzi, E.: Eleven quick tips for data cleaning and feature engineering. PLoS Comput. Biol. 18, e1010718 (2022)
Urbanowicz, R.J., Meeker, M., La Cava, W., Olson, R.S., Moore, J.H.: Relief-based feature selection: introduction and review. J. Biomed. Inform. 85, 189–203 (2018)
Geng, L., Hamilton, H.J.: Interestingness measures for data mining: a survey. ACM Comput. Surv. 38 (2006)
Smits, G.F., Kotanchek, M.: Pareto-front exploitation in symbolic regression. In: O’Reilly, U.-M., Yu, T., Riolo, R., Worzel, B. (eds.) Genetic Programming Theory and Practice II. pp. 283–299 (2006)
Combi, C., Amico, B., Bellazzi, R., Holzinger, A., Moore, J.H., Zitnik, M., Holmes, J.H.: A manifesto on explainability for artificial intelligence in medicine. Artif. Intell. Med. 133, 102423 (2022)
Hutter, F., Kotthoff, L., Vanschoren, J. (eds.): Automated Machine Learning: Methods, Systems, Challenges. Springer International Publishing (2019)
Thornton, C., Hutter, F., Hoos, H.H., Leyton-Brown, K.: Auto-WEKA: combined selection and hyperparameter optimization of classification algorithms. In: Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 847–855. ACM, New York, NY, USA (2013)
Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The WEKA data mining software: an update. ACM SIGKDD Explor. Newsl. 11, 10–18 (2009)
Wang, H.-L., Hsu, W.-Y., Lee, M.-H., Weng, H.-H., Chang, S.-W., Yang, J.-T., Tsai, Y.-H.: Automatic machine-learning-based outcome prediction in patients with primary intracerebral hemorrhage. Front. Neurol. 10, 910 (2019)
Feurer, M., Klein, A., Eggensperger, K., Springenberg, J., Blum, M., Hutter, F.: Efficient and robust automated machine learning. In: Cortes, C., Lawrence, N.D., Lee, D.D., Sugiyama, M., Garnett, R. (eds.) Advances in Neural Information Processing Systems, vol. 28. pp. 2962–2970. Curran Associates, Inc. (2015)
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., Duchesnay, E.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)
Howard, D., Maslej, M.M., Lee, J., Ritchie, J., Woollard, G., French, L.: Transfer learning for risk classification of social media posts: model evaluation study. J. Med. Internet Res. 22, e15371 (2020)
Olson, R.S., Urbanowicz, R.J., Andrews, P.C., Lavender, N.A., Kidd, L.C., Moore, J.H.: Automating biomedical data science through tree-based pipeline optimization. In: Squillero, G., Burelli, P. (eds.) Applications of Evolutionary Computation, pp. 123–137. Springer International Publishing, Cham (2016)
Olson, R.S., Bartley, N., Urbanowicz, R.J., Moore, J.H.: Evaluation of a tree-based pipeline optimization tool for automating data science. In: Proceedings of the Genetic and Evolutionary Computation Conference 2016, pp. 485–492. ACM, New York, NY, USA (2016)
Olson, R.S., Moore, J.H.: TPOT: a tree-based pipeline optimization tool for automating machine learning. In: Hutter, F., Kotthoff, L., Vanschoren, J. (eds.) Automated Machine Learning: Methods, Systems, Challenges, pp. 151–160. Springer International Publishing, Cham (2019)
Koza, J.R.: Genetic Programming: On the Programming of Computers by Means of Natural Selection. MIT Press, Cambridge, MA, USA (1992)
Fortin, F., De Rainville, F., Gardner, M.A., Parizeau, M., Gagné, C.: DEAP: evolutionary algorithms made easy. J. Mach. Learn. Res. 13, 2171–2175 (2012)
Deb, K., Pratap, A., Agarwal, S., Meyarivan, T.: A fast and elitist multiobjective genetic algorithm: NSGA-II. IEEE Trans. Evol. Comput. 6, 182–197 (2002)
Helmuth, T., McPhee, N.F., Spector, L.: Lexicase selection for program synthesis: a diversity analysis. In: Riolo, R., Worzel, W.P., Kotanchek, M., Kordon, A. (eds.) Genetic Programming Theory and Practice XIII, pp. 151–167. Springer International Publishing, Cham (2016)
Le, T.T., Fu, W., Moore, J.H.: Scaling tree-based automated machine learning to biomedical big data with a feature set selector. Bioinforma. Oxf. Engl. 36, 250–256 (2020)
Romano, J., Le, T., Fu, W., Moore, J.: TPOT-NN: augmenting tree-based automated machine learning with neural network estimators. Genet. Program. Evolvable Mach. 1–21 (2021)
Manduchi, E., Romano, J.D., Moore, J.H.: The promise of automated machine learning for the genetic analysis of complex traits. Hum. Genet. 141, 1529–1544 (2022)
Orlenko, A., Kofink, D., Lyytikäinen, L.-P., Nikus, K., Mishra, P., Kuukasjärvi, P., Karhunen, P.J., Kähönen, M., Laurikka, J.O., Lehtimäki, T., Asselbergs, F.W., Moore, J.H.: Model selection for metabolomics: predicting diagnosis of coronary artery disease using automated machine learning. Bioinforma. Oxf. Engl. 36, 1772–1778 (2020)
Manduchi, E., Fu, W., Romano, J.D., Ruberto, S., Moore, J.H.: Embedding covariate adjustments in tree-based automated machine learning for biomedical big data analyses. BMC Bioinform. 21, 430 (2020)
Purkayastha, S., Zhao, Y., Wu, J., Hu, R., McGirr, A., Singh, S., Chang, K., Huang, R.Y., Zhang, P.J., Silva, A., Soulen, M.C., Stavropoulos, S.W., Zhang, Z., Bai, H.X.: Differentiation of low and high grade renal cell carcinoma on routine MRI with an externally validated automatic machine learning algorithm. Sci. Rep. 10, 19503 (2020)
Heimisdottir, L.H., Lin, B.M., Cho, H., Orlenko, A., Ribeiro, A.A., Simon-Soro, A., Roach, J., Shungin, D., Ginnis, J., Simancas-Pallares, M.A., Spangler, H.D., Zandoná, A.G.F., Wright, J.T., Ramamoorthy, P., Moore, J.H., Koo, H., Wu, D., Divaris, K.: Metabolomics insights in early childhood caries. J. Dent. Res. 100, 615–622 (2021)
Manduchi, E., Le, T.T., Fu, W., Moore, J.H.: Genetic analysis of coronary artery disease using tree-based automated machine learning informed by biology-based feature selection. IEEE/ACM Trans. Comput. Biol. Bioinform. 19, 1379–1386 (2022)
Lundberg, S.M., Lee, S.-I.: A unified approach to interpreting model predictions. In: Advances in Neural Information Processing Systems. Curran Associates, Inc. (2017)
Sipper, M., Moore, J.H.: Genetic programming theory and practice: a fifteen-year trajectory. Genet. Program Evolvable Mach. 21, 169–179 (2020)
La Cava, W., Williams, H., Fu, W., Vitale, S., Srivatsan, D., Moore, J.H.: Evaluating recommender systems for AI-driven biomedical informatics. Bioinforma. Oxf. Engl. 37, 250–256 (2021)
Moore, J.H., Parker, J.S., Olsen, N.J., Aune, T.M.: Symbolic discriminant analysis of microarray data in autoimmune disease. Genet. Epidemiol. 23, 57–69 (2002)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this chapter
Cite this chapter
Moore, J.H., Ribeiro, P.H., Matsumoto, N., Saini, A.K. (2024). Genetic Programming as an Innovation Engine for Automated Machine Learning: The Tree-Based Pipeline Optimization Tool (TPOT). In: Banzhaf, W., Machado, P., Zhang, M. (eds) Handbook of Evolutionary Machine Learning. Genetic and Evolutionary Computation. Springer, Singapore. https://doi.org/10.1007/978-981-99-3814-8_14
Download citation
DOI: https://doi.org/10.1007/978-981-99-3814-8_14
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-99-3813-1
Online ISBN: 978-981-99-3814-8
eBook Packages: Computer ScienceComputer Science (R0)