ABSTRACT
This paper introduces our auto-ML system, MLStar, which uses genetic programming to create scikit-learn and Keras-based Python programs to perform supervised learning. MLStar leverages our own genetic programming system (GPStar4) and provides a greater search space compared to traditional genetic programming frameworks.
Key elements that enable MLStar's performance include representing individuals as Directed Acyclic Graphs (DAGs), a rich type system to shape the kinds of graphs generated, novel genetic operators which work on the DAG structure, and advanced hyperparameter tuning via the Optuna hyperparameter optimization framework. MLStar also offers multiobjective fitnesses and a variety of complex population types.
We show that MLStar performs favorably to several other auto-ML frameworks on benchmark tests. We also demonstrate that MLStar is capable of competitive solutions even when running with computationally expensive features disabled.
- 2023. auto_ml. https://github.com/ClimbsRocks/auto_mlGoogle Scholar
- Takuya Akiba, Shotaro Sano, Toshihiko Yanase, Takeru Ohta, and Masanori Koyama. 2019. Optuna: A Next-generation Hyperparameter Optimization Framework. In Proceedings of the 25rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.Google ScholarDigital Library
- James Bergstra, Daniel Yamins, and David Cox. 2013. Making a Science of Model Search: Hyperparameter Optimization in Hundreds of Dimensions for Vision Architectures. In Proceedings of the 30th International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 28), Sanjoy Dasgupta and David McAllester (Eds.). PMLR, Atlanta, Georgia, USA, 115--123. https://proceedings.mlr.press/v28/bergstra13.htmlGoogle Scholar
- Tianqi Chen and Carlos Guestrin. 2016. XGBoost: A Scalable Tree Boosting System. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (San Francisco, California, USA) (KDD '16). ACM, New York, NY, USA, 785--794. Google ScholarDigital Library
- Francois Chollet et al. 2015. Keras. https://github.com/fchollet/kerasGoogle Scholar
- Piali Das, Nikita Ivkin, Tanya Bansal, Laurence Rouesnel, Philip Gautier, Zohar Karnin, Leo Dirac, Lakshmi Ramakrishnan, Andre Perunicic, Iaroslav Shcherbatyi, Wilton Wu, Aida Zolic, Huibin Shen, Amr Ahmed, Fela Winkelmolen, Miroslav Miladinovic, Cedric Archembeau, Alex Tang, Bhaskar Dutt, Patricia Grao, and Kumar Venkateswar. 2020. Amazon SageMaker Autopilot: A White Box AutoML Solution at Scale. In Proceedings of the Fourth International Workshop on Data Management for End-to-End Machine Learning (Portland, OR, USA) (DEEM'20). Association for Computing Machinery, New York, NY, USA, Article 2, 7 pages. Google ScholarDigital Library
- Matthias Feurer, Aaron Klein, Katharina Eggensperger, Jost Springenberg, Manuel Blum, and Frank Hutter. 2015. Efficient and Robust Automated Machine Learning. In Advances in Neural Information Processing Systems 28 (2015). 2962--2970.Google ScholarDigital Library
- Haifeng Jin, Qingquan Song, and Xia Hu. 2019. Auto-Keras: An Efficient Neural Architecture Search System. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (Anchorage, AK, USA) (KDD '19). Association for Computing Machinery, New York, NY, USA, 1946--1956. Google ScholarDigital Library
- Gabriel Kopito Julien Amblard, Robert Filman. 2023. GPStar4: A flexible framework for experimenting with genetic programming. submitted to GECCO 2023. Google ScholarDigital Library
- James Max Kanter and Kalyan Veeramachaneni. 2015. Deep feature synthesis: Towards automating data science endeavors. 2015 IEEE International Conference on Data Science and Advanced Analytics (DSAA) (2015), 1--10.Google ScholarCross Ref
- Donald E. Knuth. 1968. Semantics of Context-Free Languages. Math. Syst. Theory 2, 2 (1968), 127--145. Google ScholarCross Ref
- Erin LeDell and Sebastien Poirier. 2020. H2O AutoML: Scalable Automatic Machine Learning. 7th ICML Workshop on Automated Machine Learning (AutoML) (July 2020). https://www.automl.org/wp-content/uploads/2020/07/AutoML_2020_paper_61.pdfGoogle Scholar
- Edgar Galván López and Katya Rodríguez-Vázquez. 2007. Multiple Interactive Outputs in a Single Tree: An Empirical Investigation. In Genetic Programming, 10th European Conference, EuroGP 2007, Valencia, Spain, April 11--13, 2007, Proceedings (Lecture Notes in Computer Science, Vol. 4445), Marc Ebner, Michael O'Neill, Anikó Ekárt, Leonardo Vanneschi, and Anna Esparcia-Alcázar (Eds.). Springer, 341--350. Google ScholarCross Ref
- Randal S. Olson, Nathan Bartley, Ryan J. Urbanowicz, and Jason H. Moore. 2016. Evaluation of a Tree-based Pipeline Optimization Tool for Automating Data Science. Google ScholarCross Ref
- Randal S. Olson, William La Cava, Patryk Orzechowski, Ryan J. Urbanowicz, and Jason H. Moore. 2017. PMLB: a large benchmark suite for machine learning evaluation and comparison. BioData Mining 10, 36 (11 Dec 2017), 1--13. Google ScholarCross Ref
- Michael O'Neill. 2009. Riccardo Poli, William B. Langdon, Nicholas F. McPhee: A Field Guide to Genetic Programming: Lulu. com, 2008, 250 pp, ISBN 978-1-4092-0073-4.Google Scholar
- F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. 2011. Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research 12 (2011), 2825--2830.Google ScholarDigital Library
- Liudmila Prokhorenkova, Gleb Gusev, Aleksandr Vorobev, Anna Veronika Dorogush, and Andrey Gulin. 2018. CatBoost: unbiased boosting with categorical features. In Advances in Neural Information Processing Systems, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (Eds.), Vol. 31. Curran Associates, Inc. https://proceedings.neurips.cc/paper/2018/file/14491b756b3a51daac41c24863285549-Paper.pdfGoogle Scholar
- Léo Françoso D. P. Sotto, Paul Kaufmann, Timothy Atkinson, Roman Kalkreuth, and Márcio Porto Basgalupp. 2020. A Study on Graph Representations for Genetic Programming. In Proceedings of the 2020 Genetic and Evolutionary Computation Conference (Cancún, Mexico) (GECCO '20). Association for Computing Machinery, New York, NY, USA, 931--939. Google ScholarDigital Library
- In-Kwon Yeo and Richard A. Johnson. 2000. A new family of power transformations to improve normality or symmetry. Biometrika 87, 4 (12 2000), 954--959. arXiv:https://academic.oup.com/biomet/article-pdf/87/4/954/633221/870954.pdf Google ScholarCross Ref
Index Terms
- MLStar: A System for Synthesis of Machine-Learning Programs
Recommendations
On finding optimal polytrees
We study the NP-hard problem of finding a directed acyclic graph (DAG) on a given set of nodes so as to maximize a given scoring function. The problem models the task of inferring a probabilistic network from data, which has been studied extensively in ...
Stack and Queue Layouts of Directed Acyclic Graphs: Part I
Stack layouts and queue layouts of undirected graphs have been used to model problems in fault-tolerant computing and in parallel process scheduling. However, problems in parallel process scheduling are more accurately modeled by stack and queue layouts ...
Learning Bayesian Network Classifiers: Searching in a Space of Partially Directed Acyclic Graphs
There is a commonly held opinion that the algorithms for learning unrestricted types of Bayesian networks, especially those based on the score+search paradigm, are not suitable for building competitive Bayesian network-based classifiers. Several ...
Comments