Skip to main content

STREAMLINE: A Simple, Transparent, End-To-End Automated Machine Learning Pipeline Facilitating Data Analysis and Algorithm Comparison

  • Chapter
  • First Online:
Genetic Programming Theory and Practice XIX

Part of the book series: Genetic and Evolutionary Computation ((GEVO))

Abstract

Machine learning (ML) offers powerful methods for detecting and modeling associations often in data with large feature spaces and complex associations. Many useful tools/packages (e.g. scikit-learn) have been developed to make the various elements of data handling, processing, modeling, and interpretation accessible. However, it is not trivial for most investigators to assemble these elements into a rigorous, replicatable, unbiased, and effective data analysis pipeline. Automated machine learning (AutoML) seeks to address these issues by simplifying the process of ML analysis for all. Here, we introduce STREAMLINE, a simple, transparent, end-to-end AutoML pipeline designed as a framework to easily conduct rigorous ML modeling and analysis (limited initially to binary classification). STREAMLINE is specifically designed to compare performance between datasets, ML algorithms, and other AutoML tools. It is unique among other autoML tools by offering a fully transparent and consistent baseline of comparison using a carefully designed series of pipeline elements including (1) exploratory analysis, (2) basic data cleaning, (3) cross validation partitioning, (4) data scaling and imputation, (5) filter-based feature importance estimation, (6) collective feature selection, (7) ML modeling with ‘Optuna’ hyperparameter optimization across 15 established algorithms (including less well-known Genetic Programming and rule-based ML), (8) evaluation across 16 classification metrics, (9) model feature importance estimation, (10) statistical significance comparisons, and (11) automatically exporting all results, plots, a PDF summary report, and models that can be easily applied to replication data.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 169.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Catboost. https://catboost.ai/en/docs/. Accessed 28 May 2022

  2. Extreme gradient boosting. https://xgboost.readthedocs.io/en/stable/. Accessed 28 May 2022

  3. gp-learn github respository. https://github.com/trevorstephens/gplearn. Accessed 28 May 2022

  4. Light gradient boosting. https://lightgbm.readthedocs.io/en/latest/. Accessed 28 May 2022

  5. scikit-elcs github respository. https://github.com/UrbsLab/scikit-eLCS. Accessed 28 May 2022

  6. scikit-exstracs github respository. https://github.com/UrbsLab/scikit-ExSTraCS. Accessed 28 May 2022

  7. scikit-learn ann. https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html. Accessed 28 May 2022

  8. scikit-learn decision tree. https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html. Accessed 28 May 2022

  9. scikit-learn gradient boosting trees. https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingClassifier.html. Accessed 28 May 2022

  10. scikit-learn knn. https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html. Accessed 28 May 2022

  11. scikit-learn logistic regression. https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html. Accessed 28 May 2022

  12. scikit-learn naive bayes. https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.GaussianNB.html. Accessed 28 May 2022

  13. scikit-learn random forest. https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html. Accessed 28 May 2022

  14. scikit-learn svm. https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html. Accessed 28 May 2022

  15. scikit-xcs github respository. https://github.com/UrbsLab/scikit-XCS. Accessed 28 May 2022

  16. Streamline github repository. https://github.com/UrbsLab/STREAMLINE. Accessed 28 May 2022

  17. Akiba, T., Sano, S., Yanase, T., Ohta, T., Koyama, M.: Optuna: a next-generation hyperparameter optimization framework. In: Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 2623–2631 (2019)

    Google Scholar 

  18. Buuren, S., Groothuis-Oudshoorn, K.: Mice: multivariate imputation by chained equations in r. J. Stat. Softw. 45(3) (2011)

    Google Scholar 

  19. Chauhan, K., Jani, S., Thakkar, D., Dave, R., Bhatia, J., Tanwar, S., Obaidat, M.S.: Automated machine learning: The new wave of machine learning. In 2020 2nd International Conference on Innovative Mechanisms for Industry Applications (ICIMIA), pp. 205–212. IEEE (2020)

    Google Scholar 

  20. Diao, J.A., Kohane, I.S., Manrai, A.K.: Biomedical informatics and machine learning for clinical genomics. Human Molecul. Genet. 27(R1), R29–R34 (2018)

    Article  Google Scholar 

  21. Dua, D., Graff, C.: UCI machine learning repository (2017)

    Google Scholar 

  22. Elsebakhi, E., Lee, F., Schendel, E., Haque, A., Kathireason, N., Pathare, T., Syed, N., Al-Ali, R.: Large-scale machine learning based on functional networks for biomedical big data with high performance computing platforms. J. Comput. Sci. 11, 69–81 (2015)

    Article  MathSciNet  Google Scholar 

  23. Fabris, F., Freitas, A.A.: Analysing the overfit of the auto-sklearn automated machine learning tool. In: International Conference on Machine Learning, Optimization, and Data Science, pp. 508–520. Springer (2019)

    Google Scholar 

  24. Garreta, R., Moncecchi, G., Hauck, T., Hackeling, G.: Scikit-Learn: Machine Learning Simplified: Implement Scikit-Learn into Every Step of the Data Science Pipeline. Packt Publishing Ltd, 2017

    Google Scholar 

  25. Greener, J.G., Kandathil, S.M., Moffat, L., Jones, D.T.: A guide to machine learning for biologists. Nat. Rev. Molecul. Cell Biol. 23(1), 40–55 (2022)

    Article  Google Scholar 

  26. Heil, B.J., Hoffman, M.M., Markowetz, F., Lee, S.-I., Greene, C.S., Hicks, S.C.: Reproducibility standards for machine learning in the life sciences. Nat. Methods 18(10), 1132–1135 (2021)

    Article  Google Scholar 

  27. Hutter, F., Kotthoff, L., Vanschoren, J.: Automated Machine Learning: Methods, Systems. Challenges, Springer Nature (2019)

    Book  Google Scholar 

  28. Krstajic, D., Buturovic, L.J., Leahy, D.E., Thomas, S.: Cross-validation pitfalls when selecting and assessing regression and classification models. J. Cheminformat. 6(1), 1–15 (2014)

    Article  Google Scholar 

  29. Kusiak, A.: Feature transformation methods in data mining. IEEE Trans. Electron. Packag. Manufact. 24(3), 214–221 (2001)

    Article  Google Scholar 

  30. La Cava, W., Williams, H., Fu, W., Vitale, S., Srivatsan, D., Moore, J.H.: Evaluating recommender systems for ai-driven biomedical informatics. Bioinformatics 37(2), 250–256 (2021)

    Article  Google Scholar 

  31. Linden, A., Yarnold, P.R.: Using machine learning to assess covariate balance in matching studies. J. Eval. Clin. Pract. 22(6), 848–854 (2016)

    Article  Google Scholar 

  32. Luo, J., Wu, M., Gopukumar, D., Zhao, Y.: Big data application in biomedical research and health care: a literature review. Biomed. Inf. Insights 8, BII–S31559 (2016)

    Google Scholar 

  33. Luo, W., Phung, D., Tran, T., Gupta, S., Rana, S., Karmakar, C., Shilton, A., Yearwood, J., Dimitrova, N., Ho, T.B., et al.: Guidelines for developing and reporting machine learning predictive models in biomedical research: a multidisciplinary view. J. Med. Internet Res. 18(12), e323 (2016)

    Article  Google Scholar 

  34. Moore, J.H., White, B.C.: Tuning relieff for genome-wide genetic analysis. In: European Conference on Evolutionary Computation, Machine Learning and Data Mining in Bioinformatics, pp. 166–175. Springer (2007)

    Google Scholar 

  35. Olson, R.S., Moore, J.H.: Tpot: a tree-based pipeline optimization tool for automating machine learning. In: Automated Machine Learning, pp. 151–160. Springer (2019)

    Google Scholar 

  36. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., et al.: Scikit-learn: machine learning in python. J. Mach. Learn. Res. 12, 2825–2830 (2011)

    MathSciNet  MATH  Google Scholar 

  37. Peng, H., Long, F., Ding, C.: Feature selection based on mutual information: criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans. Pattern Anal. Mach. Intell. 27, 1226–1238 (2005)

    Article  Google Scholar 

  38. Rauschert, S., Raubenheimer, K., Melton, P., Huang, R.: Machine learning and clinical epigenetics: a review of challenges for diagnosis and classification. Clin. Epigenet. 12, 1–11 (2020)

    Article  Google Scholar 

  39. Riley, P.: Three pitfalls to avoid in machine learning (2019)

    Google Scholar 

  40. Schratz, P., Muenchow, J., Iturritxa, E., Richter, J., Brenning, A.: Hyperparameter tuning and performance assessment of statistical and machine-learning algorithms using spatial data. Ecol. Model. 406, 109–120 (2019)

    Article  Google Scholar 

  41. Smialowski, P., Frishman, D., Kramer, S.: Pitfalls of supervised feature selection. Bioinformatics 26(3), 440–443 (2010)

    Article  Google Scholar 

  42. Thornton-Wells, T.A., Moore, J.H., Haines, J.L.: Genetics, statistics and human disease: analytical retooling for complexity. TRENDS Genet. 20(12), 640–647 (2004)

    Article  Google Scholar 

  43. Truong, A., Walters, A., Goodsitt, J., Hines, K., Bruss, C.B., Farivar, R.: Towards automated machine learning: evaluation and comparison of automl approaches and tools. In: 2019 IEEE 31st International Conference on Tools with Artificial Intelligence (ICTAI), pp. 1471–1479. IEEE (2019)

    Google Scholar 

  44. Uçar, M.K., Nour, M., Sindi, H., Polat, K.: The effect of training and testing process on machine learning in biomedical datasets. Math. Probl, Eng (2020)

    Book  Google Scholar 

  45. Uppu, S., Krishna, A.: Tuning hyperparameters for gene interaction models in genome-wide association studies. In: International Conference on Neural Information Processing, pp. 791–801. Springer (2017)

    Google Scholar 

  46. Urbanowicz, R.J., Kiralis, J., Fisher, J.M., Moore, J.H.: Predicting the difficulty of pure, strict, epistatic models: metrics for simulated model selection. BioData Mining 5(1), 1–13 (2012)

    Article  Google Scholar 

  47. Urbanowicz, R.J., Kiralis, J., Sinnott-Armstrong, N.A., Heberling, T., Fisher, J.M., Moore, J.H.: Gametes: a fast, direct algorithm for generating pure, strict, epistatic models with random architectures. BioData Mining 5(1), 1–14 (2012)

    Article  Google Scholar 

  48. Urbanowicz, R.J., Meeker, M., La Cava, W., Olson, R.S., Moore, J.H.: Relief-based feature selection: introduction and review. J. Biomed. Inf. 85, 189–203 (2018)

    Article  Google Scholar 

  49. Urbanowicz, R.J., Moore, J.H.: Exstracs 2.0: description and evaluation of a scalable learning classifier system. Evolut. Intell. 8(2–3), 89–116 (2015)

    Google Scholar 

  50. Urbanowicz, R.J., Olson, R.S., Schmitt, P., Meeker, M., Moore, J.H.: Benchmarking relief-based feature selection methods for bioinformatics data mining. J. Biomed. Inf. 85, 168–188 (2018)

    Article  Google Scholar 

  51. Verma, S.S., Lucas, A., Zhang, X., Veturi, Y., Dudek, S., Li, B., Li, R., Urbanowicz, R., Moore, J.H., Kim, D., et al.: Collective feature selection to identify crucial epistatic variants. BioData Mining 11(1), 5 (2018)

    Article  Google Scholar 

  52. Vieira, S., Garcia-Dias, R., Pinaya, W.H.L.: A step-by-step tutorial on how to build a machine learning model. In: Machine Learning, pp. 343–370. Elsevier (2020)

    Google Scholar 

  53. Waring, J., Lindvall, C., Umeton, R.: Automated machine learning: review of the state-of-the-art and opportunities for healthcare. Artif. Intell. Med. 104, 101822 (2020)

    Article  Google Scholar 

  54. White, I.R., Daniel, R., Royston, P.: Avoiding bias due to perfect prediction in multiple imputation of incomplete categorical variables. Comput. Stat. Data Anal. 54(10), 2267–2275 (2010)

    Article  MathSciNet  MATH  Google Scholar 

  55. Zhang, R., Stolzenberg-Solomon, R., Lynch, S.M., Urbanowicz, R.J.: Lcs-dive: an automated rule-based machine learning visualization pipeline for characterizing complex associations in classification (2021). arXiv preprint arXiv:2104.12844

  56. Zhang, R.F., Urbanowicz, R.J.: A scikit-learn compatible learning classifier system. In: Proceedings of the 2020 Genetic and Evolutionary Computation Conference Companion, pp. 1816–1823 (2020)

    Google Scholar 

Download references

Acknowledgements

The study was supported by the following NIH grants: R01s LM010098 and AG066833. STREAMLINE development benefited from multiple biomedical research collaborators at the University of Pennsylvania, Fox Chase Cancer Center, Cedars Sinai Medical Center, and the University of Kansas Medical Center. Special thanks to Patryk Orzechowski, Trang Le, Sy Hwang, Richard Zhang, Wilson Zhang, and Pedro Ribeiro for their code contributions and feedback. We also thank the following collaborators for their feedback on the application of the pipeline during development: Shannon Lynch, Rachael Stolzenberg-Solomon, Ulysses Magalang, Allan Pack, Brendan Keenan, Danielle Mowery, Jason Moore, and Diego Mazzotti.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ryan Urbanowicz .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Urbanowicz, R., Zhang, R., Cui, Y., Suri, P. (2023). STREAMLINE: A Simple, Transparent, End-To-End Automated Machine Learning Pipeline Facilitating Data Analysis and Algorithm Comparison. In: Trujillo, L., Winkler, S.M., Silva, S., Banzhaf, W. (eds) Genetic Programming Theory and Practice XIX. Genetic and Evolutionary Computation. Springer, Singapore. https://doi.org/10.1007/978-981-19-8460-0_9

Download citation

  • DOI: https://doi.org/10.1007/978-981-19-8460-0_9

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-19-8459-4

  • Online ISBN: 978-981-19-8460-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics