Abstract
Feature engineering is a key step in a machine learning study. We propose FERMAT, a grammatical evolution framework for the automatic discovery of an optimal set of engineered features, with enhanced ability to characterize data. The framework contains a grammar specifying the original features and possible operations that can be applied to data. The optimization process searches for a transformation strategy to apply to the original dataset, aiming at creating a novel characterization composed by a combination of original and engineered attributes. FERMAT was applied to two real-world drug development datasets and results reveal that the framework is able to craft novel representations for data that foster the predictive ability of tree-based regression models.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
- 2.
- 3.
Some of the best solutions are composed just by engineered features. In these runs, it is not possible to design a solution for FERMAT-Sel. Accordingly, the number of repetitions for this variant is lower than for the remaining alternatives.
References
Archetti, F., Lanzeni, S., Messina, E., Vanneschi, L.: Genetic programming for computational pharmacokinetics in drug discovery and development. Genetic Program. Evolvable Mach. 8(4), 413–432 (2007)
Assunção, F., Lourenço, N., Ribeiro, B., Machado, P.: Evolution of Scikit-learn pipelines with dynamic structured grammatical evolution. In: Castillo, P.A., Jiménez Laredo, J.L., Fernández de Vega, F. (eds.) EvoApplications 2020. LNCS, vol. 12104, pp. 530–545. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-43722-0_34
Bengio, Y., Courville, A., Vincent, P.: Representation learning: a review and new perspectives. IEEE Trans. Pattern Anal. Mach. Intell. 35(8), 1798–1828 (2013)
Castelli, M., Manzoni, L., Vanneschi, L.: An efficient genetic programming system with geometric semantic operators and its application to human oral bioavailability prediction. arXiv preprint arXiv:1208.2437 (2012)
Dick, G., Rimoni, A.P., Whigham, P.A.: A re-examination of the use of genetic programming on the oral bioavailability problem. In: Proceedings of the 2015 Annual Conference on Genetic and Evolutionary Computation, pp. 1015–1022 (2015)
Feurer, M., Klein, A., Eggensperger, K., Springenberg, J.T., Blum, M., Hutter, F.: Auto-sklearn: efficient and robust automated machine learning. In: Hutter, F., Kotthoff, L., Vanschoren, J. (eds.) Automated Machine Learning. TSSCML, pp. 113–134. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-05318-5_6
Foster, D., Karloff, H., Thaler, J.: Variable selection is hard. In: Conference on Learning Theory, pp. 696–709. PMLR (2015)
Jiménez, Á.B., Lázaro, J.L., Dorronsoro, J.R.: Finding optimal model parameters by deterministic and annealed focused grid search. Neurocomputing 72(13–15), 2824–2832 (2009)
Jolliffe, I.T.: Principal components in regression analysis. In: Principal component analysis, pp. 129–155. Springer, New York (1986). https://doi.org/10.1007/978-1-4757-1904-8_8
La Cava, W., Moore, J.: A general feature engineering wrapper for machine learning using \(\epsilon \)-Lexicase survival. In: McDermott, J., Castelli, M., Sekanina, L., Haasdijk, E., García-Sánchez, P. (eds.) EuroGP 2017. LNCS, vol. 10196, pp. 80–95. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-55696-3_6
Lourenço, N., Assunção, F., Pereira, F.B., Costa, E., Machado, P.: Structured grammatical evolution: a dynamic approach. In: Ryan, C., O’Neill, M., Collins, J.J. (eds.) Handbook of Grammatical Evolution, pp. 137–161. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-78717-6_6
Lourenço, N., Pereira, F.B., Costa, E.: Unveiling the properties of structured grammatical evolution. Genetic Program. Evolvable Mach. 17(3), 251–289 (2016). https://doi.org/10.1007/s10710-015-9262-4
McDermott, J., et al.: Genetic programming needs better benchmarks. In: Proceedings of the 14th Annual Conference on Genetic and Evolutionary Computation, pp. 791–798 (2012)
Muharram, M.A., Smith, G.D.: The effect of evolved attributes on classification algorithms. In: Gedeon, T.T.D., Fung, L.C.C. (eds.) AI 2003. LNCS (LNAI), vol. 2903, pp. 933–941. Springer, Heidelberg (2003). https://doi.org/10.1007/978-3-540-24581-0_80
Muharram, M.A., Smith, G.D.: Evolutionary feature construction using information gain and Gini index. In: Keijzer, M., O’Reilly, U.-M., Lucas, S., Costa, E., Soule, T. (eds.) EuroGP 2004. LNCS, vol. 3003, pp. 379–388. Springer, Heidelberg (2004). https://doi.org/10.1007/978-3-540-24650-3_36
Olson, R.S., Moore, J.H.: TPOT: a tree-based pipeline optimization tool for automating machine learning. In: Hutter, F., Kotthoff, L., Vanschoren, J. (eds.) Automated Machine Learning. TSSCML, pp. 151–160. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-05318-5_8
Robnik-Šikonja, M., Kononenko, I.: Theoretical and empirical analysis of ReliefF and RReliefF. Mach. Learn. 53(1), 23–69 (2003)
de Sá, A.G.C., Pinto, W.J.G.S., Oliveira, L.O.V.B., Pappa, G.L.: RECIPE: a grammar-based framework for automatically evolving classification pipelines. In: McDermott, J., Castelli, M., Sekanina, L., Haasdijk, E., García-Sánchez, P. (eds.) EuroGP 2017. LNCS, vol. 10196, pp. 246–261. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-55696-3_16
Shahriari, B., Swersky, K., Wang, Z., Adams, R.P., De Freitas, N.: Taking the human out of the loop: a review of Bayesian optimization. Proc. IEEE 104(1), 148–175 (2015)
Thornton, C., Hutter, F., Hoos, H.H., Leyton-Brown, K.: Auto-WEKA: combined selection and hyperparameter optimization of classification algorithms. In: Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 847–855 (2013)
Vamathevan, J., et al.: Applications of machine learning in drug discovery and development. Nat. Rev. Drug Discov. 18(1), 463–477 (2019). https://doi.org/10.1038/s41573-019-0024-5
Vanneschi, L., Silva, S., Castelli, M., Manzoni, L.: Geometric semantic genetic programming for real life applications. In: Riolo, R., Moore, J.H., Kotanchek, M. (eds.) Genetic Programming Theory and Practice XI. GEC, pp. 191–209. Springer, New York (2014). https://doi.org/10.1007/978-1-4939-0375-7_11
Whigham, P.A., et al.: Grammatically-based genetic programming. In: Proceedings of the Workshop on Genetic Programming: from Theory to Real-World Applications, vol. 16, pp. 33–41 (1995)
White, D.R., et al.: Better GP benchmarks: community survey results and proposals. Genetic Program. Evolvable Mach. 14(1), 3–29 (2013)
Acknowledgments
This work was funded by FEDER funds through the Operational Programme Competitiveness Factors- COMPETE and national funds by FCT - Foundation for Science and Technology (POCI-01-0145-FEDER-029297, CISUC - UID/CEC/ 00326/2020) and within the scope of the project A4A: Audiology for All (CENTRO-01-0247-FEDER-047083) financed by the Operational Program for Competitiveness and Internationalisation of PORTUGAL 2020 through the European Regional Development Fund.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Monteiro, M., Lourenço, N., Pereira, F.B. (2021). FERMAT: Feature Engineering with Grammatical Evolution. In: Marreiros, G., Melo, F.S., Lau, N., Lopes Cardoso, H., Reis, L.P. (eds) Progress in Artificial Intelligence. EPIA 2021. Lecture Notes in Computer Science(), vol 12981. Springer, Cham. https://doi.org/10.1007/978-3-030-86230-5_19
Download citation
DOI: https://doi.org/10.1007/978-3-030-86230-5_19
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-86229-9
Online ISBN: 978-3-030-86230-5
eBook Packages: Computer ScienceComputer Science (R0)