ABSTRACT
Difficult benchmark problems are in increasing demand in Genetic Programming (GP). One problem seeing increased usage is the oral bioavailability problem, which is often presented as a challenging problem to both GP and other machine learning methods. However, few properties of the bioavailability data set have been demonstrated, so attributes that make it a challenging problem are largely unknown. This work uncovers important properties of the bioavailability data set, and suggests that the perceived difficulty in this problem can be partially attributed to a lack of pre-processing, including features within the data set that contain no information, and contradictory relationships between the dependent and independent features of the data set. The paper then re-examines the performance of GP on this data set, and contextualises this performance relative to other regression methods. Results suggest that a large component of the observed performance differences on the bioavailability data set can be attributed to variance in the selection of training and testing data. Differences in performance between GP and other methods disappear when multiple training/testing splits are used within experimental work, with performance typically no better than a null modelling approach of reporting the mean of the training data.
- F. Archetti, S. Lanzeni, E. Messina, and L. Vanneschi. Genetic programming for human oral bioavailability of drugs. In Proceedings of the 8th Annual Conference on Genetic and Evolutionary Computation, GECCO '06, pages 255--262, New York, NY, USA, 2006. ACM. Google ScholarDigital Library
- F. Archetti, S. Lanzeni, E. Messina, and L. Vanneschi. Genetic programming for computational pharmacokinetics in drug discovery and development. Genetic Programming and Evolvable Machines, 8(4):413--432, 2007. Google ScholarDigital Library
- R. M. A. Azad and C. Ryan. A simple approach to lifetime learning in genetic programming-based symbolic regression. Evolutionary computation, 22(2):287--317, 2014. Google ScholarDigital Library
- L. Breiman. Random forests. Machine learning, 45(1):5--32, 2001. Google ScholarDigital Library
- M. Castelli, S. Silva, and L. Vanneschi. A C+ framework for geometric semantic genetic programming. Genetic Programming and Evolvable Machines, pages 1--9, 2014. Google ScholarDigital Library
- G. Dick. Bloat and generalisation in symbolic regression. In G. Dick, W. Browne, P. Whigham, M. Zhang, L. Bui, H. Ishibuchi, Y. Jin, X. Li, Y. Shi, P. Singh, K. Tan, and K. Tang, editors, Simulated Evolution and Learning, volume 8886 of Lecture Notes in Computer Science, pages 491--502. Springer International Publishing, 2014.Google ScholarDigital Library
- J. Friedman, T. Hastie, and R. Tibshirani. glmnet: Lasso and elastic-net regularized generalized linear models. R package version, 1, 2009.Google Scholar
- I. Gonçalves, S. Silva, and C. M. Fonseca. On the generalization ability of geometric semantic genetic programming. In P. Machado, M. I. Heywood, J. McDermott, M. Castelli, P. García-Sánchez, P. Burelli, S. Risi, and K. Sim, editors, Genetic Programming, volume 9025 of Lecture Notes in Computer Science, pages 41--52. Springer International Publishing, 2015.Google ScholarCross Ref
- M. A. Haeri, M. M. Ebadzadeh, and G. Folino. Improving GP generalization: a variance-based layered learning approach. Genetic Programming and Evolvable Machines, 16(1):27--55, 2015. Google ScholarDigital Library
- R. Harper. Spatial co-evolution: quicker, fitter and less bloated. In Proceedings of the 14th Annual Conference on Genetic and Evolutionary Computation, GECCO '12, pages 759--766. ACM, 2012. Google ScholarDigital Library
- T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning. Springer Series in Statistics. Springer New York Inc., second edition, 2009.Google ScholarCross Ref
- A. Liaw and M. Wiener. Classification and regression by randomforest. R News, 2(3):18--22, 2002.Google Scholar
- C. A. Lipinski, F. Lombardo, B. W. Dominy, and P. J. Feeney. Experimental and computational approaches to estimate solubility and permeability in drug discovery and development settings. Advanced Drug Delivery Reviews, 23(1--3):3--25, 1997.Google ScholarCross Ref
- J. McDermott, D. R. White, S. Luke, L. Manzoni, M. Castelli, L. Vanneschi, W. Jaskowski, K. Krawiec, R. Harper, K. De Jong, and U.-M. O'Reilly. Genetic programming needs better benchmarks. In Proceedings of the 14th Annual Conference on Genetic and Evolutionary Computation, GECCO '12, pages 791--798, New York, NY, USA, 2012. ACM. Google ScholarDigital Library
- R. Muhammad Atif Azad, D. Medernach, and C. Ryan. Efficient approaches to interleaved sampling of training data for symbolic regression. In The 6th World Congress on Nature and Biologically Inspired Computing (NaBIC), pages 176--183. IEEE, 2014.Google Scholar
- J. R. Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1993. Google ScholarDigital Library
- K. Schliep and K. Hechenbichler. kknn: Weighted k-Nearest Neighbors, 2014. R package version 1.2--5.Google Scholar
- S. Silva. Reassembling operator equalisation: a secret revealed. ACM SIGEVOlution, 5(3):10--22, 2011. Google ScholarDigital Library
- S. Silva and L. Vanneschi. State-of-the-art genetic programming for predicting human oral bioavailability of drugs. In Advances in Bioinformatics, pages 165--173. Springer, 2010.Google ScholarCross Ref
- S. Silva and L. Vanneschi. The importance of being flat-studying the program length distributions of operator equalisation. In Genetic Programming Theory and Practice IX, pages 211--233. Springer, 2011.Google ScholarCross Ref
- S. Silva and L. Vanneschi. Bloat free genetic programming: Application to human oral bioavailability prediction. International journal of data mining and bioinformatics, 6(6):585--601, 2012. Google ScholarDigital Library
- R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), pages 267--288, 1996.Google Scholar
- L. Vanneschi. Investigating problem hardness of real life applications. In Genetic Programming Theory and Practice V, pages 107--124. Springer, 2008.Google ScholarCross Ref
- L. Vanneschi, M. Castelli, L. Manzoni, and S. Silva. A new implementation of geometric semantic gp and its application to problems in pharmacokinetics. In K. Krawiec, A. Moraglio, T. Hu, A. c. Etaner-Uyar, and B. Hu, editors, Genetic Programming, volume 7831 of Lecture Notes in Computer Science, pages 205--216. Springer Berlin Heidelberg, 2013. Google ScholarDigital Library
- L. Vanneschi and S. Gustafson. Using crossover based similarity measure to improve genetic programming generalization ability. In Proceedings of the 11th Annual conference on Genetic and Evolutionary Computation, GECCO '09, pages 1139--1146. ACM, 2009. Google ScholarDigital Library
Index Terms
- A Re-Examination of the Use of Genetic Programming on the Oral Bioavailability Problem
Recommendations
Genetic programming for human oral bioavailability of drugs
GECCO '06: Proceedings of the 8th annual conference on Genetic and evolutionary computationAutomatically assessing the value of bioavailability from the chemical structure of a molecule is a very important issue in biomedicine and pharmacology. In this paper, we present an empirical study of some well known Machine Learning techniques, ...
Semi-supervised genetic programming for classification
GECCO '11: Proceedings of the 13th annual conference on Genetic and evolutionary computationLearning from unlabeled data provides innumerable advantages to a wide range of applications where there is a huge amount of unlabeled data freely available. Semi-supervised learning, which builds models from a small set of labeled examples and a ...
Neural network crossover in genetic algorithms using genetic programming
AbstractThe use of genetic algorithms (GAs) to evolve neural network (NN) weights has risen in popularity in recent years, particularly when used together with gradient descent as a mutation operator. However, crossover operators are often omitted from ...
Comments