ABSTRACT
Feature selection in high-dimensional data sets is an open problem with no universal satisfactory method available. In this paper we discuss the requirements for such a method with respect to the various aspects of feature importance and explore them using regression random forests and symbolic regression. We study 'conventional' feature selection with both methods on several test problems and a case study, compare the results, and identify the conceptual differences in generated feature importances.
We demonstrate that random forests might overlook important variables (significantly related to the response) for various reasons, while symbolic regression identifies all important variables if models of sufficient quality are found. We explain the results by the fact that variable importances obtained by these methods have different semantics.
- H. D. Bondell and B. J. Reich. Simultaneous regression shrinkage, variable selection, and supervised clustering of predictors with oscar. Biometrics, 64:115--123, Mar. 2008.Google ScholarCross Ref
- L. Breiman. Bagging predictors. Machine Learning, 24:123--140, August 1996. Google ScholarDigital Library
- L. Breiman. Random forests. Machine Learning, 45:5--32, 2001. 10.1023/A:1010933404324. Google ScholarDigital Library
- Evolved Analytics LLC. DataModeler Release 1.0. Evolved Analytics LLC, 2010.Google Scholar
- M. Gashler, C. Giraud-Carrier, and T. Martinez. Decision Tree Ensemble: Small Heterogeneous Is Better Than Large Homogeneous. In 2008 Seventh International Conference on Machine Learning and Applications, pages 900--905. IEEE, Dec. 2008.Google Scholar
- R. Genuer, J.-M. Poggi, and C. Tuleau-Malot. Variable selection using random forests. Pattern Recogn. Lett., 31:2225--2236, October 2010. Google ScholarDigital Library
- U. Grömping. Variable importance assessment in regression: Linear regression versus random forest. The American Statistician, 64:308--319, Nov. 2009.Google ScholarCross Ref
- I. Guyon and A. Elisseeff. An introduction to variable and feature selection. J. Mach. Learn. Res., 3:1157--1182, March 2003. Google ScholarDigital Library
- I. Guyon, J. Weston, S. Barnhill, and V. Vapnik. Gene selection for cancer classification using support vector machines. Machine Learning, 46:389--422, 2002. Google ScholarDigital Library
- Human Development Reports. http://www.hdr.undp.org/.Google Scholar
- Human Development Research Papers. http://hdr.undp.org/en/reports/global/hdr2010/papers/.Google Scholar
- H. Ishwaran. Variable importance in binary regression trees and forests. Electronic Journal of Statistics, 1:519--537, 2007.Google ScholarCross Ref
- Y. Lin and Y. Jeon. Random forests and adaptive nearest neighbors. Journal of the American Statistical Association, 101:578--590, June 2006.Google ScholarCross Ref
- T. McConaghy. Latent Variable Symbolic Regression for High-Dimensional Inputs, pages 103--118. Springer, 2010.Google Scholar
- R. K. McRee. Symbolic regression using nearest neighbor indexing. In Proceedings of the 12th annual conference companion on Genetic and evolutionary computation, GECCO '10, pages 1983--1990, New York, NY, USA, 2010. ACM. Google ScholarDigital Library
- H. Peng, F. Long, and C. Ding. Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy. IEEE Transactions on Pattern Analysis and Machine Intelligence, 27(8):1226--1238, Aug. 2005. Google ScholarDigital Library
- A. Rakotomamonjy. Variable selection using svm-based criteria. Journal of Machine Learning Research, 3:1357--1370, 2003. Google ScholarCross Ref
- D. S. Siroky. Navigating random forests and related advances in algorithmic modeling. Statistics Surveys, 3:147--163, 2009.Google ScholarCross Ref
- G. Smits and M. Kotanchek. Pareto-front exploitation in symbolic regression. In U.-M. O'Reilly, T. Yu, R. L. Riolo, and B. Worzel, editors, Genetic Programming Theory and Practice II, chapter 17, pages 283--299. Springer, Ann Arbor, 13--15 May 2004.Google Scholar
- C. Strobl, A. L. Boulesteix, T. Kneib, T. Augustin, and A. Zeileis. Conditional variable importance for random forests. BMC Bioinformatics, 9(1):307+, July 2008.Google Scholar
- C. Strobl, A. L. Boulesteix, A. Zeileis, and T. Hothorn. Bias in random forest variable importance measures: Illustrations, sources and a solution. BMC Bioinformatics, 8(1):25+, Jan. 2007.Google Scholar
- K. Vladislavleva. Model-based Problem Solving through Symbolic Regression via Pareto Genetic Programming. PhD thesis, Tilburg University, 2008.Google Scholar
- K. Vladislavleva, K. Veeramachaneni, M. Burland, J. Parcon, and U.-M. O'Reilly. Knowledge mining with genetic programming methods for variable selection in flavor design. In GECCO, pages 941--948, 2010. Google ScholarDigital Library
- Z. Zhao, F. Morstatter, S. Sharma, S. Alelyani, A. Anand, and H. Liu. Advancing feature selection research - asu feature selection repository. Technical report, Arizona State University, June 2010.Google Scholar
- H. Zou and T. Hastie. Regularization and variable selection via the Elastic Net. Journal of the Royal Statistical Society B, 67:301--320, 2005.Google ScholarCross Ref
Index Terms
Separating the wheat from the chaff: on feature selection and feature importance in regression random forests and symbolic regression
Recommendations
Sensitivity-like analysis for feature selection in genetic programming
GECCO '17: Proceedings of the Genetic and Evolutionary Computation ConferenceFeature selection is an important process within machine learning problems. Through pressures imposed on models during evolution, genetic programming performs basic feature selection, and so analysis of the evolved models can provide some insights into ...
Variable selection using random forests
This paper proposes, focusing on random forests, the increasingly used statistical method for classification and regression problems introduced by Leo Breiman in 2001, to investigate two classical issues of variable selection. The first one is to find ...
Variable selection by Random Forests using data with missing values
Variable selection has been suggested for Random Forests to improve data prediction and interpretation. However, the basic element, i.e. variable importance measures, cannot be computed straightforward when there are missing values in the predictor ...
Comments