skip to main content
10.1145/2001858.2002059acmconferencesArticle/Chapter ViewAbstractPublication PagesgeccoConference Proceedingsconference-collections
tutorial

Separating the wheat from the chaff: on feature selection and feature importance in regression random forests and symbolic regression

Published:12 July 2011Publication History

ABSTRACT

Feature selection in high-dimensional data sets is an open problem with no universal satisfactory method available. In this paper we discuss the requirements for such a method with respect to the various aspects of feature importance and explore them using regression random forests and symbolic regression. We study 'conventional' feature selection with both methods on several test problems and a case study, compare the results, and identify the conceptual differences in generated feature importances.

We demonstrate that random forests might overlook important variables (significantly related to the response) for various reasons, while symbolic regression identifies all important variables if models of sufficient quality are found. We explain the results by the fact that variable importances obtained by these methods have different semantics.

References

  1. H. D. Bondell and B. J. Reich. Simultaneous regression shrinkage, variable selection, and supervised clustering of predictors with oscar. Biometrics, 64:115--123, Mar. 2008.Google ScholarGoogle ScholarCross RefCross Ref
  2. L. Breiman. Bagging predictors. Machine Learning, 24:123--140, August 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. L. Breiman. Random forests. Machine Learning, 45:5--32, 2001. 10.1023/A:1010933404324. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Evolved Analytics LLC. DataModeler Release 1.0. Evolved Analytics LLC, 2010.Google ScholarGoogle Scholar
  5. M. Gashler, C. Giraud-Carrier, and T. Martinez. Decision Tree Ensemble: Small Heterogeneous Is Better Than Large Homogeneous. In 2008 Seventh International Conference on Machine Learning and Applications, pages 900--905. IEEE, Dec. 2008.Google ScholarGoogle Scholar
  6. R. Genuer, J.-M. Poggi, and C. Tuleau-Malot. Variable selection using random forests. Pattern Recogn. Lett., 31:2225--2236, October 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. U. Grömping. Variable importance assessment in regression: Linear regression versus random forest. The American Statistician, 64:308--319, Nov. 2009.Google ScholarGoogle ScholarCross RefCross Ref
  8. I. Guyon and A. Elisseeff. An introduction to variable and feature selection. J. Mach. Learn. Res., 3:1157--1182, March 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. I. Guyon, J. Weston, S. Barnhill, and V. Vapnik. Gene selection for cancer classification using support vector machines. Machine Learning, 46:389--422, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Human Development Reports. http://www.hdr.undp.org/.Google ScholarGoogle Scholar
  11. Human Development Research Papers. http://hdr.undp.org/en/reports/global/hdr2010/papers/.Google ScholarGoogle Scholar
  12. H. Ishwaran. Variable importance in binary regression trees and forests. Electronic Journal of Statistics, 1:519--537, 2007.Google ScholarGoogle ScholarCross RefCross Ref
  13. Y. Lin and Y. Jeon. Random forests and adaptive nearest neighbors. Journal of the American Statistical Association, 101:578--590, June 2006.Google ScholarGoogle ScholarCross RefCross Ref
  14. T. McConaghy. Latent Variable Symbolic Regression for High-Dimensional Inputs, pages 103--118. Springer, 2010.Google ScholarGoogle Scholar
  15. R. K. McRee. Symbolic regression using nearest neighbor indexing. In Proceedings of the 12th annual conference companion on Genetic and evolutionary computation, GECCO '10, pages 1983--1990, New York, NY, USA, 2010. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. H. Peng, F. Long, and C. Ding. Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy. IEEE Transactions on Pattern Analysis and Machine Intelligence, 27(8):1226--1238, Aug. 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. A. Rakotomamonjy. Variable selection using svm-based criteria. Journal of Machine Learning Research, 3:1357--1370, 2003. Google ScholarGoogle ScholarCross RefCross Ref
  18. D. S. Siroky. Navigating random forests and related advances in algorithmic modeling. Statistics Surveys, 3:147--163, 2009.Google ScholarGoogle ScholarCross RefCross Ref
  19. G. Smits and M. Kotanchek. Pareto-front exploitation in symbolic regression. In U.-M. O'Reilly, T. Yu, R. L. Riolo, and B. Worzel, editors, Genetic Programming Theory and Practice II, chapter 17, pages 283--299. Springer, Ann Arbor, 13--15 May 2004.Google ScholarGoogle Scholar
  20. C. Strobl, A. L. Boulesteix, T. Kneib, T. Augustin, and A. Zeileis. Conditional variable importance for random forests. BMC Bioinformatics, 9(1):307+, July 2008.Google ScholarGoogle Scholar
  21. C. Strobl, A. L. Boulesteix, A. Zeileis, and T. Hothorn. Bias in random forest variable importance measures: Illustrations, sources and a solution. BMC Bioinformatics, 8(1):25+, Jan. 2007.Google ScholarGoogle Scholar
  22. K. Vladislavleva. Model-based Problem Solving through Symbolic Regression via Pareto Genetic Programming. PhD thesis, Tilburg University, 2008.Google ScholarGoogle Scholar
  23. K. Vladislavleva, K. Veeramachaneni, M. Burland, J. Parcon, and U.-M. O'Reilly. Knowledge mining with genetic programming methods for variable selection in flavor design. In GECCO, pages 941--948, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Z. Zhao, F. Morstatter, S. Sharma, S. Alelyani, A. Anand, and H. Liu. Advancing feature selection research - asu feature selection repository. Technical report, Arizona State University, June 2010.Google ScholarGoogle Scholar
  25. H. Zou and T. Hastie. Regularization and variable selection via the Elastic Net. Journal of the Royal Statistical Society B, 67:301--320, 2005.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Separating the wheat from the chaff: on feature selection and feature importance in regression random forests and symbolic regression

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        GECCO '11: Proceedings of the 13th annual conference companion on Genetic and evolutionary computation
        July 2011
        1548 pages
        ISBN:9781450306904
        DOI:10.1145/2001858

        Copyright © 2011 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 12 July 2011

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • tutorial

        Acceptance Rates

        Overall Acceptance Rate1,669of4,410submissions,38%

        Upcoming Conference

        GECCO '24
        Genetic and Evolutionary Computation Conference
        July 14 - 18, 2024
        Melbourne , VIC , Australia

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader