skip to main content
research-article

Evolutionary feature manipulation in data mining/big data

Published:02 May 2017Publication History
Skip Abstract Section

Abstract

Known as the GIGO (Garbage In, Garbage Out) principle, the quality of the input data highly influences or even determines the quality of the output of any machine learning, big data and data mining algorithm. The input data which is often represented by a set of features may suffer from many issues. Feature manipulation is an effective means to improve the feature set quality, but it is a challenging task. Evolutionary computation (EC) techniques have shown advantages and achieved good performance in feature manipulation. This paper reviews recent advances on EC based feature manipulation methods in classifcation, clustering, regression, incomplete data, and image analysis, to provide the community the state-of-the-art work in the field.

References

  1. H. Al-Sahaf, A. Al-Sahaf, B. Xue, M. Johnston, and M. Zhang. 2017. Automatically Evolving Rotation-Invariant Texture Image Descriptors by Genetic Programming. IEEE Transactions on Evolutionary Computation 21, 1 (2017), 83--101. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Salem Alelyani, Jiliang Tang, and Huan Liu. 2013. Feature Selection for Clustering: A Review. In Data Clustering: Algorithms and Applications. 29--60.Google ScholarGoogle Scholar
  3. Haider Banka and Suresh Dara. 2015. A Hamming distance based binary particle swarm optimization (HDBPSO) algorithm for high dimensional feature selection, classification and validation. Pattern Recognition Letters 52 (2015), 94--100. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Neven Boric and Pablo A. Estevez. 2007. Genetic programming-based clustering using an information theoretic fitness measure. In Proceedings of the IEEE Congress on Evolutionary Computation (CEC). 31--38.Google ScholarGoogle Scholar
  5. K. Y. Chan, T. S. Dillon, and C. K. Kwong. 2011. Modeling of a Liquid Epoxy Molding Process Using a Particle Swarm Optimization-Based Fuzzy Regression Approach. IEEE Transactions on Industrial Informatics 7, 1 (2011), 148--158.Google ScholarGoogle ScholarCross RefCross Ref
  6. Qi Chen, Mengjie Zhang, and Bing Xue. 2016. Genetic Programming with Embedded Feature Construction for High-Dimensional Symbolic Regression. In the 20th Asia Pacific Symposium on Intelligent and Evolutionary Systems (IES). Springer, 87--102.Google ScholarGoogle Scholar
  7. Qi Chen, Mengjie Zhang, and Bing Xue. 2017. Feature Selection to Improve Generalisation of Genetic Programming for High-Dimensional Symbolic Regression. IEEE Transactions on Evolutionary Computation 99, 1 (2017), to appear.Google ScholarGoogle Scholar
  8. Beatriz de la Iglesia. 2013. Evolutionary computation for feature selection in classification problems. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 3, 6 (2013), 381--407.Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Agoston E Eiben and Jim Smith. 2015. From evolutionary computation to the evolution of things. Nature 521, 7553 (2015), 476--482.Google ScholarGoogle Scholar
  10. Rana Forsati, Alireza Moayedikia, Richard Jensen, Mehrnoush Shamsfard, and Mohammad Reza Meybodi. 2014. Enriched ant colony optimization and its application in feature selection. Neurocomputing 142 (2014), 354--371. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. W. Fu, M. Johnston, and M. Zhang. 2014. Low-Level Feature Extraction for Edge Detection Using Genetic Programming. IEEE Transactions on Cybernetics 44, 8 (2014), 1459--1472.Google ScholarGoogle ScholarCross RefCross Ref
  12. Wenlong Fu, Mark Johnston, and Mengjie Zhang. 2015. Distribution-based invariant feature construction using genetic programming for edge detection. Soft Computing 19, 8 (2015), 2371--2389. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Min Han and Weijie Ren. 2015. Global mutual information-based feature selection approach using single-objective and multi-objective optimization. Neurocomputing 168 (2015), 47--54. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. M. Iqbal, B. Xue, H. Al-Sahaf, and M. Zhang. 2017. Cross-Domain Reuse of Extracted Knowledge in Genetic Programming for Image Classi cation. IEEE Transactions on Evolutionary Computation 99 (2017).Google ScholarGoogle Scholar
  15. Dervis Karaboga, Celal Ozturk, Nurhan Karaboga, and Beyza Gorkemli. 2012. Artificial bee colony programming for symbolic regression. Information Sciences 209 (2012), 1 -- 15. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Ahmed Kattan, Michael Kampouridis, and Alexandros Agapitos. 2014. Generalisation Enhancement via Input Space Transformation: A GP Approach. Springer Berlin Heidelberg, 61--74. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Ron Kohavi and George H. John. 1997. Wrappers for feature subset selection. Artificial Intelligence 97 (1997), 273--324. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. John R. Koza. 1992. Genetic Programming: On the Programming of Computers by Means of Natural Selection. MIT Press, Cambridge, MA, USA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Riccardo Leardi and Amparo Lupiez Gonzlezb. 1998. Genetic algorithms applied to feature selection in PLS regression: how and when to use them. Chemometrics and Intelligent Laboratory Systems 41, 2 (1998), 195 -- 207.Google ScholarGoogle ScholarCross RefCross Ref
  20. Jaesung Lee and Dae-Won Kim. 2015. Memetic feature selection algorithm for multi-label classification. Information Sciences 293 (2015), 80 -- 96.Google ScholarGoogle ScholarCross RefCross Ref
  21. Andrew Lensen, Harith Al-Sahaf, Mengjie Zhang, and Bing Xue. 2016. Genetic Programming for Region Detection, Feature Extraction, Feature Construction and Classification in Image Data. In European Conference on Genetic Programming. Vol. 9594. Springer International Publishing, 51--67.Google ScholarGoogle Scholar
  22. Andrew Lensen, Bing Xue, and Mengjie Zhang. 2016. Particle swarm optimisation representations for simultaneous clustering and feature selection. In IEEE Symposium Series on Computational Intelligence (SSCI). 1--8.Google ScholarGoogle ScholarCross RefCross Ref
  23. Andrew Lensen, Bing Xue, and Mengjie Zhang. 2017. Using Particle Swarm Optimisation and the Silhouette Metric to Estimate the Number of Clusters, Select Features, and Perform Clustering. In Proceeding of the 20th European Conference on the Applications of Evolutionary Computation. Springer, to appear.Google ScholarGoogle ScholarCross RefCross Ref
  24. Huan Liu, Hiroshi Motoda, Rudy Setiono, and Zheng Zhao. 2010. Feature Selection: An Ever Evolving Frontier in Data Mining. In Feature Selection for Data Mining (JMLR Proceedings), Vol. 10. JMLR.org, 4--13.Google ScholarGoogle Scholar
  25. Huan Liu and Lei Yu. 2005. Toward integrating feature selection algorithms for classification and clustering. IEEE Transactions on Knowledge and Data Engineering 17, 4 (2005), 491--502. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Huan Liu and Zheng Zhao. 2009. Manipulating Data and Dimension Reduction Methods: Feature Selection. In Encyclopedia of Complexity and Systems Science. Springer, 5348--5359.Google ScholarGoogle Scholar
  27. L. Liu, L. Shao, X. Li, and K. Lu. 2016. Learning Spatio-Temporal Representations for Action Recognition: A Genetic Programming Approach. IEEE Transactions on Cybernetics 46, 1 (2016), 158--170.Google ScholarGoogle ScholarCross RefCross Ref
  28. Trent McConaghy. 2010. Latent variable symbolic regression for high-dimensional inputs. Springer.Google ScholarGoogle Scholar
  29. K. Nag and N.R. Pal. 2016. A Multiobjective Genetic Programming-Based Ensemble for Simultaneous Feature Selection and Classi cation. IEEE Transactions on Cybernetics 46 (2016), 499--510.Google ScholarGoogle ScholarCross RefCross Ref
  30. Enrique Naredo and Leonardo Trujillo. 2013. Searching for novel clustering programs. In Genetic and Evolutionary Computation Conference (GECCO). 1093-- 1100. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Bach Hoai Nguyen, Bing Xue, and Peter Andreae. 2016. A Novel Binary Particle Swarm Optimization Algorithm and Its Applications on Knapsack and Feature Selection Problems. In the 20th Asia Pacific Symposium on Intelligent and Evolutionary Systems (IES). Springer, 319--332.Google ScholarGoogle Scholar
  32. Hoai Bach Nguyen, Bing Xue, and Peter Andreae. 2016. Mutual information for feature selection: estimation or counting? Evolutionary Intelligence 9, 3 (2016), 95--110. Conference on the Applications of Evolutionary Computation. Springer International Publishing, to appear.Google ScholarGoogle ScholarCross RefCross Ref
  33. Hoai Bach Nguyen, Bing Xue, and Peter Andreae. 2017. Surrogate-model based Particle Swarm Optimisation with Local Search for Feature Selection in Classification. In Proceeding of the 21th European Conference on the Applications of Evolutionary Computation. Springer International Publishing, to appear.Google ScholarGoogle ScholarCross RefCross Ref
  34. Hoai Bach Nguyen, Bing Xue, Ivy Liu, Peter Andreae, and Mengjie Zhang. 2016. New mechanism for archive maintenance in PSO-based multi-objective feature selection. Soft Computing (2016), 1--20. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Stjepan Oreski and Goran Oreski. 2014. Genetic algorithm-based heuristic for feature selection in credit risk assessment. Expert Systems with Applications 41, 4, Part 2 (2014), 2052 -- 2064. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Wenbin Qian and Wenhao Shu. 2015. Mutual information criterion for feature selection from incomplete data. Neurocomputing 168 (2015), 210--220. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Conor Ryan, Jeannie Fitzgerald, Krzysztof Krawiec, and David Medernach. 2015. Image Classification with Genetic Programming: Building a Stage 1 Computer Aided Detector for Breast Cancer. Springer International Publishing, 245--287.Google ScholarGoogle Scholar
  38. Weiguo Sheng, Xiaohui Liu, and Mike Fairhurst. 2008. A niching memetic algorithm for simultaneous clustering and feature selection. IEEE Transactions on Knowledge and Data Engineering 20, 7 (2008), 868--879. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. M. Suganuma, D. Tsuchiya, S. Shirakawa, and T. Nagao. 2016. Hierarchical feature construction for image classification using Genetic Programming. In IEEE International Conference on Systems, Man, and Cybernetics (SMC). 1423-- 1428.Google ScholarGoogle Scholar
  40. Binh Tran, Bing Xue, and Mengjie Zhang. 2015. Genetic programming for feature construction and selection in classification on high-dimensional data. Memetic Computing 8, 1 (2015), 3--15.Google ScholarGoogle ScholarCross RefCross Ref
  41. Binh Tran, Mengjie Zhang, and Bing Xue. 2016. Multiple feature construction in classification on highdimensional data using GP. In IEEE Symposium Series on Computational Intelligence (SSCI). 1--8.Google ScholarGoogle Scholar
  42. Binh Ngan Tran, Bing Xue, and Mengjie Zhang. 2017. Using Feature Clustering for GP-Based Feature Construction on High-Dimensional Data. Springer International Publishing, to appear.Google ScholarGoogle Scholar
  43. Cao Truong Tran, Mengjie Zhang, and Peter Andreae. 2016. A Genetic Programming-Based Imputation Method for Classification with Missing Data. Springer International Publishing, 149--163.Google ScholarGoogle Scholar
  44. Cao Truong Tran, Mengjie Zhang, Peter Andreae, and Bing Xue. 2016. Directly Constructing Multiple Features for Classification with Missing Data using Genetic Programming with Interval Functions. In Genetic and Evolutionary Computation Conference (GECCO). Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Cao Truong Tran, Mengjie Zhang, Peter Andreae, and Bing Xue. 2016. Improving performance for classification with incomplete data using wrapper-based feature selection. Evolutionary Intelligence 9, 3 (2016), 81--94.Google ScholarGoogle ScholarCross RefCross Ref
  46. Cao Truong Tran, Mengjie Zhang, Peter Andreae, and Bing Xue. 2017. Bagging and Feature Selection for Classification with Incomplete Data. In Proceeding of the 20th European Conference on the Applications of Evolutionary Computation. Springer, to appear.Google ScholarGoogle ScholarCross RefCross Ref
  47. Jorge R. Vergara and Pablo A. Estevez. 2014. A review of feature selection methods based on mutual information. Neural Computing and Applications 24, 1 (2014), 175--186.Google ScholarGoogle ScholarCross RefCross Ref
  48. Jiaheng Wang, Bing Xue, Xiaoying Gao, and Mengjie Zhang. 2016. A Di erential Evolution Approach to Feature Selection and Instance Selection. Springer International Publishing, 588--602.Google ScholarGoogle Scholar
  49. Chih-Hung Wu, Gwo-Hshiung Tzeng, and Rong-Ho Lin. 2009. A Novel hybrid genetic algorithm for kernel function and parameter optimization in support vector regression. Expert Systems with Applications 36, 3, Part 1 (2009), 4725 -- 4735. Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. Bing Xue and Mengjie Zhang. 2016. Evolutionary computation for feature manipulation: Key challenges and future directions. In 2016 IEEE Congress on Evolutionary Computation (CEC). 3061--3067.Google ScholarGoogle ScholarCross RefCross Ref
  51. Bing Xue, Mengjie Zhang, Will N. Browne, and Xin Yao. 2016. A Survey on Evolutionary Computation Approaches to Feature Selection. IEEE Transactions on Evolutionary Computation 20, 4 (2016), 606--626.Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. Yiteng Zhai, Yew-Soon Ong, and I.W. Tsang. 2014. The Emerging "Big Dimensionality". IEEE Computational Intelligence Magazine 9, 3 (2014), 14--26. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Evolutionary feature manipulation in data mining/big data
      Index terms have been assigned to the content through auto-classification.

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM SIGEVOlution
        ACM SIGEVOlution  Volume 10, Issue 1
        March 2017
        10 pages
        EISSN:1931-8499
        DOI:10.1145/3089251
        Issue’s Table of Contents

        Copyright © 2017 Authors

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 2 May 2017

        Check for updates

        Qualifiers

        • research-article

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader