On explaining machine learning models by evolving crucial and compact features

https://doi.org/10.1016/j.swevo.2019.100640Get rights and content

Abstract

Feature construction can substantially improve the accuracy of Machine Learning (ML) algorithms. Genetic Programming (GP) has been proven to be effective at this task by evolving non-linear combinations of input features. GP additionally has the potential to improve ML explainability since explicit expressions are evolved. Yet, in most GP works the complexity of evolved features is not explicitly bound or minimized though this is arguably key for explainability. In this article, we assess to what extent GP still performs favorably at feature construction when constructing features that are (1) Of small-enough number, to enable visualization of the behavior of the ML model; (2) Of small-enough size, to enable interpretability of the features themselves; (3) Of sufficient informative power, to retain or even improve the performance of the ML algorithm. We consider a simple feature construction scheme using three different GP algorithms, as well as random search, to evolve features for five ML algorithms, including support vector machines and random forest. Our results on 21 datasets pertaining to classification and regression problems show that constructing only two compact features can be sufficient to rival the use of the entire original feature set. We further find that a modern GP algorithm, GP-GOMEA, performs best overall. These results, combined with examples that we provide of readable constructed features and of 2D visualizations of ML behavior, lead us to positively conclude that GP-based feature construction still works well when explicitly searching for compact features, making it extremely helpful to explain ML models.

Introduction

Feature selection and feature construction are two important steps to improve the performance of any Machine Learning (ML) algorithm [1,2]. Feature selection is the task of excluding features that are redundant or misleading. Feature construction is the task of transforming (parts of) the original feature space into one that the ML algorithm can better exploit.

A very interesting method to perform feature construction automatically is Genetic Programming (GP) [3,4]. GP can synthesize functions without many prior assumptions on their form, differently from, e.g., logistic regression or regression splines [5,6]. Moreover, feature construction not only depends on the data at hand, but also on the way a specific ML algorithm can model that data. Evolutionary methods in general are highly flexible in their use due to the way they perform search (i.e., derivative free). This makes it possible, for example, to evaluate the quality of a feature for a specific ML algorithm by directly measuring what its impact is on the performance of the ML algorithm (i.e., by training and validating the ML algorithm when using that feature).

Explaining what constructed features mean can shed light on the behavior of ML-inferred models that use such features. Reducing the number of features is also important to improve interpretability. If the original feature space is reduced to few constructed features (e.g., up to two for regression and up to three for classification), the function learned by the ML model can be straightforwardly visualized w.r.t. the new features. In fact, how to make ML models more understandable is a key topic of modern ML research, as many practical, sensitive applications exist, where explaining (part of) the behavior of ML models is essential to trust their use (e.g., in medical applications) [[7], [8], [9], [10]]. Typically, GP for feature construction searches in a subspace of mathematical expressions. Adding to the appeal and potential of GP, these expressions can be human-interpretable if simple enough [8,11].

Fig. 1 presents an example of the potential held by such an approach: a multi-dimensional dataset transformed into a 2D one, where both the behavior of the ML algorithm and the meaning of the new features is clear, while the performance of the ML algorithm is not compromised w.r.t. the use of the original feature set (it is actually improved).

In this article we study whether GP can be useful to construct a low number of small features, to increase the chance of obtaining interpretable ML models, without compromising their accuracy (compared to using the original feature set). To this end, we design a simple, iterative feature construction scheme, and perform a wide set of experiments: we consider four types of feature construction methods (three GP algorithms and random search), five types of machine learning algorithms. We apply their combinations on 21 datasets between classification and regression to determine to what extent they are capable of effectively and efficiently finding crucial and compact features for specific ML algorithms.

The main original scientific contribution of this work is an investigation of whether GP can be used to construct features that are:

  • Of small-enough number, to enable visualization of the behavior of the ML model;

  • Of small-enough size, to enable interpretability of the features themselves;

  • Of sufficient informative power, to retain or even improve the performance of the ML, compared to using the original feature set;

These aspects are assessed under different circumstances:

  • We test different search algorithms, including modern model-based GP and random search;

  • We test different ML algorithms.

The remainder of this article is organized as follows. Related work is reported in Section 2. The proposed feature construction scheme is presented in Section 3. The search algorithms to construct features, as well as the considered ML algorithms, are presented in Section 4. The experimental setup is described in Section 5. Results related to performance are reported in Sections 6 and 7, while results concerning interpretability are reported in Section 8. Section 9 presents typical running times. Section 10 discusses our findings, and Section 11 concludes this article.

Section snippets

Related work

In this article, we consider GP for feature construction to achieve better explainable ML models. Different forms of GP to obtain explainable ML have been explored in literature, but they do not necessarily leverage feature construction. For example, Ref. [12] introduced a form of GP for the automatic synthesis of interpretable classifiers, generated from scratch as self-contained ML models, made of IF-THEN rules. A very different paradigm for explainable ML by GP is considered in Ref. [13],

Iterative evolutionary feature construction

We use a remarkably simple scheme to construct features. Our approach constructs KN+ features by iterating K GP runs. The evolution of the k-th feature (k ∈{1, , K}) uses the previously constructed k − 1 features.

Considered search algorithms and machine learning algorithms

We consider SGP, Random Search (RS), and the GP version of the Gene-pool Optimal Mixing Evolutionary Algorithm (GP-GOMEA) as competing search algorithms to construct features. SGP is widely used in feature construction (see related work in Sec. 2). RS is not typically considered, yet we believe it is important to assess whether evolution does bring any benefit over random enumeration within the confines of our study, i.e., when forcing to find small features. GP-GOMEA is a recently introduced

Experiments

We perform 30 runs of our Feature Construction Scheme (FCS), with SGP, SGPb, RS, and GP-GOMEART, in combination with each ML algorithm (NB only for classification and LR only for regression), on each problem. Each run of the FCS uses a random train-test split of 80%–20%, and considers up to K = 5 features construction rounds. We use a population size of 100 for the search algorithms, and assign a maximum budget of 10, 000 function evaluations to each FCS iteration. This results in relatively

Results: performance on traditional datasets

The results described in this section aim at assessing whether it is possible to construct few and small features that lead to an equal or better performance than the original set, and whether some search algorithms can construct better features than others.

Results: performance on a highly-dimensional dataset

We further consider the RNA-Seq cancer gene expression dataset, comparing FCS by GP-GOMEART with h = 4 against the use of the original feature set, when using NB. Fig. 7 shows that NB with the original feature set overfits: the training performance is maximal, while the test performance reaches an F1 of approximately 0.65. Even tough NB is typically considered a weak estimator, the system described by the data is so severely underdetermined (over 20,000 features vs less than 1,000 examples)

Results: improving interpretability

The results presented in Sec. 6 Results: performance on traditional datasets, 7 Results: performance on a highly-dimensional dataset showed that the original feature set can be already outperformed by two small constructed features in many cases. We now aim at assessing whether constraining features size can enable interpretability of the features themselves, as well as if extra insight can be achieved by plotting and visualizing the behavior of a trained ML model in the new two-dimensional

Running time

Our results are made possible by evaluating the fitness of constructed features with cross-validation, a procedure which is particularly expensive. Table 6 shows the (mean over 30 runs) serial running time to construct five features on the smallest and largest classification and regression datasets, using GP-GOMEART with h = 4 and the parameter settings of Sec. 5, on the relatively old AMD Opteron Processor 6386 SE3. Running time has a large

Discussion

We believe this is one of the few works on evolutionary feature construction where the focus is put on both improving the performance of an ML algorithm, and on human interpretability at the same time. The interpretability we aimed for is twofold: understanding the meaning of the features themselves, as well as reducing their number. GP algorithms are key, as they can provide constructed features as interpretable expressions given basic functional components, and a complexity limit (e.g., tree

Conclusion

With a simple evolutionary feature construction framework we have studied the feasibility of constructing few crucial and compact features with Genetic Programming (GP), towards improving the explainability of Machine Learning (ML) models without losing prediction accuracy. Within the proposed framework, we compared standard GP, random search, and the GP version of the Gene-pool Optimal Mixing Evolutionary Algorithm (GP-GOMEA) as feature constructors, and found that GP-GOMEA is overall

Acknowledgments

The authors acknowledge the Kinderen Kankervrij foundation for financial support (project #187). The majority of the computations for this work were performed on the Lisa Compute Cluster with the support of SURFsara.

References (44)

  • Z.C. Lipton

    The mythos of model interpretability

    Queue

    (2018)
  • R. Guidotti et al.

    A survey of methods for explaining black box models

    ACM Comput. Surv.

    (2018)
  • A. Adadi et al.

    Peeking inside the black-box: a survey on explainable artificial intelligence (xai)

    IEEE Access

    (2018)
  • B. Goodman et al.

    European Union regulations on algorithmic decision-making and a “right to explanation”

    AI Mag.

    (2017)
  • Virgolin M., Alderliesten T., Witteveen C., Bosman P.A.N., Improving model-based genetic programming for symbolic...
  • B.P. Evans et al.

    What's inside the black-box?: a genetic programming method for interpreting complex machine learning models

  • B. Xue et al.

    A survey on evolutionary computation approaches to feature selection

    IEEE Trans. Evol. Comput.

    (2016)
  • K. Krawiec

    Genetic programming-based construction of features for machine learning and knowledge discovery tasks

    Genet. Program. Evolvable Mach.

    (2002)
  • L. Breiman

    Classification and Regression Trees

    (2017)
  • M. Muharram et al.

    Evolutionary constructive induction

    IEEE Trans. Knowl. Data Eng.

    (2005)
  • B. Tran et al.

    Genetic programming for feature construction and selection in classification on high-dimensional data

    Memetic Computing

    (2016)
  • N.S. Altman

    An introduction to kernel and nearest-neighbor nonparametric regression

    Am. Stat.

    (1992)
  • Cited by (26)

    • PS-Tree: A piecewise symbolic regression tree

      2022, Swarm and Evolutionary Computation
    • Automatically Choosing Selection Operator Based on Semantic Information in Evolutionary Feature Construction

      2024, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
    View all citing articles on Scopus
    View full text