Elsevier

Information Sciences

Volume 181, Issue 5, 1 March 2011, Pages 954-971
Information Sciences

Evolutionary model trees for handling continuous classes in machine learning

https://doi.org/10.1016/j.ins.2010.11.010Get rights and content

Abstract

Model trees are a particular case of decision trees employed to solve regression problems. They have the advantage of presenting an interpretable output, helping the end-user to get more confidence in the prediction and providing the basis for the end-user to have new insight about the data, confirming or rejecting hypotheses previously formed. Moreover, model trees present an acceptable level of predictive performance in comparison to most techniques used for solving regression problems. Since generating the optimal model tree is an NP-Complete problem, traditional model tree induction algorithms make use of a greedy top-down divide-and-conquer strategy, which may not converge to the global optimal solution. In this paper, we propose a novel algorithm based on the use of the evolutionary algorithms paradigm as an alternate heuristic to generate model trees in order to improve the convergence to globally near-optimal solutions. We call our new approach evolutionary model tree induction (E-Motion). We test its predictive performance using public UCI data sets, and we compare the results to traditional greedy regression/model trees induction algorithms, as well as to other evolutionary approaches. Results show that our method presents a good trade-off between predictive performance and model comprehensibility, which may be crucial in many machine learning applications.

Introduction

Model trees are a popular alternative to classical regression methods, presenting good predictive performance and an intuitive interpretable output. Similar to decision/regression trees, they are structured trees that represent graphically if–then–else rules, which seek to extract implicit knowledge from data sets. While decision trees are used to solve classification problems (i.e., the output is a nominal value), both model and regression trees are used to solve regression problems (i.e., the output is a continuous value). The main difference between the model trees and the regression trees is that whereas regression trees have a single value as the output in their leaves, model trees hold linear models used for calculating the final output.

A model tree is composed by non-terminal nodes, each one representing a test over a data set attribute, and linking edges that partition the data according to the test result. In the bottom of the tree, the terminal nodes hold linear regression models built according to the data that reached each given node. Thus, for predicting the target-attribute value for a given data set instance, we follow down the tree from the root node to the bottom, until a terminal node is reached, and then we apply the corresponding linear model.

Model trees are traditionally induced by divide-and-conquer greedy algorithms which are sequential in nature and locally optimal at each node split [10]. Since inducing the best tree is an NP-Complete problem [32], a greedy heuristic may not derive the best tree overall. In addition, recursive partitioning iteratively degrades the quality of the data set for the purpose of statistical inference, because the larger the number of times the data is partitioned, the smaller becomes the data sample that fits the specific split, leading to results without statistical significance and creating a model that overfits the training data [4].

In order to avoid the drawbacks of the greedy tree-induction algorithms, recent works have focused on powerful ensemble methods1 [18], [30], which attempt to take advantage of the unstable induction of models by growing a forest of trees from the data and later averaging their predictions. While presenting very good predictive performance, ensemble methods fail to produce a single-tree solution, operating in a black-box fashion.

We highlight the importance of validation and interpretation of discovered knowledge in many data mining applications, because comprehensible models can lead to new insights and hypotheses upon the data [15]. Hence, we believe there should be a trade-off between predictive performance and model comprehensibility, so a predictive system can be useful and helpful in real-world applications.

Evolutionary algorithms (EAs) are a solid heuristic able to deal with a variety of optimization problems. An evolutionary approach for producing trees could enhance the chances of converging to global optima, avoiding solutions that get trapped in local-optima and that are too sensitive to small changes in the data. Evolutionary induction of decision trees is well-explored in the research community. Basgalupp et al. [3] proposed an evolutionary algorithm for the induction of decision trees named LEGAL-Tree, which looks for a good trade-off between accuracy and model comprehensibility. Very few works however propose evolving regression/model trees as an alternative to greedy approaches.

We propose a new algorithm based on the EAs paradigm and also on the core idea presented in LEGAL-Tree [3] to deal with regression problems. By evolving model trees with an evolutionary algorithm, we seek to avoid local-optima convergence by performing a robust global search in the space of candidate solutions. Our approach is, to the best of our knowledge, the first EA that evolves model trees with linear-regression models in their leaves. It is also the first EA that evolves model trees through distinct multi-objective optimization strategies, allowing the user to choose between them.

We test the predictive performance and comprehensibility of this new algorithm using UCI regression data sets [13], and we compare the results to those produced by well-known algorithms, such as M5 [28] and REPTree [35], as well as to other evolutionary approaches. Experimentation shows that our approach presents a good trade-off between predictive performance and comprehensibility.

The remaining of this work is organized as follows. We review tree-structured approaches for mining continuous classes in Section 2. Section 3 details our novel algorithm, namely E-Motion. Sections 4 Comparison to greedy approaches, 5 Comparison to evolutionary approaches present the experiments we have executed, comparing E-Motion to greedy and evolutionary approaches, respectively. In Section 6 we discuss important issues of this work such as comprehensibility of the models produced, overfitting, local-optima avoidance and related issues. We end this work with our conclusions and future work directions in Section 7.

Section snippets

Tree-structured approaches for mining continuous classes

We have divided this background section on tree-structured methods for mining continuous classes in two perspectives: (i) greedy approaches, which is the most commonly employed technique for building predictive trees; and (ii) evolutionary approaches, which seek for globally near-optimal solutions at the expense of computational resources and execution time.

Evolutionary model trees

Evolutionary model tree induction (E-Motion) is a novel multi-objective genetic programming (GP) algorithm for model trees induction. Each step of E-Motion is presented in this chapter following the natural flow of an evolutionary algorithm (Fig. 2). In addition, we describe two specific aspects of our implementation: (i) consistency check (Section 3.8), and (ii) prediction smoothing (Section 3.9).

Comparison to greedy approaches

In this section we present the comparison among E-Motion and greedy tree-induction algorithms. More specifically, we detail the data sets we have selected for the experiments, the environment we have set to run the experiments and the respective results.

Comparison to evolutionary approaches

We have compared E-Motion’s two approaches to the evolutionary algorithms described in Section 2.2. Because we did not have access to the source code of these algorithms,3 we used the same methodology and data sets described in each paper [10], [25], and we present the results reported in them.

For the comparison of E-Motion and GPMCC, the training set for each database consisted of 80% of the patterns,

Discussion

In this section we discuss important issues related to the models which are generated by E-Motion.

Conclusions and future work

Model trees are a popular alternative to classical regression methods, mainly because the models they provide resemble human reasoning. We emphasize that the comprehensibility of the discovered model is important in many applications where decisions will be made by human beings based on the discovered knowledge. Therefore, there is a clear motivation to provide model trees that are not only accurate but also relatively simple.

Traditional model tree induction algorithms which rely on a recursive

Acknowledgements

Our thanks to Fundação de Amparo à Pesquisa do Estado de São Paulo (FAPESP) and the European Research Consortium for Informatics and Mathematics (ERCIM) for supporting this research. We would also like to thank the anonymous reviewers who have contributed enormously to this work.

Rodrigo Coelho Barros obtained his B.Sc. in Computer Science from UFPel-RS, Brazil, in 2007; his M.Sc. in Computer Science from PUC-RS, Brazil, in 2009; and he is currently a Ph.D. student in Computer Science at University of São Paulo where he works with machine learning and data mining topics. He has published papers in peer-reviewed journals and conferences. His current research interests are data mining, knowledge discovery and biologically-inspired computational intelligence algorithms.

References (35)

  • M.P. Basgalupp et al.

    Legal-Tree: a lexicographic multi-objective genetic algorithm for decision tree induction

  • Å. Björck

    Numerical Methods for Least Squares Problems

    (1996)
  • L. Breiman et al.

    Classification and Regression Trees

    (1984)
  • H. Chipman et al.

    Bayesian CART model search

    J. Am. Stat. Assoc.

    (1997)
  • D. Denison et al.

    A Bayesian CART algorithm

    Biometrika

    (1998)
  • G. Fan et al.

    Regression tree analysis using TARGET

    J. Comput. Graph. Stat.

    (2005)
  • P. Fernandes et al.

    The impact of random samples in ensemble classifiers

  • Cited by (41)

    • PS-Tree: A piecewise symbolic regression tree

      2022, Swarm and Evolutionary Computation
      Citation Excerpt :

      GPMCC [29] (Genetic Program for the Mining of Continuous-valued Classes) evolves a decision tree using a GP and symbolic models called GASOLPE [30] as leaf nodes. GPMCC, on the other hand, performs worse than Cubist and has been criticized in several ways, including being overly complex [31] and prone to overfitting [32]. Another type of piecewise symbolic regression is clustered symbolic regression (CSR) [33], which optimizes piecewise symbolic regressors using an expectation-maximization (EM) framework.

    • GPU-based acceleration of evolutionary induction of model trees

      2022, Applied Soft Computing
      Citation Excerpt :

      However, at the same time, it brings new challenges. Direct application of evolutionary DT induction to big data may be hard or even unachievable [5–7]. Population-based and iterative calculations may simply be too demanding.

    • A hybrid decision tree algorithm for mixed numeric and categorical data in regression analysis

      2017, Pattern Recognition Letters
      Citation Excerpt :

      One of these approaches is to integrate evolutionary algorithms within decision tree algorithms. There are two distinct works related to this approach: tree analysis with randomly generated and evolved trees (TARGET) [2,12,14] and genetic programming approach for mining continuous-valued classes (GPMCC) [2,4,25]. Both TARGET and GPMCC utilize a genetic algorithm, inspired by natural Darwinian evolution that employs concepts such as inheritance, mutation, and natural selection [35].

    • The role of decision tree representation in regression problems – An evolutionary perspective

      2016, Applied Soft Computing Journal
      Citation Excerpt :

      A strongly typed GP (Genetic Programming) approach called STGP was also proposed [21] for univariate regression tree induction. There are also globally induced systems that evolve univariate model trees, such as the E-Motion tree [1] that implements standard 1-point crossover and two different mutation strategies and the GMT system [12] that incorporates knowledge about the inducing problem for the global model tree into the evolutionary search. There are also preliminary studies on oblique trees called oGMT [10].

    View all citing articles on Scopus

    Rodrigo Coelho Barros obtained his B.Sc. in Computer Science from UFPel-RS, Brazil, in 2007; his M.Sc. in Computer Science from PUC-RS, Brazil, in 2009; and he is currently a Ph.D. student in Computer Science at University of São Paulo where he works with machine learning and data mining topics. He has published papers in peer-reviewed journals and conferences. His current research interests are data mining, knowledge discovery and biologically-inspired computational intelligence algorithms.

    Duncan Dubugras Alcoba Ruiz is an associate professor at the Graduate Program on Computer Science of PUC-RS, Brazil. His research interests include: (a) information systems modeling and design, in particular workflow modeling and automation; (b) applications of database technology, in special data mining, data warehousing, temporal and active databases; and (c) business process intelligence and e-science. In 2001, he spent a sabbatical year at the College of Computing, GeorgiaTech, USA, working on self-restructuring workflow systems. He received his Ph.D. degree in Computer Science from UFRGS – Brazil in 1995, and his M.Sc. degree in 1987 from the same university.

    Márcio Porto Basgalupp is an associate professor at UNIFESP, São Paulo, Brazil. He obtained his B.Sc. in Computer Science from UFPel-RS, Brazil, in 2005; his M.Sc. in Computer Science from PUC-RS, Brazil, in 2007; his Ph.D. in Computer Science from University of São Paulo, Brazil, in 2010. He took a post-doctoral position in 2010 at NTNU, Trondheim, Norway, where he worked with bio-medical data mining. He has published papers in peer-reviewed journals and conferences. His current research interests are machine learning, data mining and bio-inspired computation.

    View full text