Evolutionary model trees for handling continuous classes in machine learning

doi:10.1016/j.ins.2010.11.010

Information Sciences

Volume 181, Issue 5, 1 March 2011, Pages 954-971

https://doi.org/10.1016/j.ins.2010.11.010 Get rights and content

Abstract

Model trees are a particular case of decision trees employed to solve regression problems. They have the advantage of presenting an interpretable output, helping the end-user to get more confidence in the prediction and providing the basis for the end-user to have new insight about the data, confirming or rejecting hypotheses previously formed. Moreover, model trees present an acceptable level of predictive performance in comparison to most techniques used for solving regression problems. Since generating the optimal model tree is an NP-Complete problem, traditional model tree induction algorithms make use of a greedy top-down divide-and-conquer strategy, which may not converge to the global optimal solution. In this paper, we propose a novel algorithm based on the use of the evolutionary algorithms paradigm as an alternate heuristic to generate model trees in order to improve the convergence to globally near-optimal solutions. We call our new approach evolutionary model tree induction (E-Motion). We test its predictive performance using public UCI data sets, and we compare the results to traditional greedy regression/model trees induction algorithms, as well as to other evolutionary approaches. Results show that our method presents a good trade-off between predictive performance and model comprehensibility, which may be crucial in many machine learning applications.

Introduction

Model trees are a popular alternative to classical regression methods, presenting good predictive performance and an intuitive interpretable output. Similar to decision/regression trees, they are structured trees that represent graphically if–then–else rules, which seek to extract implicit knowledge from data sets. While decision trees are used to solve classification problems (i.e., the output is a nominal value), both model and regression trees are used to solve regression problems (i.e., the output is a continuous value). The main difference between the model trees and the regression trees is that whereas regression trees have a single value as the output in their leaves, model trees hold linear models used for calculating the final output.

A model tree is composed by non-terminal nodes, each one representing a test over a data set attribute, and linking edges that partition the data according to the test result. In the bottom of the tree, the terminal nodes hold linear regression models built according to the data that reached each given node. Thus, for predicting the target-attribute value for a given data set instance, we follow down the tree from the root node to the bottom, until a terminal node is reached, and then we apply the corresponding linear model.

Model trees are traditionally induced by divide-and-conquer greedy algorithms which are sequential in nature and locally optimal at each node split [10]. Since inducing the best tree is an NP-Complete problem [32], a greedy heuristic may not derive the best tree overall. In addition, recursive partitioning iteratively degrades the quality of the data set for the purpose of statistical inference, because the larger the number of times the data is partitioned, the smaller becomes the data sample that fits the specific split, leading to results without statistical significance and creating a model that overfits the training data [4].

In order to avoid the drawbacks of the greedy tree-induction algorithms, recent works have focused on powerful ensemble methods¹ [18], [30], which attempt to take advantage of the unstable induction of models by growing a forest of trees from the data and later averaging their predictions. While presenting very good predictive performance, ensemble methods fail to produce a single-tree solution, operating in a black-box fashion.

We highlight the importance of validation and interpretation of discovered knowledge in many data mining applications, because comprehensible models can lead to new insights and hypotheses upon the data [15]. Hence, we believe there should be a trade-off between predictive performance and model comprehensibility, so a predictive system can be useful and helpful in real-world applications.

Evolutionary algorithms (EAs) are a solid heuristic able to deal with a variety of optimization problems. An evolutionary approach for producing trees could enhance the chances of converging to global optima, avoiding solutions that get trapped in local-optima and that are too sensitive to small changes in the data. Evolutionary induction of decision trees is well-explored in the research community. Basgalupp et al. [3] proposed an evolutionary algorithm for the induction of decision trees named LEGAL-Tree, which looks for a good trade-off between accuracy and model comprehensibility. Very few works however propose evolving regression/model trees as an alternative to greedy approaches.

We propose a new algorithm based on the EAs paradigm and also on the core idea presented in LEGAL-Tree [3] to deal with regression problems. By evolving model trees with an evolutionary algorithm, we seek to avoid local-optima convergence by performing a robust global search in the space of candidate solutions. Our approach is, to the best of our knowledge, the first EA that evolves model trees with linear-regression models in their leaves. It is also the first EA that evolves model trees through distinct multi-objective optimization strategies, allowing the user to choose between them.

We test the predictive performance and comprehensibility of this new algorithm using UCI regression data sets [13], and we compare the results to those produced by well-known algorithms, such as M5 [28] and REPTree [35], as well as to other evolutionary approaches. Experimentation shows that our approach presents a good trade-off between predictive performance and comprehensibility.

The remaining of this work is organized as follows. We review tree-structured approaches for mining continuous classes in Section 2. Section 3 details our novel algorithm, namely E-Motion. Sections 4 Comparison to greedy approaches, 5 Comparison to evolutionary approaches present the experiments we have executed, comparing E-Motion to greedy and evolutionary approaches, respectively. In Section 6 we discuss important issues of this work such as comprehensibility of the models produced, overfitting, local-optima avoidance and related issues. We end this work with our conclusions and future work directions in Section 7.

Section snippets

Tree-structured approaches for mining continuous classes

We have divided this background section on tree-structured methods for mining continuous classes in two perspectives: (i) greedy approaches, which is the most commonly employed technique for building predictive trees; and (ii) evolutionary approaches, which seek for globally near-optimal solutions at the expense of computational resources and execution time.

Evolutionary model trees

Evolutionary model tree induction (E-Motion) is a novel multi-objective genetic programming (GP) algorithm for model trees induction. Each step of E-Motion is presented in this chapter following the natural flow of an evolutionary algorithm (Fig. 2). In addition, we describe two specific aspects of our implementation: (i) consistency check (Section 3.8), and (ii) prediction smoothing (Section 3.9).

Comparison to greedy approaches

In this section we present the comparison among E-Motion and greedy tree-induction algorithms. More specifically, we detail the data sets we have selected for the experiments, the environment we have set to run the experiments and the respective results.

Comparison to evolutionary approaches

We have compared E-Motion’s two approaches to the evolutionary algorithms described in Section 2.2. Because we did not have access to the source code of these algorithms,³ we used the same methodology and data sets described in each paper [10], [25], and we present the results reported in them.

For the comparison of E-Motion and GPMCC, the training set for each database consisted of 80% of the patterns,

Discussion

In this section we discuss important issues related to the models which are generated by E-Motion.

Conclusions and future work

Model trees are a popular alternative to classical regression methods, mainly because the models they provide resemble human reasoning. We emphasize that the comprehensibility of the discovered model is important in many applications where decisions will be made by human beings based on the discovered knowledge. Therefore, there is a clear motivation to provide model trees that are not only accurate but also relatively simple.

Traditional model tree induction algorithms which rely on a recursive

Acknowledgements

Our thanks to Fundação de Amparo à Pesquisa do Estado de São Paulo (FAPESP) and the European Research Consortium for Informatics and Mathematics (ERCIM) for supporting this research. We would also like to thank the anonymous reviewers who have contributed enormously to this work.

Rodrigo Coelho Barros obtained his B.Sc. in Computer Science from UFPel-RS, Brazil, in 2007; his M.Sc. in Computer Science from PUC-RS, Brazil, in 2009; and he is currently a Ph.D. student in Computer Science at University of São Paulo where he works with machine learning and data mining topics. He has published papers in peer-reviewed journals and conferences. His current research interests are data mining, knowledge discovery and biologically-inspired computational intelligence algorithms.

References (35)

C. Burgess et al.
Can genetic programming improve software effort estimation? a comparative evaluation
Inf. Softw. Technol.
(2001)
E. Fernandez et al.
Increasing selective pressure towards the best compromise in evolutionary multiobjective optimization: the extended nosga method
Inf. Sci.
(2011)
J. Gray et al.
Classification tree analysis using TARGET
Comput. Stat. Data Anal.
(2008)
I. Partalas et al.
Greedy regression ensemble selection: theory and an application to water quality prediction
Inf. Sci.
(2008)
G. Potgieter et al.
Genetic algorithms for the structural optimisation of learned polynomial expressions
Appl. Math. Comput.
(2007)
G. Potgieter et al.
Evolving model trees for mining data sets with continuous-valued classes
Expert Syst. Appl.
(2008)
A. Tsakonas
A comparison of classification accuracy of four genetic programming-evolved intelligent structures
Inf. Sci.
(2006)
W. Banzhaf et al.
Genetic Programming: An Introduction: On the Automatic Evolution of Computer Programs and Its Applications (The Morgan Kaufmann Series in Artificial Intelligence)
(1997)
R.C. Barros et al.
Evolutionary model tree induction
M. Basgalupp et al.
Lexicographic multi-objective evolutionary induction of decision trees
Int. J. Bio-Insp. Comput.
(2009)

M.P. Basgalupp et al.

Legal-Tree: a lexicographic multi-objective genetic algorithm for decision tree induction

Å. Björck

Numerical Methods for Least Squares Problems

(1996)

L. Breiman et al.

Classification and Regression Trees

(1984)

H. Chipman et al.

Bayesian CART model search

J. Am. Stat. Assoc.

(1997)

D. Denison et al.

A Bayesian CART algorithm

Biometrika

(1998)

G. Fan et al.

Regression tree analysis using TARGET

J. Comput. Graph. Stat.

(2005)

P. Fernandes et al.

The impact of random samples in ensemble classifiers

Cited by (41)

Steering the interpretability of decision trees using lasso regression - an evolutionary perspective
2023, Information Sciences
Since machine and deep learning have made accurate solutions possible, the search for explainable predictors has begun. Decision trees are competitive in tasks that require transparency, but have been underestimated due to their insufficient prediction performance, often caused by generalization issues. It is especially noticeable in the case of model trees, designed to solve regression tasks. Evolutionary tree induction can to some extent counteract this over and under-fitting problem with its global approach.
In this paper, we examine whether integrating the lasso estimator in the tree induction process, can help to control the interpretability of the decision tree and/or improve its overall performance. We propose a new evolutionary model tree inducer called Global Lasso Tree. Its novelty lies in regularization of linear models coefficients, in the leaves during the evolutionary search. To reduce the tree's tendency to misfit, a weighted fitness function is used to dynamically balance the trade-off between conflicting objectives which is the tree error and overall complexity. The proposed method was validated on 26 publicly available regression data sets. The empirical study showed that by using the lasso-based regularization technique, we were able to steer the tree's interpretability and thus generate simpler and significantly more accurate trees.
PS-Tree: A piecewise symbolic regression tree
2022, Swarm and Evolutionary Computation
Citation Excerpt :
GPMCC [29] (Genetic Program for the Mining of Continuous-valued Classes) evolves a decision tree using a GP and symbolic models called GASOLPE [30] as leaf nodes. GPMCC, on the other hand, performs worse than Cubist and has been criticized in several ways, including being overly complex [31] and prone to overfitting [32]. Another type of piecewise symbolic regression is clustered symbolic regression (CSR) [33], which optimizes piecewise symbolic regressors using an expectation-maximization (EM) framework.
The symbolic methods have recently regained popularity due to their reasonable interpretability compared to neural network-based artificial intelligence techniques. The regression tree is such a symbolic method that divides the feature space into several subregions and builds a simple response surface model, such as a constant value or a linear model, for each subregion. However, this strategy may fail when nonlinear structures exist in the subregions. To overcome this problem, this paper proposes a new regression model, named piecewise symbolic regression tree (PS-Tree). Instead of using constant values or linear models as the leaf nodes, PS-Tree builds symbolic regressors for the leaf nodes or subregions. In addition to that, we also propose an adaptive space partition strategy by dynamically adjusting the partition of the space to alleviate the problem caused by incorrect partitioning. PS-Tree is applied to 122 synthetic and real-world datasets, and the results show that it outperforms several state-of-the-art regression methods.
GPU-based acceleration of evolutionary induction of model trees
2022, Applied Soft Computing
Citation Excerpt :
However, at the same time, it brings new challenges. Direct application of evolutionary DT induction to big data may be hard or even unachievable [5–7]. Population-based and iterative calculations may simply be too demanding.
Evolutionary algorithms (EAs) are naturally prone to parallel processing. However, when they are applied to data mining, the fitness calculations start to dominate and the typical population-based decomposition limits the parallel efficiency. When dealing with large-scale data, the scalable solution may become a real challenge. In this article, we propose a GPU-based parallelization of evolutionary induction of model trees. Such trees are a special case of decision tree (DT) that is designed to solve regression problems. The evolutionary approach allows not only a robust prediction but also to preserve the simplicity of DTs. However, the global approach is much more computationally demanding than state-of-the-art greedy inducers, and thus hard to apply to large-scale data mining directly. A parallelized induction of model trees (with univariate tests in the internal nodes and multiple linear regression models in the leaves) requires a carefully designed decomposition strategy. Six GPU-supported procedures are designed to successively: redistribute, sort and rearrange dataset samples, next, calculate models and fitness, and finally gather the results. Experimental validation is performed on real-life and artificial datasets, using various (low- and high-end) GPU accelerators. Results show that the GPU-supported solution enables time-efficient global induction of model trees on large-scale data, which until now was reserved for greedy methods. The obtained speedup is very satisfactory (even up to hundreds of times). The solution is scalable for datasets of different sizes and dimensions.
Multi-objective Grammatical Evolution of Decision Trees for Mobile Marketing user conversion prediction
2021, Expert Systems with Applications
The worldwide adoption of mobile devices is raising the value of Mobile Performance Marketing, which is supported by Demand-Side Platforms (DSP) that match mobile users to advertisements. In these markets, monetary compensation only occurs when there is a user conversion. Thus, a key DSP issue is the design of a data-driven model to predict user conversion. To handle this nontrivial task, we propose a novel Multi-objective Optimization (MO) approach to evolve Decision Trees (DT) using a Grammatical Evolution (GE), under two main variants: a pure GE method (MGEDT) and a GE with Lamarckian Evolution (MGEDTL). Both variants evolve variable-length DTs and perform a simultaneous optimization of the predictive performance and model complexity. To handle big data, the GE methods include a training sampling and parallelism evaluation mechanism. The algorithms were applied to a recent database with around 6 million records from a real-world DSP. Using a realistic Rolling Window (RW) validation, the two GE variants were compared with a standard DT algorithm (CART), a Random Forest and a state-of-the-art Deep Learning (DL) model. Competitive results were obtained by the GE methods, which present affordable training times and very fast predictive response times.
A hybrid decision tree algorithm for mixed numeric and categorical data in regression analysis
2017, Pattern Recognition Letters
Citation Excerpt :
One of these approaches is to integrate evolutionary algorithms within decision tree algorithms. There are two distinct works related to this approach: tree analysis with randomly generated and evolved trees (TARGET) [2,12,14] and genetic programming approach for mining continuous-valued classes (GPMCC) [2,4,25]. Both TARGET and GPMCC utilize a genetic algorithm, inspired by natural Darwinian evolution that employs concepts such as inheritance, mutation, and natural selection [35].
In many real world problems, the collected data are not always numeric; rather, the data can include categorical variables. Inclusion of different types of variables may lead to complications in regression analysis. Many regression algorithms such as linear regression, support vector regression, and neural networks that train parameters of a model to identify relations between input and output variables, can easily process numeric variables; however, there are additional considerations for categorical variables. On the other hand, a decision tree algorithm estimates a target based on the specified rules; therefore, it can support categorical variables as well as numeric variables. Using this property, a new hybrid model combining a decision tree with another regression algorithm is proposed to analyze mixed data. In the proposed model, the portions explained by categorical variables in target values are estimated by the decision tree and the remaining parts are predicted by any regression algorithm trained by numerical variables. The proposed algorithm was evaluated using 12 datasets selected from real decision problems, and it was confirmed that the proposed algorithm achieved better or comparable accuracy than the comparison methods including the M5 decision tree and the evolutionary tree. In addition, the new hybrid method does not significantly increase computational complexity, even though it builds two separate models, which is an advantage that is in contrast with the M5 decision tree and the evolutionary tree.
The role of decision tree representation in regression problems – An evolutionary perspective
2016, Applied Soft Computing Journal
Citation Excerpt :
A strongly typed GP (Genetic Programming) approach called STGP was also proposed [21] for univariate regression tree induction. There are also globally induced systems that evolve univariate model trees, such as the E-Motion tree [1] that implements standard 1-point crossover and two different mutation strategies and the GMT system [12] that incorporates knowledge about the inducing problem for the global model tree into the evolutionary search. There are also preliminary studies on oblique trees called oGMT [10].
A regression tree is a type of decision tree that can be applied to solve regression problems. One of its characteristics is that it may have at least four different node representations; internal nodes can be associated with univariate or oblique tests, whereas the leaves can be linked with simple constant predictions or multivariate regression models. The objective of this paper is to demonstrate the impact of particular representations on the induced decision trees. As it is difficult if not impossible to choose the best representation for a particular problem in advance, the issue is investigated using a new evolutionary algorithm for the decision tree induction with a structure that can self-adapt to the currently analyzed data. The proposed solution allows different leaves and internal nodes representation within a single tree. Experiments performed using artificial and real-life datasets show the importance of tree representation in terms of error minimization and tree size. In addition, the presented solution managed to outperform popular tree inducers with defined homogeneous representations.

View all citing articles on Scopus

Duncan Dubugras Alcoba Ruiz is an associate professor at the Graduate Program on Computer Science of PUC-RS, Brazil. His research interests include: (a) information systems modeling and design, in particular workflow modeling and automation; (b) applications of database technology, in special data mining, data warehousing, temporal and active databases; and (c) business process intelligence and e-science. In 2001, he spent a sabbatical year at the College of Computing, GeorgiaTech, USA, working on self-restructuring workflow systems. He received his Ph.D. degree in Computer Science from UFRGS – Brazil in 1995, and his M.Sc. degree in 1987 from the same university.

Márcio Porto Basgalupp is an associate professor at UNIFESP, São Paulo, Brazil. He obtained his B.Sc. in Computer Science from UFPel-RS, Brazil, in 2005; his M.Sc. in Computer Science from PUC-RS, Brazil, in 2007; his Ph.D. in Computer Science from University of São Paulo, Brazil, in 2010. He took a post-doctoral position in 2010 at NTNU, Trondheim, Norway, where he worked with bio-medical data mining. He has published papers in peer-reviewed journals and conferences. His current research interests are machine learning, data mining and bio-inspired computation.

View full text

Evolutionary model trees for handling continuous classes in machine learning

Abstract

Introduction

Section snippets

Tree-structured approaches for mining continuous classes

Evolutionary model trees

Comparison to greedy approaches

Comparison to evolutionary approaches

Discussion

Conclusions and future work

Acknowledgements

Inf. Softw. Technol.

Inf. Sci.

Comput. Stat. Data Anal.

Inf. Sci.

Appl. Math. Comput.

Expert Syst. Appl.

Inf. Sci.

Genetic Programming: An Introduction: On the Automatic Evolution of Computer Programs and Its Applications (The Morgan Kaufmann Series in Artificial Intelligence)

Evolutionary model tree induction

Lexicographic multi-objective evolutionary induction of decision trees

Int. J. Bio-Insp. Comput.

Legal-Tree: a lexicographic multi-objective genetic algorithm for decision tree induction

Numerical Methods for Least Squares Problems

Classification and Regression Trees

Bayesian CART model search

J. Am. Stat. Assoc.

A Bayesian CART algorithm

Biometrika

Regression tree analysis using TARGET

J. Comput. Graph. Stat.

The impact of random samples in ensemble classifiers