PS-Tree: A piecewise symbolic regression tree

https://doi.org/10.1016/j.swevo.2022.101061Get rights and content

Abstract

The symbolic methods have recently regained popularity due to their reasonable interpretability compared to neural network-based artificial intelligence techniques. The regression tree is such a symbolic method that divides the feature space into several subregions and builds a simple response surface model, such as a constant value or a linear model, for each subregion. However, this strategy may fail when nonlinear structures exist in the subregions. To overcome this problem, this paper proposes a new regression model, named piecewise symbolic regression tree (PS-Tree). Instead of using constant values or linear models as the leaf nodes, PS-Tree builds symbolic regressors for the leaf nodes or subregions. In addition to that, we also propose an adaptive space partition strategy by dynamically adjusting the partition of the space to alleviate the problem caused by incorrect partitioning. PS-Tree is applied to 122 synthetic and real-world datasets, and the results show that it outperforms several state-of-the-art regression methods.

Introduction

Regression is a fundamental problem in machine learning. In this paper, we consider the following regression problem. For a given set of training data D={xi,yi,i=1n}, it aims to find the best model f*Ω to fit the training data. By using the mean squared error (MSE) as the loss function, the regression problem can be converted into the following optimization problem.f*=arg minfΩi=1n(f(xi)yi)2where xi is a feature vector, yi represents a target value, Ω defines the model space, and f denotes a regression model.

There are a large variety of regression methods [1] available, including linear model [2], Gaussian process [3], support vector regression [4], gradient boost regression tree [5], and neural network [6]. From the perspective of interpretability, we can divide these methods into two categories: black-box methods and white-box methods. A typical example of the black-box method is the deep neural network [7]. Recently, these methods have demonstrated good performances in some areas, such as image processing [8] and natural language processing [9]. However, they are also criticized for their low explainability, which poses a challenge when being applied in critical fields such as transportation, finance, and health care  [10]. As a result, some researchers resort to the second category, namely white-box techniques. Regression tree [11] based methods are a particular focus due to their high accuracy and interpretability in a wide range of tasks [12]. The basic idea behind regression trees is to divide the feature space into several subregions and then fit the data in those subregions using several simple models. In the classification and regression tree (CART) [13], for example, a constant value is used as the simple model in each subregion. Compared to neural networks, regression trees are easier to implement and understand because they use a set of simple models to capture the structure of the dataset. However, traditional regression trees may be difficult to portray a smooth relationship between independent and dependent variables because they are piecewise constant functions in essence  [14]. Thus, developing an explainable algorithm for solving nonlinear piecewise regression problems remains a critical issue.

Evolutionary algorithms (EAs) are a type of heuristic search algorithm [15]. Given a set of candidate solutions, EAs can search for the optimal solution based on a given objective function. EAs, unlike traditional mathematical programming methods, do not make strong assumptions about the problems to be solved. As a result, they can be used to find various types of solution representations. When a solution is represented by a variable length representation, such as dispatching rules [16], syntax trees [17], or even complex graphs [18], EAs are also referred to as genetic programming (GP) [19]. When dealing with interpretable machine learning problems, GP has been used to generate nonlinear regression models and has shown some promising results [20]. In this paper, we focus on tree-based GP (Tree-GP) [19], which uses a tree structure to represent a candidate solution. A candidate solution (tree) in a Tree-GP is made up of function operators (non-leaf nodes) and terminal nodes (leaf nodes). Operators such as arithmetic operators {+,,}, logical operators {and,or,}, and rules {ifthan,ifthanelse,} are function nodes. The terminal nodes are variable nodes that correspond to the training data x and constant nodes cR. In Fig. 1, an example of a Tree-GP is shown for clarity. Although the structure and parameters of a Tree-GP can be trained concurrently, the Tree-GP evolutionary process typically consists of two stages. In the first stage, genetic operators on expression trees are used to generate new tree structures. Following that, in the second stage, numerical optimization methods like linear scale [21] and gradient descent [22] are used to optimize parameters (constants) in the expression trees. Thus, Tree-GP methods can evolve a high-accuracy model while still being interpretable [23]. However, building a high-precision model takes lots of time because it typically requires GP methods to search for a complex tree structure in an enormous search space. Consequently, if computational resources are insufficient, the performance of Tree-GP methods may be improved if we can partition the feature space and then evolve multiple simple models to compose a powerful model rather than evolving a single model.

In light of CART’s shortcomings, machine learning practitioners have proposed numerous novel approaches that use complex models to replace simple models in decision tree leaf nodes. The most well-known example is the piece-wise linear regression tree (PL-Tree) [24]. As shown in Fig. 1, a PL-Tree employs a linear model rather than a constant value as the leaf node model of regression trees. The concept of PL-Tree originated with an algorithm known as M5 [14], which was followed by several variants such as M5 [25] and Gradient boost PL-Tree [26]. Although PL-Tree has high interpretability and accuracy, when there are complex features in the training dataset, such as high nonlinearity, PL-Tree may have some difficulties learning the nonlinear relationship in subregions. Despite attempts by some researchers to incorporate kernel regressors into PL-Tree to capture nonlinearity in the data, its performance on either training speed or interpretability has deteriorated [27]. Hence, designing an appropriate tree structure that balances accuracy and interpretability remains a challenge. At present, GP may be suitable for this scenario. Unlike other interpretable machine learning methods, GP employs EAs to find the optimal symbolic model that can fit the training data [28]. Thus, it may be able to discover a non-linear model with strong predictive performance and good interpretability.

From the perspective of GP, because of the high demand for interpretable models, GP-based piecewise symbolic regression techniques have also received some attention in recent years. GPMCC [29] (Genetic Program for the Mining of Continuous-valued Classes) evolves a decision tree using a GP and symbolic models called GASOLPE [30] as leaf nodes. GPMCC, on the other hand, performs worse than Cubist and has been criticized in several ways, including being overly complex [31] and prone to overfitting [32]. Another type of piecewise symbolic regression is clustered symbolic regression (CSR) [33], which optimizes piecewise symbolic regressors using an expectation-maximization (EM) framework. Nonetheless, such a system focuses on modeling the dynamical system rather than solving traditional regression problems.

Taking into account both the benefits and drawbacks of CART and GP, this paper proposes a new regression method called piecewise symbolic regression tree (PS-Tree). The basic idea is to divide the feature space into several subregions by CART and then use GP and ridge regression to construct a simple regression model for each subregion. Thus, in terms of representation, a PS-Tree is made up of two parts: the upper level is a decision tree and the lower level is a set of symbolic regressors that act at local models. See Fig. 1 for an example. In terms of model training algorithm, we use classification tree to learn the most appropriate data assignment scheme for each partition on the fly and we evolve a set of GP trees as expressive non-linear features for constructing local ridge regression models in all local regions. From the perspective of evolutionary algorithm, we can also describe the above model training process as evolving a set of non-linear features to enhance the existing PL-Tree based on GP. Specifically, in the evolutionary process, each GP individual represents a feature. In order to drive the GP algorithm to discover more powerful features, we use the feature importance of each GP individual in the local model as fitness values. To save the computational resource, inspired by the related work in representation learning, we build each local ridge regression model based on the same set of GP individuals. Thus, our goal is to optimize a set of features that performs well on all subregions. Then, we convert the problem into a multi-objective optimization problem. Based on this design, we are able to discover a powerful PS-Tree with only running the evolutionary algorithm once. The following are the main contributions of our work:

  • We propose a new regression model, named PS-Tree. It divides the feature space into several subregions with the CART algorithm and uses the tree-based GP algorithm to evolve a set of non-linear features for constructing non-linear models in subspaces. As a result, PS-Tree is able to achieve high predictive performance while retaining a high level of interpretability.

  • The GP-based feature construction problem for non-linear local models is transformed into a multi-objective optimization problem. By doing so, we can obtain a set of important features across all subregions at the same time. The results of the experiments show that constructing features in a multi-objective paradigm is preferable to random search.

  • Given that the initial space partition paradigm may be incorrect, we propose an adaptive approach for dynamically adjusting the partition scheme based on runtime information.

The rest of this article is structured as follows. Section 2 illustrates the motivation for our work using a real-world example. Section 3 then presents the proposed algorithm. The experimental result is then revealed in Section 4. Finally, in Section 5, we conclude our paper and suggest some future directions.

Section snippets

Adaptive non-linear piecewise regression

Non-linear piecewise regression is used extensively in metrology [34], physics [35], and geology [36]. However, these real-world problems also present some challenges for the current regression methods, which are demonstrated by a preliminary reference earth model (PREM) [36].

PREM is a classical model designed by experts based on observatory data, and it provides basic earth properties such as gravity and density with respect to depth in the geological domain. We uniformly sample a set of data

Piecewise symbolic regression tree

In the above section, we discussed why it is necessary to develop an automated piecewise non-linear regression method. Following this idea, we propose a piecewise symbolic regression tree, named PS-Tree, in this section.

Experimental study

This section studies the performance of PS-Tree.1 We use all datasets from the PMLB benchmark set [46], which contains 122 regression datasets, to test the effectiveness of our algorithm in various scenarios. These datasets are characterized according to the number of features and instances in Fig. 6. The smallest dataset in PMLB has 47 items while the largest has 1,025,010 items. As for the number of features in these datasets, they range from

Conclusion

In this paper, we propose PS-Tree, a novel machine-learning algorithm that combines the benefits of decision trees and genetic programming. PS-Tree, like the decision tree, divides the subspace into several subregions. Rather than developing a constant or linear model for each region, we evolve a set of non-linear shared features and form a set of non-linear models to capture the complex relationship between different variables in different regions. We convert the representation learning

CRediT authorship contribution statement

Hengzhe Zhang: Investigation, Writing – original draft. Aimin Zhou: Methodology, Writing – review & editing. Hong Qian: Data curation. Hu Zhang: Writing – review & editing.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgment

This work is supported by the Science and Technology Commission of Shanghai Municipality under Grant No.19511120601, the Scientific and Technological Innovation 2030 Major Projects under Grant No. 2018AAA0100902, and the National Natural Science Foundation of China under Grant No. 61731009 and 61907015.

References (59)

  • W. La Cava et al.

    Multidimensional genetic programming for multiclass classification

    Swarm Evol. Comput.

    (2019)
  • J. Friedman et al.

    The Elements of Statistical Learning

    (2001)
  • J.A. Nelder et al.

    Generalized linear models

    J. R. Stat. Soc. Ser. A

    (1972)
  • C.K.I. Williams et al.

    Gaussian processes for regression

  • H. Drucker et al.

    Support vector regression machines

    Adv. Neural Inf. Process. Syst.

    (1996)
  • J.H. Friedman

    Greedy function approximation: a gradient boosting machine

    Ann. Stat.

    (2001)
  • D.F. Specht

    A general regression neural network

    IEEE Trans. Neural Netw.

    (1991)
  • I. Goodfellow et al.

    Deep Learning

    (2016)
  • K. He et al.

    Delving deep into rectifiers: surpassing human-level performance on imagenet classification

    Proceedings of the IEEE International Conference on Computer Vision

    (2015)
  • J. Devlin et al.

    BERT: pre-training of deep bidirectional transformers for language understanding

  • C. Rudin

    Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead

    Nat. Mach. Intell.

    (2019)
  • G. Ke et al.

    LightGBM: a highly efficient gradient boosting decision tree

    Advances in Neural Information Processing Systems

    (2017)
  • B.P. Evans et al.

    What’s inside the black-box? A genetic programming method for interpreting complex machine learning models

    Proceedings of the Genetic and Evolutionary Computation Conference

    (2019)
  • L. Brieman et al.

    Classification and regression trees

    (1984)
  • J.R. Quinlan

    Learning with continuous classes

    5th Australian Joint Conference on Artificial Intelligence

    (1992)
  • A.E. Eiben et al.

    What is an evolutionary algorithm?

    Introduction to Evolutionary Computing

    (2015)
  • Y. Mei et al.

    An efficient feature selection algorithm for evolving job shop scheduling rules with genetic programming

    IEEE Trans. Emerg. Top. Comput. Intell.

    (2017)
  • Y. Yuan et al.

    ARJA: automated repair of java programs via multi-objective genetic programming

    IEEE Trans. Softw. Eng.

    (2020)
  • M. Suganuma et al.

    Evolution of deep convolutional neural networks using cartesian genetic programming

    Evol. Comput.

    (2020)
  • View full text