Elsevier

Neurocomputing

Volume 432, 7 April 2021, Pages 275-287
Neurocomputing

Performing multi-target regression via gene expression programming-based ensemble models

https://doi.org/10.1016/j.neucom.2020.12.060Get rights and content

Highlights

  • Three multi-target regression ensemble models with different architectures.

  • Use gene-expression programming to build each member of the ensemble.

  • Individuals encode a full solution to the problem, using as many genes as targets.

  • Competitive results compared versus 5 state-of-the-art methods over 18 datasets.

Abstract

Multi-Target Regression problem comprises the prediction of multiple continuous variables given a common set of input features, unlike traditional regression tasks, where just one output target is available. There are two major challenges when addressing this problem, namely the exploration of the inter-target dependencies and the modeling of complex input–output relationships. This work proposes a Symbolic Regression method following the basis of Gene Expression Programming paradigm to solve the multi-target regression problem, and called GEPMTR. It evolves a population of individuals, where each one represents a complete solution to the problem by using a multi-genic chromosome, and encodes a mathematical function for each target variable involving the input attributes. The proposed model can estimate the inter-target dependencies by applying some genetic operators. Furthermore, three ensemble-based methods are developed to better exploit the inter-target and input–output relationships. The effectiveness of the proposals is analyzed through an extensive experimental study on 18 datasets. The codification schema and the process followed to ensure a diverse population in GEPMTR lead to obtain an effective proposal to solve the MTR problem. Furthermore, the EGEPMTR-B ensemble method obtained the best performance across all proposed models, being the best in 8 out of 11 cases, demonstrating that more sophisticated mechanisms were not needed for ensuring that GEPMTR method would properly model the existing inter-target dependencies. Finally, the experimental results also showed that the proposed approach attains competitive results compared to state-of-the-art, showing the possibilities that can bring this research line for effectively solving the MTR problem.

Introduction

In the last two decades, the study of problems where samples are associated with several output variables at the same time has gained a lot of attention in the machine learning community. Multi-target regression (henceforth, MTR) is one of these problems, which comprises the prediction of multiple continuous target variables using a common set of input variables [1]. The great interest in studying such multi-output problems has been mainly due to the high number of real-world applications that can be analyzed under this framework [2]. For example, MTR has been successfully applied to the ecological modeling [3], signal processing [4], and energy efficiency [5].

Two major challenges appear when solving MTR, the modeling of existing dependencies between target variables, and the estimation of nonlinear relationships that may exist between the input and output spaces of a problem [6], [7]. To date, many methods have been proposed to study MTR [8], such as algorithms inspired by multi-label learning1 approaches [10], statistical methods [11], regression trees [3], and Support Vector Machines (SVMs) [12]. However, despite the considerable number of existing contributions, there are few solutions tackling the two aforementioned challenges simultaneously [6].

Recently, ensemble-based methods that decompose a MTR problem into several single-target regression tasks have shown their ability to partially capture inter-target relationships [10], [1]. These methods are also able to exploit some input–output relationships by means of expanding the input space with target variables. The main assumption behind these ensemble methods lies in the fact that a lower probability of poor predictions could be attained by combining several simpler regressors [13]. However, it should be stressed that they largely depend on the capacity of the internal regressor used to solve each single-target regression task. In this sense, Reyes et al. [1] demonstrated that ensemble models composed by members handling the MTR problem directly (i.e. each member of the ensemble directly handles multi-target data) are really effective, being able to better model the inter-target and input–output relationships. The main challenges of this last approach are, however, first to select as member of the ensemble a regressor that is able to effectively solve the MTR problem separately, and second to develop an adequate ensemble schema and aggregation strategy in order to effectively combine all the predictions.

As for developing an effective regressor that is able to directly tackle MTR, it is interesting to analyze one of the most popular frameworks that has been widely used to solve regression problems, the so-called Symbolic Regression (hereafter, SR) [14]. The aim of SR is to find a mathematical expression that best fits a given dataset [15], and it is commonly studied by means of applying evolutionary algorithms, in particular Genetic Programming (GP) [16] and Gene Expression Programming (hereafter, GEP) [15] methods. In the last years, GEP-based SR methods have increased in popularity [17], since they leverage the benefits of Genetic Algorithms (GAs) and GP paradigms. This type of method uses a population of individuals, selects parents according to their fitness, and evolves the population using genetic operators. However, the main difference is that individuals are encoded as linear strings of fixed length (as in GAs), but then they are expressed as trees of different size and shapes (as in GP) [18]. To date, the existing GEP-based SR methods have been restricted to solve single-target regression problems. However, the application of the GEP-based SR approach to MTR would give the opportunity to obtain more interpretable models than the black box ones which are obtained after applying some popular learning algorithms, such as SVMs [19]. Furthermore, this technique does not need to provide a particular model at the beginning, but the expressions are built by combining mathematical blocks, thus providing an implicit feature selection process and detection of complex data relationships.

In this work, a GEP-based SR method (henceforth, GEPMTR) and three ensemble-based methods (they are composed by GEPMTR members) are proposed for MTR. GEPMTR directly deals with multi-target data (i.e. a MTR problem is not decomposed into several single-target regression tasks) and, therefore, it has an acceptable computational cost in problems comprising a large number of target variables. GEPMTR solves a MTR problem by means of evolving a population of individuals, where each individual represents a complete solution to the problem; an individual is coded by a multi-genic chromosome, where each gene represents a mathematical function which predicts the value of a target variable. The creative power provided by GEP technique allows to constantly create new genetic material, being able to better explore the search space, as well as to detect inter-target dependencies by means of applying transposition operators. On the other hand, the three proposed ensemble-based methods differ in the way that base regressors are generated; two of these ensemble models are able to implicitly model the inter-target dependencies and input–output relationships by considering target variables as predictive ones.

The main contribution of this work is the introduction of a GEP-based SR method and its corresponding ensemble-based models that can effectively solve MTR, being able to tackle the two main challenges that commonly appear in MTR, namely the modeling of inter-target and nonlinear input–output relationships. To demonstrate the effectiveness of the proposal, an extensive experimental study was conducted in a collection of 18 datasets, and the results showed that the proposed methods obtained competitive results in comparison with the state-of-the-art MTR algorithms.

The rest of this work is organized as follows: Section 2 presents a formal definition of the MTR problem, and briefly portrays the state-of-the-art in MTR and SR; Section 3 describes the GEPMTR method and the proposed ensemble-based models; Section 4 shows the experimental study carried out, where the obtained results are discussed; and finally, Section 5 presents some concluding remarks.

Section snippets

Related works

In this section, first the MTR problem is formulated, and then the state-of-the-art methods are briefly described. Second, a general overview of the evolutionary methods, particularly those methods based on GEP, that have been proposed to perform SR is portrayed.

A GEP-based SR method for MTR

In this section, first a GEP-based SR method for MTR is proposed, and then, three ensemble-based approaches are presented.

Experimental study

In this section, first the details and settings of the experiments are explained, and then, the results are presented and discussed.

Conclusions

In this work, an effective GEP-based SR method has been proposed for solving the MTR problem. The codification schema selected and the way in which the individuals are evaluated allowed us to design a method with an acceptable computational cost, enabling its use in large-scale datasets. Also, the process followed to ensure a diverse population avoided to drop in local optimums in early generations, obtaining better models as the number of generations increased. Furthermore, several components

CRediT authorship contribution statement

Jose M. Moyano: Formal analysis, Investigation, Software, Validation, Writing - original draft, Writing - review & editing. Oscar Reyes: Formal analysis, Investigation, Software, Validation, Writing - original draft, Writing - review & editing. Habib M. Fardoun: Formal analysis, Investigation, Software, Validation, Writing - original draft, Writing - review & editing. Sebastián Ventura: Conceptualization, Formal analysis, Funding acquisition, Investigation, Methodology, Resources, Supervision,

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgements

This research was supported by the Spanish Ministry of Economy and Competitiveness and the European Regional Development Fund, project TIN2017-83445-P, and by the University of Cordoba and Andalusian Council of Economy, Knowledge, Business, and Universities, project 1262678-F. This research was also supported by the Spanish Ministry of Education under FPU Grant FPU15/02948.

Jose M. Moyano obtained his Ph.D. in Computer Science from the University of Córdoba (Spain) and the Virginia Commonwealth University (USA) in 2020. He also received his B.Sc. and M.Sc. degrees in Computer Science from the University of Córdoba (Spain), in 2014 and 2016, respectively. His research is mainly focused on ensemble methods for multi-label classification. He is member of the KDIS Research Group of the University of Córdoba, and he has published 15 papers in journals and international

References (49)

  • O. Reyes et al.

    A locally weighted learning method based on a data gravitation model for multi-target regression

    Int. J. Comput. Intell. Syst.

    (2018)
  • D. Tuia et al.

    Multioutput support vector regression for remote sensing biophysical parameter estimation

    IEEE Geosci. Remote Sens. Lett.

    (2011)
  • X. Zhen et al.

    Multi-target regression via robust low-rank learning

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2018)
  • O. Reyes et al.

    Performing multi-target regression via a parameter sharing-based deep network

    Int. J. Neural Syst.

    (2019)
  • H. Borchani et al.

    A survey on multi-output regression

    Wiley Interdisc. Rev. Data Min. Knowl. Disc.

    (2015)
  • O. Reyes et al.

    Evolutionary strategy to perform batch-mode active learning on multi-label data

    ACM Trans. Intell. Syst. Technol.

    (2018)
  • E. Spyromitros-Xioufis et al.

    Multi-target regression via input space expansion: treating targets as inputs

    Mach. Learn.

    (2016)
  • J.M. Moyano, E. Gibaja, S. Ventura, An evolutionary algorithm for optimizing the target ordering in ensemble of...
  • S. Stijven, E. Vladislavleva, A. Kordon, L. Willem, M.E. Kotanchek, Genetic Programming Theory and Practice XIII,...
  • M. Haeri et al.

    Statistical genetic programming for symbolic regression

    Appl. Soft Comput.

    (2017)
  • J. Zhong et al.

    Gene expression programming: a survey

    IEEE Comput. Intell. Mag.

    (2017)
  • C. Ferreira

    Gene expression programming: a new adaptive algorithm for solving problems

    Complex Syst.

    (2001)
  • O. Reyes et al.

    A gene expression programming method for multi-target regression

    International Conference on Learning and Optimization Algorithms: Theory and Applications

    (2018)
  • G. Tsoumakas, E. Spyromitros-Xioufis, A. Vrekou, I. Vlahavas, Multi-target regression via random linear target...
  • Cited by (12)

    • Multi-target regression via non-linear output structure learning

      2022, Neurocomputing
      Citation Excerpt :

      Nevertheless, the performance of this algorithm is slightly worse than that of the multi-objective random forests method [26]. In [27], a symbolic regression approach based on Gene Expression Programming is proposed for the multi-target regression problem. The method can estimate the inter-target correlations using some genetic operators.

    • VMFS: A VIKOR-based multi-target feature selection

      2021, Expert Systems with Applications
      Citation Excerpt :

      If the predicting outputs are binary, the learning process is a classification problem called Multi-Label (ML) learning (Bayati et al., 2020). If the outputs are continuous values, it is a regression problem that refers to Multi-Target (MT) learning (Moyano et al., 2021). In ML data, each label's value indicates each sample belongs to a class label, but in MT data, this value represents the quality of each instance for a particular target (Borchani et al., 2015; Spyromitros-Xioufis et al., 2016).

    View all citing articles on Scopus

    Jose M. Moyano obtained his Ph.D. in Computer Science from the University of Córdoba (Spain) and the Virginia Commonwealth University (USA) in 2020. He also received his B.Sc. and M.Sc. degrees in Computer Science from the University of Córdoba (Spain), in 2014 and 2016, respectively. His research is mainly focused on ensemble methods for multi-label classification. He is member of the KDIS Research Group of the University of Córdoba, and he has published 15 papers in journals and international conferences.

    Oscar Reyes received the B.S. and M.Sc. degrees in Computer Science from the University of Holguin (Cuba), in 2008 and 2011, respectively. He obtained the Ph.D. degree in Computer Science from the University of Cordoba (Spain) in 2016. He is currently a researcher at the Knowledge Discovery and Intelligent Systems Research Laboratory of University of Cordoba, Spain. He has also been engaged in 2 national research projects. His current research interests are in the fields of data mining, machine learning, metaheuristics, and their applications

    Habib M. Fardoun is Associate Professor at the Faculty of Computing and Information Technology of the King Abdulaziz University. He is the project manager of the ISE Research group at the Computer Engineering Research Institute of Albacete. He is the author of more than 40 international publications. His research interest fall in the fields of artificial intelligence, machine learning, educational systems and human computer interaction.

    Sebastian Ventura is currently a Full Professor in the Department of Computer Science and Numerical Analysis at the University of Cordoba, where he heads the Knowledge Discovery and Intelligent Systems Research Laboratory. He received his B.Sc. and Ph.D. degrees in sciences from the University of Cordoba, Spain, in 1989 and 1996, respectively. He has published three books and about 300 papers in journals and scientific conferences, and he has edited about ten books and special issues in international journals in his area of expertise. He has also been engaged in 15 research projects (being the coordinator of seven of them) supported by the Spanish and Andalusian governments and the European Union. He has also served as General and Program Chair in several conferences in the fields of machine learning and artificial intelligence, and currently he holds different positions at the editorial board of journals such as Engineering Applications of Artificial Intelligence, Computational Intelligence, International Journal of Educational Technology in Higher Education, and Information Fusion, serving also as Editor in Chief at the Progress in Artificial Intelligence journal. His main research interests are in the fields of data science, computational intelligence, and their applications. Dr. Ventura is a senior member of the IEEE Computer, the IEEE Computational Intelligence and the IEEE Systems, Man and Cybernetics Societies, as well as the Association of Computing Machinery (ACM).

    View full text