Performing multi-target regression via gene expression programming-based ensemble models
Introduction
In the last two decades, the study of problems where samples are associated with several output variables at the same time has gained a lot of attention in the machine learning community. Multi-target regression (henceforth, MTR) is one of these problems, which comprises the prediction of multiple continuous target variables using a common set of input variables [1]. The great interest in studying such multi-output problems has been mainly due to the high number of real-world applications that can be analyzed under this framework [2]. For example, MTR has been successfully applied to the ecological modeling [3], signal processing [4], and energy efficiency [5].
Two major challenges appear when solving MTR, the modeling of existing dependencies between target variables, and the estimation of nonlinear relationships that may exist between the input and output spaces of a problem [6], [7]. To date, many methods have been proposed to study MTR [8], such as algorithms inspired by multi-label learning1 approaches [10], statistical methods [11], regression trees [3], and Support Vector Machines (SVMs) [12]. However, despite the considerable number of existing contributions, there are few solutions tackling the two aforementioned challenges simultaneously [6].
Recently, ensemble-based methods that decompose a MTR problem into several single-target regression tasks have shown their ability to partially capture inter-target relationships [10], [1]. These methods are also able to exploit some input–output relationships by means of expanding the input space with target variables. The main assumption behind these ensemble methods lies in the fact that a lower probability of poor predictions could be attained by combining several simpler regressors [13]. However, it should be stressed that they largely depend on the capacity of the internal regressor used to solve each single-target regression task. In this sense, Reyes et al. [1] demonstrated that ensemble models composed by members handling the MTR problem directly (i.e. each member of the ensemble directly handles multi-target data) are really effective, being able to better model the inter-target and input–output relationships. The main challenges of this last approach are, however, first to select as member of the ensemble a regressor that is able to effectively solve the MTR problem separately, and second to develop an adequate ensemble schema and aggregation strategy in order to effectively combine all the predictions.
As for developing an effective regressor that is able to directly tackle MTR, it is interesting to analyze one of the most popular frameworks that has been widely used to solve regression problems, the so-called Symbolic Regression (hereafter, SR) [14]. The aim of SR is to find a mathematical expression that best fits a given dataset [15], and it is commonly studied by means of applying evolutionary algorithms, in particular Genetic Programming (GP) [16] and Gene Expression Programming (hereafter, GEP) [15] methods. In the last years, GEP-based SR methods have increased in popularity [17], since they leverage the benefits of Genetic Algorithms (GAs) and GP paradigms. This type of method uses a population of individuals, selects parents according to their fitness, and evolves the population using genetic operators. However, the main difference is that individuals are encoded as linear strings of fixed length (as in GAs), but then they are expressed as trees of different size and shapes (as in GP) [18]. To date, the existing GEP-based SR methods have been restricted to solve single-target regression problems. However, the application of the GEP-based SR approach to MTR would give the opportunity to obtain more interpretable models than the black box ones which are obtained after applying some popular learning algorithms, such as SVMs [19]. Furthermore, this technique does not need to provide a particular model at the beginning, but the expressions are built by combining mathematical blocks, thus providing an implicit feature selection process and detection of complex data relationships.
In this work, a GEP-based SR method (henceforth, GEPMTR) and three ensemble-based methods (they are composed by GEPMTR members) are proposed for MTR. GEPMTR directly deals with multi-target data (i.e. a MTR problem is not decomposed into several single-target regression tasks) and, therefore, it has an acceptable computational cost in problems comprising a large number of target variables. GEPMTR solves a MTR problem by means of evolving a population of individuals, where each individual represents a complete solution to the problem; an individual is coded by a multi-genic chromosome, where each gene represents a mathematical function which predicts the value of a target variable. The creative power provided by GEP technique allows to constantly create new genetic material, being able to better explore the search space, as well as to detect inter-target dependencies by means of applying transposition operators. On the other hand, the three proposed ensemble-based methods differ in the way that base regressors are generated; two of these ensemble models are able to implicitly model the inter-target dependencies and input–output relationships by considering target variables as predictive ones.
The main contribution of this work is the introduction of a GEP-based SR method and its corresponding ensemble-based models that can effectively solve MTR, being able to tackle the two main challenges that commonly appear in MTR, namely the modeling of inter-target and nonlinear input–output relationships. To demonstrate the effectiveness of the proposal, an extensive experimental study was conducted in a collection of 18 datasets, and the results showed that the proposed methods obtained competitive results in comparison with the state-of-the-art MTR algorithms.
The rest of this work is organized as follows: Section 2 presents a formal definition of the MTR problem, and briefly portrays the state-of-the-art in MTR and SR; Section 3 describes the GEPMTR method and the proposed ensemble-based models; Section 4 shows the experimental study carried out, where the obtained results are discussed; and finally, Section 5 presents some concluding remarks.
Section snippets
Related works
In this section, first the MTR problem is formulated, and then the state-of-the-art methods are briefly described. Second, a general overview of the evolutionary methods, particularly those methods based on GEP, that have been proposed to perform SR is portrayed.
A GEP-based SR method for MTR
In this section, first a GEP-based SR method for MTR is proposed, and then, three ensemble-based approaches are presented.
Experimental study
In this section, first the details and settings of the experiments are explained, and then, the results are presented and discussed.
Conclusions
In this work, an effective GEP-based SR method has been proposed for solving the MTR problem. The codification schema selected and the way in which the individuals are evaluated allowed us to design a method with an acceptable computational cost, enabling its use in large-scale datasets. Also, the process followed to ensure a diverse population avoided to drop in local optimums in early generations, obtaining better models as the number of generations increased. Furthermore, several components
CRediT authorship contribution statement
Jose M. Moyano: Formal analysis, Investigation, Software, Validation, Writing - original draft, Writing - review & editing. Oscar Reyes: Formal analysis, Investigation, Software, Validation, Writing - original draft, Writing - review & editing. Habib M. Fardoun: Formal analysis, Investigation, Software, Validation, Writing - original draft, Writing - review & editing. Sebastián Ventura: Conceptualization, Formal analysis, Funding acquisition, Investigation, Methodology, Resources, Supervision,
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgements
This research was supported by the Spanish Ministry of Economy and Competitiveness and the European Regional Development Fund, project TIN2017-83445-P, and by the University of Cordoba and Andalusian Council of Economy, Knowledge, Business, and Universities, project 1262678-F. This research was also supported by the Spanish Ministry of Education under FPU Grant FPU15/02948.
Jose M. Moyano obtained his Ph.D. in Computer Science from the University of Córdoba (Spain) and the Virginia Commonwealth University (USA) in 2020. He also received his B.Sc. and M.Sc. degrees in Computer Science from the University of Córdoba (Spain), in 2014 and 2016, respectively. His research is mainly focused on ensemble methods for multi-label classification. He is member of the KDIS Research Group of the University of Córdoba, and he has published 15 papers in journals and international
References (49)
- et al.
Using single and multi-target regression trees and ensembles to model a compound index of vegetation condition
Ecol. Model.
(2009) - et al.
Accurate quantitative estimation of energy performance of residential buildings using statistical machine learning tools
Energy Build.
(2012) - et al.
Input selection and shrinkage in multiresponse linear regression
Comput. Stat. Data Anal.
(2007) - et al.
Multi-target support vector regression via correlation regressor chains
Inf. Sci.
(2017) - et al.
An improved gene expression programming approach for symbolic regression problems
Neurocomputing
(2014) - et al.
Artificial bee colony programming for symbolic regression
Inf. Sci.
(2012) - et al.
Robust gene expression programming
Procedia Comput. Sci.
(2011) Modeling slump flow of concrete using second-order regressions and artificial neural networks
Cem. Concr. Compos.
(2007)- et al.
An empirical study on sea water quality prediction
Knowl.-Based Syst.
(2008) - et al.
An ensemble-based method for the selection of instances in the multi-target regression problem
Integr. Comput. Aided Eng.
(2018)
A locally weighted learning method based on a data gravitation model for multi-target regression
Int. J. Comput. Intell. Syst.
Multioutput support vector regression for remote sensing biophysical parameter estimation
IEEE Geosci. Remote Sens. Lett.
Multi-target regression via robust low-rank learning
IEEE Trans. Pattern Anal. Mach. Intell.
Performing multi-target regression via a parameter sharing-based deep network
Int. J. Neural Syst.
A survey on multi-output regression
Wiley Interdisc. Rev. Data Min. Knowl. Disc.
Evolutionary strategy to perform batch-mode active learning on multi-label data
ACM Trans. Intell. Syst. Technol.
Multi-target regression via input space expansion: treating targets as inputs
Mach. Learn.
Statistical genetic programming for symbolic regression
Appl. Soft Comput.
Gene expression programming: a survey
IEEE Comput. Intell. Mag.
Gene expression programming: a new adaptive algorithm for solving problems
Complex Syst.
A gene expression programming method for multi-target regression
International Conference on Learning and Optimization Algorithms: Theory and Applications
Cited by (12)
Multi-target regression via non-linear output structure learning
2022, NeurocomputingCitation Excerpt :Nevertheless, the performance of this algorithm is slightly worse than that of the multi-objective random forests method [26]. In [27], a symbolic regression approach based on Gene Expression Programming is proposed for the multi-target regression problem. The method can estimate the inter-target correlations using some genetic operators.
VMFS: A VIKOR-based multi-target feature selection
2021, Expert Systems with ApplicationsCitation Excerpt :If the predicting outputs are binary, the learning process is a classification problem called Multi-Label (ML) learning (Bayati et al., 2020). If the outputs are continuous values, it is a regression problem that refers to Multi-Target (MT) learning (Moyano et al., 2021). In ML data, each label's value indicates each sample belongs to a class label, but in MT data, this value represents the quality of each instance for a particular target (Borchani et al., 2015; Spyromitros-Xioufis et al., 2016).
Greedy control group selection for multi-explanatory multi-output regression problem
2024, Research Square
Jose M. Moyano obtained his Ph.D. in Computer Science from the University of Córdoba (Spain) and the Virginia Commonwealth University (USA) in 2020. He also received his B.Sc. and M.Sc. degrees in Computer Science from the University of Córdoba (Spain), in 2014 and 2016, respectively. His research is mainly focused on ensemble methods for multi-label classification. He is member of the KDIS Research Group of the University of Córdoba, and he has published 15 papers in journals and international conferences.
Oscar Reyes received the B.S. and M.Sc. degrees in Computer Science from the University of Holguin (Cuba), in 2008 and 2011, respectively. He obtained the Ph.D. degree in Computer Science from the University of Cordoba (Spain) in 2016. He is currently a researcher at the Knowledge Discovery and Intelligent Systems Research Laboratory of University of Cordoba, Spain. He has also been engaged in 2 national research projects. His current research interests are in the fields of data mining, machine learning, metaheuristics, and their applications
Habib M. Fardoun is Associate Professor at the Faculty of Computing and Information Technology of the King Abdulaziz University. He is the project manager of the ISE Research group at the Computer Engineering Research Institute of Albacete. He is the author of more than 40 international publications. His research interest fall in the fields of artificial intelligence, machine learning, educational systems and human computer interaction.
Sebastian Ventura is currently a Full Professor in the Department of Computer Science and Numerical Analysis at the University of Cordoba, where he heads the Knowledge Discovery and Intelligent Systems Research Laboratory. He received his B.Sc. and Ph.D. degrees in sciences from the University of Cordoba, Spain, in 1989 and 1996, respectively. He has published three books and about 300 papers in journals and scientific conferences, and he has edited about ten books and special issues in international journals in his area of expertise. He has also been engaged in 15 research projects (being the coordinator of seven of them) supported by the Spanish and Andalusian governments and the European Union. He has also served as General and Program Chair in several conferences in the fields of machine learning and artificial intelligence, and currently he holds different positions at the editorial board of journals such as Engineering Applications of Artificial Intelligence, Computational Intelligence, International Journal of Educational Technology in Higher Education, and Information Fusion, serving also as Editor in Chief at the Progress in Artificial Intelligence journal. His main research interests are in the fields of data science, computational intelligence, and their applications. Dr. Ventura is a senior member of the IEEE Computer, the IEEE Computational Intelligence and the IEEE Systems, Man and Cybernetics Societies, as well as the Association of Computing Machinery (ACM).