Created by W.Langdon from gp-bibliography.bib Revision:1.8562
Aiming at extracting hidden, interesting and previously unknown information from large quantities of data, many different techniques have been proposed along the years. Nevertheless, all of them could be categorized in two main groups: descriptive tasks, which depict intrinsic and important properties of data; and predictive tasks, which predict an output variable for unseen data. Classification based on association rule mining, generally known as Associative Classification (AC), integrates a descriptive task in the process of generating a classifier. Several researches have proved that AC algorithms are able to obtain accurate and interpretable results in an efficient way thanks to leveraging association rule discovery methods in the training phase. This enables to obtain all the possible hidden relationships among the attribute values which possibly may be missed by other lesser exhaustive methodologies. Furthermore, AC also enables to update and tune a subset of rules without having to redraw the whole tree as happens in decision tree approaches. Last but not least, the main advantage of AC with regard to other techniques is the final model representation, which is formed by simple and easy to interpret rules that enables end-user to understand and interpret the results.
This Doctoral Thesis aims at solving the challenging problem of AC and its application on very large datasets. The main contributions of this Ph.D. thesis are summarized in the following points:
AC state-of-art has been studied and analyzed, and a new tool covering the whole taxonomy of algorithms as well as providing many different measures has been proposed. The goal of this tool is two-fold: 1) unification of comparisons, since existing works compare with very different measures; 2) providing a unique tool which has at least one algorithm of each category forming the taxonomy.
AC has been analyzed on very large quantities of data. In this regard, many different platforms for distributed computing have been studied and different proposals have been developed on them. These proposals enable to deal with very large data in a efficient way scaling up the load on very different compute nodes.
As one of the most important part of the AC is to extract high quality rules, it has been proposed a novel grammar-guided genetic programming algorithm which enables to obtain interesting association rules with regard to different metrics and in different kinds of data, including truly Big Data datasets. This proposal has proved to obtain very good results in terms of both quality and interpretability, at the same time of providing a very flexible way of representing the solutions and enabling to introduce subjective knowledge in the search process. Then, a novel algorithm has been proposed for AC using a non-trivial adaptation of the aforementioned algorithm to obtain the rules forming the classifier. This methodology is also based on grammar-guided genetic programming enabling user not only to constrain the form of the rules, but the final form of the classifier. Results have proved that this algorithm obtains very accurate classifiers at the same time of maintaining a good level of interpretability.",
TIN2014-55252-P and TIN-2017-83445-P
Supervisors: Sebastian Ventura Soto and Jose Maria Luna Ariza",
Genetic Programming entries for Francisco Padillo