Created by W.Langdon from gp-bibliography.bib Revision:1.8178
In this thesis we will describe the work developed at the Machine Learning Lab1 at University of Trieste, consisting in novel Machine Learning techniques aimed at the solution of real world problems of practical interest: automatic synthesis of regular expressions for text extraction and text classification tasks; an approach for the continuous reauthentication of web users; design of algorithms for author verification for text documents; author profiling for text messages; automatic generation of fake textual reviews. Among them the main contribution of this thesis is the design and implementation of new algorithms for the automatic generation of regular expressions for text extraction, based solely on examples of the desired behavior [21, 23, 31]. This is a long-standing problem where we aim at generating a regular expression that generalises the extraction behavior represented by some examples, i.e., strings annotated by a user with the desired portions to be extracted. The proposed algorithms are based on an evolutionary approach called Genetic Programming (GP), that is an evolutionary computing paradigm which implements an heuristic search, in a space of candidate solution, in a way that mimic the natural evolution process.
The results demonstrate that our new algorithms have higher effectiveness than previous proposals and demonstrate that our algorithms are able to generate regular expressions in a way that is competitive with human experts both in terms of effectiveness and generation time [24, 33]. Thanks to these achievements,the proposed method has been awarded with the Silver Medal at the 13th Annual Humies Award 2, an international competition that establishes the state of the art in genetic and evolutionary computation and is open to human-competitive results that are equal to or better than the most recent human-created solution to a long-standing problem. The result of our research has been also released as an opensource framework 3 and as a web application demo 4 where users are free to provide text extraction examples to the application and obtain the corresponding regular expression.
Later in this thesis we will extend our work on automatic generation of regular expressions for text extraction from examples in order to operate in an Active learning scenario. In this scenario the user is not required to annotate all the examples at once but the Active learning tool interacts with the user in order to assist him during the annotation of the extractions in examples. We will propose our Active learning method [22, 26] that is based on our previous GP algorithms and the results will demonstrate that our active learning tool reduces the user annotation effort while providing comparable effectiveness for the generated regular expressions.
Moreover, in this thesis we will consider two applications of the proposed regular expressions generator, adapted in order to cope with text categorization problems that are different from text extraction:(i) the Regex Golf game and (ii) the identification of Genic Interactions in sentences. The Regex Golf is a game where the player should write he shortest regular expression that accepts the strings in a positive set and does not accept strings in a negative set. We will show that our GP algorithm is able to play this game effectively and we will demonstrate that our algorithm is competitive with human players [20]. In the second case, we will consider the problem of automatically identifying sentences that contain interactions between genes and proteins inside a text document [30]. Our proposal requires solely a dictionary of genes and proteins and a small set of sample sentences in natural language. The proposed method generates a model in form of regular expressions that represents the relevant syntax patterns in terms of standard part-of-speech annotations. We will assess our approach on realistic datasets and show an accuracy that is sufficiently high to be of practical interest and that is in line with significant baseline methods.
The following contributions leave the field of the Genetic Programming algorithms and will propose solutions based on other Machine Learning methodologies, ranging from Grammatical Evolution to Support Vector Machines and Random Forests to Recurrent Neural Networks.
We will propose a methodology for predicting the accuracy of the text extractor [25] that may be inferred with the proposed GP method. We will employ several prediction techniques and the results suggest that reliable predictions for tasks of practical complexity may indeed be obtained quickly and without actually generating the entity extractor. Later, we will approach the problem of the automatic text extraction from another perspective and we will propose a novel learning algorithm that is able to generate a string similarity function tailored to problems of syntax-based entity extraction from unstructured text streams [27]. The proposed algorithm, based on an evolutionary paradigm named Grammatical Evolution, takes in input pairs of strings along with an indication of whether they adhere or not adhere to the same syntactic pattern. The results suggest that the proposed approach is indeed feasible and that the learned similarity function is more effective than the Levenshtein distance and the Jaccard similarity index. Hence, we will propose a system for continuous reauthentication of web users based on the observed mouse dynamics [144]; the key feature of our proposal is that no specific software needs to be installed on client machines. We obtain accuracy in the order of 97percent, which is aligned with earlier proposals. Then, we will approach the user authentication problem [14], this task consists in determining if an unknown document was authored by the same author of a set of documents with the same author. Our methods has been submitted to the 2015 PAN competition and achieved the first position in the final rank for the Spanish language. Hence, we will approach the user profiling problem [19], this task consists in predicting some BibTeX entry too long. Truncated
Genetic Programming entries for Fabiano Tarlao