Created by W.Langdon from gp-bibliography.bib Revision:1.8010
In this thesis, we first propose a genetic programming approach to record deduplication. This approach combines several different pieces of evidence extracted from the actual data present in the repositories to suggest a deduplication function that is able to identify whenever two entries in a repository are replicas or not. As shown by our experiments, our approach outperforms existing state-of-the-art methods found in the literature. Moreover, the suggested function is computationally less demanding since it uses fewer evidence. Finally, it is also important to notice that our approach is capable of automatically adapting to a given fixed replica identification boundary, freeing the user from the burden of having to choose and tune this parameter
Based on the previous approach, we also devised a novel evolutionary approach, that is able to automatically find complex schema matches. Our aim was to develop a method to find semantic relationships between schema elements, in a restricted scenario in which only the data instances are available. To the best of our knowledge, this is the first approach that is capable of discovering complex schema matches using only the data instances, which is performed by exploiting record deduplication and information retrieval techniques to find schema matches during the evolutionary process. To demonstrate the effectiveness of our approach, we conducted an experimental evaluation using real-world and synthetic datasets. Our results show that our approach is able to find complex matches with high accuracy, despite using only the data instances.",
Genetic Programming entries for Moises G de Carvalho