abstract = "A central problem in data integration and data
cleansing is to identify pairs of entities in data sets
that describe the same real-world object. Many existing
methods for matching entities rely on explicit linkage
rules, which specify how two entities are compared for
equivalence. Unfortunately, writing accurate linkage
rules by hand is a non-trivial problem that requires
detailed knowledge of the involved data sets. Another
important issue is the efficient execution of linkage
rules. In this thesis, we propose a set of novel
methods that cover the complete entity matching
workflow from the generation of linkage rules using
genetic programming algorithms to their efficient
execution on distributed systems. First, we propose a
supervised learning algorithm that is capable of
generating linkage rules from a gold standard
consisting of set of entity pairs that have been
labelled as duplicates or non-duplicates. We show that
the introduced algorithm outperforms previously
proposed entity matching approaches including the
state-of-the-art genetic programming approach by de
Carvalho et al. and is capable of learning linkage
rules that achieve a similar accuracy than the human
written rule for the same problem. In order to also
cover use cases for which no gold standard is
available, we propose a complementary active learning
algorithm that generates a gold standard interactively
by asking the user to confirm or decline the
equivalence of a small number of entity pairs. In the
experimental evaluation, labelling at most 50 link
candidates was necessary in order to match the
performance that is achieved by the supervised GenLink
algorithm on the entire gold standard. Finally, we
propose an efficient execution work flow that can be
run on cluster of multiple machines. The execution
workflow employs a novel multidimensional indexing
method that allows the efficient execution of learnt
linkage rules by reducing the number of required
comparisons significantly.",
notes = "Supervised by Professor Bizer and Professor
Stuckenschmidt. Broken Feb 2019
http://dws.informatik.uni-mannheim.de/en/news/detail/news/2013/09/02/congratulations-to-robert-isele-for-receiving-his-doctoral-degree/