An unsupervised heuristic-based approach for bibliographic metadata deduplication

https://doi.org/10.1016/j.ipm.2011.01.009Get rights and content

Abstract

Digital libraries of scientific articles contain collections of digital objects that are usually described by bibliographic metadata records. These records can be acquired from different sources and be represented using several metadata standards. These metadata standards may be heterogeneous in both, content and structure. All of this implies that many records may be duplicated in the repository, thus affecting the quality of services, such as searching and browsing. In this article we present an approach that identifies duplicated bibliographic metadata records in an efficient and effective way. We propose similarity functions especially designed for the digital library domain and experimentally evaluate them. Our results show that the proposed functions improve the quality of metadata deduplication up to 188% compared to four different baselines. We also show that our approach achieves statistical equivalent results when compared to a state-of-the-art method for replica identification based on genetic programming, without the burden and cost of any training process.

Research highlights

► An efficient and effective approach for bibliographic metadata record deduplication. ► Up to 188% improvement in the quality of metadata deduplication. ► Up to 44% of failure cases solved by the proposed similarity functions.

Introduction

Digital libraries (DLs) are complex information systems built to address the information needs of specific target communities (Gonçalves, Fox, Watson, & Kipp, 2004). DLs are composed of collections of rich (possibly multimedia) digital objects along with services such as searching, browsing and recommendation, that allow easy access and retrieval of these objects by the members of the target community (Fox et al., 1995, Gonçalves et al., 2004).

Collections of digital objects are usually described by means of metadata records (usually organized in a metadata catalog) whose function is to describe, organize and specify how these objects can be manipulated and retrieved, including who has the rights for doing so. In order to promote interoperability among DLs and similar systems, metadata records usually conform to one or more metadata standards that specify, among others, a standardized set of metadata fields and their semantics for the description of digital objects. The Dublin Core,1 for example, is a general descriptive metadata standard for the representation and storage of information about scientific publications and Web pages.

Although very useful, these standards do not completely solve all the interoperability problems as there is not a consensus among all existing digital libraries in terms of a unique ‘de facto’ standard. Moreover, even if such a consensus existed, differences in practices and in the way some metadata elements are filled, not mentioning possible errors in this process (e.g., misspellings and typos), allow for the existence of several different records describing the same digital object.

Consider the example of Fig. 1 that presents excerpts of metadata records from three distinct digital libraries: BDBComp,2 DBLP3 and IEEE Xplore.4 All records refer to the same digital object. The field source in the BDBComp metadata record (line 3) corresponds to the field booktitle from DBLP (line 6). The metadata structures are different, but both refer to the same information, i.e., the publication venue of a specific paper. Also, the author of the paper, which is represented by the creator and author fields, has the value “Carlos H. Morimoto” in the BDBComp record (line 2), “Carlos Hitoshi Morimoto” in the DBLP record (line 5) and “Morimoto, C.H.” in the IEEE Xplore record (line 8). The value of the title field also differs in the word “Remote” (lines 1, 4 and 7).

Deduplication is the task of identifying in a data repository duplicated records that refer to the same real-world entity. These records may be hard to identify due to, as mentioned before, variations in spelling, writing style, metadata standard use, or even typos (Carvalho, Gonçalves, Laender, & da Silva, 2006). Deduplication is also known as record linkage, object matching or instance matching (Doan, Noy, & Halevy, 2004). In fact, this is not a new problem but a long-standing issue, for example, in the Library and Information Science field, in the context of Online Public Access Catalogs (OPACs) (Large & Beheshti, 1997), as well as in the database realm.

Several approaches to record deduplication have been proposed in recent years (Bilenko and Mooney, 2003, Dorneles et al., 2009, Carvalho et al., 2006, Carvalho et al., 2008, Chaudhuri et al., 2003, Cohen and Richman, 2002, Tejada et al., 2001). Most of these works focus on the deduplication task in the context of integration of relational databases. Few automatic approaches, if any, have been specifically developed for the realm of digital libraries or in a more general sense, for bibliographic metadata records. For example, metadata fields that specify the authors of a digital object are some of the most discriminative fields of a record and this information should be used as a strong evidence for the deduplication process. In fact, there may be several objects with similar titles but there is a very small chance that they will also have authors with similar names and be a different real-world object. For instance, Baeza-Yates and Ribeiro-Neto as well as Manning have published books with similar titles (Baeza-Yates and Ribeiro-Neto, 1999, Manning et al., 2008). Another specific problem to deal with is the variation in the way author names are represented in bibliographic citations. Variations include abbreviations, inversions of names, different spellings and omission of suffixes as Jr. (Ley, 2002). Deduplication techniques applied to the digital libraries domain should therefore take into special consideration the fields that refer to author names to correctly identify duplicated metadata records. These techniques may even explore a number of other sources of information such as authority files to help with the task of comparing author names, although this is not the focus of this work.

This article presents an approach to identifying duplicated bibliographic metadata records. We assume that a mapping between the metadata fields in different standards is provided and we focus on the application of specially designed similarity functions for the metadata content. We are aware of the problem of schema matching (Rahm & Bernstein, 2001), however it is not the focus of our work; recent solutions in the literature for the problem could be used (Fagin et al., 2009). Here, instead, we are interested in the instance matching problem, specifically for the realm of digital libraries. In this context, the main contributions of this work are:

  • an efficient and effective approach for metadata record deduplication that is based on a set of similarity functions specially designed for the digital library domain;

  • the identification and analysis of the failure cases of the evaluated deduplication functions, which are valuable for the development of new approaches for automatic bibliographic metadata deduplication.

The quality of the proposed deduplication functions is evaluated in experiments with two real datasets and compared with four other related approaches. The results of the experiments show that the proposed functions improve the deduplication quality from 2% (with regard to an already very accurate result which is difficult to improve) to 62% when identifying replicas in a dataset containing portions of the metadata records from two real digital libraries, and from 7% to 188% in a dataset with article citation data. We also show that our approach achieves slightly superior results when compared to a state-of-the-art method for replica identification based on genetic programming, without the burden and cost of any training process.

The rest of this article is organized as follows. In Section 2, we discuss related work. In Section 3, we present our approach to deduplicate bibliographic metadata. There we define a set of functions and algorithms that are specific for our deduplication approach, which is especially designed for bibliographic metadata. In Section 4, we give details on the performed experiments and discuss the obtained results. Finally, in Section 5, we draw our conclusions and point out some future work directions.

Section snippets

Related work

Chaudhuri et al. (2003) explore a vector space model representation of records and propose a similarity function that considers the weight of words using the Inverse Document Frequency (IDF) (Baeza-Yates & Ribeiro-Neto, 1999). Carvalho and da Silva (2003) also use the vector space model to calculate the similarity between objects from multiple sources. Their approach can be used to deduplicate objects with complex structures such as XML documents.

Dorneles, Heuser, Lima, da Silva, and de Moura

The metadata deduplication approach

This section presents our approach to deduplicate bibliographic metadata records. We define as duplicates or replicas two or more metadata records that are semantically equivalent, i.e., records that describe the same publication item (digital object) indexed by a digital library. We assume that the mapping between metadata standards or structures is provided. Again, we are aware of the problem of schema matching but this is out of the scope of this work

Experimental evaluation

In this section we describe the experiments we conducted in order to empirically validate and check the quality of our bibliographic metadata deduplication approach. Two real datasets were used in the experiments. The first dataset contains a portion of the metadata records of two real digital libraries and the second one is composed by article citation data. The experiments are divided into two groups:

  • 1.

    The proposed functions are compared with four different baselines (Guth, Acronyms, Fragments

Conclusions

This paper proposes an effective and efficient heuristic-based approach to deduplicating bibliographic metadata. We propose a set of functions that are applied together to identify duplicate records with high precision and recall. Differently from several related approaches presented in Section 2, which are based on machine learning techniques, ours does not require any type of training, which sometimes is very expensive to carry out.

The experimental results show that the performance of the

Acknowledgments

This research is partially supported by the Brazilian National Institute for Web Research (Grant number 573871/2008-6), CNPq Universal project ApproxMatch (Grant number 481055/2007-0), MCT/CNPq/CT-INFO projects Gestão de Grandes Volumes de Documentos Textuais (Grant number 550891/2007-2) and InfoWeb (Grant number 550874/2007-0), and by the authors scholarships and individual research grants from CAPES, CNPq and FAPEMIG.

References (31)

  • C.F. Dorneles et al.

    A strategy for allowing meaningful and comparable scores in approximate matching

    Informaion Systems

    (2009)
  • A. Large et al.

    OPACS: A research review

    Library & Information Science Research

    (1997)
  • S. Tejada et al.

    Learning object identification rules for information integration

    Information System

    (2001)
  • R. Baeza-Yates et al.

    Modern information retrieval

    (1999)
  • Bilenko, M., & Mooney, R. J. (2003). Adaptive duplicate detection using learnable string similarity measures. In...
  • Boser, B. E., Guyon, I. M., & Vapnik, V. N. (1992). A training algorithm for optimal margin classifiers. In Proceedings...
  • Carvalho, J. C. P., & da Silva, A. S. (2003). Finding similar identities among objects from multiple web sources. In...
  • Carvalho, M. G., Gonçalves, M. A., Laender, A. H. F., & da Silva, A. S. (2006). Learning to deduplicate. In Proceedings...
  • Carvalho, M. G., Laender, A. H. F., Gonçalves, M. A., & da Silva, A. S. (2008). Replica identification using genetic...
  • Chaudhuri, S., Ganjam, K., Ganti, V., & Motwani, R. (2003). Robust and efficient fuzzy match for online data cleaning....
  • Cohen, W. W., Richman, J. (2002). Learning to match and cluster large high-dimensional data sets for data integration....
  • Convis, D. B., Glickman, D., & Rosenbaum, W. S. (1982). Alpha content match prescan method for automatic spelling error...
  • Cota, R., Gonçalves, M. A., & Laender, A. H. F. (2007). A heuristic-based hierarchical clustering method for author...
  • A. Doan et al.

    Introduction to the special issue on semantic integration

    SIGMOD Record

    (2004)
  • Dorneles, C. F., Heuser, C. A., Lima, A. E. N., da Silva, A. S., & de Moura, E. S. (2004). Measuring similarity between...
  • Cited by (22)

    • Configurable assembly of classification rules for enhancing entity resolution results

      2020, Information Processing and Management
      Citation Excerpt :

      The task of identifying duplicate entities is denominated Entity Resolution (ER) (also known as Deduplication, Entity Matching and others). As depicted in Fig. 1, a typical workflow for executing this task involves the following steps (Christen, 2012a): i) Indexing (Christen, 2012b; Steorts, Ventura, Sadinle, & Fienberg, 2014), which aims to identify pairs of entities that share common properties (e.g., a pair of entities that produces the same blocking key value) to produce candidate pairs of entities to be considered for comparison; ii) pruning (Araújo, Pires, & da Nóbrega, 2017; Papadakis, Koutrika, Palpanas, & Nejdl, 2014a; Papadakis, Papastefanatos, Palpanas, & Koubarakis, 2016), which aims to filter the candidate pairs of entities by selecting the most promising ones to be compared; iii) comparison, which performs the actual comparison between the entities, usually employing and aggregating similarity values produced by similarity functions; and iv) classification (Borges, de Carvalho, Galante, Gonçalves, & Laender, 2011; Christen, 2008; do Nascimento, Pires, & Mestre, 2018), which aims to indicate, based on the computed similarities, which pairs of entities represent the same real-world object. Nowadays, these steps are challenged by the underlying characteristics of contemporary datasets (Gruenheid, Dong, & Srivastava, 2014): significant volume of data, high frequency of data updates, a variety of representation formats and privacy requirements (da Nóbrega, Pires, Araújo, & Mestre, 2018).

    • DedupCloud: An optimized efficient virtual machine deduplication algorithm in cloud computing environment

      2020, Data Deduplication Approaches: Concepts, Strategies, and Challenges
    • Essentials of data deduplication using open-source toolkit

      2020, Data Deduplication Approaches: Concepts, Strategies, and Challenges
    View all citing articles on Scopus
    View full text