Abstract

Often, in the real world, entities have two or more representations in databases.Duplicate records do not share a common key and/or they contain errors that makeduplicate matching a difficult task. Errors are introduced as the result of transcriptionerrors, incomplete information, lack of standard formats or any combination of thesefactors. In this article, we present a thorough analysis of the literature on duplicaterecord detection. We cover similarity metrics that are commonly used to detect similarfield entries, and we present an extensive set of duplicate detection algorithms thatcan detect approximately duplicate records in a database. We also cover multiple techniques for improving the efficiency and scalability of approximate duplicate detectionalgorithms. We conclude with a coverage of existing tools and with a brief discussionof the big open problems in the area.

Links and resources

Tags

community