Conference,

Indexing based Genetic Programming Approach to Record Deduplication

.
(2013)

Abstract

In this paper, we present a genetic programming (GP) approach to record deduplication with indexing techniques.Data de-duplication is a process in which data are cleaned from duplicate records due to misspelling, field swap or any other mistake or data inconsistency. This process requires that we identify objects that are included in more than one list.The problem of detecting and eliminating duplicated data is one of the major problems in the broad area of data cleaning and data quality in data warehouse. So, we need to create such a algorithm that can detect and eliminate maximum duplications.GP with indexing is one of the optimization technique that helps to find maximum duplicates in the database. We used adeduplication function that is able to identify whether two or more entries in a repository are replicas or not. As many industries and systems depend on the accuracy and reliability of databases to carry out operations. Therefore, the quality of the information stored in the databases, can have significant cost implications to a system that relies on information to function and conduct business. Moreover, this is fact that clean and replica-free repositories not only allow the retrieval of higher quality information but also lead to more concise data and to potential savings in computational time and resources to process this data.

Tags

Users

  • @idescitation

Comments and Reviews