Near-duplicate Detection by Instance-level Constrained Clustering
, and .
Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, page 421--428. New York, NY, USA, ACM, (2006)

For the task of near-duplicated document detection, both traditional fingerprinting techniques used in database community and bag-of-word comparison approaches used in information retrieval community are not sufficiently accurate. This is due to the fact that the characteristics of near-duplicated documents are different from that of both älmost-identical" documents in the data cleaning task and "relevant" documents in the search task. This paper presents an instance-level constrained clustering approach for near-duplicate detection. The framework incorporates information such as document attributes and content structure into the clustering process to form near-duplicate clusters. Gathered from several collections of public comments sent to U.S. government agencies on proposed new regulations, the experimental results demonstrate that our approach outperforms other near-duplicate detection algorithms and as about as effective as human assessors.
  • @brusilovsky
  • @stroeh
  • @aho
  • @hotho
  • @dblp
This publication has not been reviewed yet.

rating distribution
average user rating0.0 out of 5.0 based on 0 reviews
    Please log in to take part in the discussion (add own reviews or comments).