copy delete add this publication to your clipboard
community post
history of this post
URL
DOI
BibTeX
EndNote
APA
Chicago
DIN 1505
Harvard
MSOffice XML

Near-duplicate Detection by Instance-level Constrained Clustering

H. Yang, and J. Callan. Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, page 421--428. New York, NY, USA, ACM, (2006)
DOI: 10.1145/1148170.1148243

Abstract

For the task of near-duplicated document detection, both traditional fingerprinting techniques used in database community and bag-of-word comparison approaches used in information retrieval community are not sufficiently accurate. This is due to the fact that the characteristics of near-duplicated documents are different from that of both älmost-identical" documents in the data cleaning task and "relevant" documents in the search task. This paper presents an instance-level constrained clustering approach for near-duplicate detection. The framework incorporates information such as document attributes and content structure into the clustering process to form near-duplicate clusters. Gathered from several collections of public comments sent to U.S. government agencies on proposed new regulations, the experimental results demonstrate that our approach outperforms other near-duplicate detection algorithms and as about as effective as human assessors.

Links and resources

BibTeX key: citeulike:2295267
entry type: inproceedings
address: New York, NY, USA
booktitle: Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval
year: 2006
pages: 421--428
publisher: ACM
series: SIGIR '06
citeulike-article-id: 2295267
citeulike-linkout-0: http://portal.acm.org/citation.cfm?id=1148243
isbn: 1-59593-369-7
citeulike-linkout-1: http://dx.doi.org/10.1145/1148170.1148243
location: Seattle, Washington, USA
priority: 2
posted-at: 2011-01-11 17:15:07
DOI: 10.1145/1148170.1148243
url: http://dx.doi.org/10.1145/1148170.1148243

@brusilovsky's tags highlighted

clustering

Cite this publication

search on

Meta data

Last update 7 years ago
Created 7 years ago

Comments and Reviews
(0)

There is no review or comment yet. You can write one!

BibSonomy

copy delete add this publication to your clipboard
community post
history of this post
URL
DOI
BibTeX
EndNote
APA
Chicago
DIN 1505
Harvard
MSOffice XML

Near-duplicate Detection by Instance-level Constrained Clustering

Abstract

Links and resources

Tags

community

Cite this publication

More citation styles

search on

Meta data

Comments and Reviews
(0)

BibSonomy

copydeleteadd this publication to your clipboardcommunity posthistory of this postURLDOIBibTeXEndNoteAPAChicagoDIN 1505HarvardMSOffice XML Near-duplicate Detection by Instance-level Constrained Clustering

Abstract

Links and resources

Tags

community

Cite this publication

More citation styles

search on

Meta data

Comments and Reviews (0)

copy delete add this publication to your clipboard
community post
history of this post
URL
DOI
BibTeX
EndNote
APA
Chicago
DIN 1505
Harvard
MSOffice XML

Near-duplicate Detection by Instance-level Constrained Clustering

Comments and Reviews
(0)