copy delete add this publication to your clipboard
community post
history of this post
URL
DOI
BibTeX
EndNote
APA
Chicago
DIN 1505
Harvard
MSOffice XML

Detecting near-duplicates for web crawling

G. Manku, A. Jain, and A. Sarma. WWW '07: Proceedings of the 16th international conference on World Wide Web, page 141--150. New York, NY, USA, ACM, (2007)
DOI: http://doi.acm.org/10.1145/1242572.1242592

Abstract

Near-duplicate web documents are abundant. Two such documents differ from each other in a very small portion that displays advertisements, for example. Such differences are irrelevant for web search. So the quality of a web crawler increases if it can assess whether a newly crawled web page is a near-duplicate of a previously crawled web page or not. In the course of developing a near-duplicate detection system for a multi-billion page repository, we make two research contributions. First, we demonstrate that Charikar's fingerprinting technique is appropriate for this goal. Second, we present an algorithmic technique for identifying existing f-bit fingerprints that differ from a given fingerprint in at most k bit-positions, for small k. Our technique is useful for both online queries (single fingerprints) and all batch queries (multiple fingerprints). Experimental evaluation over real data confirms the practicality of our design.

Description

DAS IST EIN KOMMENTAR

Links and resources

BibTeX key: 1242592
entry type: inproceedings
address: New York, NY, USA
booktitle: WWW '07: Proceedings of the 16th international conference on World Wide Web
year: 2007
pages: 141--150
publisher: ACM
location: Banff, Alberta, Canada
isbn: 978-1-59593-654-7
DOI: http://doi.acm.org/10.1145/1242572.1242592
url: http://portal.acm.org/citation.cfm?id=1242592#

@lysander07's tags highlighted

Cite this publication

search on

Meta data

Last update 15 years ago
Created 16 years ago

Comments and Reviews
(0)

There is no review or comment yet. You can write one!

BibSonomy

copy delete add this publication to your clipboard
community post
history of this post
URL
DOI
BibTeX
EndNote
APA
Chicago
DIN 1505
Harvard
MSOffice XML

Detecting near-duplicates for web crawling

Abstract

Description

Links and resources

Tags

community

Cite this publication

More citation styles

search on

Meta data

Comments and Reviews
(0)

BibSonomy

copydeleteadd this publication to your clipboardcommunity posthistory of this postURLDOIBibTeXEndNoteAPAChicagoDIN 1505HarvardMSOffice XML Detecting near-duplicates for web crawling

Abstract

Description

Links and resources

Tags

community

Cite this publication

More citation styles

search on

Meta data

Comments and Reviews (0)

copy delete add this publication to your clipboard
community post
history of this post
URL
DOI
BibTeX
EndNote
APA
Chicago
DIN 1505
Harvard
MSOffice XML

Detecting near-duplicates for web crawling

Comments and Reviews
(0)