copy delete add this publication to your clipboard
community post
history of this post
URL
DOI
BibTeX
EndNote
APA
Chicago
DIN 1505
Harvard
MSOffice XML

Predicting Pair Similarities for Near-Duplicate Detection in High Dimensional Spaces

M. Fisichella, A. Ceroni, F. Deng, and W. Nejdl. Database and Expert Systems Applications, volume 8645 of Lecture Notes in Computer Science, Springer International Publishing, (2014)
DOI: 10.1007/978-3-319-10085-2_5

Abstract

The problem of near–duplicate detection consists in finding those elements within a data set which are closest to a new input element, according to a given distance function and a given closeness threshold. Solving such problem for high–dimensional data sets is computationally expensive, since the amount of computation required to assess the similarity between any two elements increases with the number of dimensions. As a motivating example, an image or video sharing website would take advantage of detecting near–duplicates whenever new multimedia content is uploaded. Among different approaches, near–duplicate detection in high–dimensional data sets has been effectively addressed by SimPair LSH 11. Built on top of Locality Sensitive Hashing (LSH), SimPair LSH computes and stores a small set of near-duplicate pairs in advance, and uses them to prune the candidate set generated by LSH for a given new element. In this paper, we develop an algorithm to predict a lower bound of the number of elements pruned by SimPair LSH from the candidate set generated by LSH. Since the computational overhead introduced by SimPair LSH to compute near-duplicate pairs in advance is rewarded by the possibility of using that information to prune the candidate set, predicting the number of pruned points would be crucial. The pruning prediction has been evaluated through experiments over three real–world data sets. We also performed further experiments on SimPair LSH, confirming that it consistently outperforms LSH with respect to memory space and running time.

Description

Predicting Pair Similarities for Near-Duplicate Detection in High Dimensional Spaces - Springer

@xander71988's tags highlighted

Cite this publication

%0 Book Section %1 noKey %A Fisichella, Marco %A Ceroni, Andrea %A Deng, Fan %A Nejdl, Wolfgang %B Database and Expert Systems Applications %D 2014 %E Decker, Hendrik %E Lhotská, Lenka %E Link, Sebastian %E Spies, Marcus %E Wagner, RolandR. %I Springer International Publishing %K duraark hashing high-dimensional locality myown near-duplicates sensitive spaces sync3 %P 59-73 %R 10.1007/978-3-319-10085-2_5 %T Predicting Pair Similarities for Near-Duplicate Detection in High Dimensional Spaces %U http://dx.doi.org/10.1007/978-3-319-10085-2_5 %V 8645 %X The problem of near–duplicate detection consists in finding those elements within a data set which are closest to a new input element, according to a given distance function and a given closeness threshold. Solving such problem for high–dimensional data sets is computationally expensive, since the amount of computation required to assess the similarity between any two elements increases with the number of dimensions. As a motivating example, an image or video sharing website would take advantage of detecting near–duplicates whenever new multimedia content is uploaded. Among different approaches, near–duplicate detection in high–dimensional data sets has been effectively addressed by SimPair LSH 11. Built on top of Locality Sensitive Hashing (LSH), SimPair LSH computes and stores a small set of near-duplicate pairs in advance, and uses them to prune the candidate set generated by LSH for a given new element. In this paper, we develop an algorithm to predict a lower bound of the number of elements pruned by SimPair LSH from the candidate set generated by LSH. Since the computational overhead introduced by SimPair LSH to compute near-duplicate pairs in advance is rewarded by the possibility of using that information to prune the candidate set, predicting the number of pruned points would be crucial. The pruning prediction has been evaluated through experiments over three real–world data sets. We also performed further experiments on SimPair LSH, confirming that it consistently outperforms LSH with respect to memory space and running time. %@ 978-3-319-10084-5

@incollection{noKey, abstract = {The problem of near–duplicate detection consists in finding those elements within a data set which are closest to a new input element, according to a given distance function and a given closeness threshold. Solving such problem for high–dimensional data sets is computationally expensive, since the amount of computation required to assess the similarity between any two elements increases with the number of dimensions. As a motivating example, an image or video sharing website would take advantage of detecting near–duplicates whenever new multimedia content is uploaded. Among different approaches, near–duplicate detection in high–dimensional data sets has been effectively addressed by SimPair LSH [11]. Built on top of Locality Sensitive Hashing (LSH), SimPair LSH computes and stores a small set of near-duplicate pairs in advance, and uses them to prune the candidate set generated by LSH for a given new element. In this paper, we develop an algorithm to predict a lower bound of the number of elements pruned by SimPair LSH from the candidate set generated by LSH. Since the computational overhead introduced by SimPair LSH to compute near-duplicate pairs in advance is rewarded by the possibility of using that information to prune the candidate set, predicting the number of pruned points would be crucial. The pruning prediction has been evaluated through experiments over three real–world data sets. We also performed further experiments on SimPair LSH, confirming that it consistently outperforms LSH with respect to memory space and running time.}, added-at = {2014-09-02T12:28:05.000+0200}, author = {Fisichella, Marco and Ceroni, Andrea and Deng, Fan and Nejdl, Wolfgang}, biburl = {https://www.bibsonomy.org/bibtex/25babdd3ea8fc74353c6446b7b65abccf/xander71988}, booktitle = {Database and Expert Systems Applications}, description = {Predicting Pair Similarities for Near-Duplicate Detection in High Dimensional Spaces - Springer}, doi = {10.1007/978-3-319-10085-2_5}, editor = {Decker, Hendrik and Lhotská, Lenka and Link, Sebastian and Spies, Marcus and Wagner, RolandR.}, interhash = {b1fc30b516efe7d5ef9f487e91381db8}, intrahash = {5babdd3ea8fc74353c6446b7b65abccf}, isbn = {978-3-319-10084-5}, keywords = {duraark hashing high-dimensional locality myown near-duplicates sensitive spaces sync3}, language = {English}, pages = {59-73}, publisher = {Springer International Publishing}, series = {Lecture Notes in Computer Science}, timestamp = {2014-09-02T12:28:05.000+0200}, title = {Predicting Pair Similarities for Near-Duplicate Detection in High Dimensional Spaces}, url = {http://dx.doi.org/10.1007/978-3-319-10085-2_5}, volume = 8645, year = 2014 }

BibSonomy

copy delete add this publication to your clipboard
community post
history of this post
URL
DOI
BibTeX
EndNote
APA
Chicago
DIN 1505
Harvard
MSOffice XML

Predicting Pair Similarities for Near-Duplicate Detection in High Dimensional Spaces

Abstract

Description

Links and resources

Tags

community

Cite this publication

More citation styles

search on

Meta data

Comments and Reviews
(0)

BibSonomy

copydeleteadd this publication to your clipboardcommunity posthistory of this postURLDOIBibTeXEndNoteAPAChicagoDIN 1505HarvardMSOffice XML Predicting Pair Similarities for Near-Duplicate Detection in High Dimensional Spaces

Abstract

Description

Links and resources

Tags

community

Cite this publication

More citation styles

search on

Meta data

Comments and Reviews (0)

copy delete add this publication to your clipboard
community post
history of this post
URL
DOI
BibTeX
EndNote
APA
Chicago
DIN 1505
Harvard
MSOffice XML

Predicting Pair Similarities for Near-Duplicate Detection in High Dimensional Spaces

Comments and Reviews
(0)