copy delete add this publication to your clipboard
community post
history of this post
URL
DOI
BibTeX
EndNote
APA
Chicago
DIN 1505
Harvard
MSOffice XML

In Defense of MinHash Over SimHash

A. Shrivastava, and P. Li. Proceedings of the 17th International Conference on Artificial Intelligence and Statistics (AISTATS), 33, Reykjavik, Iceland, (2014)

Abstract

MinHash and SimHash are the two widely adopted Locality Sensitive Hashing (LSH) algorithms for large-scale data processing applications. Deciding which LSH to use for a particular problem at hand is an important question, which has no clear answer in the existing literature. In this study, we provide a theoretical answer (validated by experiments) that MinHash virtually always outperforms SimHash when the data are binary, as common in practice such as search. The collision probability of MinHash is a function of resemblance similarity (R), while the collision probability of SimHash is a function of cosine similarity (S). To provide a common basis for comparison, we evaluate retrieval results in terms of S for both MinHash and SimHash. This evaluation is valid as we can prove that MinHash is a valid LSH with respect to S, by using a general inequality S2<=R<=S2−S. Our worst case analysis can show that MinHash significantly outperforms SimHash in high similarity region. Interestingly, our intensive experiments reveal that MinHash is also substantially better than SimHash even in datasets where most of the data points are not too similar to each other. This is partly because, in practical data, often R>=Sz−S holds where z is only slightly larger than 2 (e.g., z<=2.1). Our restricted worst case analysis by assuming Sz−S<=R<=S2−S shows that MinHash indeed significantly outperforms SimHash even in low similarity region. We believe the results in this paper will provide valuable guidelines for search in practice, especially when the data are sparse.

Links and resources

BibTeX key: Ping2014
entry type: inproceedings
address: Reykjavik, Iceland
booktitle: Proceedings of the 17th International Conference on Artificial Intelligence and Statistics (AISTATS)
year: 2014
volume: 33
file: shrivastava14.pdf:http\://jmlr.org/proceedings/papers/v33/shrivastava14.pdf:PDF
owner: vilhuber
Document: http://jmlr.org/proceedings/papers/v33/shrivastava14.html

@ncrn-cornell's tags highlighted

imported

Cite this publication

@inproceedings{Ping2014, abstract = {MinHash and SimHash are the two widely adopted Locality Sensitive Hashing (LSH) algorithms for large-scale data processing applications. Deciding which LSH to use for a particular problem at hand is an important question, which has no clear answer in the existing literature. In this study, we provide a theoretical answer (validated by experiments) that MinHash virtually always outperforms SimHash when the data are binary, as common in practice such as search. The collision probability of MinHash is a function of resemblance similarity (R), while the collision probability of SimHash is a function of cosine similarity (S). To provide a common basis for comparison, we evaluate retrieval results in terms of S for both MinHash and SimHash. This evaluation is valid as we can prove that MinHash is a valid LSH with respect to S, by using a general inequality S2<=R<=S2−S. Our worst case analysis can show that MinHash significantly outperforms SimHash in high similarity region. Interestingly, our intensive experiments reveal that MinHash is also substantially better than SimHash even in datasets where most of the data points are not too similar to each other. This is partly because, in practical data, often R>=Sz−S holds where z is only slightly larger than 2 (e.g., z<=2.1). Our restricted worst case analysis by assuming Sz−S<=R<=S2−S shows that MinHash indeed significantly outperforms SimHash even in low similarity region. We believe the results in this paper will provide valuable guidelines for search in practice, especially when the data are sparse.}, added-at = {2015-01-19T16:50:57.000+0100}, address = {Reykjavik, Iceland}, author = {Shrivastava, Anshumali and Li, Ping}, biburl = {https://www.bibsonomy.org/bibtex/202c0771c5decc841b9b231f7c97ee5ef/ncrn-cornell}, booktitle = {Proceedings of the 17th International Conference on Artificial Intelligence and Statistics (AISTATS)}, file = {shrivastava14.pdf:http\://jmlr.org/proceedings/papers/v33/shrivastava14.pdf:PDF}, interhash = {d3312ee5edfb4842bc01d9b4fd883287}, intrahash = {02c0771c5decc841b9b231f7c97ee5ef}, keywords = {imported}, owner = {vilhuber}, timestamp = {2015-01-19T16:50:57.000+0100}, title = {In Defense of MinHash Over SimHash}, url = {http://jmlr.org/proceedings/papers/v33/shrivastava14.html}, volume = 33, year = 2014 }

BibSonomy

copy delete add this publication to your clipboard
community post
history of this post
URL
DOI
BibTeX
EndNote
APA
Chicago
DIN 1505
Harvard
MSOffice XML

In Defense of MinHash Over SimHash

Abstract

Links and resources

Tags

community

Cite this publication

More citation styles

search on

Meta data

Comments and Reviews
(0)

BibSonomy

copydeleteadd this publication to your clipboardcommunity posthistory of this postURLDOIBibTeXEndNoteAPAChicagoDIN 1505HarvardMSOffice XML In Defense of MinHash Over SimHash

Abstract

Links and resources

Tags

community

Cite this publication

More citation styles

search on

Meta data

Comments and Reviews (0)

copy delete add this publication to your clipboard
community post
history of this post
URL
DOI
BibTeX
EndNote
APA
Chicago
DIN 1505
Harvard
MSOffice XML

In Defense of MinHash Over SimHash

Comments and Reviews
(0)