P. Li, A. Owen, and C. Zhang. Advances in Neural Information Processing Systems 25, (2012)
Abstract
While minwise hashing is promising for large-scale learning in massive binary data, the preprocessing cost is prohibitive as it requires applying (e.g.,) k=500 permutations on the data. The testing time is also expensive if a new data point (e.g., a new document or a new image) has not been processed. In this paper, we develop a simple one permutation hashing scheme to address this important issue. While it is true that the preprocessing step can be parallelized, it comes at the cost of additional hardware and implementation. Also, reducing k permutations to just one would be much more energy-efficient, which might be an important perspective as minwise hashing is commonly deployed in the search industry. While the theoretical probability analysis is interesting, our experiments on similarity estimation and SVM & logistic regression also confirm the theoretical results.
%0 Book Section
%1 NIPS2012_1436
%A Li, Ping
%A Owen, Art
%A Zhang, Cun-Hui
%B Advances in Neural Information Processing Systems 25
%D 2012
%E Bartlett, P.
%E Pereira, F.C.N.
%E Burges, C.J.C.
%E Bottou, L.
%E Weinberger, K.Q.
%K imported
%P 3122--3130
%T One Permutation Hashing
%U http://papers.nips.cc/paper/4778-one-permutation-hashing
%X While minwise hashing is promising for large-scale learning in massive binary data, the preprocessing cost is prohibitive as it requires applying (e.g.,) k=500 permutations on the data. The testing time is also expensive if a new data point (e.g., a new document or a new image) has not been processed. In this paper, we develop a simple one permutation hashing scheme to address this important issue. While it is true that the preprocessing step can be parallelized, it comes at the cost of additional hardware and implementation. Also, reducing k permutations to just one would be much more energy-efficient, which might be an important perspective as minwise hashing is commonly deployed in the search industry. While the theoretical probability analysis is interesting, our experiments on similarity estimation and SVM & logistic regression also confirm the theoretical results.
@incollection{NIPS2012_1436,
abstract = {While minwise hashing is promising for large-scale learning in massive binary data, the preprocessing cost is prohibitive as it requires applying (e.g.,) k=500 permutations on the data. The testing time is also expensive if a new data point (e.g., a new document or a new image) has not been processed. In this paper, we develop a simple \textbf{one permutation hashing} scheme to address this important issue. While it is true that the preprocessing step can be parallelized, it comes at the cost of additional hardware and implementation. Also, reducing k permutations to just one would be much more \textbf{energy-efficient}, which might be an important perspective as minwise hashing is commonly deployed in the search industry. While the theoretical probability analysis is interesting, our experiments on similarity estimation and SVM \& logistic regression also confirm the theoretical results.},
added-at = {2015-01-19T16:50:57.000+0100},
author = {Li, Ping and Owen, Art and Zhang, Cun-Hui},
biburl = {https://www.bibsonomy.org/bibtex/255fcbbf6a063035ed478b9c654f4fbff/ncrn-cornell},
booktitle = {Advances in Neural Information Processing Systems 25},
editor = {Bartlett, P. and Pereira, F.C.N. and Burges, C.J.C. and Bottou, L. and Weinberger, K.Q.},
file = {4778-one-permutation-hashing.pdf:http\://papers.nips.cc/paper/4778-one-permutation-hashing.pdf:PDF},
interhash = {3bffcbe3bda9b523eae165f6e4c998be},
intrahash = {55fcbbf6a063035ed478b9c654f4fbff},
keywords = {imported},
pages = {3122--3130},
timestamp = {2015-01-19T16:50:57.000+0100},
title = {One Permutation Hashing},
url = {http://papers.nips.cc/paper/4778-one-permutation-hashing},
year = 2012
}