局部敏感哈希（LSH）资料（附链接）

最近在看大规模相似数据召回，翻到一个好的资料，分享给大家。原文地址：https://janzhou.org/lsh/，感谢博主。

Similarity Measure

Similarity measure is a real-valued function that quantifies the similarity between two objects. Although no single definition of a similarity measure exists, usually such measures are in some sense the inverse of distance metrics.

Cosine similarity
Euclidean similarity
Nucleotide similarity
Amino acid similarity
Hamming similarity
Jaccard similarity

Types of LSH

HyperplaneLSH for Cosine Distance
Super-Bit Locality-Sensitive Hashing for Hamming distance
Min Hash for Jaccard similarity
Min-wise independent permutations
Nilsimsa Hash
Random projection

Implementations

tdebatty/java-LSH A Java implementation of Locality Sensitive Hashing (LSH) MinHash & Super-Bit
apache/incubator-datafu a collection of libraries for working with large-scale data in Hadoop.
marufaytekin/lsh-spark HyperplaneLSH for Spark
soundcloud/cosine-lsh-join-spark Approximate Nearest Neighbors in Spark
karlhigley/spark-neighbors Spark-based approximate nearest neighbor search using locality-sensitive hashing supports Hamming, Jaccard, Euclidean, and cosine distance.
rholder/nilsimsa Nilsimsa locality-sensitive hashing algorithm in Java.
chrisjmccormick/MinHash MinHash Tutorial with Python Code with example to mining documents similarity.
barneygovan/lsh-scala A Locality-Sensitive Hashing Library for Scala with optional Redis storage.
treadstone90/Locality-Sensitive-Hashing works only for the text and can support only Jaccard Similarity.
richwhitjr/DistNN Distributed LSH Implementation in Scala.
beckgael/Mean-Shift-LSH Distributed Nearest Neighbours Mean Shift with Locality Sensitive Hashing DNNMS-LSH. Scala/Spark implementation.
ohtaman/LSH C++ implemented MinHash and SimHash.
JorenSix/TarsosLSH A Java library implementing Locality-sensitive Hashing (LSH), a practical nearest neighbour search algorithm for multidimensional vectors that operates in sublinear time.

Papers

Practical and Optimal LSH for Angular Distance
Optimal Data-Dependent Hashing for Approximate Near Neighbors
Beyond Locality Sensitive Hashing
Original LSH algorithm (1999)
Efficient Distributed Locality Sensitive Hashing
Jaccard distance: Mining Massive Data Sets chapter#3
Hamming norm A. Gionis, P. Indyk, and R. Motwani. Similarity Search in High Dimensions via Hashing. In Proc. of the 25th Intl. Conf. on Very Large Data Bases, VLDB(1999).
Lp norms M. Datar, N. Immorlica, P. Indyk, and V. Mirrokni Locality-Sensitive Hashing Scheme Based on p-Stable Distributions. In Proc. of the 20th ACM Annualhttp://people.csail.mit.edu/indyk/nips-nn.ps
Cosine distance and Earth movers distance (EMD) M. Charikar. Similarity Estimation Techniques from Rounding Algorithms. In Proc. of the 34th Annual ACM Symposium on Theory of Computing, STOC (2002).
Very Sparse Random Projections Ping Li, T. Hastie and K. W. Church, 2006
Similarity Estimation Techniques from RoundingAlgorithms
Random projection Random projection in dimensionality reduction: Applications to image and text data
An Introduction to Sequence Similarity (“Homology”) Searching
Efficient large-scale sequence comparison by locality-sensitive hashing

Finding Nearest Neighbors

Additional Reading

Issues for LSH

SPARK-5992 Locality Sensitive Hashing (LSH) for Spark
spark/pull/15148

Implement Locality Sensitive Hashing along with approximate nearest neighbors and approximate similarity join based on the design doc.

Detailed changes are as follows:

Implement abstract LSH, LSHModel classes as Estimator-Model
Implement approxNearestNeighbors and approxSimilarityJoin in the abstract S.Model
Implement Random Projection as LSH subclass for Euclidean distance, Min a.h for Jaccard Distance
Implement unit test utility methods including checkLshProperty, checkNearestNeighbor and checkSimilarityJoin

Things that will be implemented in a follow-up PR:

Bit Sampling for Hamming Distance, SignRandomProjection for Cosine Distance
PySpark Integration for the scala classes and methods.

Datasets

Text Technologies for Data Science Assessment 3