最近在看大规模相似数据召回,翻到一个好的资料,分享给大家。原文地址:https://janzhou.org/lsh/,感谢博主。
Similarity Measure
Similarity measure is a real-valued function that quantifies the similarity between two objects. Although no single definition of a similarity measure exists, usually such measures are in some sense the inverse of distance metrics.
- Cosine similarity
- Euclidean similarity
- Nucleotide similarity
- Amino acid similarity
- Hamming similarity
- Jaccard similarity
Types of LSH
- HyperplaneLSH for Cosine Distance
- Super-Bit Locality-Sensitive Hashing for Hamming distance
- Min Hash for Jaccard similarity
- Min-wise independent permutations
- Nilsimsa Hash
- Random projection
Implementations
- tdebatty/java-LSH A Java implementation of Locality Sensitive Hashing (LSH) MinHash & Super-Bit
- apache/incubator-datafu a collection of libraries for working with large-scale data in Hadoop.
- marufaytekin/lsh-spark HyperplaneLSH for Spark
- soundcloud/cosine-lsh-join-spark Approximate Nearest Neighbors in Spark
- karlhigley/spark-neighbors Spark-based approximate nearest neighbor search using locality-sensitive hashing supports Hamming, Jaccard, Euclidean, and cosine distance.
- rholder/nilsimsa Nilsimsa locality-sensitive hashing algorithm in Java.
- chrisjmccormick/MinHash MinHash Tutorial with Python Code with example to mining documents similarity.
- barneygovan/lsh-scala A Locality-Sensitive Hashing Library for Scala with optional Redis storage.
- treadstone90/Locality-Sensitive-Hashing works only for the text and can support only Jaccard Similarity.
- richwhitjr/DistNN Distributed LSH Implementation in Scala.
- beckgael/Mean-Shift-LSH Distributed Nearest Neighbours Mean Shift with Locality Sensitive Hashing DNNMS-LSH. Scala/Spark implementation.
- ohtaman/LSH C++ implemented MinHash and SimHash.
- JorenSix/TarsosLSH A Java library implementing Locality-sensitive Hashing (LSH), a practical nearest neighbour search algorithm for multidimensional vectors that operates in sublinear time.
Papers
- Practical and Optimal LSH for Angular Distance
- Optimal Data-Dependent Hashing for Approximate Near Neighbors
- Beyond Locality Sensitive Hashing
- Original LSH algorithm (1999)
- Efficient Distributed Locality Sensitive Hashing
- Jaccard distance: Mining Massive Data Sets chapter#3
- Hamming norm A. Gionis, P. Indyk, and R. Motwani. Similarity Search in High Dimensions via Hashing. In Proc. of the 25th Intl. Conf. on Very Large Data Bases, VLDB(1999).
- Lp norms M. Datar, N. Immorlica, P. Indyk, and V. Mirrokni Locality-Sensitive Hashing Scheme Based on p-Stable Distributions. In Proc. of the 20th ACM Annualhttp://people.csail.mit.edu/indyk/nips-nn.ps
- Cosine distance and Earth movers distance (EMD) M. Charikar. Similarity Estimation Techniques from Rounding Algorithms. In Proc. of the 34th Annual ACM Symposium on Theory of Computing, STOC (2002).
- Very Sparse Random Projections Ping Li, T. Hastie and K. W. Church, 2006
- Similarity Estimation Techniques from RoundingAlgorithms
- Random projection Random projection in dimensionality reduction: Applications to image and text data
- An Introduction to Sequence Similarity (“Homology”) Searching
- Efficient large-scale sequence comparison by locality-sensitive hashing
Finding Nearest Neighbors
Additional Reading
Issues for LSH
- SPARK-5992 Locality Sensitive Hashing (LSH) for Spark
- spark/pull/15148
Implement Locality Sensitive Hashing along with approximate nearest neighbors and approximate similarity join based on the design doc.
Detailed changes are as follows:
- Implement abstract LSH, LSHModel classes as Estimator-Model
- Implement approxNearestNeighbors and approxSimilarityJoin in the abstract S.Model
- Implement Random Projection as LSH subclass for Euclidean distance, Min a.h for Jaccard Distance
- Implement unit test utility methods including checkLshProperty, checkNearestNeighbor and checkSimilarityJoin
Things that will be implemented in a follow-up PR:
- Bit Sampling for Hamming Distance, SignRandomProjection for Cosine Distance
- PySpark Integration for the scala classes and methods.