局部敏感哈希(LSH)资料(附链接)

最近在看大规模相似数据召回,翻到一个好的资料,分享给大家。原文地址:https://janzhou.org/lsh/,感谢博主。

Similarity Measure

Similarity measure is a real-valued function that quantifies the similarity between two objects. Although no single definition of a similarity measure exists, usually such measures are in some sense the inverse of distance metrics.

  • Cosine similarity
  • Euclidean similarity
  • Nucleotide similarity
  • Amino acid similarity
  • Hamming similarity
  • Jaccard similarity

Types of LSH

Implementations

Papers

  1. Practical and Optimal LSH for Angular Distance
  2. Optimal Data-Dependent Hashing for Approximate Near Neighbors
  3. Beyond Locality Sensitive Hashing
  4. Original LSH algorithm (1999)
  5. Efficient Distributed Locality Sensitive Hashing
  6. Jaccard distance: Mining Massive Data Sets chapter#3
  7. Hamming norm A. Gionis, P. Indyk, and R. Motwani. Similarity Search in High Dimensions via Hashing. In Proc. of the 25th Intl. Conf. on Very Large Data Bases, VLDB(1999).
  8. Lp norms M. Datar, N. Immorlica, P. Indyk, and V. Mirrokni Locality-Sensitive Hashing Scheme Based on p-Stable Distributions. In Proc. of the 20th ACM Annualhttp://people.csail.mit.edu/indyk/nips-nn.ps
  9. Cosine distance and Earth movers distance (EMD) M. Charikar. Similarity Estimation Techniques from Rounding Algorithms. In Proc. of the 34th Annual ACM Symposium on Theory of Computing, STOC (2002).
  10. Very Sparse Random Projections Ping Li, T. Hastie and K. W. Church, 2006
  11. Similarity Estimation Techniques from RoundingAlgorithms
  12. Random projection Random projection in dimensionality reduction: Applications to image and text data
  13. An Introduction to Sequence Similarity (“Homology”) Searching
  14. Efficient large-scale sequence comparison by locality-sensitive hashing

Finding Nearest Neighbors

Additional Reading

Issues for LSH

  1. SPARK-5992 Locality Sensitive Hashing (LSH) for Spark
  2. spark/pull/15148

Implement Locality Sensitive Hashing along with approximate nearest neighbors and approximate similarity join based on the design doc.

Detailed changes are as follows:

  1. Implement abstract LSH, LSHModel classes as Estimator-Model
  2. Implement approxNearestNeighbors and approxSimilarityJoin in the abstract S.Model
  3. Implement Random Projection as LSH subclass for Euclidean distance, Min a.h for Jaccard Distance
  4. Implement unit test utility methods including checkLshProperty, checkNearestNeighbor and checkSimilarityJoin

Things that will be implemented in a follow-up PR:

  • Bit Sampling for Hamming Distance, SignRandomProjection for Cosine Distance
  • PySpark Integration for the scala classes and methods.

Datasets


猜你喜欢

转载自blog.csdn.net/mimi9919/article/details/80351546
LSH
今日推荐