Big data course K10 - Spark's Vector_Space_Model algorithm

Email of the author of the article: [email protected] Address: Huizhou, Guangdong

 ▲ This chapter’s program

⚪ Master Spark’s Vector Space Model algorithm;

⚪ Master the cosine of the angle between Spark’s vectors;

1. Vector Space Model vector space model algorithm

1 Overview

Vector Space Model (VSM: Vector Space Model) was proposed by Salton et al. in the 1970s and was successfully used in text retrieval systems.

VSM has a simple concept, simplifying the processing of text content into vector operations in vector space, and it uses spatial similarity to express semantic similarity, which is intuitive and easy to understand. When documents are represented as vectors in document space, the similarity between documents can be measured by calculating the similarity between vectors. The most commonly used similarity measure in text processing is cosine distance.

M unordered feature items ti, roots/words/phrases/other documents dj can be represented by feature item vectors (a1j, a2j,..., aMj) weight calculation, N training documents AM*N= (aij) documents Similarity comparison

The vector space model (or phrase vector model) is an algebraic model used for information filtering, information retrieval, indexing, and evaluating relevance.

This algorithm can be used for document ranking. Learning this algorithm requires 3 basics:

1. Inverted index table.

2. The concept of similarity.

3. TF-IDF algorithm.

Forward index: document->vocabulary index, for example:

1.txt -> hello 2; spark 5; AI 1;

2.txt -> world 1; hadoop 6;

... ...

Directional index (inverted index) : vocabulary -> document index, such as:

hello -> 1.txt 2; 3.txt 10;

spark -> 1.txt 5; 4.txt 7;

Guess you like

Origin blog.csdn.net/u013955758/article/details/132438313