Use word vectors to mathematically find words with similar meaning

Photography: Nika Charakova

1. Description

        In short, a word vector is nothing but a sequence of real numbers representing the meaning of a natural language word. This technology is an important enabler of useful NLP capabilities, enabling machines to "understand" human language. This article discusses how to use word vectors to programmatically compute the semantic similarity of texts, which is useful, for example, if you need to classify texts according to the topics they cover. It starts with a conceptual view and examples, then shows how to use spaCy, a leading NLP Python library, to determine semantic similarity of text.

2. The concept of word vector

        So let's take a conceptual look at word vectors so you can get a basic idea of ​​how to mathematically compute semantic similarity between words represented in vector form. You'll then look at spaCy's similarity() method, which compares the word vectors of container objects (Doc, Span, Token) to determine how close they are in meaning.

        In statistical modeling, words are mapped to vectors of real numbers that reflect the semantic similarity of words. You can imagine the word vector space as a cloud where word vectors with similar meanings are located nearby. For example, the vector representing the word "apple" should be closer to the vector for the word "pear" than it is for the word "car". Since the first two refer to edible fruits, the latter refers to four-wheeled road vehicles. To generate these vectors, you need to encode the meaning of these words. Actually, there are several ways of encoding meaning.

3. Define meaning with coordinates

        One way to generate meaningful word vectors is to assign real-world objects or categories to each coordinate of the word vector. For example, suppose you are generating word vectors for the following words: Roman, Italian, Athens, and Greek. The word vectors should mathematically reflect the fact that Rome is the capital of Italy and has a different relationship to Italy than Athens. At the same time, they should reflect the fact that Athens and Rome are capitals, and Greece and Italy are states.

        The figure below illustrates what this vector space might look like in matrix form.

        Here, you distribute the meaning of each word between coordinates in four-dimensional space representing the categories "Country", "Capital", "Greek", and "Italian". In this simplified example, the coordinate values ​​can be 1 or 0, indicating whether the corresponding word belongs to that category.

        Once you have a vector space where the vectors of numbers capture the meaning of the corresponding words, you can use vector arithmetic on this vector space to gain insight into the meaning of the words. To find out which country Athens is the capital of, you can use the following equation, where each token represents its corresponding vector and X is the unknown vector:

        Italy — Rome = X — Athens

        This equation expresses an analogy where X represents word vectors that have the same relationship to Athens as Italy has to Rome. To solve for X, you can rewrite the equation like this:

        X = Italy - Rome + Athens

        First subtract vector Rome from vector Italy by subtracting the corresponding vector elements. Then, add the resulting vector and the sum of vector Athens. The diagram in the figure below summarizes this calculation.

        By subtracting the word vector for Rome from the word vector for Italy, and then adding the word vector for Athens, you can get a vector equal to the vector Greece.

4. Use Dimensions to Express Meaning

        Although the vector space you just created has only four categories, real-world vector spaces may require tens of thousands of such categories. A vector space of this size is impractical for most applications because it requires a huge word embedding matrix. For example, if you want to encode 10,000 categories and 1,000,000 entities, you need 10,000 × 1,000,000 embedding matrices.

        An obvious way to reduce the size of the embedding matrix is ​​to reduce the number of categories in the vector space. Instead of using coordinates to represent all categories, the actual implementation of word vector spaces uses the distance between vectors to quantify and classify semantic similarity. Dimensions generally have no inherent meaning. Instead, they represent positions in a vector space, and the distance between vectors represents the similarity in meaning of corresponding words. To see an example of a real vector space, you can download the fastText word vector library at  English word vectors fastText  , which distributes the meaning of words in a 300-dimensional word vector space.

Five, spaCy's Similarity () method

        In spaCy, each type of container object has a similarity method that allows you to compute an estimate of the semantic similarity between two container objects of any type by comparing their word vectors. To calculate the similarity of spans and documents that do not have their own word vectors, spaCy averages the labeled word vectors they contain.

        The semantic similarity of two container objects can be computed even if the two objects are different. For example, you can compare a Token object to a Span object, a Span object to a Doc object, and so on.

        The following example calculates how similar a Span object is to a Doc object:

>>> doc=nlp(‘I want a green apple.’)
>>> doc.similarity(doc[2:5])
0.7305813588233471

        This code computes a semantic similarity estimate between the sentence "I want a green apple" and the phrase "a green apple" derived from the same sentence. As you can see, the calculated similarity is high enough that two objects can be considered similar in content (similarity ranges from 0 to 1). Not surprisingly, the similarity() method returns 1 when you compare an object to itself:

>>> doc.similarity(doc)
1.0
>>> doc[2:5].similarity(doc[2:5])
1.0

6. Postscript

        Note: The examples used in this article are taken from my recent book "Natural Language Processing with Python and spaCy (https://nostarch.com/NLPPython)" published by No Starch Press ( https://nostarch.com/ ). with Python and spaCy ) .

Guess you like

Origin blog.csdn.net/gongdiwudu/article/details/132325632