From Pythagorean Theorem to Cosine Similarity-Programmer's Mathematical Foundation

Most programmers have some mathematical foundations in advanced numbers, linear algebra, probability theory, and mathematical statistics due to their background in science and engineering. So when the machine learning boom came, they were all eager to try and had a strong desire to explore the algorithms of machine learning and the mathematical ideas behind them.

The author of this article is one of them. However, in the course of practice, I found that the depth of understanding of mathematics is somewhat lacking, and when I understand the meaning behind some formulas, I feel a little powerless. Therefore, I sorted out some blind spots in mathematics, straightened out my knowledge, and shared it with those in need.

This article mainly explains the relevant knowledge points of cosine similarity. Similarity calculation has a wide range of uses and is the core of business scenarios such as search engines, recommendation engines, and classification and clustering. In order to understand the ins and outs of cosine similarity, I will start with the simplest junior high school mathematics and gradually derive the cosine formula. Then based on the cosine formula string some practical examples.

1. Business background

Usually in our daily development, we may encounter the following business scenarios.

From Pythagorean Theorem to Cosine Similarity-Programmer's Mathematical Foundation

Precision marketing, image processing, and search engines, these three seemingly unrelated business scenarios, actually face a common problem that is the calculation of similarity. For example, the crowd expansion in precision marketing involves the calculation of user similarity; the image classification problem involves the calculation of image similarity, and the search engine involves the calculation of similarity between query words and documents. In the calculation of similarity, perhaps due to the influence of "The Beauty of Mathematics", the most familiar one should be the cosine similarity. So how is the cosine similarity derived?

2. Mathematical foundation

To understand cosine similarity, we must start with understanding the pyramid. We know that the base of the pyramid is a huge square. For example, the side length of the Great Pyramid of Giza exceeds 230m. To construct such a huge square, how to ensure that the constructed figure does not get out of shape? For example, how to ensure that the result of the construction is not diamond or trapezoid.

From Pythagorean Theorem to Cosine Similarity-Programmer's Mathematical Foundation

1. Pythagorean Theorem

To ensure that the constructed quadrilateral is a square, two points need to be ensured: one is that the sides of the quadrilateral are equal in length; the other is that the corners of the quadrilateral are right angles. It is easy to solve that the side lengths of the quadrilaterals are equal. In engineering practice, a fixed-length rope can be used as the side length. How to guarantee the right angle? The ancients solved it by using the Pythagorean Theorem, or more accurately the inverse theorem of the Pythagorean Theorem.

Construct a triangle to ensure that the three sides of the triangle are 3, 4, and 5 respectively. Then the angle corresponding to the side with side length 5 is a right angle. There is a Chinese idiom "no rule can't make a square circle". The rule is a square ruler.

From Pythagorean Theorem to Cosine Similarity-Programmer's Mathematical Foundation

Pythagorean proof is the knowledge of junior high school mathematics and it is easy to understand. The proof is also very simple. It is said that Einstein discovered a proof method at the age of 11. According to statistics, there are more than 400 methods to prove the Pythagorean theorem, and interested students can learn about it by themselves. In addition, the Pythagorean Theorem is also the source of inspiration for Fermat's Last Theorem. Fermat's Last Theorem has plagued the world's wise men for more than 300 years and has also produced many anecdotes, so I won't repeat them here.

2. The law of cosines

The Pythagorean theorem has a big limitation, that is, the triangle must be a right triangle. So for an ordinary triangle, what is the relationship between the three sides? This leads to the law of cosines.

From Pythagorean Theorem to Cosine Similarity-Programmer's Mathematical Foundation

The law of cosines points out the relationship between the three sides of any triangle. It is also a mathematical knowledge that can be understood in junior high school. The proof is relatively simple, so I will skip it here.

In fact, I understand the Pythagorean Theorem and the Cosine Theorem for triangles. You have already mastered many characteristics and secrets of the triangle. For example, based on an equilateral triangle, cos(60)=1/2 can be derived. But if you want to understand more of the secrets of geometry, you need to enter the world of analytic geometry. This mathematics knowledge is not very deep, high school math knowledge.

Here we understand the simplest thing, that is, the representation of the triangle in the rectangular coordinate system. The so-called "horizontal view as a ridge on the side with peaks, with different heights and lows" can be understood as another form of triangle.

From Pythagorean Theorem to Cosine Similarity-Programmer's Mathematical Foundation

For example, we can describe a triangle with three sides a, b, and c; in a rectangular coordinate system, we can use two vectors to represent a triangle.

3. Cosine similarity

When we introduced the Cartesian coordinate system, the representation of triangles entered a more flexible, powerful and abstract realm. Geometric figures can be calculated by algebraic methods, and algebra can be visualized with geometric figures, which greatly reduces the difficulty of understanding. For example, if we use a vector to represent a side of a triangle, we can directly extend from two-dimensional space to high-dimensional space.

From Pythagorean Theorem to Cosine Similarity-Programmer's Mathematical Foundation

Here, the definition of the vector is the same as the point; the multiplication of the vector is just the multiplication and accumulation of the values ​​of each dimension; the length of the vector seems to be a new thing, but in fact it goes around a circle, essentially the Pythagorean theorem, but the Pythagorean theorem Expanded from two-dimensional space to N-dimensional space. And the vector length is a special case of the multiplication of two identical vectors. The rigor of mathematics is fully reflected here.

Combine Pythagorean theorem, cosine theorem, rectangular coordinate system, vector. We can derive the cosine formula naturally. The only difficulty in understanding here is that both the Pythagorean theorem and the cosine theorem are expressed by vectors.

From Pythagorean Theorem to Cosine Similarity-Programmer's Mathematical Foundation

After getting the cosine formula, how should we understand the cosine formula?

From Pythagorean Theorem to Cosine Similarity-Programmer's Mathematical Foundation

In extreme cases, if the two vectors overlap, it means that the two vectors are completely similar. However, the similarity here is actually the direction of the quantity. A vector has two elements: direction and length. Only the direction is used here, which lays down hidden dangers in practice. But after all a mathematical model has been established. We can use this model to solve some practical problems.

The so-called mathematical model may not require advanced mathematical knowledge, and its external performance is only a mathematical formula. For example, the mathematical model of the law of cosines is sufficient to understand high school mathematics. And regarding models, there is such an interesting statement: "All mathematical models are wrong, but some are useful." Here we pay more attention to its useful side.

Understand the angle of the vector, then how to understand the vector? Is it just one side of a triangle? 

From Pythagorean Theorem to Cosine Similarity-Programmer's Mathematical Foundation

Life has geometry, everything is vector. Vector is a simple abstraction in mathematics. We can use too many practical scenarios to make it fall into this abstraction. For example, using vectors to refer to user tags, vectors to refer to colors, and vectors to refer to the logic of search engines...

Three, business practice

Understand the law of cosines and understand the way of mathematical modeling. Next we can do some interesting things. For example, in the three business scenarios mentioned earlier, we can see how to use cosine similarity to solve them. Of course, the actual problem is definitely far more complicated, but the core ideas are similar.

Case 1: Precision Marketing

Assuming an operation plan, for example, we have delineated 1w users, how to expand to 100,000 people?

From Pythagorean Theorem to Cosine Similarity-Programmer's Mathematical Foundation

Using cosine similarity, the most important question here is: how to vectorize users?

Consider each user as a vector, and each user's tag value as a dimension of the vector. Of course, there are details such as feature normalization and feature weighting in the actual project. We are here only as a demonstration, and do not fall into the details.

For the crowd, we can take the average of all user dimension values ​​in the crowd as the crowd vector. After such processing, the cosine formula can be used to calculate the similarity of users.

From Pythagorean Theorem to Cosine Similarity-Programmer's Mathematical Foundation  

We calculate the similarity between each user in the market and the delineated population, and take topN to achieve population expansion.

Just "show me the code"! 

# -*- coding: utf-8 -*-
import numpy as np
import numpy.linalg as linalg

def cos_similarity(v1, v2):
    num = float(np.dot(v1.T, v2))  # 若为行向量则 A.T * B
    denom = linalg.norm(v1) * linalg.norm(v2)
    if denom > 0:
        cos = num / denom  # 余弦值
        sim = 0.5 + 0.5 * cos  # 归一化
        return sim
    return 0

if __name__ == '__main__':

    u_tag_list = [
        ["女", "26", "是", "白领"],
        ["女", "35", "是", "白领"],
        ["女", "30", "是", "白领"],
        ["女", "22", "是", "白领"],
        ["女", "20", "是", "白领"]
    ]
    new_user = ["女", "20", "是", "白领"]

    u_tag_vector = np.array([
        [1, 26, 1, 1],
        [1, 35, 1, 1],
        [1, 30, 1, 1],
        [1, 22, 1, 1],
        [1, 20, 1, 1]
    ])

    c1 = u_tag_vector[0]
    c1 += u_tag_vector[1]
    c1 += u_tag_vector[2]
    c1 += u_tag_vector[3]
    c1 += u_tag_vector[4]
    c1 = c1/5

    new_user_v1 = np.array([1, 36, 1, 1])
    new_user_v2 = np.array([-1, 20, 0, 1])
    print("vector-u1: ", list(map(lambda x: '%.2f' % x, new_user_v1.tolist()[0:10])))
    print("vector-u2: ", list(map(lambda x: '%.2f' % x, new_user_v2.tolist()[0:10])))
    print("vector-c1: ", list(map(lambda x: '%.2f' % x, c1.tolist()[0:10])))
    print("sim<u1,c1>: ", cos_similarity(c1, new_user_v1))
    print("sim<u2,c1>: ", cos_similarity(c1, new_user_v2))

Case 2: Image classification

There are two types of pictures, food and cute pets. How to automatically classify new pictures? 

From Pythagorean Theorem to Cosine Similarity-Programmer's Mathematical Foundation  

Our core question here is: How to vectorize images?

The picture is composed of pixels, and each pixel has three RGB channels. Because the pixel granularity is too fine, the picture is divided into grids of relative size, each grid defines 3 dimensions, and the dimension value is the average of the pixels in the grid.

From Pythagorean Theorem to Cosine Similarity-Programmer's Mathematical Foundation

Reference Blog:  Image Basics 7 Image Classification-Cosine Similarity

The sample code is also given below:

# -*- coding: utf-8 -*-
import numpy as np
import numpy.linalg as linalg
import cv2

def cos_similarity(v1, v2):
    num = float(np.dot(v1.T, v2))  # 若为行向量则 A.T * B
    denom = linalg.norm(v1) * linalg.norm(v2)
    if denom > 0:
        cos = num / denom  # 余弦值
        sim = 0.5 + 0.5 * cos  # 归一化
        return sim
    return 0

def build_image_vector(im):
    """

    :param im:
    :return:
    """
    im_vector = []

    im2 = cv2.resize(im, (500, 300))
    w = im2.shape[1]
    h = im2.shape[0]
    h_step = 30
    w_step = 50

    for i in range(0, w, w_step):
        for j in range(0, h, h_step):
            each = im2[j:j+h_step, i:i+w_step]
            b, g, r = each[:, :, 0], each[:, :, 1], each[:, :, 2]
            im_vector.append(np.mean(b))
            im_vector.append(np.mean(g))
            im_vector.append(np.mean(r))
    return np.array(im_vector)

def show(imm):
    imm2 = cv2.resize(imm, (510, 300))
    print(imm2.shape)
    imm3 = imm2[0:50, 0:30]
    cv2.imshow("aa", imm3)

    cv2.waitKey()
    cv2.destroyAllWindows()
    imm4 = imm2[51:100, 0:30]
    cv2.imshow("bb", imm4)
    cv2.waitKey()
    cv2.destroyAllWindows()
    imm2.fill(0)

def build_image_collection_vector(p_name):
    path = "D:\\python-workspace\\cos-similarity\\images\\"

    c1_vector = np.zeros(300)
    for pic in p_name:
        imm = cv2.imread(path + pic)
        each_v = build_image_vector(imm)
        a=list(map(lambda x:'%.2f' % x, each_v.tolist()[0:10]))
        print("p1: ", a)
        c1_vector += each_v
    return c1_vector/len(p_name)

if __name__ == '__main__':

    v1 = build_image_collection_vector(["food1.jpg", "food2.jpg", "food3.jpg"])
    v2 = build_image_collection_vector(["pet1.jpg", "pet2.jpg", "pet3.jpg"])

    im = cv2.imread("D:\\python-workspace\\cos-similarity\\images\\pet4.jpg")
    v3 = build_image_vector(im)
    print("v1,v3:", cos_similarity(v1,v3))
    print("v2,v3:", cos_similarity(v2,v3))
    a = list(map(lambda x: '%.2f' % x, v3.tolist()[0:10]))
    print("p1: ", a)
    im2 = cv2.imread("D:\\python-workspace\\cos-similarity\\images\\food4.jpg")
    v4 = build_image_vector(im2)

    print("v1,v4:", cos_similarity(v1, v4))
    print("v2,v4:", cos_similarity(v2, v4))

As for the pictures used in the code, users can collect them by themselves. The author also intercepted directly from the search engine. The calculation result of the program is also very intuitive. The similarity between V2 (cute pet) and image D1 is 0.956626, which is higher than the similarity between V1 (food) and image D1 at 0.942010, so the result is also very clear.

From Pythagorean Theorem to Cosine Similarity-Programmer's Mathematical Foundation

Case 3: Text retrieval

Assuming there are three documents, the description is as follows. One is information about Apple in the context of the epidemic, and the other two are information about fruits. Enter the search term "Apple is my favorite fruit", how can I find the most relevant documents?

From Pythagorean Theorem to Cosine Similarity-Programmer's Mathematical Foundation

The core issue here is how to vectorize text and search terms?

In fact, the search term can also be regarded as a document, so the question is simplified to: How to vectorize the document?

From the perspective of simplifying the question, we can give the simplest answer: a document is composed of words, and each word is used as a dimension; the frequency of words in the document is used as the dimension value.

Of course, the calculation of our dimension value will be more complicated in actual operation, such as using TF-IDF. The word frequency (TF) used here does not affect the presentation effect, so we will keep it simple.

After the text is vectorized, the rest is to draw a gourd in the same way, and use the cosine formula to calculate the similarity. The process is as follows:

From Pythagorean Theorem to Cosine Similarity-Programmer's Mathematical Foundation

Finally, give the code:

# -*- coding: utf-8 -*-
import numpy as np
import numpy.linalg as linalg
import jieba

def cos_similarity(v1, v2):
    num = float(np.dot(v1.T, v2))  # 若为行向量则 A.T * B
    denom = linalg.norm(v1) * linalg.norm(v2)
    if denom > 0:
        cos = num / denom  # 余弦值
        sim = 0.5 + 0.5 * cos  # 归一化
        return sim
    return 0

def build_doc_tf_vector(doc_list):
    num = 0
    doc_seg_list = []
    word_dic = {}
    for d in doc_list:
        seg_list = jieba.cut(d, cut_all=False)
        seg_filterd = filter(lambda x: len(x)>1, seg_list)

        w_list = []
        for w in seg_filterd:
            w_list.append(w)
            if w not in word_dic:
                word_dic[w] = num
                num+=1

        doc_seg_list.append(w_list)

    print(word_dic)

    doc_vec = []

    for d in doc_seg_list:
        vi = [0] * len(word_dic)
        for w in d:
           vi[word_dic[w]] += 1
        doc_vec.append(np.array(vi))
        print(vi[0:40])
    return doc_vec, word_dic

def build_query_tf_vector(query, word_dic):
    seg_list = jieba.cut(query, cut_all=False)
    vi = [0] * len(word_dic)
    for w in seg_list:
        if w in word_dic:
            vi[word_dic[w]] += 1
    return vi

if __name__ == '__main__':
    doc_list = [
        """
         受全球疫情影响,3月苹果宣布关闭除大中华区之外数百家全球门店,其庞大的供应链体系也受到冲击,
         尽管目前富士康等代工厂已经开足马力恢复生产,但相比之前产能依然受限。中国是iPhone生产的大本营,
         为了转移风险,iPhone零部件能否实现印度制造?实现印度生产的最大难点就是,相对中国,印度制造业仍然欠发达
        """,
        """
        苹果是一种低热量的水果,每100克产生大约60千卡左右的热量。苹果中营养成分可溶性大,容易被人体吸收,故有“活水”之称。
        它有利于溶解硫元素,使皮肤润滑柔嫩。
        """,
        """
        在生活当中,香蕉是一种很常见的水果,一年四季都能吃得着,因其肉质香甜软糯,且营养价值高,所以深受老百姓的喜爱。
        那么香蕉有什么具体的功效,你了解吗?
        """
    ]

    query = "苹果是我喜欢的水果"

    doc_vector, word_dic = build_doc_tf_vector(doc_list)

    query_vector = build_query_tf_vector(query, word_dic)

    print(query_vector[0:35])

    for i, doc in enumerate(doc_vector):
        si = cos_similarity(doc, query_vector)
        print("doc", i, ":", si)

The results of our search and sorting are as follows:

From Pythagorean Theorem to Cosine Similarity-Programmer's Mathematical Foundation

Document D2 is the most similar, in line with our expectations. Here we use the simplest method to implement a search scoring and sorting example. Although it has no practical value, it demonstrates the working principle of search engines.

Fourth, beyond cosine

The previous three simple cases have demonstrated the use of the law of cosines, but the power of the law of cosines has not been fully released. Next, let's show how the law of cosines is used in industrial-grade systems. Here, Lucene, the kernel of the open source search engine database ES, is selected as the research object. The research question is: How does Lucene use cosine similarity to score document similarity?

Of course, for the realization of Lucene, it has another name: vector space model. That is, many vectorized document collections form a vector space. We first look directly at the formula:

From Pythagorean Theorem to Cosine Similarity-Programmer's Mathematical Foundation

Obviously, the actual formula looks very different from the theoretical formula. So how do we understand? In other words, how do we derive the actual formula based on the theoretical formula?

The first thing to note is that in Lucene, the feature of the document vector is no longer the word frequency used in our case 3, but TF-IDF. The knowledge about TF-IDF is relatively simple, and the main ideas are:

How to quantify the criticality of a word in a document? The answer given by TF-IDF is to comprehensively consider the two factors of word frequency (the number of times a word appears in the current document) and inverse document frequency (the number of documents where the word appears).

  1. The more the word appears in the current document (TF), the more important the word
  2. The fewer occurrences of a word in other documents (IDF), the more unique the word

If you are interested, you can refer to other materials by yourself, and I will not explain it here.

Back to our core question: How do we derive the actual formula based on the theoretical formula?

Just four steps, as shown below:

From Pythagorean Theorem to Cosine Similarity-Programmer's Mathematical Foundation

Step 1: Calculate vector multiplication

Vector multiplication is to apply mathematical formulas. It should be noted here that there are two simplified ideas:

  1. Words that do not exist in the query tf(t,q)=0
  2. There are basically no repeated words in the query sentence tf(t,q)=1

So we completed the first derivation relatively simply:

From Pythagorean Theorem to Cosine Similarity-Programmer's Mathematical Foundation

Step 2: Calculate the length of the query sentence vector |V(q)|

Calculating the length of the vector is actually the use of the Pythagorean theorem. But here is the Pythagorean theorem for multidimensional space.

From Pythagorean Theorem to Cosine Similarity-Programmer's Mathematical Foundation

The name queryNorm here means that this operation is the normalization of the vector. This is actually when the vector is multiplied by queryNorm, it becomes a unit vector. The length of the unit vector is 1, so it is called normalization, which is named norm. After understanding this layer, it is easier to understand when looking at the lucene source code. This is just like the lines of Langyabang: The question comes from the court, but the answer lies in the arena. Here is the question from the Lucene source code, but the answer lies in mathematics.

Step 3: Calculate the document vector length |V(d)|

In fact, the second step cannot be used here. As mentioned earlier, a vector has two major elements: direction and length. The cosine formula only considers directional factors. In this way, in practical applications, the cosine similarity is independent of the vector length.

From Pythagorean Theorem to Cosine Similarity-Programmer's Mathematical Foundation

This is in the search engine, if the query sentence hits the long document and the short document, according to the cosine formula TF-IDF feature, it is biased to give a higher score to the short document. It is not fair to long documents, so it needs to be optimized.

From Pythagorean Theorem to Cosine Similarity-Programmer's Mathematical Foundation

The optimization idea here is to accumulate the number of document words, thereby reducing the gap between long documents and short documents. Of course, the business requirements here may be more diverse, so when the source code is implemented, the interface is opened to allow users to customize. In order to improve flexibility.

Step 4: Mix user weights and scoring factors

The so-called user weight refers to the weight of the query term specified by the user. For example, a typical bidding ranking is to artificially increase the weight of certain query terms. The so-called scoring factor means that if there are more query keywords in a document than in other documents, the larger the value. Comprehensive consideration of multi-word query scenarios. After 4 steps, we look at the derived formula and the actual formula again and find that the similarity is very high.

From Pythagorean Theorem to Cosine Similarity-Programmer's Mathematical Foundation

The deduced formula is basically the same as the official formula.

Five, summary

This article briefly introduces the mathematical background of cosine similarity. Starting from the construction of the pyramids in Egypt, the Pythagorean Theorem is derived, and then the Cosine Theorem is derived. The cosine formula is derived based on the vector.

Next, we will introduce the application of the cosine formula through three examples of business scenarios, that is, how the mathematical model can be applied to business scenarios. These three simple example codes are no more than a hundred lines, which can help readers better understand cosine similarity.

Finally, an industrial-grade example is introduced. ES based on Lucene is currently the hottest search engine solution. Learning the cosine formula in Lucene will help understand the real gameplay in the industry. Further improve the understanding of the cosine formula.

Six, references

  1. Book "Mathematical Beauty" Author: Wu Jun

  2. Image Basics 7 Image Classification-Cosine Similarity

Author: Shuai Guangying

Guess you like

Origin blog.51cto.com/14291117/2546415