Cosine similarity formula derivation and code implementation

1. Why use cosine similarity?

  • Definition of similarity between two points on the spatial dimension
    • Between two points in the space dimension there isangle and direction, the angle range is [0,180]
    • The more similar two points are ,directioncloser to the same direction ,AngleThe smaller it should be , the closer it is to 0 degrees ;
    • The two dots have opposite meanings ,directionThe closer to the opposite direction ,AngleIt should be larger and closer to 180 degrees ;
    • The two points are not similar ,directionvertical ,Angleshould be close to 90 degrees ;
  • Why use cosine instead of sine ?
    • According to the similar definition above, we need to find the angle range in [0,180]monotonicityThe function
    • As shown in the figure below, the range of the included angle of the sine value is [0,180] , which is not a monotonic value [does not meet]
    • The range of the cosine value included angle is [0,180] , which is monotonic
      • Angleis 0 , it should be more similar ,Angle valuebigger (1)
      • Angleis 90 , there should be no correlation ,Angle valueis 0
      • Angleis 180 , it should mean the opposite ,Angle valuesmaller (-1)
        insert image description here
  • Summarize
    • Cosine monotonicity conforms to our relationship in the spatial dimension position, so use cosine similarity 【Self-understanding】

2. Formula derivation

2.1 Derivation of trigonometric function cosine formula

c o s ( θ ) = A B 2 + A C 2 − B C 2 2 ∗ A B ∗ A C \begin{aligned} {cos}(\theta)&=\frac{ {AB}^{2}+{AC}^{2}-{BC}^{2}}{2*AB*AC} \end{aligned} cos ( θ ).=2ABACAB2+AC2BC2
insert image description here

  • Displacement
    cos ( θ ) = ADAC \begin{aligned}{cos}(\theta)=\frac{AD}{AC}\end{aligned}cos ( θ )=ACAD
  • According to the principle of Pythagorean theorem,
    AC 2 = AD 2 + CD 2 ( 1 ) BC 2 = BD 2 + CD 2 ( 2 ) \begin{aligned} {AC}^{2}={AD}^{2}+ {CD}^{2}\quad\quad\quad(1)\\ {BC}^{2}={BD}^{2}+{CD}^{2}\quad\quad\quad(2) \end{aligned}AC2=AD2+CD2(1)BC2=BD2+CD2(2)
  • 公式1-公式2
    A C 2 − B C 2 = A D 2 − B D 2 ( 3 ) \begin{aligned} {AC}^{2}-{BC}^{2}&={AD}^{2}-{BD}^{2}\quad\quad\quad(3)\\ \end{aligned} AC2BC2=AD2BD2(3)
  • A B , A D , B D AB,AD,BD AB,AD,Relationship between B D , AD ADA D isan unknown value(butthe cosine formula requires parameters),AB ABA B isa known value, so getBD BDB D related formula, putBD BDBD消掉
    A B = A D + B D B D = A B − A D ( 4 ) \begin{aligned} AB&=AD+BD\\ BD&=AB-AD\quad\quad\quad(4) \end{aligned} ABBD=AD+BD=ABAD(4)
  • 公式4带入公式3
    A C 2 − B C 2 = A D 2 − ( A B − A D ) 2 A C 2 − B C 2 = A D 2 − A B 2 + 2 ∗ A B ∗ A D − A D 2 A C 2 − B C 2 = − A B 2 + 2 ∗ A B ∗ A D A D = A B 2 + A C 2 − B C 2 2 ∗ A B ( 5 ) \begin{aligned} {AC}^{2}-{BC}^{2}&={AD}^{2}-{(AB-AD)}^{2}\\ {AC}^{2}-{BC}^{2}&={AD}^{2}-{AB}^{2}+2*AB*AD-{AD}^{2}\\ {AC}^{2}-{BC}^{2}&=-{AB}^{2}+2*AB*AD\\ AD&=\frac{ {AB}^{2}+{AC}^{2}-{BC}^{2}}{2*AB}\quad\quad\quad\quad\quad(5) \end{aligned} AC2BC2AC2BC2AC2BC2AD=AD2(ABAD)2=AD2AB2+2ABADAD2=AB2+2ABAD=2ABAB2+AC2BC2(5)
  • 5- dimensional coefficient
    cos ( θ ) = ADAC = AB 2 + AC 2 − BC 2 2 ∗ ABAC = AB 2 + AC 2 − BC 2 2 ∗ AB ∗ AC \begin{aligned} {cos}(\theta); &=\frac{AD}{AC}\\ &=\frac{\frac{ {AB}^{2}+{AC}^{2}-{BC}^{2}}{2*AB}} {AC}\\ &=\frac{ {AB}^{2}+{AC}^{2}-{BC}^{2}}{2*AB*AC}\end{aligned}cos ( θ )=ACAD=AC2ABAB2+AC2BC2=2ABACAB2+AC2BC2

2.2 Derivation of trigonometric function vector cosine formula

c o s ( θ ) = a ⃗ ∗ b ⃗ ∥ a ∥ ∗ ∥ b ∥ \begin{aligned} {cos}(\theta)&=\frac{\vec{a}*\vec{b}}{\parallel a\parallel*\parallel b\parallel} \end{aligned} cos ( θ ).=aba b

  • 向量公式
    c ⃗ = a ⃗ − b ⃗ ( 1 ) A C = ∥ b ⃗ ∥ ( 2 ) A B = ∥ a ⃗ ∥ ( 3 ) B C = ∥ c ⃗ ∥ ( 4 ) \begin{aligned} \vec{c}&=\vec{a}-\vec{b}\quad\quad(1)\\ AC &=\parallel \vec{b} \parallel\quad\quad(2)\\ AB &=\parallel \vec{a} \parallel\quad\quad(3)\\ BC &=\parallel \vec{c} \parallel\quad\quad(4)\\ \end{aligned} c ACABBC=a b (1)=∥b (2)=∥a (3)=∥c (4)
    insert image description here
  • Formulas 1, 2, 3, 4 Formulas 1, 2, 3, 4Form 1 , 2 , 3 , 4 ,the infinitive
    cos ( θ ) = AB 2 + AC 2 − BC 2 2 ∗ AB ∗ AC cos ( θ ) = ∥ a ⃗ ∥ 2 + ∥ b ⃗ ∥ 2 − ( ∥ a ⃗ − b ⃗ ∥ ) 2 2 ∗ ∥ a ⃗ ∥ ∗ ∥ b ⃗ ∥ cos ( θ ) = ∥ a ⃗ ∥ 2 + ∥ b ⃗ ∥ 2 − ∥ a ⃗ ∥ 2 + 2 ∗ a ⃗ ∗ b ⃗ − ∥ b ⃗ ∥ 2 2 ∗ ∥ a ⃗ ∥ ∗ ∥ b ⃗ ∥ cos ( θ ) = a ⃗ ∗ b ⃗ ∥ a ∥ ∗ ∥ b ∥ \begin{aligned} {cos}(\theta)&=\frac{ { ; AB}^{2}+{AC}^{2}-{BC}^{2}}{2*AB*AC}\\{cos}(\theta)&=\frac{{\parallel\vec { a} \parallel}^{2}+{\parallel \vec{b}\parallel}^{2}-({\parallel\vec{a}-\vec{b}\parallel)}^{2}} {2*\parallel \vec{a} \parallel*\parallel \vec{b}\parallel}\\ {cos}(\theta)&=\frac{ {\parallel \vec{a} \parallel}^{2}+{\parallel \vec{b} \parallel}^{2}-{\parallel \vec{a} \parallel}^{2}+2*\vec{a}*\vec{b}-{\parallel \vec{b} \parallel}^{2}}{2*\parallel \vec{a} \parallel*\parallel \vec{b} \parallel}\\ {cos}(\theta)&=\frac{\vec{a}*\vec{b}}{\parallel a\parallel*\parallel b\parallel} \end{aligned} cos ( θ )cos ( θ )cos ( θ )cos ( θ ).=2ABACAB2+AC2BC2=2a b a 2+b 2(a b )2=2a b a 2+b 2a 2+2a b b 2=aba b

3. Cosine similarity code implementation

  • The code is from the book:Advanced Deep Learning: Natural Language Processing
    import numpy as np
    
    def preprocess(text):
        """
           语料库预处理
    
           :param text:句子字符串
           :return:
                corpus 是单词ID 列表
                word_to_id:是单词到单词 ID 的字典
                id_to_word 是单词 ID 到单词的字典
        """
        text = text.lower().replace('.', ' .') # 单词全为小写
        words = text.split(' ') # 以空格分隔
        word_to_id = {
          
          }
        id_to_word = {
          
          }
        for word in words:
         if word not in word_to_id:
             new_id = len(word_to_id)
             word_to_id[word] = new_id
             id_to_word[new_id] = word
        corpus = np.array([word_to_id[w] for w in words])
        return corpus, word_to_id, id_to_word
    def create_co_matrix(corpus, vocab_size, window_size=1):
        """
        语料库生成共现矩阵
    
           :param corpus:corpus 是单词 ID 列表
           :param vocab_size:词汇个数
           :param window_size:窗口大小
           :return:
                共现矩阵
        """
        corpus_size = len(corpus)
        co_matrix = np.zeros((vocab_size, vocab_size), dtype=np.int32)
        for idx, word_id in enumerate(corpus):
            for i in range(1, window_size + 1):
                left_idx = idx - i
                right_idx = idx + i
                if left_idx >= 0:
                    left_word_id = corpus[left_idx]
                    co_matrix[word_id, left_word_id] += 1
                if right_idx < corpus_size:
                    right_word_id = corpus[right_idx]
                    co_matrix[word_id, right_word_id] += 1
        return co_matrix
    def cos_similarity(x, y, eps=1e-8):
        """
        余弦相似度函数
    
        :param x:x坐标值
        :param y:y坐标值
        :param eps:默认值为1e-8,防止分母为0
        :return:
            余弦相似度值
        """
        nx = x / (np.sqrt(np.sum(x ** 2)) + eps)
        ny = y / (np.sqrt(np.sum(y ** 2)) + eps)
        return np.dot(nx, ny)
    text = 'I say hello and You say goodbye.'
    corpus, word_to_id, id_to_word = preprocess(text)
    print("corpus为:",corpus)
    print("word_to_id为:",word_to_id)
    print("id_to_word为:",id_to_word)
    vocab_size=len(set(corpus))
    C=create_co_matrix(corpus, vocab_size, window_size=1)
    print("共现矩阵为:",C)
    c0 = C[word_to_id['you']] # you的单词向量
    c1 = C[word_to_id['i']] # i的单词向量
    print('you和i的相似度为',cos_similarity(c0, c1))
    
    

insert image description here

Guess you like

Origin blog.csdn.net/m0_46926492/article/details/130504554