Python case analysis|Text similarity comparative analysis

picture

 This case helps you further master the ability to design Python classes to solve practical problems by designing and implementing the classes Vector and Sketch related to text similarity comparison.

01. Overview of text similarity comparison

Text similarity comparisons are achieved by computing and comparing summaries of documents.

The simplest form of document summarization can be represented using a vector of relative frequencies of k-grams (k consecutive characters) in the document. Assuming that the value of a character may have 128 different values ​​(ASCII code), the dimension d of the vector is 128k, which is even more astronomical for Unicode encoding. Therefore, the hash function hash(s) % d is generally used to map the k-grams string s to an integer between 0 and d-1, so that the dimension of the document summary vector is d.

After the document summary vector is created, the similarity between the two documents can be judged by comparing the distance between the two document summary vectors.

The following first describes the design and implementation of the vector class (Vector) and the document summary class (Sketch), and then uses the document summary class to compare the similarity of documents.

02. Vector (Vector) class design and implementation

A vector is a mathematical abstraction. An n-dimensional vector can use an ordered list of n real numbers (x0, x1, ..., xn-1). Vector supports basic four arithmetic operations, so it can be realized by operator overloading.

Basic operations on vectors include: addition of two vectors, multiplication of a vector by a scalar (a real number), calculation of the dot product of two vectors, calculation of vector magnitude and direction.

(1)加法:x + y = ( x0 + y0, x1 + y1, . . ., xn-1 + yn-1 )

(2) Subtraction: x - y = ( x0 - y0, x1 - y1, . . ., xn-1 - yn-1 )

(3) Scalar product: αx = (αx0, αx1, . . ., αxn-1)

(4) Dot product: x y = x0y0 + x1y1 + . . . + xn-1yn-1

(5) Size: |x| = (x02 + x12 + . . . + xn-12)1/2

(6) Direction: x / |x| = ( x0/|x|, x1/|x|, . . ., xn-1/|x|)

The basic vector (Vector) class design ideas are as follows.

(1) Define a constructor with a list parameter (the coordinates of the vector, which can be any dimension), which is used to initialize the instance object attribute _coords of the corresponding vector.

(2) Overload the method __getitem__() to return the i-th element, that is, the i-th dimension coordinate.

(3) Overload methods __add__(), __sub__(), __abs__() to implement vector operations, namely addition, subtraction, and size (modulus).

(4) Define methods scale(), dot(), and direction() to realize vector operations, namely scalar product, dot product, and direction.

(5) Overload the method __len__() to return the dimension of the vector.

(6) The overloaded method __str__() returns the string representation of the vector.

Based on the above design idea, the implementation and test code of Vector are as follows.

[Example 1] Implementation and testing of the vector class (Vector) (vector.py).

import math
class Vector:
    """笛卡尔坐标系向量"""
    def __init__(self, a):
        """构造函数:切片拷贝列表参数a到对象实例变量_coords"""
        self._coords = a[:] # 坐标列表
        self._n = len(a) # 维度
    def __getitem__(self, i):
        """返回第i个元素,即第i维坐标"""
        return self._coords[i]
    def __add__(self, other):
        """返回2个向量之和"""
        result = []
        for i in range(self._n):
            result.append(self._coords[i] + other._coords[i])
        return Vector(result)
    def __sub__(self, other):
        """返回2个向量之差"""
        result = []
        for i in range(self._n):
            result.append(self._coords[i] - other._coords[i])
        return Vector(result)
    def scale(self, n):
        """返回向量与数值的乘积差"""
        result = []
        for i in range(self._n):
            result.append(self._coords[i] * n)
        return Vector(result)
    def dot(self, other):
        """返回2向量的内积"""
        result = 0
        for i in range(self._n):
            result += self._coords[i] * other._coords[i]
        return result
    def __abs__(self):
        """返回向量的模"""
        return math.sqrt(self.dot(self))
    def direction(self):
        """返回向量的单位向量"""
        return self.scale(1.0 / abs(self))
    def __str__(self):
        """返回向量的字符串表示"""
        return str(self._coords)
    def __len__(self):
        """返回向量的维度"""
        return self._n
#测试代码
def main():
    xCoords = [2.0, 2.0, 2.0]
    yCoords = [5.0, 5.0, 0.0]
    x = Vector(xCoords)
    y = Vector(yCoords)
    print('x = {}, y = {}'.format(x, y))
    print('x + y = {}'.format(x + y))
    print('10x = {}'.format(x.scale(10.0)))
    print('|x| = {}'.format(abs(x)))
    print(' = {}'.format(x.dot(y)))
    print('|x-y| = {}'.format(abs(x-y)))
if __name__ == '__main__': main()

 

The program running results are as follows.

x = [2.0, 2.0, 2.0], y = [5.0, 5.0, 0.0]

x + y = [7.0, 7.0, 2.0]

10x = [20.0, 20.0, 20.0]

|x| = 3.4641016151377544

 = 20.0

|x-y| = 4.69041575982343

03. Design and implementation of document summary class (Sketch)

The document summary class (Sketch) is used to encapsulate the summary information of the document. The design idea is as follows.

(1) Define a constructor with 3 list parameters (text (text), k (k-grams), d (dimension of document summary vector)). Use list comprehension to create a list freq of d elements (initial value is 0), which is used to store the frequency of k-grams. Loop to extract all k-grams of the text, and use the hash function to map to an integer between 0-d, thereby updating the element value of the corresponding list freq (incrementing). Then use freq to create a Vector object vector, and call the direction() method of the vector object for normalization. Finally, save the document summary vector vector to the instance object attribute _sketch.

(2) Define the method similarTo() to calculate the cosine similarity of two document summary vectors.

Common methods for comparing two vectors include Euclidean distance and cosine similarity. Given vectors x and y, their Euclidean distance is defined as:

picture

 The cosine similarity is defined as:

picture

 

Based on the Vector object, given vectors x and y, the Euclidean distance is abs(x – y), and the calculation method of cosine similarity is x.dot(y).

(3) The overloaded method __str__() returns the string representation of the vector.

Based on the above design idea, the implementation and test code of the vector (Sketch) are as follows.

[Example 2] Implementation and testing of the document summary class (Sketch) (sketch.py).

import sys
from vector import Vector
class Sketch:
    """计算文本text的k-grams的文档摘要向量(d维)"""
    def __init__(self, text, k, d):
        """初始化函数:计算文本text的文档摘要向量"""
        freq = [0 for i in range(d)] #创建长度为d的列表,初始值0
        for i in range(len(text) - k): #循环抽取k-grams,计算频率
            kgram = text[i:i+k]
            freq[hash(kgram) % d] += 1
        vector = Vector(freq) #创建文档摘要向量
        self._sketch = vector.direction() #归一化并赋值给对象实例变量
    def similarTo(self, other):
        """比较两个文档摘要对象Sketch的余弦相似度"""
        return self._sketch.dot(other._sketch)
    def __str__(self):
        return str(self._sketch)
#测试代码
def main():
    with open("tomsawyer.txt","r") as f:
        text = f.read()
        sketch = Sketch(text, 5, 100)
        print(sketch)
if __name__ == '__main__': main()

 

The result of running the program is as follows.

[0.09151094195152963, …, 0.08903767325013694]

Description  /

Hash functions are computed based on a numerical "seed". In Python 3, the hash seed changes (by default), i.e. the hash value for a given object may vary from run to run. Therefore, the program output may vary.

 

04. Determine the similarity of documents by comparing document summaries

Using the Sketch class designed and implemented above, the similarity of documents can be compared.

[Example 3] Use the Sketch class to compare the similarity of documents (document_compare.py).

import sys
from vector import Vector
from sketch import Sketch
#测试文档列表
filenames = [ 'gene.txt', 'pride.txt', 'tomsawyer.txt']
k = 5    #k-grams
d = 100000 #文档摘要向量维度
sketches = [0 for i in filenames]
for i in range(len(filenames)):
    with open(filenames[i], 'r') as f:
        text = f.read()
        sketches[i] = Sketch(text, k, d)
#输出结果标题
print(' '*15, end='')
for filename in filenames:
    print('{:>22}'.format(filename), end='')
print()
#输出结果比较明细
for i in range(len(filenames)):
    print('{:15}'.format(filenames[i]), end='')
    for j in range(len(filenames)):
        print('{:22}'.format(sketches[i].similarTo(sketches[j])), end='')
    print()

 The result of the program running is as follows:

picture

The results show that the similarity of the same document is 1, the similarity of the same type of documents (pride.txt and tomsawyer.txt) is relatively large, and the similarity of different types of documents (gene.txt and pride.txt) is relatively low. 

 

Guess you like

Origin blog.csdn.net/qq_41640218/article/details/131715402