NIL入门(一):提取文章关键词和评价两段文本的相似度

1、数据准备

# text.txt文件下的数据内容
1.In her mind she followed the white BUICK  along the road somewhere between here and the Niagara River.
2.There were some sweet machines other than women, an old BUGATTI , a lean Farina coachwork on an American chassis, a Swallow, a type 540- K Mercedes and lots more.
3.Only three standard models- BUICK , Chrysler, and Mercury- had slight year-to-year gains in March sales in the county.
4.The white BUICK  hadn't moved away yet.
5. The simple mechanical strain of overweight, says New York's Dr Norman Jolliffe, can overburden and damage the heart "for much the same reason that a Chevrolet engine in a CADILLAC  body would wear out sooner than if it were in a body for which it was built".
6. It is well to bear in mind that gasoline will cost from 80 to 90 for the equivalent of a United States gallon and while you might prefer a familiar Ford, Chevrolet or even a CADILLAC , which are available in some countries, it is probably wiser to choose the smaller European makes which average thirty, thirty-five and even forty miles to the gallon.
7. Your chauffeur's expenses will average between $7.00 to $12.00 a day, but this charge is the same whether you rent a 7-passenger CADILLAC  limousine or a 4-passenger Peugeot or Fiat 1800.
8.Of course, if you want to throw all caution to the winds and rent an Imperial or CADILLAC  limousine just for you and your bride, you'll have a memorable tour, but it won't be cheap, and it is not recommended unless you own a producing oil well or you've had a winner in the Irish Sweepstakes.
9.They answered him in monosyllables, nods, occasionally muttering in Greek to one another, awaiting the word from Papa, who restlessly cracked his knuckles, anxious to stuff himself into his white CADILLAC  and burst off to the freeway.
10. It was a CADILLAC , black grayed with the dust of the road, its windows closed tight so you knew that the people who climbed out of it would be cool and unwrinkled.
11. Almost immediately Howard and his daughter Debora drove up in the CADILLAC .
12. There was really no reason to refuse, and Linda Kay had never ridden in a CADILLAC .
13. Rates for American cars are somewhat higher, ranging from about $8.00 a day up to $14.00 a day for a CHEVROLET  Convertible, but the rate per kilometer driven is roughly the same as for the larger European models.
14.Friends, a picture magazine distributed by CHEVROLET  dealers, describes a paramilitary organization of employees of the Gulf Telephone Company at Foley, Alabama.
15.He had a perfectly good Audi when he moved here last year.
16.We're getting an Audi.
17. The string was walking round in a circle at the end of the gallops when Bill's Audi drew up.
18. She parked the hired Audi Coupe in front of the wire fence .
19. Sabrina zipped up her anorak as she stepped out into the cold night air and rummaged in her pockets for the keys to the Audi Coupe.
20. Ellwood drove an Audi -- fast but not flashy.
21.They parked the Audi where the guardhouse had once stood, on a small patch of concreted ground to the side of the road.
22.Adam grabbed Billie and hid her behind the Audi, glad that he'd chosen a four-wheel-drive Quattro.
23.He switched the engine on and swung the Audi out of the car-park, down Yorckstrasse towards the outskirts of the city.
24.With his other arm he wrenched the wheel to the right, forced the Audi on to the pavement and against the wall.
25.With the flames engulfing the roof of the Audi, Adam lay across the two front seats, aimed the machine-gun and shot the bomber dead.
26.Adam crawled out of the Audi, grabbed Billie and ran with her before the petrol tanks exploded.
27.As they dragged her away from the flaming Audi, she had turned and seen Adam lying on the road, shielding himself.
28.The Audi slammed into the side of the Volvo and Donna had to use all her strength to keep control of the car.
29.Minutes later cannabis worth 234,000 was found hidden in Melms's Audi at Newhaven, Sussex.
30.Gus halted the Aston Martin at the doorway instead of driving straight on to the garage, and was out of the driving-seat like a greyhound out of a trap, to dart round to the passenger side and hand Charlotte out.

2、用TF-IDF提取关键词
TF:term frequency，文本频率，即统计单词在文本中出现的频率
IDF:inverse document frequency，逆文档频率，即统计该单词在哪些文档里面出现过，返回出现该词文档的数量
我们设一个单词在文本中重要性程度为k，则k=该单词在该文本中出现次数*log(总文档数量/出现该单词的文档数量)。不难看出，如果一个单词出现频率非常高，比如"the", “I”, "are"之类的词，很明显，这些单词不是关键词，会使得log(总文档数量/出现该单词的文档数量)变小。故而TF-IDF倾向于过滤掉常见的词语，保留重要的词语。

from collections import Counter
import math
import numpy as np
class tfIdf():
    def __init__(self, path, topK):
        self.path = path
        self.topK = topK

    """
    function: 读取txt文件，返回由一句话组成的列表和由一个词一个词组成的二维列表
    path: txt文件的绝对路径
    """
    def ReadTxtFile(self):
        dataByLine = []
        dataByWord = []
        with open(self.path, 'r', encoding='utf-8', errors='ignore') as f:
            for line in f.readlines():
                # if(len(line.split('\t')[0]) <= self.topK ):
                #     continue
                dataByLine.append(line.split('\n')[0])
                dataByWord.append(line.split('\n')[0].split(' ')[:-1])
        return dataByLine, dataByWord

    """
    function: 返回每个文本中每个单词在原文本中出现频数
    """
    def freq(self):
        _, dataByWord = self.ReadTxtFile()
        freqMat = []
        for line in dataByWord:
            _freqList = []
            temp = Counter(line)
            for word in line:
                _freqList.append(temp[word]/len(line))
            freqMat = freqMat+[_freqList]
        return freqMat

    """
    function: 计算每个文本中每个单词的逆文档频率
              idf = log("总文档数"/"出现该单词的文档数")
    """
    def _wordCount(self, Word):
        _, dataByWord = self.ReadTxtFile()
        count = 0
        for line in dataByWord:
            for word in line:
                if(word==Word):
                    count = count+1
                    break
        return count

    def IDF(self):
        _, dataByWord = self.ReadTxtFile()
        idfMat = []
        for line in dataByWord:
            _idfList = []
            for word in line:
                _idfList.append(math.log(len(dataByWord)/self._wordCount(word)).real)
            idfMat = idfMat+[_idfList]
        return idfMat

    """
    function: 根据TF-IDF计算出每个文本的topK个关键词，返回为一个文本数*topK的二维列表
    """
    def getKeyWord(self):
        _, dataByWord = self.ReadTxtFile()
        freqMat = self.freq()
        idfMat  = self.IDF()
        keyWord = []
        for i in range(len(freqMat)):
            _keyWordList = []
            for j in range(len(freqMat[i])):
                _keyWordList.append(freqMat[i][j]*idfMat[i][j])
            index = np.argsort(_keyWordList)
            _keyWord = []
            for item in index:
                _keyWord.append(dataByWord[i][item])
            _keyWord = _keyWord[::-1]
            _keyWord = _keyWord[:self.topK]
            keyWord = keyWord+[_keyWord]
        return keyWord

if __name__ == '__main__':
    fileName  =r'data/text.txt'
    ssh = tfIdf(fileName, topK=6)
    idfMat = ssh.getKeyWord()
    for line in idfMat:
        print(line)

利用TF-IDF可以找出文本关键词。在这个的基础上，我们可以利用余弦距离来评价两篇文档的相似性。
3、利用余弦距离衡量两篇文档的相似性
步骤其实很简单，用我们刚刚写好的TF-IDF类来获取两个文本的关键词，组成两个关键词向量keyWordA、keyWordB；再将这两个关键词向量合并，去重，得到一个词袋bag。之后分别统计keyWordA、keyWordB中是否含有这个bag里面的每一个词，有的地方填这个词出现的次数，没有的地方填0，形成两个词频向量A，B。之后用矩阵运算公式可直接算出这两个文本之间的余弦距离。
举个栗子，
keyWordA=[“I”, “love”, “you”, “deeply”]
keyWordB=[“I”, “don’t”, “love”, “you”]
那么，bag=[“I”, “love”, “you”, “deeply”, “don’t”]
A=[1, 1, 1, 1, 0]
B=[1, 1, 1, 0, 1]
余弦距离就是AB/(|A||B|)=3/5

from myTFIDF import tfIdf   # 调用我们刚刚写好的TF-IDF程序
from collections import Counter
import numpy as np
import warnings
warnings.filterwarnings("ignore") # 屏蔽警告

class cosDistance():
    def __init__(self, path):
        self.path = path

    """
    function: 计算两条推文的cos距离
              两个向量A,B的余弦距离等于(A.*B)/(|A|*|B|)
    input: 两条推文的关键词向量
    """
    def dis(self, keyWordA, keyWordB):
        A = []
        B = []
        bag = set(keyWordA+keyWordB)
        counterA = Counter(keyWordA)
        counterB = Counter(keyWordB)
        for word in bag:
            if(counterA[word]):
                A.append(counterA[word])
            else:
                A.append(0)
        for word in bag:
            if (counterB[word]):
                B.append(counterB[word])
            else:
                B.append(0)
        cosDis = np.dot(np.array(A), np.array(B))/(np.sum(np.array(A)**2)**0.5 * np.sum(np.array(B)**2)**0.5+0.01)
        return cosDis

有了这些基本的操作，我们就能够求出文本之间的相似程度啦。但这些都不需要自己实现，比如TF-IDF在scikit-learn中就实现了很好的封装。

NIL入门(一):提取文章关键词和评价两段文本的相似度

NIL入门(一):提取文章关键词和评价两段文本的相似度

猜你喜欢