用Python实现针对英文论文的词频分析

有时候看英文论文,高频词汇是一些术语,可能不太认识,因此我们可以先分析一下该论文的词频,对于高频词汇可以在看论文之前就记住其意思,这样看论文思路会更顺畅一旦,接下来就讲一下如何用python输出一篇英文论文的词汇出现频次。

首先肯定要先把论文从PDF版转为txt格式,一般来说直接转会出现乱码,建议先转为Word格式,之后再复制为txt文本格式。

接下来附上含有详细注释的代码

#论文词频分析
#You should convert the file to text format

__author__ = 'Chen Hong'

#Read the text and save all the words in a list
def readtxt(filename):
    fr = open(filename, 'r')
    wordsL = []#use this list to save the words
    for word in fr:
        word = word.strip()
        word = word.split()
        wordsL = wordsL + word
    fr.close()
    return wordsL

#count the frequency of every word and store in a dictionary
#And sort dictionaries by value from large to small
def count(wordsL):
    wordsD = {}
    for x in wordsL:
        #move these words that we don't need
        if Judge(x):
            continue
        #count
        if not x in wordsD:
            wordsD[x] = 1
        wordsD[x] += 1
    #Sort dictionaries by value from large to small
    wordsInorder = sorted(wordsD.items(), key=lambda x:x[1], reverse = True)
    return wordsInorder
    
#juege whether the word is that we want to move such as punctuation or letter
#You can modify this function to move more words such as number
def Judge(word):
    punctList = [' ','\t','\n',',','.',':','?']#juege whether the word is punctuation
    letterList = ['a','b','c','d','m','n','x','p','t']#juege whether the word is letter
    if word in punctList:
        return True
    elif word in letterList:
        return True
    else:
        return False


#Read the file and output the file 
filename = 'F:\\python\\Paper1.txt'
wordsL = readtxt(filename)
words = count(wordsL)
fw = open('F:\\python\\Words In Order_1.txt','w')
for item in words:
    fw.write(item[0] + ' ' + str(item[1]) + '\n')
fw.close()


猜你喜欢

转载自blog.csdn.net/watermelon_learn/article/details/82383025