一起学python-3发现群组-分词bigtable生成

在前面文章我们已经抓取好了全部的文章内容（一起学python-数据抓取、一起学python-文章抓取过滤格式）我们的目标是自动计算出相关性，为了计算相关性，我们的想法用比较常用的方法来实现，即对文章出现词汇的频率来判断文章的相关性，但是对于词语的频率来说，有很多高频词，但是没有实际意义，比如“的”，“逗号”，“句号”，“你”，“我”，“他”等，对于这些高频词汇，我们需要过滤掉，在本文中我们将10%定为下限，50%定为上限，如果发现过多常见或者少见的词汇出现，可以调整边界值。我们输出格式准备为

关键词

keyword xx xxx dd 。。。

blogname 0 1 2 。。。

blogname2 1 2 3 。。。

。。。

最终为这样的矩阵，为什么是矩阵下一篇文章再说。

最初生成遇到了些麻烦，主要是生成结果有很多其他乱七八糟的东西，如数字，括号，特殊字符。所以再次进行了过滤

 r1 = u'[a-zA-Z0-9’!"#$%&\'()*+,-./:;<=>?@，。?★、…【】《》？“”‘’！[\\]^_`{|}（）；~]+'
  txt = re.sub(r1, '', txt)

在其中需要对词语进行分词，此处使用jieba 来分词，这个使用起来很简单，一行代码搞定。

 words=jieba.cut(txt)

处理之后，贴上代码

#python 3
# encoding: utf-8
import re
import sys
import jieba
def getwordcounts(urlline):
  url = urlline.split("\t")[2]
  #文章名字
  title= urlline.split("\t")[3]
  print(url)
  out = open('E:\\blog\\70171506\\' + url.split("/t/")[1] + '.txt', 'r', encoding='utf-8')
  txt=out.read()
  wc={}
  # Extract a list of words
  words=getwords(txt)
  for word in words:
      wc.setdefault(word,0)
      wc[word]+=1
  return title,wc

def getwords(html):
  # Remove all the HTML tags
  txt=re.compile(r'<[^>]+>').sub('',html)
  r1 = u'[a-zA-Z0-9’!"#$%&\'()*+,-./:;<=>?@，。?★、…【】《》？“”‘’！[\\]^_`{|}（）；~]+'
  txt = re.sub(r1, '', txt)
  # Split words by all non-alpha characters
  #print(txt)
  #words=re.compile(r'[^A-Z^a-z]+').split(txt)
  words=jieba.cut(txt)
  #print(words)
  # Convert to lowercase

  return [word.lower() for word in words if word!='']


apcount={}
wordcounts={}
feedlist=[line for line in open('urllist.txt' ,'r', encoding='utf-8')]
for feedurl in feedlist:
  try:
    title,wc=getwordcounts(feedurl)
    wordcounts[title]=wc
    for word,count in wc.items():
      apcount.setdefault(word,0)
      if count>1:
        apcount[word]+=1
  except (BaseException) as e:
    print ('Failed to parse feed %s' % feedurl)
    print(e.message)

wordlist=[]
for w,bc in apcount.items():
  frac=float(bc)/len(feedlist)
  if frac>0.1 and frac<0.5:
    wordlist.append(w)

#reload(sys) python3不需要
#sys.setdefaultencoding("utf-8")
out=open('blogbigtable.txt','w',encoding='utf-8')
out.write('Blog')
for word in wordlist: out.write('\t%s' % word)
out.write('\n')
for blog,wc in wordcounts.items():
  print (blog)
  #替换标题最后的换行符
  out.write(blog.replace("\n",""))
  for word in wordlist:
    if word in wc: out.write('\t%d' % wc[word])
    else: out.write('\t0')
  out.write('\n')

运行结果

下文学习"皮尔逊相关度"知识。

一起学python-3发现群组-分词bigtable生成

猜你喜欢