注意库包的导入。
1、爬取短评内容部分:因为爬取的短评不只一页,还要进行翻页处理,这里我用了while循环、str(k)
- 首先打开网页查看源代码,
- 使用requests要爬取部分的网页源代码
def getHtml(url):
try:
r = requests.get(url,headers={'User-Agent': 'Mozilla/5.0'})
r.raise_for_status()
r.encoding = "utf-8"
return r.text
except:
print("Failed!!!")
- 用BeautifulSoup进行解析,open()写入xxx.txt中,这里要注意写入方式:wb+/ab+,中文编码问题,详见pythton菜鸟教程;
f = open("E:/movieComment.txt",'wb+')
def getData(html):
soup = BeautifulSoup(html,"html.parser")
comment_list = soup.find('div',attrs={'class':'mod-bd'})
for comment in comment_list.find_all('div',attrs={'class':'comment-item'}):
comment_content = comment.find('span',attrs={'class':'short'}).get_text()
f.write(comment_content.encode('UTF-8'))
2、爬取了短评的内容保存在moveComment.txt中,接下来进行去停用词处理
- 去停用词需要下载停用词表,比较出名的是哈工大停用词表,直接百度下载即可
- 导入jieba库,进行结巴分词和去停用词处理
def seg_sentence():
#创建停用词列表
filefath = 'E:/stopwords.txt'
stopwords = [line.strip() for line in open(filefath,'r').readlines()]
#实现句子的分词
final = ''
fn1 = open("E:/movieComment.txt", 'r',encoding='utf-8').read() #加载爬取的内容
sentence_seged = jieba.cut(fn1,cut_all=False) #结巴分词:精确模式
fn2 = open("E:/new.txt", "w", encoding='utf-8')
for word in sentence_seged:
if word not in stopwords:
if word != '\t':
final +=word
final +=" "
fn2.write(final) #写入去掉停用词的内容
- 生成新的new.txt文件,内容如下:
3、最后是词云的展示,图片的展示
- 先要导入wordcloud是比较麻烦的,在windows下是不能直接使用pip install xxxxxx的,在https://www.lfd.uci.edu/~gohlke/pythonlibs/#wordcloud下载和自己windows/python环境下相对于的wordcloud包,然后在cmd命令行下pip install安装,要注意路径
def wordcloud():
# 加载图片
image = Image.open("E:/wc.jpg", 'r')
img = np.array(image)
# 词云
cut_text = open('E:/new.txt', 'r', encoding='utf-8').read() # 加载去掉停用词的内容
wordcloud = WordCloud(
mask=img, # 使用该参数自动忽略height,width
height=2000, # 设置图片高度
width=4000, # 设置图片宽度
background_color='white',
max_words=1000, # 设置最大词数
max_font_size=400,
font_path="C:\Windows\Fonts\msyh.ttc", # 如有口型乱码问题,可进入目录更换字体
).generate(cut_text)
# 显示图片
plot.imshow(wordcloud, interpolation='bilinear')
plot.axis('off') # 去掉坐标轴
plot.show() #直接显示
#plot.savefig('E:/wc1.jpg') #存为图片
4、完整代码:
import requests
from bs4 import BeautifulSoup
import time
import jieba
from wordcloud import WordCloud
from PIL import Image
import numpy as np
import matplotlib.pyplot as plot
def getHtml(url):
try:
r = requests.get(url,headers={'User-Agent': 'Mozilla/5.0'})
r.raise_for_status()
r.encoding = "utf-8"
return r.text
except:
print("Failed!!!")
f = open("E:/movieComment.txt",'wb+')
def getData(html):
soup = BeautifulSoup(html,"html.parser")
comment_list = soup.find('div',attrs={'class':'mod-bd'})
for comment in comment_list.find_all('div',attrs={'class':'comment-item'}):
comment_content = comment.find('span',attrs={'class':'short'}).get_text()
f.write(comment_content.encode('UTF-8'))
def seg_sentence():
#创建停用词列表
filefath = 'E:/stopwords.txt'
stopwords = [line.strip() for line in open(filefath,'r').readlines()]
#实现句子的分词
final = ''
fn1 = open("E:/movieComment.txt", 'r',encoding='utf-8').read() #加载爬取的内容
sentence_seged = jieba.cut(fn1,cut_all=False) #结巴分词:精确模式
fn2 = open("E:/new.txt", "w", encoding='utf-8')
for word in sentence_seged:
if word not in stopwords:
if word != '\t':
final +=word
final +=" "
fn2.write(final) #写入去掉停用词的内容
def wordcloud():
# 加载图片
image = Image.open("E:/wc.jpg", 'r')
img = np.array(image)
# 词云
cut_text = open('E:/new.txt', 'r', encoding='utf-8').read() # 加载去掉停用词的内容
wordcloud = WordCloud(
mask=img, # 使用该参数自动忽略height,width
height=2000, # 设置图片高度
width=4000, # 设置图片宽度
background_color='white',
max_words=1000, # 设置最大词数
max_font_size=400,
font_path="C:\Windows\Fonts\msyh.ttc", # 如有口型乱码问题,可进入目录更换字体
).generate(cut_text)
# 显示图片
plot.imshow(wordcloud, interpolation='bilinear')
plot.axis('off') # 去掉坐标轴
plot.show() #直接显示
#plot.savefig('E:/wc1.jpg') #存为图片
def main():
# 翻页处理 : max(start)=200
k = 0 #start = k
i = 0
while k <200:
url = 'https://movie.douban.com/subject/26752088/comments?start=' + str(k) + '&limit=20&sort=new_score&status=P'
k += 20
i += 1
print("正在爬取第" + str(i) + "页的数据")
#time.sleep(2) # 设置睡眠时间
html = getHtml(url)
getData(html)
seg_sentence()
wordcloud()
if __name__ == "__main__":
main()
5、运行结果