这一连好多天都在忙,到今天为止,终于忙完了,可以认真学习一些自己的技术了!
这次我的目标是爬取B站中《啥是佩奇》的实时弹幕,下面我们开始吧!
找到网址,检查,并找到实时弹幕的API接口
凭我自己的实力,我还真的找不到实时弹幕的API,还是借鉴别人的成果才找到API。具体的寻找方式,请看下图:
还有一张图片:
在浏览器中输入后的效果:
我们的前期工作处理完之后,我们就可以写代码了
找到必要的信息,然后获取这些数据
根据之前找到的各种请求信息,爬取上面的文字,可以利用lxml来爬取,代码如下:
import requests
import jieba
from lxml import etree
from wordcloud import WordCloud
import matplotlib.pyplot as plt
def getHtml(url):
try:
param = {
"oid":"72036218"
}
headers = {
"Host":"api.bilibili.com",
"Connection": "keep-alive",
"User-Agent":"Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.26 Safari/537.36 Core/1.63.6788.400 QQBrowser/10.3.2864.400",
"Origin":"https://www.bilibili.com",
"If-Modified-Since":"Wed, 13 Feb 2019 19:41:25 GMT"
}
response = requests.get(url,headers = headers)
response.raise_for_status()
response.encoding = response.apparent_encoding
return response.content
except:
return "爬取失败"
if __name__ == '__main__':
url = "https://api.bilibili.com/x/v1/dm/list.so?oid=72036218"
page = getHtml(url)
print(page)
这样就得到了这些数据,注意返回的是content,不是text了!!!
得到这些数据之后就可以利用jieba和wordcloud生成词云了
jieba分词,形成词云
wordcloud = WordCloud(background_color="white",
width= 2000,
height=1500,
font_path="C:\Windows\Fonts\STZHONGS.TTF",
random_state=30,
margin=2).generate(f)
plt.imshow(wordcloud)
plt.axis("off")
plt.show()
wordcloud.to_file('test.png')
这其中的font_path这一行代码非常重要!!这一行代码针对含有中文的字符串,有了这行代码,才能生成词云!
形成的效果图为:
接下来就是全部代码了:
import requests
import jieba
from lxml import etree
from wordcloud import WordCloud
import matplotlib.pyplot as plt
def getHtml(url):
try:
param = {
"oid":"72036218"
}
headers = {
"Host":"api.bilibili.com",
"Connection": "keep-alive",
"User-Agent":"Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.26 Safari/537.36 Core/1.63.6788.400 QQBrowser/10.3.2864.400",
"Origin":"https://www.bilibili.com",
"If-Modified-Since":"Wed, 13 Feb 2019 19:41:25 GMT"
}
response = requests.get(url,headers = headers)
response.raise_for_status()
response.encoding = response.apparent_encoding
return response.content
except:
return "爬取失败"
def parse_html(p):
danci = []
html = etree.HTML(p)
content = html.xpath("//d/text()")
for i in content:
new_content = jieba.lcut(i)
for item in new_content:
danci.append(item)
return danci
if __name__ == '__main__':
url = "https://api.bilibili.com/x/v1/dm/list.so?oid=72036218"
page = getHtml(url)
danci = parse_html(page)
f = " ".join(danci)
wordcloud = WordCloud(background_color="white",
width= 2000,
height=1500,
font_path="C:\Windows\Fonts\STZHONGS.TTF",
random_state=30,
margin=2).generate(f)
plt.imshow(wordcloud)
plt.axis("off")
plt.show()
wordcloud.to_file('test.png')
这样就完成了B站实时弹幕的中文分词和词云绘制!
有兴趣的小伙伴可以一起交流啊!