爬取B站中的《啥是佩奇》的实时弹幕并利用jieba分词形成词云效果

这一连好多天都在忙,到今天为止,终于忙完了,可以认真学习一些自己的技术了!

这次我的目标是爬取B站中《啥是佩奇》的实时弹幕,下面我们开始吧!

找到网址,检查,并找到实时弹幕的API接口

凭我自己的实力,我还真的找不到实时弹幕的API,还是借鉴别人的成果才找到API。具体的寻找方式,请看下图:

 

还有一张图片:

 在浏览器中输入后的效果:

我们的前期工作处理完之后,我们就可以写代码了

 找到必要的信息,然后获取这些数据

根据之前找到的各种请求信息,爬取上面的文字,可以利用lxml来爬取,代码如下:

import requests
import jieba
from lxml import etree
from wordcloud import WordCloud
import matplotlib.pyplot as plt

def getHtml(url):
    try:
        param = {
            "oid":"72036218"
        }
        headers = {
            "Host":"api.bilibili.com",
            "Connection": "keep-alive",
            "User-Agent":"Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.26 Safari/537.36 Core/1.63.6788.400 QQBrowser/10.3.2864.400",
            "Origin":"https://www.bilibili.com",
            "If-Modified-Since":"Wed, 13 Feb 2019 19:41:25 GMT"
        }
        response = requests.get(url,headers = headers)
        response.raise_for_status()
        response.encoding = response.apparent_encoding
        return response.content
    except:
        return "爬取失败"


if __name__ == '__main__':
    url = "https://api.bilibili.com/x/v1/dm/list.so?oid=72036218"
    page = getHtml(url)
    print(page)

这样就得到了这些数据,注意返回的是content,不是text了!!!

得到这些数据之后就可以利用jieba和wordcloud生成词云了

jieba分词,形成词云

 wordcloud = WordCloud(background_color="white",
                          width= 2000,
                          height=1500,
                          font_path="C:\Windows\Fonts\STZHONGS.TTF",
                          random_state=30,
                          margin=2).generate(f)
    plt.imshow(wordcloud)
    plt.axis("off")
    plt.show()
    wordcloud.to_file('test.png')

 这其中的font_path这一行代码非常重要!!这一行代码针对含有中文的字符串,有了这行代码,才能生成词云!

 形成的效果图为:

 接下来就是全部代码了:

import requests
import jieba
from lxml import etree
from wordcloud import WordCloud
import matplotlib.pyplot as plt

def getHtml(url):
    try:
        param = {
            "oid":"72036218"
        }
        headers = {
            "Host":"api.bilibili.com",
            "Connection": "keep-alive",
            "User-Agent":"Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.26 Safari/537.36 Core/1.63.6788.400 QQBrowser/10.3.2864.400",
            "Origin":"https://www.bilibili.com",
            "If-Modified-Since":"Wed, 13 Feb 2019 19:41:25 GMT"
        }
        response = requests.get(url,headers = headers)
        response.raise_for_status()
        response.encoding = response.apparent_encoding
        return response.content
    except:
        return "爬取失败"

def parse_html(p):
    danci = []
    html = etree.HTML(p)
    content = html.xpath("//d/text()")
    for i in content:
        new_content = jieba.lcut(i)
        for item in new_content:
            danci.append(item)
    return danci

if __name__ == '__main__':
    url = "https://api.bilibili.com/x/v1/dm/list.so?oid=72036218"
    page = getHtml(url)
    danci = parse_html(page)
    f = " ".join(danci)
    wordcloud = WordCloud(background_color="white",
                          width= 2000,
                          height=1500,
                          font_path="C:\Windows\Fonts\STZHONGS.TTF",
                          random_state=30,
                          margin=2).generate(f)
    plt.imshow(wordcloud)
    plt.axis("off")
    plt.show()
    wordcloud.to_file('test.png')

这样就完成了B站实时弹幕的中文分词和词云绘制!

有兴趣的小伙伴可以一起交流啊!

猜你喜欢

转载自blog.csdn.net/yanzhiguo98/article/details/87164313