'Major surgery slips' to tell you why, "Chang 12 hour" so fire!

This summer, "Chang 12 hour" fire.
Yi Xi smelt one thousand, starring Ray good news; Marber Yong novel of the same name.
Waste of money spent seven months to build seventy acres of Chang'an City.
This historical drama rigorous, elegant style, called the conscience of production.
Currently watercress score reached a score of 8.6.

Here Insert Picture Description
Tucao Youku advertising too long, but to tell you a gossip, Alipay Platinum members can use points to redeem membership Youku Members month Oh!

First, the background of the demand

The main story is about the Tang Tianbao III contains, Lantern Festival day Royal Park Chang. In the fan Huasheng Jing peacefulness of the school, a group of sneaking into the wolf guard the Turkish brewing plot to destroy the city. Only one prisoner dead to save Chang, twelve hour time limit, triggering a thrilling story.
Here Insert Picture Description
In the first episode it appeared a curious gentlemen let jargon " major surgery slips "
Here Insert Picture Description
then with a question I had to ask Baidu, do not say really.
Here Insert Picture Description
However, Baidu Encyclopedia = White said, and before we know it under the guidance of Jun barrage of major surgery slips equivalent to our today's big data analysis, and finally forget to modify the next Baidu Encyclopedia.
Here Insert Picture Description
Then I will not this major surgery slips curious, but interested barrage army, curious why we all like to watch the show, where they have in common in? What attracted you place the show there?

Second, the functional description

Because we just recently talking about reptiles and data analysis, so we wanted to use modern major slips technique to analyze why this is not drama in the end the fire, it is the evaluation of the drama is what? (All barrage highest frequency word 900)
Here Insert Picture Description

Third, the technical program

  1. Analysis of a cool barrage loading and then requests the library crawling
  2. Require a lot of data to crawl past may be more
  3. Key data cleaning to do , such as: Drama, names, and so energy-Jun barrage
  4. The barrage made the word cloud

Fourth, technology

猪哥将会详细的讲解每一步的过程,希望感兴趣的同学可以认真阅读,然后自己动手实践,这样才能真的学习到知识。

本教程只为学习交流,不得用于商用获利,后果自负!
如有侵权或者对任何公司或个人造成不利影响,请告知删除

1.分析并获取弹幕接口的URL

第一步:打开某酷网站,然后点击电视剧播放,在页面中鼠标右键选择检查(或F12)调出浏览器的调试窗口。
Here Insert Picture Description
第二步:复制任意一条弹幕,然后点击调试窗口按Control+F进行搜索!
Here Insert Picture Description
第三步:点击该请求的Headers按钮,查看请求url,并且注意请求头中的RefererUser-Agent参数。
Here Insert Picture Description
只需通过以上三步弹幕加载的url便被我们找到:

https://service.danmu.youku.com/list?jsoncallback=jQuery111205151507831610791_1562918614483&mat=0&mcount=1&ct=1001&iid=1061156738&aid=322943&cid=97&lid=0&ouid=0&_=1562918614486

2.爬取弹幕数据

URL找到之后我们便可以开始coding了,还是老规矩:先从一条数据的抓取、提取、保存,这些都没问题之后我们再研究批量抓取。

这里我们依然是用我们的requests库来操作,不知道requests库是什么东西的老铁们先看看这篇文章:requests库介绍
Here Insert Picture Description
有了上次的教训,我们这次直接把请求头加上,一次就把弹幕数据爬取到手。

3.数据提取

第一步:提取json数据
我们观察返回的数据会发现,和上一篇一样,跨域请求都是用的是jsonp,所以我们需要对返回的数据进行稍微的截取,就是将外面的jQuery111203412576115734338_1562833192066(和最后的)去掉,只保留中间的json数据。
Here Insert Picture Description
这里我们和上篇做了一个小小的改动:猪哥使用r.text.index('(')获取到了回调函数的左括号的角标,然后再用这个角标去做切片,这样的好处是可以通用,即使jsonp返回函数名长度改变也不影响。

第二步:提取弹幕数据
得到json之后,我们就来分析弹幕数据在哪里,我们可以在浏览器的调试窗口的Preview里面查看
Here Insert Picture Description
可以看到result字段里面便是弹幕数据,而且他的数据格式是一个列表,列表中是每个弹幕对象,弹幕对象中的content字段就是实际的弹幕内容,好那我们用json把他们提取并打印出来。
Here Insert Picture Description

4.数据保存

想要的数据提取出来之后,我们就可以把数据保存。数据保存我们还是使用文件来保存,原因是操作方便,满足需求。
Here Insert Picture Description

5.批量爬取

完成一次请求请求的爬取、提取、保存之后,我们来研究下如何批量保存数据。这里和其他批量爬取有所区别:如何爬取多集的批量数据?

在遇到问题和困难时,猪哥总是喜欢把事情或者工作量化,然后再细化,分步解决!

这里我们就把批量爬取分为两步:第一步批量爬取一集的所有弹幕;第二步爬取多集的弹幕!

第一步:爬取某一集所有弹幕
批量爬取的关键就在于找到分页参数,找分页的技巧就是:比较两个请求url的参数,看看有何不同。
Here Insert Picture Description
我们比较同一集第一次请求与第二次请求的url发现mat参数不同,而且还是依次递增的趋势,这个参数便是我们寻找的分页参数(其实mat参数表示分钟数,表示获取第几分钟的弹幕),找到分页参数后我们就可以对原方法改造,改造思路:

将原url中分页参数变为可变参数,由方法传入。然后新建一个批量爬取的方法,循环调用单次爬取方法,每次调用传入页数即可!

Here Insert Picture Description
第二步:爬取多集的所有弹幕
这一步的关键在于找到代表集数的参数,我们可以同样可以使用对比的方法:比较第一集与第二集的第一个弹幕请求URL,从而找到不同参数!
Here Insert Picture Description
我们发现第一集的iid=1061156738、第二集的iid=1061112026,但是这个iid参数并不是递增,如何找到规律?

这时候我们还是要回到网页中寻找答案,我们复制第一集的iid值1061156738到浏览器的调试窗口搜索,找到iid就是某接口的vid值。
Here Insert Picture Description
找到集数参数之后,我们就可以写一个函数将所有集数参数爬取到。
Here Insert Picture Description
令牌为空?很奇怪,URL和headers我们都填了为什么还是不行?而浏览器却可以?

这里需要引入另一个请求头:Cookie,Cookie是干什么的?

因为HTTP协议是无状态协议,也就是说下次再请求服务器并不知道你是谁,所以就用Cookie和Seesion来记录状态,最简单的例子就是用户登录后,服务器就给浏览器遗传一串加密字符串(key),然后服务器自己缓存一个key-value,这样浏览器每次来请求都带上这个key,服务器就知道你是哪个用户!

由于篇幅有限今天只给大家简单介绍,考虑其重要性,后面猪哥会专门写一篇文章介绍Cookie。

那我们去哪里找Cookie呢?答案当然是浏览器咯!
Here Insert Picture Description
那这么多Cookie到底哪个才是我们要找的那个?这个谁也不知道,也不用找,我们直接把所有Cookie复制到代码里面就可以。

但是这种表格形式的根本无法复制,有没有什么小技巧能方便我们复制Cookie吗?当然有,我们点击浏览器调试窗口的Console按钮,然后输入document.cookie就可以看到全部Cookie啦,直接复制出来就可以,是不是很方便!
Here Insert Picture Description
我们把Cookie复制到代码里试试吧,注意Cookie有过期时间,尤其是这个token大概十几分钟可能就会过期,过期之后在浏览器中重新复制即可!
Here Insert Picture Description
我们可以观察到返回的数据同样是个jsonp函数,同样需要提取内部的json数据,所以我们可以封装一个公用方法,用于提取jsonp返回的数据转为json对象,这样提高了复用性!
Here Insert Picture Description
得到json数据之后没我们通过观察可以得到知其数据结构,然后将vid提取出来并返回,上图猪哥返回了一个生成器!

代表集数的id拿到了,现在我们就可以双层循环去爬所有的弹幕啦,上代码。
Here Insert Picture Description
一共爬取了近30万条的数据,大概用了40分钟,当然如果你觉得时间间隔太长也可缩短,但是建议不要太频繁,不然对人家服务器或者被监控到就不好!
Here Insert Picture Description

6.数据清洗+生成词云

What we have to clean the data? In fact, this is difficult to guess in advance, so we do not directly generate data cleansing cloud word and see what the effect is, and then make adjustments. About generate word clouds gentlemen in Part crawling Jingdong product reviews and generates a word cloud has been talked about for everyone!

Here Insert Picture Description
We can see a word cloud generated by the right side, like: Haha, not this, so what more these words, this word is not much value analysis, data cleansing we will have direction. (Ps: a cool barrage nothing connotation ...)
Here Insert Picture Description
gentlemen add a clean word list, so you can block out these words, and then we look at the effect it!
Here Insert Picture Description

7. Analysis word cloud

From the above word cloud, we can analyze:

  1. Some of the main character in this drama: Zhang Xiaojing, Li will, Cui, a long wave, Xu Bin, but there are still people like Cao Yan broken.
  2. Some people say that good-looking, some people say can not read, the story may be a bit depth explanation
  3. Style may be a bit like Assassin's Creed
  4. Four words brother, thousands Xi, Yi smelt one thousand have explained the drama Xi
  5. There may be surprises OST
  6. Tang, Chang describes the background story
  7. Barrage, IQ, may we remind you: Off barrage, insurance IQ!

At present the play has been completed updated first season (20 episodes), is really a domestic drama of conscience, quality, clothing, etiquette, filming, screenplay, acting world-class event, we recommend a look!

V. Summary

We come from a technical analysis to summarize this article today, and this article appears on a crawling Jingdong product reviews and generates a word cloud process is very similar, but a little more difficult:

  1. The crawling barrage only looking for paging parameters, and looking for diversity parameters
  2. The crawling barrage need to use the Cookie, and an expiration time
  3. The large amount of data, computer performance might be a little test
  4. Generating the data word cloud cleaning

Weekend, melon seeds peanuts and beer, watching drama programming correct, life would not flattered!

Project Address: https://github.com/pig6/youku_danmu_spider
Here Insert Picture Description

Guess you like

Origin www.cnblogs.com/pig66/p/11181571.html