Weibo users crawling public information, analysis sunset red old age group is the highest charting Jay, tell you their true ages!

Some time ago as " Jay hit list " topic quickly boarded microblogging Trending
because cxk fans questioned Jay microblogging no data
(Jay did not open microblogging)
As a result, countless hidden corners of the country for many years
could not stand Jay elderly powder began to be forced open
Here Insert Picture Description
so a middle-aged powder VS Jay Cai Xu Kun Iron
microblogging sounding the war hit list
is listening to so many years of Jay
fans have to pull down the old head
and after 00 laps microblogging rice were xiaonianqing
how to do data from zero learning
Here Insert Picture Description

First, the background of the demand

iKun have said that is a fan of Jay sunset red old age groups
Here Insert Picture Description
today, we will use the data, they say, to face the strength to fight, let them look at iKun Jay fans in the end is not the middle-aged powder!

Second, the functional description

With reptiles crawling #周杰伦超话#microblogging under, then crawling their profile information, access to age, region, gender and other information, and data analysis, visualization and then presents!

Note: The text says microblog Weibo profile information are public information, does not contain any private information, while the full text will not appear anyone's personal information, analyze information only for learning, no person shall use this tutorial use for commercial use, consequences and offenders pays!

Third, the technical program

We probably break down under the technical steps, and the use of technology

  1. Crawling #周杰伦超话#microblogging under
  2. The micro-blog each user's information crawling
  3. Save the information to a csv file
  4. Data analysis using the user's age, sex distribution
  5. Area fans of distribution
  6. Word cloud analysis using the highest charting micro-blog content

Crawling data that we can use requests库, we can save the csv file using the built-in library csv, data analysis and visualization of the library to introduce a super easy to use pyechartsafter the technology selection Well we can begin to realize the technology!

Fourth, crawling over the words microblogging

1. Locate the super URL, then load the data

我们在谷歌浏览器(chrome)中找到#周杰伦超话#页面,然后调出调试窗口,改为手机模式,然后过滤请求,只查看异步请求,查看返回数据格式,找到微博内容所在!
Here Insert Picture Description
微博请求链接:https://m.weibo.cn/api/container/getIndex?jumpfrom=weibocom&containerid=1008087a8941058aaf4df5147042ce104568da_-_feed

2.代码模拟请求数据

拿到链接我们就可以模拟请求,这里我们还是使用我们熟悉的requests库。简单几句便可以获取微博!
Here Insert Picture Description

3.提取微博内容

我们可以看到返回的数据是一个json格式的,我们一层一层寻找,就可以找到微博内容、用户id所在!
Here Insert Picture Description
了解微博返回的数据结构之后我们就可以将微博内容和id提取出来啦!
Here Insert Picture Description

4.批量爬取微博

在我们提取一条微博之后,我们便可以批量爬取微博啦,如何批量?当然是要分页了?那如何分页,这里猪哥再教大家一遍寻找分页参数技巧:

查找分页参数技巧:比较第一次和第二次请求url,看看有何不同,找出不同的参数!给大家推荐一款文本比较工具:Beyond Compare

比较两次请求的URL发现,第二次比第一次请求链接中多了一个:since_id参数,而这个since_id参数就是每条微博的id!

微博分页机制:根据时间分页,每一条微博都有一个since_id,时间越大的since_id越大所以在请求时将since_id传入,则会加载对应话题下比此since_id小的微博,然后又重新获取最小since_id将最小since_id传入,依次请求,这样便实现分页

了解微博分页机制之后,我们就可以制定我们的分页策略:我们将上一次请求返回的微博中最小的since_id作为下次请求的参数,这样就等于根据时间倒序分页抓取数据

Here Insert Picture Description
然后写一个for循环调用上面那个方法就可以啦

# 批量爬取
    for i in range(1000):
        print('第%d页' % (i + 1))
        spider_topic()

四、爬取用户信息

批量爬取微博搞定之后,我们就可以开始爬取用户信息啦!

首先我们得了解,用户基本信息页面的链接为:https://weibo.cn/用户id/info,我们以某**喜欢唱、跳、rap还有篮球**的同学主页为例子!
Here Insert Picture Description
所以我们只要获取到用户的id就可以拿到他的公开基本信息!

1.获取用户id

回顾我们之前分析的微博数据格式,发现其中便有我们需要的用户id!
Here Insert Picture Description
所以我们在提取微博内容的时候可以顺便将用户id提取出来!
Here Insert Picture Description

2.模拟登录

我们获取到用户id之后,只要请求https://weibo.cn/用户id/info 这个url就可以获取公开信息了,但是查看别人用户主页是需要登录的,那我们就先用代码模拟登录!

我们之前爬取豆瓣的时候,已经教过大家如何模拟登录了,这里就直接放出代码!
Here Insert Picture Description
登录我们使用的是requests.Session()对象,这个对象会自动保存cookies,下次请求自动带上cookies!

3.爬取用户公开信息

拿到用户id又登录之后,就可以开始爬取用户公开信息啦!
Here Insert Picture Description
这里公开信息我们只要:用户名、性别、地区、生日这些数据!所以我们需要将这几个数据提取出来!
Here Insert Picture Description
爬取用户信息不能过于频繁,否则会出现请求失败(响应状态码=418),但是不会封你的ip,其实很多大厂 不太会轻易的封ip,太容易误伤了,也许一封就是一个小区甚至更大!

五、保存csv文件

微博信息拿到了、用户信息也拿到了,那我们就把这些数据保存起来,方便后面做数据分析!

我们之前一直是保存txt格式的,因为之前都是只有一项数据,而这次是多项数据(微博内容、用户名、地区、年龄、性别等),所以选择CSV(Comma Separated Values逗号分隔值)格式的文件!
Here Insert Picture Description
我们生成一个列表,然后将数据按顺序放入,再写入csv文件!
Here Insert Picture Description
看看生成的csv文件,注意csv如果用wps或excel打开可能会乱码,因为我们写入文件用utf-8编码,而wps或excel只能打开gbk编码的文件,你可以用一般的文本编辑器即可,pycharm也可以!
Here Insert Picture Description

六、数据分析

数据保存下来之后我们就可以进行数据分析了,首先我们要知道我们需要分析哪些数据?

  1. 我们可以将性别数据做生成饼图,简单直观
  2. 将年龄数据作出柱状图,方便对比,看看到底是不是夕阳红老年团
  3. 将地区做成中国热力图,看看哪个地区粉丝最活跃
  4. 最后将微博内容做成词云图,直观了解大家在说啥

1.读取csv文件列

因为我们保存的数据格式为:’用户id’, ‘用户名’, ‘性别’, ‘地区’, ‘生日’, ‘微博id’, ‘微博内容’,的很多行,而现在做数据分析需要获取指定的某一列,比如:性别列,所以我们需要封装一个方法用来读取指定的列!
Here Insert Picture Description
这里猪哥还使用了Counter类来统计词频,方便后面数据分析,他返回的格式为:{‘女’: 1062, ‘男’: 637}。

2.可视化库pyecharts

在我们分析之前,有一件很重要的事情,那就是选择一个合适可视化库!大家都知道Python可视化库非常多,之前我们一直在用matplotlib库做词云,matplotlib做一些简单的绘图非常方便。但是今天我们需要做一个全国分布图,所以经过猪哥对比筛选,选择了国人开发的pyecharts库。选择这个库的理由是:开源免费、文档详细、图形丰富、代码简介,用着就是一个字:爽!

po一张他们的官方文档图片
Here Insert Picture Description
这里有非常详细的例子,直接复制过来就可以运行得到图片!
Here Insert Picture Description

3.分析性别

选择了可视化库之后,我们就来使用吧!
Here Insert Picture Description
这里说下为什么生成的是html?因为这是动态图,就是可以点击选择显示的,非常人性化!执行之后会生成一个gender.html文件,在浏览器打开就可以!
Here Insert Picture Description
Here Insert Picture Description
效果图中可以看到,在打榜的粉丝中女性多于男性,女性占比大概为62%!

4.分析年龄

This one is more concerned about, really red sunset fans do?
Here Insert Picture Description
Here Insert Picture Description
We found that the above figure as the main force Jay highest charting is: 90!

5. Area

Let us look at the distribution of the provinces of the highest charting fans!
Here Insert Picture Description
Figure above we can see that most of the highest charting three provinces (municipalities) as follows: Guangzhou, Beijing, Shanghai !

6. The highest charting content analysis

Let's look at these highest charting what fans are saying!
Here Insert Picture Description
The figure appears analyze some interesting words: business, the elderly, milk tea!

It seems highest charting fans are thinks he is older, ha ha ha!
Here Insert Picture Description

Seven summary

From the results in terms of the main force is still a fan of Jay highest charting 80,90, after all, a former youth ah, but more boys than girls, most fans of the hit list for the region in Guangdong!

From a technical analysis today, in this example, there are many new things to learn Sina microblogging paging mechanism, crawling public user information, use the csv library to save the file, use pyecharts do data visualization!

Of course, the middle encountered many problems, only you will know personally try, gentlemen have put the source code on GitHub ( https://github.com/pig6/sina_topic_spider or click to read the original text), the interested students remember to forward collection time when taste fresh!

Guess you like

Origin www.cnblogs.com/pig66/p/11299011.html