Python crawling know almost 9674 Q, 98 Top Secret book!

Original link: https://mp.weixin.qq.com/s?src=11×tamp=1571985685&ver=1932&signature=QhR9plwBLWFqBimlJ1lLa7VsHopg8AtWmFaFGsTttoqDwUyeLH6FjgBVG7RdHrRbZiHTf*DTioU5K1Itim22PMerNxjjEa1HqY73tn3K6a1EN28Oe1mGyw8uyvm3ygJ2&new=1

data collection

Although know almost have a "reading" of the topic, but I looked at the inside of the problem is not all of the testimonials, if possible are crawling down 80% of the data are irrelevant and book recommendations.

So I know almost directly search for "book", select the answer to a higher heat of six questions:
Here Insert Picture Description
Click on "Check" page, continue to pull down,

'''
更多Python学习资料以及源码教程资料,可以在群821460695 免费获取
'''

We can find a link with an obvious "answer" in the words of XHR:
Here Insert Picture Description
look at a few links you can find the law anymore.

(offset:0,5,15,20……)

Fields will be able to pick their own interest, "Xiu Xiu call out" climb down,

The other five issues followed suit, the following:
Here Insert Picture Description
A total of 9674 get answered, basic fields as follows:
Here Insert Picture Description

Data cleaning

I used to think the hardest climb of data, as long as the climb down, all say! Like how to deal with how to deal with, like how to analyze how the analysis.

But this time, the main purpose of reptiles is a list of books listed a high frequency of appearance, everyone's answer is concise.

Here Insert Picture Description
There are such, the recommended language (nonsense) a lot of:
Here Insert Picture Description
you look, to answer the most number of words may have more than 30,000 words of it! Research reptiles probably took me an hour, but how to analyze these answers gave me a headache for three nights! Look at the main issues:

  • Many answers did not take the title number, and therefore can not simply use regular expressions;

  • 知友们回答的时候会出现书名打错(“一句话顶一万句”),还有书名简写或表达方式不同的情况(比如,关于哈利波特系列书籍的说法就有11种……);

  • 最重要的是,我还不具有“看到一个词或一句话就分辨出哪些是书名哪些不是”的能力。我自己都不知道,我怎么让Python判断提取呢……

我也曾想过干脆只用《》来正则匹配内容,结果发现:
Here Insert Picture Description
44.96%的用户回答问题的时候非常不规范,他们在回答中没有有使用书名号!直接这样分析的话就相当于丢失了将近一半的数据!

除非……除非我有一个图书库,里面有所有书的书名,这样我只要遍历每个答案,如果Ta提到了这本书,就把这个书名提取出来,最后再统计分析就好啦!然而,那句话怎么说来着,想象很丰满,现实很骨感。我并没有这样的图书库。利用现有的数据,我只能勉强以另外55.04%个答案中出现的书名,进行简单处理,得到一个简陋的书名列表……

Here Insert Picture Description
然后再对每个答案进行遍历……

个中辛酸就不提了,提了也没用。因为并不是完美的解决之道,只能勉强满足我本次爬虫的目的罢了,不过就我走过的一些坑,我还是列一下。虽然前方有很多坑,但是大家能少进一个就少进一个吧:
Here Insert Picture Description
Here Insert Picture Description
Here Insert Picture Description
Here Insert Picture Description
Here Insert Picture Description

数据分析

在得到最终TOP书单之前,我们按照惯例看看这些答案的基本情况。
Here Insert Picture Description
知乎后台性别显示的是0、1和-1,通过研究具体两三个用户的资料,我发现0表示女生,1表示男生,-1表示未知。

It appears that the proportion of six boys slightly more than girls answer below.
Here Insert Picture Description
Boys and girls to answer questions of length is very close, we explained Duman hard, from the interaction point of view, the answer to the number of boys and girls per capita thumbs slightly higher than the per capita number of comments was 55% higher than the girls, they might have answers compare controversial. But reading such a thing Well, originally young and old Safe, so distinctive in this topic should be minimal.
Here Insert Picture Description
And normal (modern) human rest very close to most of the answers are submitted during the day, of which 11% of users answered between 0:00 to 4:00, I think these people certainly not bedtime reading.
Here Insert Picture Description
From time to answer a few points and praise scatterplot term, some high praise answers are found in the morning between 8:00 to 20:00 this time everyone energetic, easier to write high-quality answers, health Girl appeal again , we must go to bed early ah! Someone asked not sleep how to do? I have not told you a while (own experience).
Here Insert Picture Description
Noted earlier, the least number of words the answer answer, only one word: Gone with the Wind. The length of the longest of 32,210 words is 1.5 times the length of my thesis. Overall survey about 84% of the length of the answer in less than 1000 words, it is in everyone fragmented reading habits. However, another 16% of users, but the number of points won praise at these answers 93% and 72% of the number of comments. Look, look (knock blackboard), examples of how the image of the Pareto rule, students take notes fast!
Here Insert Picture Description
Later, look, I got three days to download the book list (sorted by frequency Friends of Scouting mentioned):
Here Insert Picture Description
Here Insert Picture Description
This 98 book you've seen how much of this do?

Why do some people ask is TOP98, not 100? Because I think this looks relatively small, it will be more motivated "yes" all the books.

Guess you like

Origin blog.csdn.net/fei347795790/article/details/102742779