I know almost python crawling the Sand Sculpture Top Questions Top

Foreword

Text and images in this article from the network, only to learn, exchange, not for any commercial purposes, belongs to original author, if any questions, please contact us for treatment.

Author: Lin Sen data

PS: If necessary Python learning materials can be added to a small partner click the link below to obtain their own

http://note.youdao.com/noteshare?id=3054cce4add8a909e784ad934f956cef

These days when the Internet occasionally, is known almost on a "Jade Emperor lived in the troposphere or stratosphere," the problem of attracting, the thought just chipping away, not knowing that this problem has triggered a strong response in the know almost browse the number of 500W +, 7000 + attention: Here Insert Picture Description

Data Sources

Know almost very "close" to have a special problem to meet our needs, the surprise is the problem actually has 243 answer, and Tao Fei students received approval 3W +

Here Insert Picture Description

Us from crawling links to all the questions answered in the emergence of more than 400 common issues, Tao Fei provides 200+, to express my gratitude to the Tao Fei students helped us build a "sand sculpture Database", this part of the code is as follows :

 1 import re
 2 import selenium
 3 from selenium import webdriver
 4 import requests
 5 from bs4 import BeautifulSoup
 6 import pandas as pd
 7 import time
 8  9 driver = webdriver.Chrome()
10 driver.maximize_window()
11 12 url = 'https://www.zhihu.com/question/37453271'
13 js='window.open("'+url+'")'
14 driver.execute_script(js)
15 driver.close()
16 driver.switch_to_window(driver.window_handles[0])
17 for i in range(100):
18      js="var q=document.documentElement.scrollTop=10000000"  
19      driver.execute_script(js)
20 21 all_html = [k.get_property('innerHTML') for k in driver.find_elements_by_class_name('AnswerItem')]
22 all_text = ''.join(all_html)
23 24 #all_text = all_text.replace('\u002F','/')
25 all_text = all_text.replace('questions','question')
26 pat = 'question/\d+'
27 questions = list(set([k for k in re.findall(pat,all_text)]))

 

获得到了问题的对应的编号后,就可以去各自的页面获取各个问题对应的的标题、浏览数等信息,如下图所示: Here Insert Picture Description

这部分代码如下:

 1 header = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win32; x32; rv:54.0) Gecko/20100101 Firefox/54.0',
 2 'Connection': 'keep-alive'}
 3 cookies ='v=3; iuuid=1A6E888B4A4B29B16FBA1299108DBE9CDCB327A9713C232B36E4DB4FF222CF03; webp=true; ci=1%2C%E5%8C%97%E4%BA%AC; __guid=26581345.3954606544145667000.1530879049181.8303; _lxsdk_cuid=1646f808301c8-0a4e19f5421593-5d4e211f-100200-1646f808302c8; _lxsdk=1A6E888B4A4B29B16FBA1299108DBE9CDCB327A9713C232B36E4DB4FF222CF03; monitor_count=1; _lxsdk_s=16472ee89ec-de2-f91-ed0%7C%7C5; __mta=189118996.1530879050545.1530936763555.1530937843742.18'
 4 cookie = {}
 5 for line in cookies.split(';'):
 6     name, value = cookies.strip().split('=', 1)
 7     cookie[name] = value
 8  9 questions_df = pd.DataFrame(columns = ['title','visit','follower','answer','is_open'])
10 11 for i in range(len(questions)):
12     try:
13         url = 'https://www.zhihu.com/'+questions[i]
14         html = requests.get(url,cookies=cookie, headers=header).content
15         bsObj = BeautifulSoup(html.decode('utf-8'),"html.parser")
16         text = str(bsObj)
17         title = bsObj.find('h1',attrs={'class':'QuestionHeader-title'}).text
18         visit = int(re.findall('"visitCount":\d+',text)[0].replace('"visitCount":',''))
19         follower = int(re.findall('"followerCount":\d+',text)[0].replace('"followerCount":',''))
20         answer = int(re.findall('"answerCount":\d+',text)[0].replace('"answerCount":',''))
21         is_open = int(len(re.findall('问题已关闭',text))==0)
22         questions_df = questions_df.append({'title':title,'visit':visit,
23                                             'follower':follower,'answer':answer,
24                                             'is_open':is_open},ignore_index=True)
25         time.sleep(2)
26         print(i)
27     except:
28         print('错误'+str(i))

 

数据分析

在分享出最终的“沙雕排行榜”前,我们首先严肃认真(lixinggongshi)的进行一波分析,主要看一下问题中的关键词,首先是所有词云的词云: Here Insert Picture Description

看来这些问题大多是源自于大家对于人生的探索,否则“为什么”,“如果”,“怎么办”也不会出现那么多,出人意料的是“体验”这个知乎专属tag居然并不多,可能是出于对知乎的尊重,和“体验”相关的问题都不会问得那么“沙雕”。

下面把这些助词去掉,再来看下结果:

Here Insert Picture Description

这个图看来,读者关注的问题还是很极端,一方面在关注男女朋友“你冷酷、你无情、你无理取闹”这种问题,另一方面却在关注宇宙、地球这种关乎全人类的问题,很符合知乎“人均985,各个过百万”的人设。

这两个图实际上都是基于一个表情,不知道有没有看出来:

Here Insert Picture Description

Well, in fact, I do not see is normal, can see it may now know almost to a question, the next issue will be on the list, the last part of the question to make a word cloud:

Here Insert Picture Description

Do not know if you can see, to be honest I can not see that he is also not ready to let everyone see, the real purpose is to elicit the following rankings

Sand Sculpture Questions Top

Viewed through a comprehensive number of issues, concerned about the number, answer a few concerns accounting, accounting answer, obtain an integrated fractional flow index and the new index and, ultimately, an overall score, as shown below: Here Insert Picture Description

Sounds complicated, in fact ultimately be ranked by 90% 10% 10% 90% + data is subjective, as we issue a selection of 15 of the most "Sand Sculpture", and

Guess you like

Origin www.cnblogs.com/Qqun821460695/p/11917918.html
top