Text and images in this article from the network, only to learn, exchange, not for any commercial purposes, belongs to original author, if any questions, please contact us for treatment.
Author: Lin Sen data
http://note.youdao.com/noteshare?id=3054cce4add8a909e784ad934f956cef
These days when the Internet occasionally, is known almost on a "Jade Emperor lived in the troposphere or stratosphere," the problem of attracting, the thought just chipping away, not knowing that this problem has triggered a strong response in the know almost browse the number of 500W +, 7000 + attention:
Data Sources
Know almost very "close" to have a special problem to meet our needs, the surprise is the problem actually has 243 answer, and Tao Fei students received approval 3W +
Us from crawling links to all the questions answered in the emergence of more than 400 common issues, Tao Fei provides 200+, to express my gratitude to the Tao Fei students helped us build a "sand sculpture Database", this part of the code is as follows :
1 import re 2 import selenium 3 from selenium import webdriver 4 import requests 5 from bs4 import BeautifulSoup 6 import pandas as pd 7 import time 8 9 driver = webdriver.Chrome() 10 driver.maximize_window() 11 12 url = 'https://www.zhihu.com/question/37453271' 13 js='window.open("'+url+'")' 14 driver.execute_script(js) 15 driver.close() 16 driver.switch_to_window(driver.window_handles[0]) 17 for i in range(100): 18 js="var q=document.documentElement.scrollTop=10000000" 19 driver.execute_script(js) 20 21 all_html = [k.get_property('innerHTML') for k in driver.find_elements_by_class_name('AnswerItem')] 22 all_text = ''.join(all_html) 23 24 #all_text = all_text.replace('\u002F','/') 25 all_text = all_text.replace('questions','question') 26 pat = 'question/\d+' 27 questions = list(set([k for k in re.findall(pat,all_text)]))
获得到了问题的对应的编号后,就可以去各自的页面获取各个问题对应的的标题、浏览数等信息,如下图所示:
这部分代码如下:
1 header = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win32; x32; rv:54.0) Gecko/20100101 Firefox/54.0', 2 'Connection': 'keep-alive'} 3 cookies ='v=3; iuuid=1A6E888B4A4B29B16FBA1299108DBE9CDCB327A9713C232B36E4DB4FF222CF03; webp=true; ci=1%2C%E5%8C%97%E4%BA%AC; __guid=26581345.3954606544145667000.1530879049181.8303; _lxsdk_cuid=1646f808301c8-0a4e19f5421593-5d4e211f-100200-1646f808302c8; _lxsdk=1A6E888B4A4B29B16FBA1299108DBE9CDCB327A9713C232B36E4DB4FF222CF03; monitor_count=1; _lxsdk_s=16472ee89ec-de2-f91-ed0%7C%7C5; __mta=189118996.1530879050545.1530936763555.1530937843742.18' 4 cookie = {} 5 for line in cookies.split(';'): 6 name, value = cookies.strip().split('=', 1) 7 cookie[name] = value 8 9 questions_df = pd.DataFrame(columns = ['title','visit','follower','answer','is_open']) 10 11 for i in range(len(questions)): 12 try: 13 url = 'https://www.zhihu.com/'+questions[i] 14 html = requests.get(url,cookies=cookie, headers=header).content 15 bsObj = BeautifulSoup(html.decode('utf-8'),"html.parser") 16 text = str(bsObj) 17 title = bsObj.find('h1',attrs={'class':'QuestionHeader-title'}).text 18 visit = int(re.findall('"visitCount":\d+',text)[0].replace('"visitCount":','')) 19 follower = int(re.findall('"followerCount":\d+',text)[0].replace('"followerCount":','')) 20 answer = int(re.findall('"answerCount":\d+',text)[0].replace('"answerCount":','')) 21 is_open = int(len(re.findall('问题已关闭',text))==0) 22 questions_df = questions_df.append({'title':title,'visit':visit, 23 'follower':follower,'answer':answer, 24 'is_open':is_open},ignore_index=True) 25 time.sleep(2) 26 print(i) 27 except: 28 print('错误'+str(i))
数据分析
在分享出最终的“沙雕排行榜”前,我们首先严肃认真(lixinggongshi)的进行一波分析,主要看一下问题中的关键词,首先是所有词云的词云:
看来这些问题大多是源自于大家对于人生的探索,否则“为什么”,“如果”,“怎么办”也不会出现那么多,出人意料的是“体验”这个知乎专属tag居然并不多,可能是出于对知乎的尊重,和“体验”相关的问题都不会问得那么“沙雕”。
下面把这些助词去掉,再来看下结果:
这个图看来,读者关注的问题还是很极端,一方面在关注男女朋友“你冷酷、你无情、你无理取闹”这种问题,另一方面却在关注宇宙、地球这种关乎全人类的问题,很符合知乎“人均985,各个过百万”的人设。
这两个图实际上都是基于一个表情,不知道有没有看出来:
Well, in fact, I do not see is normal, can see it may now know almost to a question, the next issue will be on the list, the last part of the question to make a word cloud:
Do not know if you can see, to be honest I can not see that he is also not ready to let everyone see, the real purpose is to elicit the following rankings
Sand Sculpture Questions Top
Viewed through a comprehensive number of issues, concerned about the number, answer a few concerns accounting, accounting answer, obtain an integrated fractional flow index and the new index and, ultimately, an overall score, as shown below: