Article Directory
task
Inspired by the students of a great idea, by digging a microblogging forwarding, forwarding to view the amount of its most influential factors node.
Of course, this is a very large-scale data needs to be able to support significant illustrate the problem, in the process of solving this problem, and gradually evolved into a blogger data analysis.
process
Angle: the influence of a blogger forward microblogging
In the article a success based on, and can basically get anything microblogging.
data structure
As the first test of the project, the first of a small amount of valuable information for analysis, select here:
Bowen data | Bloggers data | ||||||||
---|---|---|---|---|---|---|---|---|---|
Forwarded | Forwarding amount | Comments amount | Thumbs amount | Issued a document time | abstract | The amount of fans | Personal information (default is known, is not recorded) | ||
original | id | nickname | Personal information (optional) * | ||||||
* Personal information: gender, location information and a complete overview of gender, location, constellations, universities, companies, etc.
data collection
According to articles and books that, microblogging mobile end page using asynchronous loading technology.
Use the browser to check (inspect, F12) function, switching to mobile terminal, and then refresh the page (F5), to obtain content links Url need.
Thus, available data and specific information on the Twitter interface.
Information json format, covering key fields by constructing dictionary.
To get all the tweets of a micro-Bobo main, need to construct a URL to include such information. By Observe that a unified format is:
https://m.weibo.cn/api/container/getIndex?containerid=230413博主的id号_-_WEIBO_SECOND_PROFILE_WEIBO&page_type=03&page=页码数
So construct a series of URL
###某微博账户的全部微博内容
def contentURL(id,pages):
i=0
urls=[]
for page in pages:
if page is not 0:
urls+=['https://m.weibo.cn/api/container/getIndex?containerid=230413'+str(id)+'_-_WEIBO_SECOND_PROFILE_WEIBO&page_type=03&page='+str(page)]
return urls
For the flexibility to adjust the data to be retrieved information, especially with the establishment of standardized data stored list, by building a dictionary at the beginning of the data needed to True:
#获取博文信息范围、排列
blogRangeDict={
'visible': False,#{type: 0, list_id: 0}
#发文时间
'created_at': True,#"20分钟前"
'id': False,#"4466073829119710"
'idstr': False,#"4466073829119710"
'mid': False,#"4466073829119710"
'can_edit': False,#false
'show_additional_indication': False,#0
#博文内容
'text': True,#"【情况通报】2019年12月31日,武汉市卫健部门发布关于肺炎疫情的情况通报。
'textLength': False,#452
'source': False,#"360安全浏览器"
'favorited': False,#false
'pic_types': False,#""
'is_paid': False,#false
'mblog_vip_type': False,#0
'user': False,#{id: 2418542712, screen_name: "平安武汉",…}
#转发、评论、点赞数
'reposts_count': True,#1035
'comments_count': True,#1886
'attitudes_count': True,#7508
'pending_approval_count': False,#0
'isLongText': False,#true
'reward_exhibition_type':False,# 0
'hide_flag': False,#0
'mblogtype': False,#0
'more_info_type': False,#0
'cardid': False,#"star_11247_common"
'content_auth': False,#0
'pic_num': False,#0
#若无相关信息,则显示:
'infoNoExist':'未知'
}
Json data may be acquired to the selected data dictionary format defined previously required, and pass in a list, then csv file for storage in the format.
#将字典类型的信息格式传递为需要的信息列表
#infoDict 字典类型(json)的信息
#rangeDict 定义的需要的数据(例:blogRangeDict、userRangeDict等)
def getInfoList(infoDict,rangeDict):
infoList=[]
for item in rangeDict:
if rangeDict.get(item) is True:
content=infoDict.get(item)
infoList.append(content)
else:
infoList.append(rangeDict['infoNoExist'])
return infoList
Similarly, the title configuration csv file format
#构造csv文件标题
#rangeDict 定义的需要的数据(例:blogRangeDict、userRangeDict等)
#prefix 此标题的前缀,防止同名
def getInfoTitle(rangeDict,prefix):
titleList=[]
for item in rangeDict:
if(rangeDict.get(item) is True):
titleList.append(prefix+item)
return (titleList)
Crawling through a series of already constructed URL data and write csv file:
* Note: Incoming csvWriter as a writer format, for example:
fp = open(fileAddress,'w+',newline='',encoding='utf-16')
writer=csv.writer(fp)
reRatio(urls,writer)
……
If crawling to the end (page content), then returns False
, the program terminates; otherwise True
, easy to construct a number of programs under the URL and crawling again.
###在已有的一系列urls中进行操作
###筛选出微博转发内容进行操作
def reRatio(urls,csvWriter):
notEnd= True
#定义标题
retweetBlogTitle=getInfoTitle(blogRangeDict,'转发')#转发博文信息标题
retweetUserTitle=getInfoTitle(userRangeDict,'转发')#转发博主信息标题
originBlogTitle=getInfoTitle(blogRangeDict,'原文')#原文博文信息标题
originUserTitle=getInfoTitle(userRangeDict,'原文')#原文博主信息标题
infoTitle=getInfoTitle(infoRangeDict,'')#原文博主个人主页信息标题
#写表格的标题
if getConcreteInfoList is True:
csvWriter.writerow(retweetBlogTitle+retweetUserTitle+originBlogTitle+originUserTitle+infoTitle)
else:
csvWriter.writerow(retweetBlogTitle+retweetUserTitle+originBlogTitle+originUserTitle)
for url in urls:
response = requests.get(url,headers=headers)
resjson = json.loads(response.text)
cards=resjson['data']['cards']
#print(cards)
#结束最后
if(len(cards)==1):
notEnd=False
break
#遍历一个页面的所有微博
for card in cards:
try:
#转发博文与博主信息
retweetBlogInfoDict=card['mblog']
retweetUserInfoDict=retweetBlogInfoDict['user']
#筛选出转发的微博
try:
originBlogInfoDict=retweetBlogInfoDict['retweeted_status']
if originBlogInfoDict is not None:
#转发博文原文与博主信息
originUserInfoDict=originBlogInfoDict['user']
retweetUserID=retweetUserInfoDict['id']
originUserID=originUserInfoDict['id']
###不是转发自己的微博,则选中进行处理
if(retweetUserID!=originUserID):
infoList=[]
#转发博文数据
retweetBlogInfoList=getInfoList(retweetBlogInfoDict,blogRangeDict)
infoList+=retweetBlogInfoList
#转发博主数据
##默认已知
retweetUserInfoList=getInfoList(retweetUserInfoDict,userRangeDict)
infoList+=retweetUserInfoList
#原文博文数据
originBlogInfoList=getInfoList(originBlogInfoDict,blogRangeDict)
infoList+=originBlogInfoList
#原文博主数据
originUserInfoList=getInfoList(originUserInfoDict,userRangeDict)
infoList+=originUserInfoList
#originUserID为原文账号的ID
#可在此对id进行信息采集
if getConcreteInfoList is True:
infoDict=getInfo(isLogin,originUserID)
otherInfoList=getInfoList(infoDict,infoRangeDict)
infoList+=otherInfoList
#print(infoList)
#保存数据至csv
csvWriter.writerow(infoList)
#不断获取该博主对的影响力
#break
except:
pass
except:
pass
#延时,防止反爬
time.sleep(3)
return notEnd
Main program: by offering bloggers id number, a complete data acquisition operations, and saved to a local csv file.
def downloadData(id):
tweeter=getExatInfo('昵称',2,int(id))
batch=0
while(1):
fileAddr=addrFile(tweeter,'batch'+str(batch))
if os.path.exists(fileAddr) is True:
print(tweeter+'已存在,跳过采集')
else:
print('文件将写入:'+fileAddr)
fp = open(fileAddr,'w+',newline='',encoding='utf-16')
writer=csv.writer(fp)
if reRatio(contentURL(id,range(20*batch,20*(batch+1))),writer) is False:
fp.close()
break
fp.close()
print('第'+str(batch)+'批数据已记录完毕')
batch+=1
Run schematically as follows:
FIG acquired data:
data processing
The key is the data visualization process, all images can be described:
After crawling data has been read, the focus of the digital image by manifested.
Bloggers forwarding the original data
Bloggers data forwarding Bowen
Microblogging flow inhibition fraud, will be forwarded to the limit of the number of comments is set to 1 million , it appears as an image, and the other: emoticon can not be displayed.
Bloggers original and forward the data comparison
The amount of each ratio of the amount of data and the fans
Here it is to observe to what extent the data associated with a greater amount of fans, and generally to assess the extent of fraud data.
Not difficult to see the relevance of the weakest forward amount.
Here, it explains why the election is a blogger forwarded Bowen studied. Taking into account a microblogging after forwarding the value of re-forwarded declined.
At the same time, an ability to communicate information and influence of bloggers is also relevant, even if the number of fans on behalf of influence.
Therefore, the number of forwarding a man wandering in a highly significant data fraud is suspected Description: How many fans are willing to heat fans no brain forwarding some useless information?
Typical bloggers show
The following select different types of bloggers comparison:
Wu Jing
Small amount of data
vista to see the world
The degree of correlation seen the best of a set of data
Huazhong University of Science and Technology
Forwarded the case may amount to an analysis of the impact of extreme microblogging, microblogging specific observation of the situation: a micro-Bo from the Martyrs Day.
Remove a few extreme cases, the amount of forward and did not change for the better, but the amount of comments and thumbs amount of fitting results significantly improved.
Kobe Bryant
As a foreign sports stars, his micro-blog data in the first heat peaked, on behalf of the people welcomed them, and after occasional waves.