Analysis of a reptile microblogging account factors affecting forward

task

Inspired by the students of a great idea, by digging a microblogging forwarding, forwarding to view the amount of its most influential factors node.
Of course, this is a very large-scale data needs to be able to support significant illustrate the problem, in the process of solving this problem, and gradually evolved into a blogger data analysis.

process

Angle: the influence of a blogger forward microblogging

In the article a success based on, and can basically get anything microblogging.

data structure

As the first test of the project, the first of a small amount of valuable information for analysis, select here:

Bowen data Bloggers data
Forwarded Forwarding amount Comments amount Thumbs amount Issued a document time abstract The amount of fans Personal information (default is known, is not recorded)
original id nickname Personal information (optional) *

* Personal information: gender, location information and a complete overview of gender, location, constellations, universities, companies, etc.

data collection

According to articles and books that, microblogging mobile end page using asynchronous loading technology.

Use the browser to check (inspect, F12) function, switching to mobile terminal, and then refresh the page (F5), to obtain content links Url need.

Schematically

Thus, available data and specific information on the Twitter interface.
Information json format, covering key fields by constructing dictionary.

To get all the tweets of a micro-Bobo main, need to construct a URL to include such information. By Observe that a unified format is:

https://m.weibo.cn/api/container/getIndex?containerid=230413博主的id号_-_WEIBO_SECOND_PROFILE_WEIBO&page_type=03&page=页码数

So construct a series of URL

###某微博账户的全部微博内容
def contentURL(id,pages):
    i=0
    urls=[]
    for page in pages:
        if page is not 0:
            urls+=['https://m.weibo.cn/api/container/getIndex?containerid=230413'+str(id)+'_-_WEIBO_SECOND_PROFILE_WEIBO&page_type=03&page='+str(page)]  
    return urls

For the flexibility to adjust the data to be retrieved information, especially with the establishment of standardized data stored list, by building a dictionary at the beginning of the data needed to True:

#获取博文信息范围、排列
blogRangeDict={
'visible': False,#{type: 0, list_id: 0}

#发文时间    
'created_at': True,#"20分钟前"
    
'id': False,#"4466073829119710"
'idstr': False,#"4466073829119710"
'mid': False,#"4466073829119710"
'can_edit': False,#false
'show_additional_indication': False,#0

#博文内容   
'text': True,#"【情况通报】2019年12月31日,武汉市卫健部门发布关于肺炎疫情的情况通报。
    
'textLength': False,#452
'source': False,#"360安全浏览器"
'favorited': False,#false
'pic_types': False,#""
'is_paid': False,#false
'mblog_vip_type': False,#0
'user': False,#{id: 2418542712, screen_name: "平安武汉",…}

#转发、评论、点赞数  
'reposts_count': True,#1035
'comments_count': True,#1886
'attitudes_count': True,#7508
    
'pending_approval_count': False,#0
'isLongText': False,#true
'reward_exhibition_type':False,# 0
'hide_flag': False,#0
'mblogtype': False,#0
'more_info_type': False,#0
'cardid': False,#"star_11247_common"
'content_auth': False,#0
'pic_num': False,#0
    
#若无相关信息,则显示:
'infoNoExist':'未知'
}

Json data may be acquired to the selected data dictionary format defined previously required, and pass in a list, then csv file for storage in the format.

#将字典类型的信息格式传递为需要的信息列表
#infoDict 字典类型(json)的信息
#rangeDict 定义的需要的数据(例:blogRangeDict、userRangeDict等)
def getInfoList(infoDict,rangeDict):
    infoList=[]
    for item in rangeDict:
        if rangeDict.get(item) is True:
            content=infoDict.get(item)            
            infoList.append(content)      
        else:
            infoList.append(rangeDict['infoNoExist'])
    return infoList

Similarly, the title configuration csv file format

#构造csv文件标题
#rangeDict 定义的需要的数据(例:blogRangeDict、userRangeDict等)
#prefix 此标题的前缀,防止同名
def getInfoTitle(rangeDict,prefix):
    titleList=[]
    for item in rangeDict:
        if(rangeDict.get(item) is True):
            titleList.append(prefix+item)
    return (titleList)

Crawling through a series of already constructed URL data and write csv file:
* Note: Incoming csvWriter as a writer format, for example:

fp = open(fileAddress,'w+',newline='',encoding='utf-16')
writer=csv.writer(fp)
reRatio(urls,writer)
……

If crawling to the end (page content), then returns False, the program terminates; otherwise True, easy to construct a number of programs under the URL and crawling again.

###在已有的一系列urls中进行操作
###筛选出微博转发内容进行操作
def reRatio(urls,csvWriter):
    notEnd= True
    #定义标题
    retweetBlogTitle=getInfoTitle(blogRangeDict,'转发')#转发博文信息标题
    retweetUserTitle=getInfoTitle(userRangeDict,'转发')#转发博主信息标题
    
    originBlogTitle=getInfoTitle(blogRangeDict,'原文')#原文博文信息标题
    originUserTitle=getInfoTitle(userRangeDict,'原文')#原文博主信息标题
    infoTitle=getInfoTitle(infoRangeDict,'')#原文博主个人主页信息标题
    
    #写表格的标题
    if getConcreteInfoList is True:       
        csvWriter.writerow(retweetBlogTitle+retweetUserTitle+originBlogTitle+originUserTitle+infoTitle)        
    else:
        csvWriter.writerow(retweetBlogTitle+retweetUserTitle+originBlogTitle+originUserTitle)
        
    for url in urls:        
        
        response = requests.get(url,headers=headers)
        resjson = json.loads(response.text)    
        cards=resjson['data']['cards']      
        
        #print(cards)
        
        #结束最后
        if(len(cards)==1):
            notEnd=False
            break
        #遍历一个页面的所有微博    
        for card in cards:
            try:
                #转发博文与博主信息
                retweetBlogInfoDict=card['mblog']   
                retweetUserInfoDict=retweetBlogInfoDict['user']              
                    
                #筛选出转发的微博
                try:                
                    originBlogInfoDict=retweetBlogInfoDict['retweeted_status']
                    
                    
                    if originBlogInfoDict is not None:                        
                        
                        #转发博文原文与博主信息
                        originUserInfoDict=originBlogInfoDict['user']
                        retweetUserID=retweetUserInfoDict['id']
                        originUserID=originUserInfoDict['id']
                        ###不是转发自己的微博,则选中进行处理
                        if(retweetUserID!=originUserID):
                            infoList=[]                            
                            
                            #转发博文数据
                            retweetBlogInfoList=getInfoList(retweetBlogInfoDict,blogRangeDict)               
                            infoList+=retweetBlogInfoList                            
                            #转发博主数据
                            ##默认已知
                            retweetUserInfoList=getInfoList(retweetUserInfoDict,userRangeDict)               
                            infoList+=retweetUserInfoList  
                            #原文博文数据
                            originBlogInfoList=getInfoList(originBlogInfoDict,blogRangeDict)               
                            infoList+=originBlogInfoList
                            #原文博主数据
                            originUserInfoList=getInfoList(originUserInfoDict,userRangeDict)               
                            infoList+=originUserInfoList                                           
                            
                            #originUserID为原文账号的ID                            
                            #可在此对id进行信息采集                               
                            
                            if getConcreteInfoList is True:
                                infoDict=getInfo(isLogin,originUserID)
                                otherInfoList=getInfoList(infoDict,infoRangeDict)      
                                infoList+=otherInfoList                          
                            #print(infoList)
                            #保存数据至csv
                            csvWriter.writerow(infoList)                       
                            
                        #不断获取该博主对的影响力
                        #break
                except:
                    pass
            except:
                pass
        #延时,防止反爬
        time.sleep(3)
        
    return notEnd

Main program: by offering bloggers id number, a complete data acquisition operations, and saved to a local csv file.

def downloadData(id):
    tweeter=getExatInfo('昵称',2,int(id))
    batch=0
    while(1):

        fileAddr=addrFile(tweeter,'batch'+str(batch))
        if os.path.exists(fileAddr) is True:
            print(tweeter+'已存在,跳过采集')                
        else:
            print('文件将写入:'+fileAddr)
            fp = open(fileAddr,'w+',newline='',encoding='utf-16')
            writer=csv.writer(fp)
            if reRatio(contentURL(id,range(20*batch,20*(batch+1))),writer) is False:
                fp.close()
                break

            fp.close()
            print('第'+str(batch)+'批数据已记录完毕')
        batch+=1

Run schematically as follows:
Run schematic

FIG acquired data:
data files
A total of four batches

data processing

The key is the data visualization process, all images can be described:
After crawling data has been read, the focus of the digital image by manifested.

Bloggers forwarding the original data

Here Insert Picture Description

Bloggers data forwarding Bowen

Microblogging flow inhibition fraud, will be forwarded to the limit of the number of comments is set to 1 million , it appears as an image, and the other: emoticon can not be displayed.
Here Insert Picture Description

Bloggers original and forward the data comparison

Data comparison

The amount of each ratio of the amount of data and the fans

Here it is to observe to what extent the data associated with a greater amount of fans, and generally to assess the extent of fraud data.
Not difficult to see the relevance of the weakest forward amount.
Here, it explains why the election is a blogger forwarded Bowen studied. Taking into account a microblogging after forwarding the value of re-forwarded declined.
At the same time, an ability to communicate information and influence of bloggers is also relevant, even if the number of fans on behalf of influence.
Therefore, the number of forwarding a man wandering in a highly significant data fraud is suspected Description: How many fans are willing to heat fans no brain forwarding some useless information?
Here Insert Picture Description

Typical bloggers show

The following select different types of bloggers comparison:

Wu Jing

Small amount of data
Wu Jing data

vista to see the world

The degree of correlation seen the best of a set of data
Here Insert Picture Description

Huazhong University of Science and Technology

Here Insert Picture Description

Forwarded the case may amount to an analysis of the impact of extreme microblogging, microblogging specific observation of the situation: a micro-Bo from the Martyrs Day.
Here Insert Picture Description
Remove a few extreme cases, the amount of forward and did not change for the better, but the amount of comments and thumbs amount of fitting results significantly improved.
Here Insert Picture Description

Kobe Bryant

Bryant data
As a foreign sports stars, his micro-blog data in the first heat peaked, on behalf of the people welcomed them, and after occasional waves.
Here Insert Picture Description

Published 39 original articles · won praise 3 · Views 4662

Guess you like

Origin blog.csdn.net/cascara/article/details/104090441