User, topics, comments clean sweep, one of the strongest share microblogging reptiles

Realization of functions

Weibo has always been an excellent melons for the Holy Land, in order to obtain microblogging uplink data seems in color, microblogging related reptiles are endless, because both the operator or data analysis practitioners are more or less need to microblogging data, many of my friends are no exception, through the efforts of intermittent, I completed a written probably the most powerful in the history of microblogging crawlers.

The crawler functions divided into three parts, the first main function is to crawl all the specified user Twitter (+ P can be quickly opened by the Ctrl hot keys), the user can search by nickname, can choose whether to fetch only the original Twitter, as FIG.

Here Insert Picture Description

Micro-blog content crawled is stored in the csv file header includes the microblogging id, microblogging text, pictures url, release location, release time, release the number of tools and thumbs up, comment number, forwarding numbers, which picture url inside All pictures microblogging interval url comma splice together.

Here Insert Picture Description

All Twitter second main function is to crawl under the specified topic, as FIG.

Here Insert Picture Description

Twitter csv format saved substantially and crawling user micro Borg similar formula.

The third main feature is based on the microblogging id crawling crawling all details in the comments microblogging, microblogging such as All comments IaYZIu0Ko id is:

Here Insert Picture Description

In addition to reptile business logic, as we can see above, there is a more friendly user interface, to facilitate the operation.

Technology Roadmap

A total of more than 1,000 lines of code, do not care about technology, students can skip to the end the link access procedure.

Reptile section mainly through the Chrome interface on the analysis of micro-blog page for the interface parameters, simulation library use requests requests need to bring cookies, the bulk of my reptile actually parsing part, I mainly use the lxml library, something needs to be resolved very large, almost csv in each field requires a separate block to parse.

Reptile achieve three functions: Press crawling users, according to topic crawling, crawling microblogging all the comments that I have used three classes to implement, WeiboUserScrapy, WeiboTopicScrapy, WeiboCommentScrapy, three classes have some that may be multiplexed function, but in order to reduce the coupling between classes, and to facilitate packaging, I did not multiplexed, so that a single class may also run out, reducing reliance.

Again mainly write interface module, I have been writing before using wxPython interface, and later in-depth study of pyqt5 this library, so the reptile interface is pyqt5 to write, this is mainly used in the ListView model-view model, custom signal and slot function and use some common components.

Reptile time-consuming, but the interface does not allow obstruction, it is necessary to adopt multi-threading technology, the use of custom signal bridge of communication between a reptile and interface classes, such as reptiles start, end will send a corresponding signal to the complete interface interface classes updates.

There is a perfect place, in addition to a background task progress box and print, no other visual view of the method and scheduling between the task simply first come first serve, I will follow the custom scheduler class to complete each kind of pause, resume, priority processing and other intelligent scheduling and advanced visual interface.

Explain the core code

To WeiboCommentScrapy class, for example, by first regular expression to obtain the total number of comments,

res = requests.get('https://weibo.cn/comment/{}'.format(self.wid),headers=self.headers,verify=False)
commentNum = re.findall("评论\[.*?\]",res.text)[0]
commentNum = int(commentNum[3:len(commentNum)-1])

Then the total number of comments pagination

pageNum = ceil(commentNum/10)

Then two cycles, the outer layer traverse pages, each page inner traverse comments, and finally make a comment for each parsed

for page in range(pageNum):
	
	result = []

    res = requests.get('https://weibo.cn/comment/{}?page={}'.format(self.wid,page+1), headers=self.headers,verify=False)

    html = etree.HTML(res.text.encode('utf-8'))

    comments = html.xpath("/html/body/div[starts-with(@id,'C')]")

    print('第{}/{}页'.format(page+1,pageNum))

    for i in range(len(comments)):
		result.append(self.get_one_comment_struct(comments[i]))

    if page==0:
		self.write_to_csv(result,isHeader=True)
    else:
		self.write_to_csv(result,isHeader=False)
	# 休眠 1-5 秒,防止被封
    sleep(randint(1,5))

Note the inner loop, each page is 10 looks comment, but in reality not, for example, the first page Top Comments will be over 10, the last page may not be 10, so the inner loop of no use for i in range(10):but for i in range(len(comments)):. The inner loop also call a function get_one_comment_struct(), its role is to get the data we want to resolve according to each element xpath get a comment, which has a recursive call several custom analytic functions, such as the time is parsed as "xxx minutes front, "" just, "we need to do to get specific timestamp string processing.

As the reptiles are resolved by using xpath lxml library, which I encountered many practical skills, we will do a detailed follow-up blog unfold.

Get use

In order to facilitate the use, I have packaged into exe, out of the Python runtime environment, you only need to reply microblogging reptiles in micro-channel public number background can be obtained, and because the interface changes, I'll upgrade program on a regular basis, in order to prevent lost to be replies group in the background, adding micro-channel fans can get timely updates.

Micro-channel public number: small water-month long

Here Insert Picture Description

Published 84 original articles · won praise 250 · Views 150,000 +

Guess you like

Origin blog.csdn.net/ygdxt/article/details/102508628