Sina Weibo crawler that crawls tens of millions of data in one hour


640?wx_fmt=jpeg&wxfrom=5&wx_lazy=1Crawler function:

  • This project is similar to the QQ space crawler , which mainly crawls the personal information, Weibo information, fans and followers of Sina Weibo users ( see here for details ).

  • The code obtains Sina Weibo Cookie to log in, which can prevent Sina's anti-picking by logging in with multiple accounts (the account used to log in can be purchased from Taobao, seven for one dollar).

  • The project is to climb the Sina Weibo wap station, which has a simple structure and should be faster, and the anti- crawling is not so strong .

  • The speed of crawling Weibo can reach  more than 13 million / day  , depending on the network conditions, I use the campus network (Guangzhou University of Technology campus), the ordinary home network may only be half the speed, or even less.

Environment, Architecture:

Development language: Python2.7 
Development environment: 64-bit Windows8 system, 4G memory, i7-3612QM processor. 
Database: MongoDB 3.2.0 
(Python editor: Pycharm 5.0.4; MongoDB management tool: MongoBooster 1.1.1)

  • Mainly use scrapy crawler framework.

  • The download middleware will randomly select one from the Cookie pool and User-Agent pool and add it to the spider.

  • In start_requests, four Requests are started according to the user ID, and personal information, Weibo, followers and fans are crawled at the same time.

  • Add the newly climbed followers and fan IDs to the queue to be climbed (de-duplicate first).

Instructions for use:

Configuration before startup:

  • MongoDB can be started after installation, no configuration is required.

  • Python needs to install scrapy (64-bit Python try to use 64-bit dependent modules)

  • In addition, the python modules used are: pymongo, json, base64, requests.

  • Add the Weibo account and password you use to log in to the cookies.py file. There are already two accounts in it as a format reference.

  • Other scrapy settings (such as interval time, log level, number of Request threads, etc.) can be adjusted in the settings by themselves.

Run the screenshot:

640?wx_fmt=png

640?wx_fmt=jpeg

Database Description:

SinaSpider mainly crawls Sina Weibo's personal information, Weibo data, followers and followers. The database sets up four tables of Information, Tweets, Follows, and Fans. Only the fields of the first two tables are introduced here.

Information table:  
id: Use "User ID" as a unique identifier. 
Birthday: Date of birth. 
City: The city where you are located. 
Gender: Gender. 
Marriage: marital status. 
NickName: Weibo nickname. 
Num
Fans: The number of fans. 
Num Follows: The number of followers. 
Num
Tweets: The number of tweets sent. 
Province: The province where it is located. 
Signature: Personalized signature. 
URL: Personal homepage of Weibo.


Tweets table:  
id: Use the form of "User ID-Weibo ID" as the unique identifier of a Weibo. 
Co
oridinates: Positioning coordinates (latitude and longitude) when posting Weibo, call the map API to directly view the specific orientation and identify which building it is in. 
Comment: The number of comments on Weibo. 
Content: The content of the Weibo. 
ID: User ID. 
Like: The number of likes on Weibo. 
PubTime: Weibo publication time. 
Tools: The tool for sending Weibo (phone type or platform) 
Transfer: The number of Weibo forwarded.

∞∞∞



640?wx_fmt=jpeg&wx_lazy=1

IT School - {Technology Youth Circle} continues to pay attention to the fields of Internet, blockchain and artificial intelligence 640?wx_fmt=jpeg&wx_lazy=1



The official account replied "Python" ,

Invite you to join {IT send Python technology group} 


Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325817153&siteId=291194637