A collection of 31 Python crawler combat projects

Reminder: Since this article contains a large number of external links, Wall Crack recommends that friends click "Read the original text" to read and collect. :) WechatSogou

https://github.com/Chyroc/WechatSogou

WeChat public number crawler. The WeChat official account crawler interface based on Sogou WeChat search can be extended to a crawler based on Sogou search, and the returned result is a list, each of which is a dictionary of the specific information of the official account.

DouBanSpider

https://github.com/lanbing510/DouBanSpider

Douban reading crawler. You can climb down all the books under the Douban Reading tab, store them in order by rating, and store them in Excel, which is convenient for everyone to filter and search. , Use User Agent to disguise as a browser for crawling, and add random delay to better imitate browser behavior and avoid crawler being blocked.

zhihu_spider

https://github.com/LiuRoy/zhihu_spider Knows the

crawler. The function of this project is to crawl Zhihu user information and interpersonal topological relationship. The crawler framework uses scrapy, and the data storage uses mongo

bilibili-user

https://github.com/airingursb/bilibili-user

Bilibili user crawler. Total number of data: 20119918, Capture fields: user id, nickname, gender, avatar, level, experience value, number of fans, birthday, address, registration time, signature, level and experience value, etc. After crawling, generate a user data report of station B.

SinaSpider

https://github.com/LiuXingMing/SinaSpider

Sina Weibo crawler. It mainly crawls the personal information, Weibo information, fans and followers of Sina Weibo users. The code obtains Sina Weibo Cookie to log in, which can prevent Sina's anti-picking by logging in with multiple accounts. Mainly use scrapy crawler framework.

distribute_crawler

https://github.com/gnemoug/distribute_crawler

novel Download distributed crawler. A distributed web crawler implemented using scrapy, Redis, MongoDB, graphite, the underlying storage is mongodb cluster, distributed using redis, and the crawler status display is implemented using graphite, mainly for a novel site.

CnkiSpider

https://github.com/yanzhou/CnkiSpider

China Knowledge Network crawler. After setting the retrieval conditions, execute src/CnkiSpider.py to grab the data. The grabbed data is stored in the /data directory. The first line of each data file is the field name.

LianJiaSpider

https://github.com/lanbing510/LianJiaSpider

chain home web crawler. Crawling the transaction records of second-hand houses in Beijing Lianjia over the years. Covers all the codes of the Lianjia crawler article, including the Lianjia simulated login code.

scrapyjingdong

https://github.com/taizilongxu/scrapyjingdong

Jingdong crawler. Jingdong website crawler based on scrapy, the save format is csv.

QQ-Groups-Spider

https://github.com/caspartse/QQ-Groups-Spider

QQ group crawler. Capture QQ group information in batches, including group name, group ID, group number, group owner, group introduction, etc., and finally generate an XLS(X) / CSV result file.

wooyunpublic

https://github.com/hanc00l/wooyunpublic

dark cloud crawler. Dark Cloud exposes vulnerabilities, knowledge base crawler and search. The list of all public vulnerabilities and the text content of each vulnerability are stored in mongodb, about 2G content; if the entire site crawls all texts and images as offline queries, it takes about 10G space and 2 hours (10M telecom bandwidth); crawling all knowledge Library, a total of about 500M space. The vulnerability search uses Flask as the web server and bootstrap as the front end.

findtrip

https://github.com/fankcoder/findtrip

airline ticket crawler (Qunar and Ctrip). Findtrip is a Scrapy-based air ticket crawler, which currently integrates two major domestic air ticket websites (Qunar + Ctrip).

163spider

https://github.com/leyle/163spider

Netease client content crawler based on requests, MySQLdb, torndb

doubanspider

https://github.com/fanpei91/doubanspider

Douban movie, book, group, photo album, stuff and other crawler collection

QQSpider

https://github.com/LiuXingMing/QQSpider

QQ space crawler, including logs, comments, personal information, etc., can capture 4 million pieces of data a day.

baidu-music-spider

https://github.com/Shu-Ji/baidu-music-spider

Baidu mp3 full-site crawler, using redis to support breakpoint resume.

tbcrawler

https://github.com/pakoo/tbcrawler

Taobao and Tmall's crawlers can grab page information according to search keywords and item id, and the data is stored in mongodb.

stockholm

https://github.com/benitoro/stockholm

A stock data (Shanghai and Shenzhen) crawler and stock selection strategy testing framework. Capture the market data of all Shanghai and Shenzhen stocks according to the selected date range. Supports the use of expressions to define stock selection strategies. Supports multithreading. Save data to JSON file, CSV file.

BaiduyunSpider

https://github.com/k1995/BaiduyunSpider

Baidu cloud disk crawler.

Spider

https://github.com/Qutan/Spider

social data crawler. Support Weibo, Zhihu, Douban.

proxy pool

https://github.com/jhao104/proxy_pool

Python crawler proxy IP pool (proxy pool).

music-163

https://github.com/RitterHou/music-163

crawls comments on all songs of NetEase Cloud Music.

jandan_spider

https://github.com/kulovecc/jandan_spider

crawls the omelet girl paper picture.

CnblogsSpider

https://github.com/jackgitgz/CnblogsSpider

cnblogs list page crawler.

spider_smooc

https://github.com/qiyeboy/spider_smooc

crawls the video of MOOC.

CnkiSpider

https://github.com/yanzhou/CnkiSpider

China Knowledge Network crawler.

knowsecSpider2

https://github.com/littlethunder/knowsecSpider2

knows the topic of Chuangyu crawler.

aiss-spider

https://github.com/x-spiders/aiss-spider

Aiss APP picture crawler, and free payment to crack VIP pictures.

SinaSpider

https://github.com/szcf-weiya/SinaSpider

dynamic IP solves Sina's anti-crawling mechanism and quickly grabs content.

csdn-spider

https://github.com/Kevinsss/csdn-spider

crawls blog articles on CSDN.

ProxySpider

https://github.com/changetjut/ProxySpider

crawls the proxy IP on the thorn and verifies the availability of the proxy


Reprinted from: http://www.sohu.com/a/160870327_505818

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326265714&siteId=291194637