Reminder: Since this article contains a large number of external links, Wall Crack recommends that friends click "Read the original text" to read and collect. :) WechatSogou
https://github.com/Chyroc/WechatSogou
WeChat public number crawler. The WeChat official account crawler interface based on Sogou WeChat search can be extended to a crawler based on Sogou search, and the returned result is a list, each of which is a dictionary of the specific information of the official account.
DouBanSpider
https://github.com/lanbing510/DouBanSpider
Douban reading crawler. You can climb down all the books under the Douban Reading tab, store them in order by rating, and store them in Excel, which is convenient for everyone to filter and search. , Use User Agent to disguise as a browser for crawling, and add random delay to better imitate browser behavior and avoid crawler being blocked.
zhihu_spider
https://github.com/LiuRoy/zhihu_spider Knows the
crawler. The function of this project is to crawl Zhihu user information and interpersonal topological relationship. The crawler framework uses scrapy, and the data storage uses mongo
bilibili-user
https://github.com/airingursb/bilibili-user
Bilibili user crawler. Total number of data: 20119918, Capture fields: user id, nickname, gender, avatar, level, experience value, number of fans, birthday, address, registration time, signature, level and experience value, etc. After crawling, generate a user data report of station B.
SinaSpider
https://github.com/LiuXingMing/SinaSpider
Sina Weibo crawler. It mainly crawls the personal information, Weibo information, fans and followers of Sina Weibo users. The code obtains Sina Weibo Cookie to log in, which can prevent Sina's anti-picking by logging in with multiple accounts. Mainly use scrapy crawler framework.
distribute_crawler
https://github.com/gnemoug/distribute_crawler
novel Download distributed crawler. A distributed web crawler implemented using scrapy, Redis, MongoDB, graphite, the underlying storage is mongodb cluster, distributed using redis, and the crawler status display is implemented using graphite, mainly for a novel site.
CnkiSpider
https://github.com/yanzhou/CnkiSpider
China Knowledge Network crawler. After setting the retrieval conditions, execute src/CnkiSpider.py to grab the data. The grabbed data is stored in the /data directory. The first line of each data file is the field name.
LianJiaSpider
https://github.com/lanbing510/LianJiaSpider
chain home web crawler. Crawling the transaction records of second-hand houses in Beijing Lianjia over the years. Covers all the codes of the Lianjia crawler article, including the Lianjia simulated login code.
scrapyjingdong
https://github.com/taizilongxu/scrapyjingdong
Jingdong crawler. Jingdong website crawler based on scrapy, the save format is csv.
QQ-Groups-Spider
https://github.com/caspartse/QQ-Groups-Spider
QQ group crawler. Capture QQ group information in batches, including group name, group ID, group number, group owner, group introduction, etc., and finally generate an XLS(X) / CSV result file.
wooyunpublic
https://github.com/hanc00l/wooyunpublic
dark cloud crawler. Dark Cloud exposes vulnerabilities, knowledge base crawler and search. The list of all public vulnerabilities and the text content of each vulnerability are stored in mongodb, about 2G content; if the entire site crawls all texts and images as offline queries, it takes about 10G space and 2 hours (10M telecom bandwidth); crawling all knowledge Library, a total of about 500M space. The vulnerability search uses Flask as the web server and bootstrap as the front end.
findtrip
https://github.com/fankcoder/findtrip
airline ticket crawler (Qunar and Ctrip). Findtrip is a Scrapy-based air ticket crawler, which currently integrates two major domestic air ticket websites (Qunar + Ctrip).
163spider
https://github.com/leyle/163spider
Netease client content crawler based on requests, MySQLdb, torndb
doubanspider
https://github.com/fanpei91/doubanspider
Douban movie, book, group, photo album, stuff and other crawler collection
QQSpider
https://github.com/LiuXingMing/QQSpider
QQ space crawler, including logs, comments, personal information, etc., can capture 4 million pieces of data a day.
baidu-music-spider
https://github.com/Shu-Ji/baidu-music-spider
Baidu mp3 full-site crawler, using redis to support breakpoint resume.
tbcrawler
https://github.com/pakoo/tbcrawler
Taobao and Tmall's crawlers can grab page information according to search keywords and item id, and the data is stored in mongodb.
stockholm
https://github.com/benitoro/stockholm
A stock data (Shanghai and Shenzhen) crawler and stock selection strategy testing framework. Capture the market data of all Shanghai and Shenzhen stocks according to the selected date range. Supports the use of expressions to define stock selection strategies. Supports multithreading. Save data to JSON file, CSV file.
BaiduyunSpider
https://github.com/k1995/BaiduyunSpider
Baidu cloud disk crawler.
Spider
https://github.com/Qutan/Spider
social data crawler. Support Weibo, Zhihu, Douban.
proxy pool
https://github.com/jhao104/proxy_pool
Python crawler proxy IP pool (proxy pool).
music-163
https://github.com/RitterHou/music-163
crawls comments on all songs of NetEase Cloud Music.
jandan_spider
https://github.com/kulovecc/jandan_spider
crawls the omelet girl paper picture.
CnblogsSpider
https://github.com/jackgitgz/CnblogsSpider
cnblogs list page crawler.
spider_smooc
https://github.com/qiyeboy/spider_smooc
crawls the video of MOOC.
CnkiSpider
https://github.com/yanzhou/CnkiSpider
China Knowledge Network crawler.
knowsecSpider2
https://github.com/littlethunder/knowsecSpider2
knows the topic of Chuangyu crawler.
aiss-spider
https://github.com/x-spiders/aiss-spider
Aiss APP picture crawler, and free payment to crack VIP pictures.
SinaSpider
https://github.com/szcf-weiya/SinaSpider
dynamic IP solves Sina's anti-crawling mechanism and quickly grabs content.
csdn-spider
https://github.com/Kevinsss/csdn-spider
crawls blog articles on CSDN.
ProxySpider
https://github.com/changetjut/ProxySpider
crawls the proxy IP on the thorn and verifies the availability of the proxy
Reprinted from:
http://www.sohu.com/a/160870327_505818
A collection of 31 Python crawler combat projects
Guess you like
Origin http://43.154.161.224:23101/article/api/json?id=326265714&siteId=291194637
Recommended
Ranking