python web crawler project Choice3

Reference links:
reptiles actual project collection
https://www.cnblogs.com/hankleo/p/9784544.html

WechatSogou
https://github.com/Chyroc/WechatSogou
micro-channel public number reptiles. Sogou search based on micro-channel micro-channel public number of reptiles interface can be expanded into Sogou search based on the reptile, the result is a list of each specific numbers are public information dictionary.

DouBanSpider
https://github.com/lanbing510/DouBanSpider
famous book crawlers. Can climb down all the books in the famous book label, were ranked according to score storage, stored in Excel, you can facilitate screening of collecting, such as the number of screening evaluations> 1000 score books; can be based on different themes stored in a different Excel Sheet using User Agent camouflage crawling into your browser and add to better mimic the behavior of the browser random delay to avoid reptile was closed.

zhihu_spider
https://github.com/LiuRoy/zhihu_spider
know almost reptiles. Function of this project is almost crawling the user information and known interpersonal topology, using Scrapy crawler frame, using data stored mongo

User-bilibili
https://github.com/airingursb/bilibili-user
bilibili user crawler. The total number of data: 20,119,918, crawl fields: user id, nickname, gender, avatar, rank, experience, number of fans, birthday, address, registration date, signature, rank and experience and so on. Generating user station B after reporting data fetch.

SinaSpider
https://github.com/LiuXingMing/SinaSpider
Weibo reptiles. The main crawling Weibo user's personal information, tweets, fans and followers. Cookie code gets Sina microblogging log, can be prevented by pocketing more than Sina account login. The main use scrapy reptile framework.

distribute_crawler
https://github.com/gnemoug/distribute_crawler
novel download distributed crawlers. Use scrapy, Redis, MongoDB, a distributed web crawler graphite to achieve, the underlying storage mongodb clusters, distributed use redis achieve, reptiles use graphite to achieve the status display, aimed at a novel site.

CnkiSpider
https://github.com/yanzhou/CnkiSpider
China HowNet reptiles. After setting the search condition, performs src / CnkiSpider.py fetches data stored in the data capture / data directory, the behavior of the first field name of each data file.

LianJiaSpider
https://github.com/lanbing510/LianJiaSpider
chain of home network reptiles. Crawling over the years in Beijing Homelink second-hand housing transaction records. Homelink covers all reptiles article code, including the chain of home analog login code.

scrapyjingdong
https://github.com/taizilongxu/scrapyjingdong
Jingdong reptiles. Based scrapy Jingdong website crawler, save format is csv.

Spider-Groups-QQ
https://github.com/caspartse/QQ-Groups-Spider
QQ group of reptiles. Batch fetching QQ group information, including group name, group number, group number, the main group, group profile and other content, and ultimately generate XLS (X) / CSV file results.

wooyunpublic
https://github.com/hanc00l/wooyunpublic
clouds crawlers. Clouds public vulnerability, reptiles and search the knowledge base. There is a list of text content and each vulnerability, the entire disclosure of vulnerabilities in mongodb, probably about 2G content; if the entire station to climb all the text and pictures as an offline query takes about 10G of space, two hours (10M-bandwidth telecommunications); crawling all knowledge library, a total of about 500M space. Use the vulnerability search Flask as web server, bootstrap as the front end.

findtrip
https://github.com/fankcoder/findtrip
ticket reptiles (and where to Ctrip). Findtrip is based Scrapy ticket reptiles, the current integration of the two major domestic airline ticket websites (where to go + Ctrip).

163spider
https://github.com/leyle/163spider
content crawler-based requests, MySQLdb, torndb Netease client

doubanspiders
https://github.com/fanpei91/doubanspiders
watercress movies, books, groups, albums, reptiles and other things set

QQSpider
https://github.com/LiuXingMing/QQSpider
QQ space reptiles, including weblogs, or personal information, a day to fetch 4 million data.

Spider-Music-baidu
https://github.com/Shu-Ji/baidu-music-spider
Baidu mp3 full station reptiles, use redis support for HTTP.

tbcrawler
https://github.com/pakoo/tbcrawler
Taobao and Tmall reptiles, according to mongodb can search keywords, article id came to the page of information, data storage.

stockholm
https://github.com/benitoro/stockholm
a stock data (CSI) reptiles and stock-picking strategy testing framework. Grab all the Shanghai and Shenzhen stock market data based on date range selected. It supports the use of expressions to define stock-picking strategy. Support multi-threading. Save data to a JSON file, CSV file.

BaiduyunSpider
https://github.com/k1995/BaiduyunSpider
Baidu cloud disk reptiles.

Spider
https://github.com/Qutan/Spider
social data reptiles. Support micro-Bo, we know almost, watercress.

the pool Proxy
https://github.com/jhao104/proxy_pool
Python Reptile proxy IP pool (proxy pool).

163-Music
https://github.com/RitterHou/music-163
crawling all comments Netease cloud music songs.

jandan_spider
https://github.com/kulovecc/jandan_spider
crawling omelette sister paper the picture.

CnblogsSpider
https://github.com/jackgitgz/CnblogsSpider
cnblogs列表页爬虫.

spider_smooc
https://github.com/qiyeboy/spider_smooc
crawling Mu-class network video.

CnkiSpider
https://github.com/yanzhou/CnkiSpider
China HowNet reptiles.

knowsecSpider2
https://github.com/littlethunder/knowsecSpider2
know Chong Yu reptiles topic.

Spider-AISS
https://github.com/x-spiders/aiss-spider
Aisi APP picture reptiles, and crack-free payment of VIP Figure.

SinaSpider
https://github.com/szcf-weiya/SinaSpider
dynamic IP mechanisms to address anti-reptile Sina quickly crawl the content.

Spider-CSDN
https://github.com/Kevinsss/csdn-spider
crawling blog articles on CSDN.

ProxySpider
https://github.com/changetjut/ProxySpider
crawling on the west thorn proxy IP and verify the availability of agents

Published 58 original articles · won praise 110 · views 180 000 +

Guess you like

Origin blog.csdn.net/JxufeCarol/article/details/104364367