For friends who don’t have a crawler foundation, it will be a little bit difficult. I suggest you follow the Python basics first and then learn my crawler selection series.
今天为大家整理了32个Python爬虫项目。整理的原因是,爬虫入门简单快速,也非常适合新入门的小伙伴培养信心,所有链接指向GitHub。
Selection of python crawlers (23 GitHub crawlers to share)
python learning directory portal
Article Directory
- Selection of python crawlers (23 GitHub crawlers to share)
-
- 1. WechatSogou-WeChat public account crawler
- 2. DouBanSpider-Douban reading crawler
- 3. zhihu_spider – Zhihu crawler
- 4. bilibili-user – Bilibili user crawler
- 5. SinaSpider-Sina Weibo crawler
- 6. distribute_crawler-novel download distributed crawler
- 7. CnkiSpider-China Knowledge Network Crawler
- 8. LianJiaSpider-Lianjia.com crawler
- 9. scrapy_jingdong-Jingdong crawler
- 10. QQ-Groups-Spider – QQ group crawler
- 11. wooyun_public-dark cloud crawler
- 12. Spider-hao123 website crawler
- 13. Findtrip-ticket crawler (Qunar and Ctrip)
- 14. 163spider-Netease client content crawler based on requests, MySQLdb, torndb
- 15. Doubanspiders-Douban movies, books, groups, photo albums, things and other crawlers
- 16. QQSpider-QQ space crawler
- 17. baidu-music-spider – Baidu mp3 site crawler
- 18. tbcrawler-Taobao and Tmall crawlers
- 19. Stockholm-a stock data (Shanghai and Shenzhen) crawler and stock selection strategy testing framework
- 20. BaiduyunSpider-Baidu cloud disk crawler
- 21. Spider-social data crawler
- 22, proxy pool-Python crawler proxy IP pool (proxy pool)
- 23. music-163 – Crawling comments on all songs of NetEase Cloud Music
1. WechatSogou-WeChat public account crawler
The WeChat official account crawler interface based on Sogou WeChat search can be extended to a crawler based on Sogou search. The returned result is a list, and each item is a dictionary of official account specific information.
github address:
https://github.com/Chyroc/WechatSogou
2. DouBanSpider-Douban reading crawler
You can climb down all books under the Douban reading label, and store them in order according to the rating ranking, and store them in Excel, which can be convenient for everyone to filter and collect, such as screening high-scoring books with the number of evaluations> 1000; they can be stored in different sheets of Excel according to different topics , Use User Agent to pretend to be a browser for crawling, and add random delays to better imitate browser behavior and prevent crawlers from being blocked.
github address:
https://github.com/lanbing510/DouBanSpider
3. zhihu_spider – Zhihu crawler
The function of this project is to crawl knowledge of user information and interpersonal topological relationships. The crawler framework uses scrapy and the data storage uses mongo.
github address:
https://github.com/LiuRoy/zhihu_spider
4. bilibili-user – Bilibili user crawler
Total number of data: 20119918, capture fields: user id, nickname, gender, avatar, level, experience value, number of fans, birthday, address, registration time, signature, level and experience value, etc. After the capture, a user data report of station B is generated.
github address:
https://github.com/airingursb/bilibili-user
5. SinaSpider-Sina Weibo crawler
It mainly crawls the personal information, Weibo information, fans and followers of Sina Weibo users. The code gets the Sina Weibo Cookie to log in, and you can log in with multiple accounts to prevent Sina from picking up. Mainly use scrapy crawler framework.
github address:
https://github.com/LiuXingMing/SinaSpider
6. distribute_crawler-novel download distributed crawler
A distributed web crawler implemented using scrapy, Redis, MongoDB, graphite, the underlying storage MongoDB cluster, distributed using Redis, crawler status display using graphite, mainly for a novel site.
github address:
https://github.com/gnemoug/distribute_crawler
7. CnkiSpider-China Knowledge Network Crawler
After setting the search conditions, execute src/CnkiSpider.py to capture the data, and the captured data is stored in the /data directory. The first line of each data file is the field name.
github address:
https://github.com/yanzhou/CnkiSpider
8. LianJiaSpider-Lianjia.com crawler
Crawling the second-hand housing transaction records of Lianjia in Beijing over the years. Covers all the code of the Lianjia crawler article, including the Lianjia simulated login code.
github address:
https://github.com/lanbing510/LianJiaSpider
9. scrapy_jingdong-Jingdong crawler
JD website crawler based on scrapy, save format as csv.
github address:
https://github.com/taizilongxu/scrapy_jingdong
10. QQ-Groups-Spider – QQ group crawler
Grab QQ group information in batches, including group name, group number, group number, group owner, group profile, etc., and finally generate XLS(X) / CSV result files.
github address:
https://github.com/caspartse/QQ-Groups-Spider
11. wooyun_public-dark cloud crawler
Wuyun exposes vulnerabilities, knowledge base crawlers and searches. The list of all public vulnerabilities and the text content of each vulnerability are stored in MongoDB, which is about 2G content; if the entire site crawls all text and pictures as offline query, it will take about 10G space and 2 hours (10M telecom bandwidth); crawl all knowledge Library, a total of about 500M space. The vulnerability search uses Flask as the web server and bootstrap as the front end.
https://github.com/hanc00l/wooyun_public
12. Spider-hao123 website crawler
Take hao123 as the entry page, scroll to crawl external links, collect URLs, and record the number of internal and external links on the URL, record title and other information, tested on windows7 32-bit, currently every 24 hours, the data can be collected is 100,000 about
https://github.com/simapple/spider
13. Findtrip-ticket crawler (Qunar and Ctrip)
Findtrip is a ticket crawler based on Scrapy, which currently integrates two major domestic ticket websites (Qunar + Ctrip).
https://github.com/fankcoder/findtrip
14. 163spider-Netease client content crawler based on requests, MySQLdb, torndb
https://github.com/leyle/163spider
15. Doubanspiders-Douban movies, books, groups, photo albums, things and other crawlers
https://github.com/fanpei91/doubanspiders
16. QQSpider-QQ space crawler
Including logs, talk, personal information, etc., 4 million pieces of data can be captured a day
https://github.com/LiuXingMing/QQSpider
17. baidu-music-spider – Baidu mp3 site crawler
Use redis to support resumable transmission
https://github.com/Shu-Ji/baidu-music-spider
18. tbcrawler-Taobao and Tmall crawlers
The information on the page can be retrieved according to the search keywords and item id, and the data is stored in mongodb.
https://github.com/pakoo/tbcrawler
19. Stockholm-a stock data (Shanghai and Shenzhen) crawler and stock selection strategy testing framework
According to the selected date range, all the stock market data of Shanghai and Shenzhen stocks are captured. Support the use of expressions to define stock selection strategies. Support multi-threaded processing. Save data to JSON file, CSV file.
https://github.com/benitoro/stockholm
20. BaiduyunSpider-Baidu cloud disk crawler
https://github.com/k1995/BaiduyunSpider
21. Spider-social data crawler
Support Weibo, Zhihu, Douban.
https://github.com/Qutan/Spider
22, proxy pool-Python crawler proxy IP pool (proxy pool)
https://github.com/jhao104/proxy_pool
23. music-163 – Crawling comments on all songs of NetEase Cloud Music
https://github.com/RitterHou/music-163