For friends who don’t have a crawler foundation, it will be a little bit difficult. I suggest you follow the Python basics first and then learn my crawler selection series.
今天为大家整理了32个Python爬虫项目。整理的原因是，爬虫入门简单快速，也非常适合新入门的小伙伴培养信心，所有链接指向GitHub。

Selection of python crawlers (23 GitHub crawlers to share)

python learning directory portal

Article Directory

Selection of python crawlers (23 GitHub crawlers to share)

1. WechatSogou-WeChat public account crawler

The WeChat official account crawler interface based on Sogou WeChat search can be extended to a crawler based on Sogou search. The returned result is a list, and each item is a dictionary of official account specific information.

github address:

https://github.com/Chyroc/WechatSogou

2. DouBanSpider-Douban reading crawler

You can climb down all books under the Douban reading label, and store them in order according to the rating ranking, and store them in Excel, which can be convenient for everyone to filter and collect, such as screening high-scoring books with the number of evaluations> 1000; they can be stored in different sheets of Excel according to different topics , Use User Agent to pretend to be a browser for crawling, and add random delays to better imitate browser behavior and prevent crawlers from being blocked.

github address:

https://github.com/lanbing510/DouBanSpider

3. zhihu_spider – Zhihu crawler

The function of this project is to crawl knowledge of user information and interpersonal topological relationships. The crawler framework uses scrapy and the data storage uses mongo.

github address:

https://github.com/LiuRoy/zhihu_spider

4. bilibili-user – Bilibili user crawler

Total number of data: 20119918, capture fields: user id, nickname, gender, avatar, level, experience value, number of fans, birthday, address, registration time, signature, level and experience value, etc. After the capture, a user data report of station B is generated.

github address:

https://github.com/airingursb/bilibili-user

5. SinaSpider-Sina Weibo crawler

It mainly crawls the personal information, Weibo information, fans and followers of Sina Weibo users. The code gets the Sina Weibo Cookie to log in, and you can log in with multiple accounts to prevent Sina from picking up. Mainly use scrapy crawler framework.

github address:

https://github.com/LiuXingMing/SinaSpider

6. distribute_crawler-novel download distributed crawler

A distributed web crawler implemented using scrapy, Redis, MongoDB, graphite, the underlying storage MongoDB cluster, distributed using Redis, crawler status display using graphite, mainly for a novel site.

github address:

https://github.com/gnemoug/distribute_crawler

7. CnkiSpider-China Knowledge Network Crawler

After setting the search conditions, execute src/CnkiSpider.py to capture the data, and the captured data is stored in the /data directory. The first line of each data file is the field name.

github address:

https://github.com/yanzhou/CnkiSpider

8. LianJiaSpider-Lianjia.com crawler

Crawling the second-hand housing transaction records of Lianjia in Beijing over the years. Covers all the code of the Lianjia crawler article, including the Lianjia simulated login code.

github address:

https://github.com/lanbing510/LianJiaSpider

9. scrapy_jingdong-Jingdong crawler

JD website crawler based on scrapy, save format as csv.

github address:

https://github.com/taizilongxu/scrapy_jingdong

10. QQ-Groups-Spider – QQ group crawler

Grab QQ group information in batches, including group name, group number, group number, group owner, group profile, etc., and finally generate XLS(X) / CSV result files.

github address:

https://github.com/caspartse/QQ-Groups-Spider

11. wooyun_public-dark cloud crawler

Wuyun exposes vulnerabilities, knowledge base crawlers and searches. The list of all public vulnerabilities and the text content of each vulnerability are stored in MongoDB, which is about 2G content; if the entire site crawls all text and pictures as offline query, it will take about 10G space and 2 hours (10M telecom bandwidth); crawl all knowledge Library, a total of about 500M space. The vulnerability search uses Flask as the web server and bootstrap as the front end.

https://github.com/hanc00l/wooyun_public

12. Spider-hao123 website crawler

Take hao123 as the entry page, scroll to crawl external links, collect URLs, and record the number of internal and external links on the URL, record title and other information, tested on windows7 32-bit, currently every 24 hours, the data can be collected is 100,000 about

https://github.com/simapple/spider

13. Findtrip-ticket crawler (Qunar and Ctrip)

Findtrip is a ticket crawler based on Scrapy, which currently integrates two major domestic ticket websites (Qunar + Ctrip).

https://github.com/fankcoder/findtrip

14. 163spider-Netease client content crawler based on requests, MySQLdb, torndb

https://github.com/leyle/163spider

15. Doubanspiders-Douban movies, books, groups, photo albums, things and other crawlers

https://github.com/fanpei91/doubanspiders

16. QQSpider-QQ space crawler

Including logs, talk, personal information, etc., 4 million pieces of data can be captured a day

https://github.com/LiuXingMing/QQSpider

17. baidu-music-spider – Baidu mp3 site crawler

Use redis to support resumable transmission

https://github.com/Shu-Ji/baidu-music-spider

18. tbcrawler-Taobao and Tmall crawlers

The information on the page can be retrieved according to the search keywords and item id, and the data is stored in mongodb.

https://github.com/pakoo/tbcrawler

19. Stockholm-a stock data (Shanghai and Shenzhen) crawler and stock selection strategy testing framework

According to the selected date range, all the stock market data of Shanghai and Shenzhen stocks are captured. Support the use of expressions to define stock selection strategies. Support multi-threaded processing. Save data to JSON file, CSV file.

https://github.com/benitoro/stockholm

20. BaiduyunSpider-Baidu cloud disk crawler

https://github.com/k1995/BaiduyunSpider

21. Spider-social data crawler

Support Weibo, Zhihu, Douban.

https://github.com/Qutan/Spider

22, proxy pool-Python crawler proxy IP pool (proxy pool)

https://github.com/jhao104/proxy_pool

23. music-163 – Crawling comments on all songs of NetEase Cloud Music

https://github.com/RitterHou/music-163