23 Python crawler open source project codes: WeChat, Taobao, Douban, Zhihu, Weibo...

Today, we have compiled 23 Python crawler projects for everyone. The reason for organizing is that the crawler entry is simple and fast, and it is also very suitable for new beginners to cultivate confidence. All links point to GitHub. WeChat cannot be opened directly. The old rules can be opened with a computer.

Many people learn python and don't know where to start.

Many people learn python and after mastering the basic grammar, they don't know where to find cases to get started.

Many people who have done case studies do not know how to learn more advanced knowledge.

So for these three types of people, I will provide you with a good learning platform, free to receive video tutorials, e-books, and the source code of the course! ??¤

QQ group: 701698587

1. WechatSogou – Wechat public account crawler

The WeChat official account crawler interface based on Sogou WeChat search can be extended to a crawler based on Sogou search. The returned result is a list, and each item is a dictionary of official account specific information.

github address:

https://github.com/Chyroc/WechatSogou

2. DouBanSpider – DouBan Book Crawler

You can climb down all books under the Douban reading label, and store them in order according to the rating rankings, and store them in Excel, which can be convenient for everyone to filter and collect, such as screening high-scoring books with the number of evaluations> 1000; they can be stored in different sheets of Excel according to different topics , Use User Agent to pretend to be a browser for crawling, and add random delays to better imitate browser behavior and prevent crawlers from being blocked.

github address:

https://github.com/lanbing510/DouBanSpider

3. zhihu_spider – Zhihu crawler

The function of this project is to crawl knowledge of user information and interpersonal topological relationships. The crawler framework uses scrapy and the data storage uses mongo.

github address:

https://github.com/LiuRoy/zhihu_spider

4. bilibili-user – Bilibili user crawler

Total number of data: 20119918, capture fields: user id, nickname, gender, avatar, level, experience value, number of fans, birthday, address, registration time, signature, level and experience value, etc. After the capture, a user data report of station B is generated.

github address:

https://github.com/airingursb/bilibili-user

5. SinaSpider – Sina Weibo Crawler

It mainly crawls the personal information, Weibo information, fans and followers of Sina Weibo users. The code gets the Sina Weibo Cookie to log in, and you can log in with multiple accounts to prevent Sina from picking up. Mainly use scrapy crawler framework.

github address:

https://github.com/LiuXingMing/SinaSpider

6. distribute_crawler – novel download distributed crawler

A distributed web crawler implemented using scrapy, Redis, MongoDB, graphite, the underlying storage MongoDB cluster, distributed using Redis, crawler status display using graphite, mainly for a novel site.

github address:

https://github.com/gnemoug/distribute_crawler

7. CnkiSpider-Chinese knowledge network crawler.

After setting the search conditions, execute src/CnkiSpider.py to capture the data, and the captured data is stored in the /data directory. The first line of each data file is the field name.

github address:

https://github.com/yanzhou/CnkiSpider

8. LianJiaSpider-LianJiaSpider.

Crawling the second-hand housing transaction records of Lianjia in Beijing over the years. Covers all the code of the Lianjia crawler article, including the Lianjia simulated login code.

github address:

https://github.com/lanbing510/LianJiaSpider

9. scrapy_jingdong-Jingdong crawler.

JD website crawler based on scrapy, save format as csv.

github address:

https://github.com/taizilongxu/scrapy_jingdong

10. QQ-Groups-Spider-QQ group crawler.

Grab QQ group information in batches, including group name, group number, group number, group owner, group profile, etc., and finally generate XLS(X) / CSV result files.

github address:

https://github.com/caspartse/QQ-Groups-Spider

11. wooyun_public-wooyun crawler.

Wuyun exposes vulnerabilities, knowledge base crawlers and searches. The list of all public vulnerabilities and the text content of each vulnerability are stored in MongoDB, which is about 2G content; if the entire site crawls all text and pictures as offline query, it will take about 10G space and 2 hours (10M telecom bandwidth); crawl all knowledge Library, a total of about 500M space. The vulnerability search uses Flask as the web server and bootstrap as the front end.

https://github.com/hanc00l/wooyun_public

12. spider-hao123 website crawler.

Take hao123 as the entry page, scroll to crawl external links, collect URLs, and record the number of internal and external links on the URL, record title and other information, tested on windows7 32-bit, currently every 24 hours, the data can be collected is 100,000 about

https://github.com/simapple/spider

13. findtrip-ticket crawler (Qunar and Ctrip).

Findtrip is a ticket crawler based on Scrapy, which currently integrates two major domestic ticket websites (Qunar + Ctrip).

https://github.com/fankcoder/findtrip

14. 163spider-Netease client content crawler based on requests, MySQLdb, torndb

https://github.com/leyle/163spider

15. doubanspiders-Douban movies, books, groups, photo albums, things and other crawlers

https://github.com/fanpei91/doubanspiders

16. QQSpider-QQ space crawler, including logs, talk, personal information, etc., can crawl 4 million pieces of data a day.

https://github.com/LiuXingMing/QQSpider

17. baidu-music-spider-Baidu mp3 site crawler, using redis to support resumable uploads.

https://github.com/Shu-Ji/baidu-music-spider

18. tbcrawler-a crawler for Taobao and Tmall, which can crawl the page information based on search keywords and item id, and the data is stored in mongodb.

https://github.com/pakoo/tbcrawler

19. stockholm-a stock data (Shanghai and Shenzhen) crawler and stock selection strategy testing framework. According to the selected date range, all the stock market data of Shanghai and Shenzhen stocks are captured. Support the use of expressions to define stock selection strategies. Support multi-threaded processing. Save data to JSON file, CSV file.

https://github.com/benitoro/stockholm

20. BaiduyunSpider-Baidu cloud disk crawler.

https://github.com/k1995/BaiduyunSpider

21. Spider-social data crawler. Support Weibo, Zhihu, Douban.

https://github.com/Qutan/Spider

22. proxy pool-Python crawler proxy IP pool (proxy pool).

https://github.com/jhao104/proxy_pool

23. music-163-Crawling comments on all songs of NetEase Cloud Music.

https://github.com/RitterHou/music-163

image

Guess you like

Origin blog.csdn.net/Python_kele/article/details/115038447