The gods are silent-personal CSDN blog post directory
(The title is based on reference to Tomato's article, it's a meme, don't care too much)
Last update time: 2023.2.6
Earliest update time: 2023.2.5
Article Directory
1. Non-programming crawler implementation tool
- Octopus: I have used this, it is very useful for simple websites, much simpler than programming
Octopus collector-free web crawler software_web page big data crawler - screen-scraper: Data extraction software and services
- ivy
2. Programming crawlers
2.1 IP Proxy
You can find some free proxy pools on the Internet. I used to find a paid one because the free proxy pools were really useless. One (one IP address is generated at a time, valid for 3-5 minutes) is 150 yuan per month. I don't know what the price is like, I haven't tried other people's.
Haven’t tried it yet:
Pick up the proxy pool’s website (need to go to the external network): Eeyhan/IPproxy: Proxy ip pool, crawling mainstream free proxies, automatic de-duplication processing, automatic testing of proxy availability, and common request headers
2.2 robots protocol
2.3 Python crawler auxiliary tools
re
json
BeautifulSoup: parse HTML code (better than regular expressions) Beautiful Soup 4.4.0 Documentation—Beautiful Soup 4.2.0 Chinese Documentation
requests
urllib2
scrapy
Scrapy Tutorial Series: Web Scraping Using Python | AccordBox
Scrapy Getting Started Tutorial—Scrapy 0.24 .1 Documentation
fiddler: packet capture analysis
wireshark
2.4 Python crawler example
Because CSDN does not allow it to be released, so now only the satellite is released, and the specific project is estimated to not continue to be updated
- watercress
- Crawling Douban book information searched by keywords: crawler practice project from 0 (1): Douban searches for books with keywords - Nuggets or How to write a crawler program to crawl Douban or Sina Weibo content? - The answer of the wind, frost, sword and strict force - Zhihu
- Jinjiang
- starting point
- tomato
- Sina News
- snowball
- Oriental Fortune Network
2.5 Other reptile learning materials
- Python advanced - talk about crawlers, anti-crawling, and anti-anti-crawling from the pits you have waded through, with a set of advanced crawler test questions- Eeyhan - Blog Garden : This article is well written and detailed
- This may be the most comprehensive summary of web crawler dry goods you have ever seen! - Tencent Cloud Developer Community-Tencent Cloud : This article was written by Cui Qingcai
- Introduction to the Three Ways Crawlers Crawl Dynamic Webpages | K0rz3n's Blog : Mainly focus on the crawling of dynamic webpages. In my Douban project, I used the direct reverse backtracking method