Learn reptiles from scratch, and you will be invincible in the end

The gods are silent-personal CSDN blog post directory

(The title is based on reference to Tomato's article, it's a meme, don't care too much)

Last update time: 2023.2.6
Earliest update time: 2023.2.5

1. Non-programming crawler implementation tool

  1. Octopus: I have used this, it is very useful for simple websites, much simpler than programming
    Octopus collector-free web crawler software_web page big data crawler
  2. screen-scraper: Data extraction software and services
  3. ivy

2. Programming crawlers

2.1 IP Proxy

You can find some free proxy pools on the Internet. I used to find a paid one because the free proxy pools were really useless. One (one IP address is generated at a time, valid for 3-5 minutes) is 150 yuan per month. I don't know what the price is like, I haven't tried other people's.

Haven’t tried it yet:
Pick up the proxy pool’s website (need to go to the external network): Eeyhan/IPproxy: Proxy ip pool, crawling mainstream free proxies, automatic de-duplication processing, automatic testing of proxy availability, and common request headers

2.2 robots protocol

2.3 Python crawler auxiliary tools

re
json
BeautifulSoup: parse HTML code (better than regular expressions) Beautiful Soup 4.4.0 Documentation—Beautiful Soup 4.2.0 Chinese Documentation
requests
urllib2
scrapy
Scrapy Tutorial Series: Web Scraping Using Python | AccordBox
Scrapy Getting Started Tutorial—Scrapy 0.24 .1 Documentation

fiddler: packet capture analysis
wireshark

2.4 Python crawler example

Because CSDN does not allow it to be released, so now only the satellite is released, and the specific project is estimated to not continue to be updated

  1. watercress
    1. Crawling Douban book information searched by keywords: crawler practice project from 0 (1): Douban searches for books with keywords - Nuggets or How to write a crawler program to crawl Douban or Sina Weibo content? - The answer of the wind, frost, sword and strict force - Zhihu
  2. Jinjiang
  3. starting point
  4. tomato
  5. Sina News
  6. snowball
  7. Oriental Fortune Network

2.5 Other reptile learning materials

  1. Python advanced - talk about crawlers, anti-crawling, and anti-anti-crawling from the pits you have waded through, with a set of advanced crawler test questions- Eeyhan - Blog Garden : This article is well written and detailed
  2. This may be the most comprehensive summary of web crawler dry goods you have ever seen! - Tencent Cloud Developer Community-Tencent Cloud : This article was written by Cui Qingcai
  3. Introduction to the Three Ways Crawlers Crawl Dynamic Webpages | K0rz3n's Blog : Mainly focus on the crawling of dynamic webpages. In my Douban project, I used the direct reverse backtracking method

Guess you like

Origin blog.csdn.net/PolarisRisingWar/article/details/128891012