Python crawler column learning

A column about python in Zhihu, in which several articles of crawlers are explained, and they are quickly read.

getting Started

The basic principle of crawlers: Use the simplest code to crawl the most basic web pages, show the most basic ideas of crawlers, and let readers know that crawlers are actually a very simple thing.

Crawler code improvement: This part is a series of articles. From the perspective of programming, it is the basic code design ideas to be mastered in crawlers. The previous code is mainly improved from two aspects: one is from the perspective of code design, so that readers are accustomed to defining functions, using generators, etc.; the second is to show the code logic of crawling multiple pages and grabbing secondary pages.

Installation of crawler-related libraries: Describes the installation methods of all libraries that will be used in this topic, some are simple, some are a little complicated. This article mainly helps readers remove unnecessary obstacles in the learning process.

After learning these three parts, readers can freely crawl websites with a small amount of data and no anti-crawling mechanism (in fact, there are still many such websites).

Web page parsing and data storage

Python provides many web page parsing methods; at the same time, it may be stored in different file formats or databases according to different needs, so this part will discuss the two together. Each part is a theory and an actual combat; each part uses a parsing library and is stored in a file format.

Note: These analysis methods are mutually replaceable (although each has its own advantages and disadvantages), basically you only need to master one, and it is best to master all of the file storage methods .

Detailed explanation of beautifulsoup: This article comprehensively explains the use of the beautifulsoup parsing library

bs4+json crawling combat: json introduction, grab the latest python question data in stackoverflow and store it in json file

Detailed xpath: This article comprehensively explains the use of xpath parsing syntax

Xpath+ mongodb crawling combat: crawling Bole online python crawler page data and storing it in mongodb database

Detailed explanation of pyquery: This article comprehensively explains the use of CSS parsing syntax

pyquery+ mysql crawl combat: crawl Ganji.com puppy data and store it in mysql

Regular expression + csv/txt crawling combat: There are many online tutorials on regular expressions. I will not go into details here, but only provide methods for regular expression to be used for web scraping. Here we grab the data that is relatively difficult to organize in Douban top250.

Detailed selenium: This article comprehensively explains the use of the beautifulsoup parsing library

Selenium scraping combat: scraping Sina Weibo data

Comparison of various web parsing libraries

So far, I have talked about the popular web page parsing libraries and data storage methods. After mastering these, web page parsing and data storage will not be difficult for you, and you can concentrate on conquering various anti-crawling mechanisms.

Friendly reminder: For students who don’t have time, this part can actually not be learned systematically, skip it first, look at some anti-crawling measures, wait until you encounter problems, and then check the article as a document, but you have learned it beforehand. If so, writing code will be very handy.

Gain experience

In the actual operation process, we will encounter some obstacles, such as restricting headers, login verification, restricting ip, dynamic loading and other anti-crawling methods, as well as crawling apps, get/post requests and so on.

These are small problems one by one. Under the premise of the previous foundation, it should not be difficult to solve these problems. The most important thing in this process is accumulation. The more you climb and the more you fall, the experience will naturally be enriched.

This part is in the form of each article solving a problem. Considering my limited level, I definitely cannot cover all the pits, so try to fill in it!

Some simple anti-crawling skills: including UA settings and skills, cookies settings, delays and other basic anti-crawling methods

Use a proxy: When a lot of pages are crawled, our ip address will be blocked, so we can use a proxy to continuously change the ip

Crawl of ajax dynamically loaded web pages

After reading the above three articles, most of the web pages can be freely crawled. The anti-crawling skills can be seen here first. First learn the scrapy framework to make the daily crawler more convenient. If there are other webpages that involve other anti-crawlers Climbing means can be checked now.

Introduction to packet capture

Web page status code analysis

post request

Grab app data

An in-depth introduction to requests

scrapy crawler framework series: This series will start from the installation of scrapy, the basic concepts, and gradually explain in depth.

 

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324881055&siteId=291194637