1. Basic introduction

1.1 What is a reptile

A crawler (spider, also a web crawler) is a program that initiates a request to a website/network, analyzes and extracts useful data after obtaining resources.

From a technical point of view, it is to simulate the behavior of the browser requesting the site through the program, crawl the HTML code/JSON data/binary data (picture, video) returned by the site to the local, and then extract the data you need, and store it for use.

1.2 Basic process of reptiles

Ways for users to obtain network data:

Method 1: browser submits request —> download web page code —> parse into page
Method 2: Simulate browser to send request (obtain webpage code) -> extract useful data -> store in database or file

All the crawler has to do is way 2 .

Getting started with python crawlers, 10 minutes is enough, this may be the simplest basic teaching I have ever seen

1. Initiate a request

Use the http library to initiate a request to the target site, that is, send a Request

Request contains: request header, request body, etc.

Request module defect: cannot execute JS and CSS code

2. Get the response content

If the server can respond normally, you will get a Response

Response contains: html, json, pictures, videos, etc.

3. Parse content

Parsing html data: regular expression (RE module), xpath (mainly used), beautiful soup, css

Parsing json data: json module

Parse binary data: write to file in wb mode

4. Save data

In the form of a database (MySQL, Mongdb, Redis) or a file.

1.3 http protocol request and response

http protocol

Request: The user sends his information to the server (socket server) through the browser (socket client)

Response: The server receives the request, analyzes the request information sent by the user, and then returns the data (the returned data may contain other links, such as pictures, js, css, etc.)

ps: After the browser receives the Response, it will parse its content and display it to the user, while the crawler program will extract the useful data after simulating the browser to send the request and then receive the Response.

Friends who are interested in Python or are learning, you can join our Python learning button: 784758214, from 0-based python scripts to web development, crawlers, django, data mining data analysis, etc., from 0-based to actual project information Everything is sorted out. For every python buddy! Share some learning methods and small details that need attention every night, learn route planning, and use programming to earn extra money. Click to join our python learning circle

1.3.1 request

(1) Request method

Common request methods: GET / POST

(2) Requested URL

url Global Uniform Resource Locator, used to define a unique resource on the Internet For example: a picture, a file, a video can be uniquely determined by url

(3) Request header

User-agent: If there is no user-agent client configuration in the request header, the server may treat you as an illegal user host;

cookies: cookies are used to save login information

Note: Generally, crawlers will add request headers.

Parameters that require attention in the request header:

Referrer: Where does the access source come from (some large websites will use Referrer as an anti-leech strategy; all crawlers should also pay attention to simulation)

User-Agent: The browser visited (to be added or it will be regarded as a crawler)

cookie: Please pay attention to carrying the request header

(4) Request body

If the request body is in the get method, the request body has no content (the request body of the get request is placed in the parameter behind the url and can be seen directly) If it is in the post method, the request body is format data

ps: 1. Login window, file upload, etc., the information will be attached to the request body 2. Login, enter the wrong username and password, and then submit, you can see the post. After the correct login, the page usually jumps and cannot be captured post

1.3.2 response

(1) Response status code

200: represents success

301: Represents a jump

404: The file does not exist

403: Unauthorized access

502: Server Error

（2）response header

Parameters that need attention in the response header: Set-Cookie: BDSVRTM=0; path=/: There may be more than one, which is to tell the browser to save the cookie

(3) preview is the source code of the web page

json data

Such as webpage html, pictures

binary data etc.

2. Basic module

2.1 requests

requests is a simple and easy-to-use HTTP library implemented by python, which is an upgrade of urllib.

Open source address:

https://github.com/pydmy…

Chinese API:

http://docs.python-requests.o…

2.2 re regular expressions

Regular expressions are used in Python using the built-in re module.

Disadvantages: Unstable data processing and heavy workload

2.3 XPath

XPath (XML Path Language) is a language for finding information in XML documents, and can be used to traverse elements and attributes in XML documents.

In python, the lxml library is mainly used for xpath acquisition (lxml is not used in the framework, and xpath can be used directly in the framework)

lxml is an HTML/XML parser, the main function is how to parse and extract HTML/XML data.

Like regexes, lxml is also implemented in C. It is a high-performance Python HTML/XML parser. We can use the XPath syntax we learned before to quickly locate specific elements and node information.

2.4 BeautifulSoup

Like lxml, Beautiful Soup is also an HTML/XML parser, and its main function is how to parse and extract HTML/XML data.

Using BeautifulSoup needs to import the bs4 library

Disadvantages: Relatively regular and xpath processing speed is slow

Pros: easy to use

2.5 Json

JSON (JavaScript Object Notation) is a lightweight data interchange format that makes it easy for people to read and write. At the same time, it is also convenient for machines to analyze and generate. It is suitable for data interaction scenarios, such as the data interaction between the foreground and the background of the website.

The json module is mainly used in python to process json data. Json parsing website:

https://www.sojson.com/simple…

2.6 threading

Use the threading module to create threads, inherit directly from threading.Thread, and then rewrite the __init__ method and the run method

3. Method example

3.1 get method instance

demo_get.py

3.2 Post method instance

demo_post.py

3.3 Add agent

demo_proxies.py

3.4 Get ajax class data instance

demo_ajax.py

3.5 Using multithreaded instances

demo_thread.py

4. Reptile framework

4.1 Srcapy framework

Scrapy is an application framework written in pure Python to crawl website data and extract structured data. It has a wide range of uses.

Scrapy uses the Twisted'twɪstɪd asynchronous network framework to handle network communication, which can speed up our download speed, without having to implement the asynchronous framework by ourselves, and includes various middleware interfaces, which can flexibly fulfill various needs.

4.2 Scrapy architecture diagram

4.3 Scrapy main components

Scrapy Engine (engine): Responsible for communication, signal, data transmission, etc. among Spider, ItemPipeline, Downloader, and Scheduler.

Scheduler (scheduler): It is responsible for accepting the Request sent by the engine, sorting it in a certain way, entering the queue, and returning it to the engine when the engine needs it.

Downloader: Responsible for downloading all the Requests sent by the Scrapy Engine (engine), and returning the obtained Responses to the Scrapy Engine (engine), and the engine will hand it over to the Spider for processing.

Spider (crawler): It is responsible for processing all Responses, analyzing and extracting data from them, obtaining the data required by the Item field, submitting the URL that needs to be followed up to the engine, and entering the Scheduler (scheduler) again,

Item Pipeline (pipeline): It is responsible for processing the items obtained in the Spider and performing post-processing (detailed analysis, filtering, storage, etc.).

Downloader Middlewares (download middleware): You can think of it as a component that can customize and extend the download function.

Spider Middlewares (Spider middleware): You can understand it as a functional component that can customize the expansion and operation engine and the intermediate communication between the engine and the Spider (such as Responses entering the Spider; and Requests going out from the Spider)

4.4 Operation process of Scrapy

Engine: Hi! Spider, which site are you dealing with?

Spider: Boss wants me to handle xxxx.com.

Engine: Give me the first URL that needs to be processed.

Spider: Here you are, the first URL is xxxxxxx.com.

Engine: Hi! Scheduler, I have a request to ask you to sort it into the queue for me.

Scheduler: OK, processing you wait.

Engine: Hi! Scheduler, give me your processed request.

Scheduler: here you are, this is the request I processed

Engine: Hi! Downloader, please help me download this request according to the boss's download middleware settings

Downloader: OK! Here you go, here's the downloaded stuff. (If it fails: sorry, the download of this request failed. Then the engine tells the scheduler that the download of this request failed, please record it, we will download it later)

Engine: Hi! Spider, this is something that has been downloaded, and it has been processed according to the download middleware of the boss, you can handle it yourself (note! The responses here are handled by the def parse() function by default)

Spider: (for the URL that needs to be followed up after processing the data), Hi! Engine, I have two results here, this is the URL I need to follow up, and this is the Item data I got.

Engine: Hi! Pipe I have an item here, please help me deal with it! scheduler! This is the URL that needs to be followed up and you can help me deal with it. Then start the cycle from the fourth step until all the information needed by the boss is obtained.

pipeline ``scheduler: ok, do it now!

4.5 Making a Scrapy crawler in 4 steps

1 Create a new crawler project scrapy startproject mySpider 2 Clear the target (write items.py) Open items.py in the mySpider directory 3 Create a crawler (spiders/xxspider.py) scrapy genspider gushi365 "gushi365.com" 4 Store content (pipelines.py) Design pipeline Store crawled content

Five, common tools

5.1 fidder

fidder is a packet capture tool, mainly used for mobile phone packet capture.

5.2 XPath Helper

The xpath helper plugin is a free chrome crawler web page analysis tool. It can help users solve problems such as failure to locate normally when obtaining the xpath path.

Installation and use of the Google Chrome plug-in xpath helper:

https://jingyan.baidu.com/art…

6. Distributed crawlers

6.1 scrapy-redis

Scrapy-redis provides some redis-based components (pip install scrapy-redis) in order to implement Scrapy distributed crawling more conveniently

6.2 Distributed strategy

Master side (core server): Build a Redis database, not responsible for crawling, only responsible for url fingerprint weight judgment, Request distribution, and data storage.

The following are the introductory learning materials for python crawlers that I have compiled, all of which have been sorted out and packaged.
This full version of the full set of Python learning materials has been uploaded to CSDN. If you need it, you can scan the QR code of CSDN official certification below on WeChat to get it for free【保证100%免费】

A complete set of learning materials for getting started with Python comes with source code:

A complete set of learning routes for Python

insert image description here

Python zero-based introductory video

insert image description here
Python project source code

Python entry to advanced e-books and practical cases