[Python] Detailed explanation of reptiles (understand at a glance)

What is a crawler?
Simply put, a crawler is a name for the process of using a program to obtain data on the network.

The principle of crawlers
If we want to obtain data on the network, we need to give the crawler a URL (usually called URL in the program), the crawler sends an HTTP request to the server of the target webpage, and the server returns data to the client (that is, our crawler), the crawler Then perform a series of operations such as data analysis and storage.

The process
crawler can save us time. For example, if I want to get the Douban movie Top250 list, if I don’t use the crawler, we need to enter the URL of Douban movie on the browser first, and the client (browser) can find the server of the Douban movie web page through parsing IP address, and then establish a connection with it, the browser creates an HTTP request and sends it to the server of Douban Movie, after the server receives the request, it extracts the Top250 list from the database, encapsulates it into an HTTP response, and returns the response result To the browser, the browser displays the response content, and we see the data. Our crawler is also based on this process, but changed to code form.
Please add a picture description

HTTP request

An HTTP request consists of a request line, a request header, a blank line, and a request body.
Please add a picture description
The request line consists of three parts:

    1.请求方法,常见的请求方法有 GET、POST、PUT、DELETE、HEAD
    2.客户端要获取的资源路径
    3.是客户端使用的 HTTP 协议版本号

The request header is a supplementary description for the client to send the request to the server, such as indicating the identity of the visitor, which will be discussed below.

The request body is the data submitted by the client to the server, such as the account password information that needs to be improved when the user logs in. Separate the request header and the request body with a blank line. Not all requests have a request body, for example, a general GET does not have a request body.

The above figure is the HTTP POST request sent by the browser to the server when logging in to Douban, and the user name and password are specified in the request body.

If you need the original code and want to learn Python, you can ↓ ↓ ↓

Click here~~ (Note: Su)

There are a lot of resources that can be prostituted for nothing, and I will update the little knowledge of Python from time to time! !

HTTP response

The HTTP response format is very similar to the request format, and is also composed of response line, response header, blank line, and response body.
Please add a picture description
The response line also contains three parts, which are the HTTP version number of the server, response status code and status description.

Here there is a table of status codes, which correspond to the meaning of each status code.
Please add a picture descriptioninsert image description herePlease add a picture description
The second part is the response header. The response header corresponds to the request header. It is some additional instructions from the server on the response, such as what is the format of the response content and the length of the response content. How many, when to return to the client, and even some cookie information will also be placed in the response header.

The third part is the response body, which is the real response data, which is actually the HTML source code of the web page.

If you need the original code and want to learn Python, you can ↓ ↓ ↓

Click here~~ (Note: Su)

There are a lot of resources that can be prostituted for nothing, and I will update the little knowledge of Python from time to time! !

How to write crawler code

Crawlers can use many languages ​​such as Python, C++, etc., but I think Python is the easiest,

Because Python has ready-made libraries that are almost perfectly encapsulated,

Although C++ also has ready-made libraries, its crawlers are still relatively small, and the only library is not simple enough, and the code is not compatible with each compiler, or even different versions of the same compiler, so Not particularly useful. So today I mainly introduce python crawlers.

Install the requests library

cmd run: pip install requests, install requests.

Then enter on IDLE or compiler (personally recommend VS Code or Pycharm)

import requests run, if there is no error, it proves that the installation is successful.

The method of installing most libraries is: pip install xxx (the name of the library)

method of requests

requests.request() Construct a request and support the basic methods of each method
requests.get() The main method of getting HTML web pages, corresponding to HTTP's GET
requests.head() The method to obtain the header information of HTML web pages, corresponding to the HEAD of HTTP
requests.post() A method of submitting a POST request to an HTML web page, corresponding to HTTP POST
requests.put() A method of submitting a PUT request to an HTML page, corresponding to HTTP's PUT
requests.patch( ) Submit partial modification requests to HTML web pages, corresponding to HTTP PATCT
requests.delete() Submit a delete request to an HTML webpage, corresponding to HTTP DELETE

The most commonly used get method

r = requests.get(url)

Contains two important objects:

Construct a Request object to request resources from the server; return a Response object containing server resources

r.status_code The return status of the HTTP request, 200 means the connection is successful, 404 means failed
r.text The string form of the HTTP response content, that is, the content of the page corresponding to the url
r.encoding The encoding method of the response content guessed from the HTTP header (if charset does not exist in the header, the encoding is considered to be ISO-8859-1)
r.apparent_encoding Response content encoding method analyzed from content (alternative encoding method)
r.content The binary form of the HTTP response content
requests.ConnectionError Abnormal network connection errors, such as DNS query failure, connection rejection, etc.
requests.HTTPError HTTP error exception
requests.URLRequired Missing URL exception
requests.TooManyRedirects If the maximum number of redirects is exceeded, a redirection exception will be generated
requests.ConnectTimeout Connection to remote server timed out exception
requests.Timeout Request URL timed out, resulting in a timeout exception

at last

If you need the original code and want to learn Python, you can ↓ ↓ ↓

Click here~~ (Note: Su)

There are a lot of resources that can be prostituted for nothing, and I will update the little knowledge of Python from time to time! !

Guess you like

Origin blog.csdn.net/weixin_72934044/article/details/128232761