Use of the Requests request library

How crawlers work

First, review the basic principle of crawler operation (the following is a more formal written language):
a crawler is an automated program used to obtain information from the Internet,

  1. Initiate a request: The crawler first needs to initiate an HTTP request to request the content of the page from the target website. This request can contain specific parameters, headers, authentication, etc.
  2. Receive response: After receiving the request, the target website will return an HTTP response. The response contains the content of the page and other relevant information, such as status code, header information, etc.
  3. Parsing the page: After receiving the response, the crawler needs to parse the page to extract the required data. Commonly used parsing methods include regular expressions, XPath, CSS selectors, etc.
  4. Extract data: By parsing the page, the crawler can extract the required data. These data can be information in various forms such as text, pictures, and links.
  5. Store data: The crawler stores the extracted data in local files or databases for subsequent processing and analysis.
  6. Process the next page: After crawling the data of a single page, the crawler may need to process the data of the next page. This usually involves a page turning operation, and the content of the next page can be obtained by simulating user clicks or modifying URL parameters.
  7. Repeat operation: The crawler can perform the above steps in a loop as needed to obtain more data or information on different pages.

But for us, what we see is almost all graphical interfaces, but the crawler can return the parameters corresponding to the page. This part is represented by code, which we cannot see on the surface, such as status code, page carrying Cookies and more. The crawler simulates the process of our users visiting web pages.
insert image description here

Requests request library

To access the web page, we also need the corresponding tool - Requests request library.
The Requests request library is an HTTP request library, through which we can initiate a request to the website.

Installation Environment

Install the library in Pycharm. If the installation is unsuccessful, you can find a domestic mirror source, so the installation will be smoother.
insert image description here

Use of the Requests library

Open pycharm, create a new file, and then enter import requests to see if there is an error:
insert image description here
Next, understand the simple syntax of Requests:

In Python, requestsis a popular third-party library for sending HTTP requests. It provides a simple and user-friendly interface that makes sending HTTP requests very easy. Here are requestssome common usage and syntax of the library:

  1. Send a GET request:
import requests

response = requests.get(url)

where urlis the target URL you want to send the request to. requests.get()method will send a GET request to the specified URL and return an Responseobject.

  1. Send a GET request with parameters:
import requests

payload = {
    
    'key1': 'value1', 'key2': 'value2'}
response = requests.get(url, params=payload)

In the above example, payloadis a dictionary containing the parameters. requests.get()The method's paramsparameters allow you to add these parameters to the requested URL.

Because the content of the GET request can be seen in the link, for example, the payload here refers to:
insert image description here

  1. Send a POST request:
import requests

payload = {
    
    'key1': 'value1', 'key2': 'value2'}
response = requests.post(url, data=payload)

In this example, payloada dictionary containing the data to send. requests.post()The method will send a POST request to the specified URL and use it payloadas the requested data. This is different from the GET request, these parameters are not visible in the open link.

  1. Handle the response:
import requests

response = requests.get(url)
print(response.status_code)  # 打印状态码
print(response.text)  # 打印响应内容

response.status_codeIt is the status code of the response, which is used to determine whether the request is successful. response.textis the content of the response, which can be used to get the data returned by the server.

status code

Take Baidu as an example:
insert image description here
here is the URL to visit.
insert image description here
The result of using print(response.status_code) here is consistent with print(response).

Check the return result:
insert image description here
the return code is 200, which means we can access normally.

web content

Next, obtain the content of the web page, or take Baidu as an example:
here, response.text is used to output, and you can see that the content of the page is output, but some of them are garbled. It can be found from the code that the original encoding of the webpage is utf-8, so we keep it consistent with the original encoding of the webpage and change it to utf-8.
insert image description here

insert image description here
At this time, the content of the page can be displayed normally.

Guess you like

Origin blog.csdn.net/ssslq/article/details/130739272