Table of contents

3. Network transmission model

4. Long link/short link

2. Reptile Basics

1. Network request process

1.HTTP

User enters URL, such as www.baidu.com
The browser first requests the DNS, finds the IP address and port number corresponding to the domain name of the URL, and sends it to the front end
The browser requests to access the server corresponding to this ip address, and then passes the parameters contained in the domain name to the backend together
The backend spells the received parameters into sql statements, and queries the data from the database through sql language
The database returns the result to the backend
The backend organizes the obtained data into a unified format and returns it to the frontend
After the front-end gets the data, it renders and processes it through the front-end code, and displays it on the page for users

2.URL

URL example:

https://www.bilibili.com/video/BV1dK4y1A7aM/?spm_id_from=333.1007.top_right_bar_window_history.content.click&vd_source=18bd1edf91a238e3df510f2409d7b427

The four parts of the URL:

Protocol: https://
Domain name: www.baidu.com/
Resource path: video/BV1dK4y1A7aM/?
参数：spm_id_from=333.1007.top_right_bar_window_history.content.click&vd_source=18bd1edf91a238e3df510f2409d7b427

In general, the resource path and the parameter part are separated by '?', and the parameters are not allowed to be separated by '&'

When entering the URL, the protocol and domain name must be included. If the protocol is not entered, the browser will use https:// by default.

The domain name includes the ip address and port number port

3. Network transmission model

HTTP protocol (HyperText Transfer Prorocol) hypertext transfer protocol is a protocol for transferring data format between browsers and web servers, based on TCP protocol, it is an application layer protocol

After ensuring that the application layer, transport layer, network layer, and link layer are consistent:

The data is first passed from the client to the link layer from the application layer to the link layer
Data transmission via Ethernet and IP
Through the token ring (encryption algorithm), the data is verified to ensure that the data is authentic and reliable
Then the link layer passes the data to the application layer, indicating that the data can be processed in the next step

4. Long link/short link

In HTTP/1.0, the server uses short links by default. The browser and the server establish a connection without performing an HTTP operation, but the connection is interrupted when the task ends.

Since HTTP/1.1, browsers use long links by default. The client's browser accesses an HTML or other type of web page that contains other web resources, such as js files, image files, CSS files, etc. When the browser encounters such a resource, it will establish an HTTP session .

Connection:keep-alive

keep-alive is an agreement between the client and the server, that is, a long link, which means that the server will not close the TCP connection after returning the response, and the client will not close the TCP connection after receiving the response, and will reuse it when sending the next HTTP request connect

The short link is relatively simple for the server to implement, and the existing links are all valid connections, and no additional control methods are required
After the short link is successfully established, the connection will be disconnected after a request and response are completed, and the connection needs to be re-established every time a request is sent. It is often possible to create a large number of connections in a short period of time, causing the server to respond too slowly
Short links do not take up too many resources on the server side, but increase the user's waiting time and slow down the access speed
Long links can save more TCP establishment and closing operations and improve efficiency
After the long link is successfully established, multiple requests and responses can be sent. When the two parties do not communicate, the server will disconnect the link.
Long links increase the resource overhead of the server, which may cause excessive load on the server and eventually cause the service to be unavailable

2. Reptile Basics

1. Basic concepts

What are reptiles? The crawler is to simulate the user's network request and obtain data from the target website. In theory, as long as it is the data displayed on the browser, the crawler can obtain it.

Crawler process:

Send a request to url and get a response
Extract the response
If there is a need to extract the url, continue to send the request to get the response
extract data, save data

HTTP和HTTPS：

HTTP: Hypertext Transfer Protocol, the default port number is 80
HTTPS: HTTP + SSL (Secure Sockets), Hypertext Transfer Protocol with Secure Sockets, default port number 443

Common HTTP request headers:

Host: host and port number
Connection: connection type
Upgrade-Insecure-Requests: Upgrade HTTPS requests
User-Agent: browser name
Accept: transfer file type
Referer: page jump point
Accept-Encoding: file encoding/decoding format
Cookie: parameter

User-Agent and Cookie are the most important parameters. These two parameters indicate the origin of the request and are the best parameters to pretend to be artificial requests.

Response status code (status code):

200: success
302: Temporary transfer to a new url
307: Temporarily transferred to a new url
404: The page cannot be found
403: Resource Unavailable
500: internal server error
503: The server is unavailable, usually it is anti-climbed

2. Send request

requests request: the old version of Python uses urllib to send requests, and requests is a further encapsulation of urllib

text request

import requests

url = "https://www.baidu.com"

# 向目标url发送get请求
response = requests.get(url)

# 响应内容
print(response.text)    # str类型
print(response.content) # bytes类型
print(response.status_code) # 状态码
print(response.request.headers) # 请求头
print(response.headers)     # 响应头
print(response.cookies)     # 响应的cookie

import requests

url = "https://www.baidu.com"

# 向目标url发送get请求
response = requests.get(url)

# 响应内容
print(response.text)    # str类型，中文字符为乱码
print(response.content) # bytes类型
print(response.content.decode())            # 正常显示中文字符
print(response.content.decode('utf-8'))     # 正常显示中文字符
print(response.content.decode('gbk'))       # 报错

image request

import requests

url = "https://hips.hearstapps.com/hmg-prod.s3.amazonaws.com/images/d-1592982809.jpg"

response = requests.get(url)

# 打印图片的字节型内容
print(response.content)

with open("image.jpg", "wb") as f:
    # 写入response.content bytes二进制类型
    f.write(response.content)

Request headers and parameters

The request header and parameters will be set by default, but if we want to customize the request header headers and params , we need to create the request header and parameter variables in a dictionary structure, and pass in the get or post request function

3. Request mode

get request: a request initiated directly to the webpage without any parameters

post request: When we determine what data we need, we need to bring parameters when we initiate the request

acting

Reverse proxy: the browser does not know the real address of the server, such as nginx

Forward proxy: the browser knows the real address of the server, such as VPN

The use of proxies essentially hides one's own IP address. Insufficient login information, specific ip is banned, and the same ip is visited multiple times in a short period of time, these will be defined by the server as malicious ip and access will be prohibited. So we use a proxy to replace the real ip and access information.

import requests
import random as rd
import time

for i in range(1, 100):
    # 生成随机代理
    proxie_str = f"https://{rd.randint(1, 100)}.{rd.randint(1, 100)}.{rd.randint(1, 100)}." \
                 f"{rd.randint(1, 100)}:{rd.randint(3000, 9000)}"

    print(proxie_str)

    # 设置代理
    proxie = {
        "https": proxie_str
    }

    url = "http://www.baidu.com"
    response = requests.get(url, proxies=proxie)

    print(response.content.decode())
    time.sleep(0.5)

A cookie is personal information when a crawler visits a website, and it is also an important part of the request header.

The information in the cookie includes, but is not limited to, device information, history, and access keys.

Three ways to use requests to handle cookies:

1. Put the cookie string into the headers request header dictionary in key-value pair format

header = {
'Cookie': 'OUTFOX_SEARCH_USER_ID_NCOO=2067300732.9291213;\
OUTFOX_SEARCH_USER_ID="[email protected]"; _ga=GA1.2.831611348.1638177688;\
[email protected]|1647320299|0|youdao_jianwai|00&99|shh&1647226292&mailmas\
ter_ios#shh&null#10#0#0|&0|mailmaster_ios|[email protected]; fanyi-ad-id=305838;\
fanyi-ad-closed=1; ___rl__test__cookies=1653295115820'
}

2. Pass the cookie dictionary to the cookie parameter of the request method to receive

cookies = {"cookie的key":"cookie的value"}
requests.get(url,headers=header,cookies=cookie)

3. Use the session template provided by requests

header = {
'Cookie': 'OUTFOX_SEARCH_USER_ID_NCOO=2067300732.9291213;\
OUTFOX_SEARCH_USER_ID="[email protected]"; _ga=GA1.2.831611348.1638177688;\
[email protected]|1647320299|0|youdao_jianwai|00&99|shh&1647226292&mailmas\
ter_ios#shh&null#10#0#0|&0|mailmaster_ios|[email protected]; fanyi-ad-id=305838;\
fanyi-ad-closed=1; ___rl__test__cookies=1653295115820'
}

session = requests.session()
response = session.get(url, headers, verify=False)
# 这个写法针对长链接和第三方跳转，保证cookie可以在请求中不被清空

5.retrying

Repeat request module

import requests
from retrying import retry

@retry(stop_max_attempt_number=4)
def get_info(url):
    # 超时会报错并重试
    # 等待超时时间设为4秒
    response = requests.get(url, timeout=4)

    # 状态码不是200也会报错重试
    assert response.status_code == 200
    return response

def parse_info(url):
    try:
        response = get_info(url)
    except:
        print(Exception)
        response = None
    return response

if __name__ == "__main__":
    res = parse_info("https://www.baidu.com")
    print(res.content.decode())

The retry here is in the form of a decorator to monitor get_info. If the number of retries reaches 4 and the data cannot be obtained normally, it will stop and report an error.

[Python] network request

1. Network request process

1.HTTP

2.URL

3. Network transmission model

4. Long link/short link

2. Reptile Basics

1. Basic concepts

2. Send request

3. Request mode

5.retrying

Guess you like

[Python] network request

1. Network request process

1.HTTP

2.URL

3. Network transmission model

4. Long link/short link

2. Reptile Basics

1. Basic concepts

2. Send request

3. Request mode

4.cookie

5.retrying

Guess you like