Two writing methods and differences of tunnel proxy, choose the writing method that protects the real IP more

What is the difference between the following two ways of writing the tunnel proxy in scrapy?

Writing 1

tunnel_host = ""
tunnel_port = 
# # 隧道id和密码
tid = ''
passw = ""
proxies = {
    
    
    "http": "http://%s:%s@%s:%s/" % (tid, pw, tunnel_host, tunnel_port),
    "https": "http://%s:%s@%s:%s/" % (tid, password, tunnel_host, tunnel_port)
}
request.meta['proxy'] = proxies["http"]
request.headers["User-Agent"] = UserAgent().random
# request.headers["Connection"] = "close"

Writing 2

proxy = "tps.kdlapi.com:-----"
request.meta['proxy'] = "http://%(proxy)s" % {
    
    'proxy': proxy}
# # 用户名密码认证
request.headers['Proxy-Authorization'] = basic_auth_header('', '')  # 白名单认证可注释此行
# request.headers["Connection"] = "close"

The main difference between these two writing methods lies in the form of the proxy server and the authentication method.

The first method uses a tunnel proxy. The form of the proxy server is http://username:password@host:port/, where username and password are the username and password of the proxy server, and host and port are the host name and port number of the proxy server. . This way of writing needs to set the proxy server in request.meta['proxy'], and also needs to set the User-Agent request header to avoid being recognized as a crawler by the target website.

The second writing method directly specifies the host name and port number of the proxy server. The form of the proxy server is http://host:port/, and no tunnel proxy is used. This way of writing needs to set the proxy server in request.meta['proxy'], and needs to set the Proxy-Authorization request header for basic authentication (whitelist authentication).

In general, the first method uses a tunnel proxy, which can better hide the real IP address of the client, but requires authentication on the proxy server. The second writing method directly specifies the address of the proxy server. The authentication method is relatively simple, but the real IP address of the client may be exposed. According to the specific requirements and the configuration of the proxy server, choose the appropriate writing method.

Why no matter which way of writing here is to use http request

Tunneling Proxy (Tunneling Proxy) is a proxy server, its function is to establish a channel between the client and the target server, so that the client can communicate with the target server through the proxy server. This kind of proxy server is usually used to proxy HTTPS traffic, because HTTPS traffic is transmitted using the encrypted SSL/TLS protocol, which cannot be decrypted and tampered by the proxy server.

When establishing a tunnel, the client will send a CONNECT request to the proxy server, requesting to connect to the specified port of the target server. After receiving the CONNECT request, the proxy server will establish a TCP connection with the target server and forward the data of the connection to the client. From then on, the communication between the client and the target server will take place through this TCP connection, and the proxy server is only responsible for forwarding the data without decrypting or modifying it.

HTTP tunneling uses a higher layer protocol (HTTP) to transmit a lower layer protocol (TCP). The HTTP protocol defines a request method called CONNECT, which can initiate two-way communication with the requested resource and can be used to open a tunnel. Reference links are as follows 1 2 .

Only the initial connection request is HTTP, after that, the server just proxies the established TCP connection. This mechanism is how clients behind an HTTP proxy can access websites that use SSL or TLS (i.e. HTTPS). Reference links are as follows 3 4 .

Because establishing a tunnel requires the use of a CONNECT request, and a CONNECT request is an HTTP request, the tunnel proxy usually uses the HTTP protocol for communication. Of course, the proxy server can also use other protocols (such as SOCKS) to implement tunnel proxy, but the HTTP protocol is more commonly used because it is one of the most commonly used protocols on the Internet and can pass through most firewalls and proxy servers.

What does request.headers["Connection"] = "close" mean?

request.headers[“Connection”] = “close” is an operation to set the HTTP request header, and its function is to tell the target server to close the connection with the client after completing the response.

In HTTP/1.1, the connection is persistent by default, that is, the same client can send multiple requests without establishing a new connection each time. This can improve performance, but it may also cause some problems. For example, the server cannot judge whether the client's request has ended, so it cannot release resources in time. To solve this problem, HTTP/1.1 stipulates that the Connection request header can be used to control the behavior of the connection.

When a client sends a request with a Connection: close header, it tells the server to close the connection when it is done with the response. In this way, the server can release resources in time to avoid some problems, such as resource leaks, connection timeouts, and so on. Of course, this also means that the client needs to establish a new connection for each request, which may reduce performance.

When using a tunnel proxy, since the proxy server will establish a TCP connection between the client and the target server, it is necessary to use the Connection: close request header to tell the target server to close the connection after completing the response, otherwise the connection may be maintained forever, resulting in Waste of resources.

Guess you like

Origin blog.csdn.net/weixin_45934622/article/details/129799417