Use socket to make underlying request to crawl pictures

Table of contents

1. HTTP request

2. What is a socket?

 3. Use socket to download pictures


Before officially explaining the actual combat, a wave of theoretical understanding is required, so that we can get twice the result with half the effort.

1. HTTP request

First of all, let's understand the request form of http (https).
The request is sent from the client to the server and can be divided into four parts: Request Method, Request URL, Request Headers, and Request Body .

 So how do we structure the request? At this time, we will use socket (of course requests request is the most convenient, because it is encapsulated at the bottom).

2. What is a socket?

By default, everyone knows about sockets, so I won’t repeat them here. Of course, if you don’t understand, there are also reference articles below to help you understand.

Reference: Introduction to socket programming: 1 day to play with socket communication technology (very detailed)

Socket is also known as "socket". Application programs usually send requests to the network or respond to network requests through "sockets", so that hosts or processes on a computer can communicate.

Three-way handshake and four-way wave reference:  knowledge summary of three-way handshake and four-way wave (super detailed)-Cloud Community-HUAWEI CLOUD (huaweicloud.com)

As shown below

method describe
connect( (host,port) ) host represents the server host name or IP, and port represents the port number bound to the server process.
send send request message
recv Receive data

 3. Use sockt to download http type pictures

code show as below:

import socket
import re

# 获取到资源地址
url = 'http://image11.m1905.cn/uploadfile/2021/0922/thumb_0_647_500_20210922030733993182.jpg'
# 创建套接字对象
client = socket.socket()
# 创建连接
client.connect(('image11.m1905.cn', 80)) #http默认是80端口,https默认是443端口
# 构造http请求
http_res = 'GET ' + url + ' HTTP/1.0\r\nHost: image11.m1905.cn\r\nUser-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.36\r\n\r\n'
# 发送请求
client.send(http_res.encode())
# 建立一个二进制对象用来存储我们得到的数据
result = b''
data = client.recv(1024)
# 循环接收响应数据 添加到bytes类型
while data:
    result += data
    data = client.recv(1024)
print(result)
# 提取数据
# re.S使 . 匹配包括换行在内的所有字符 去掉响应头
images = re.findall(b'\r\n\r\n(.*)', result, re.S) #返回一个列表
# print(images[0])
# 打开一个文件,将我们读取到的数据存入进去,即下载到本地我们获取到的图片
with open('小姐姐.png', 'wb')as f:
    f.write(images[0])
client.close() #关闭套接字

Where \r is a carriage return and \n is a line feed.

The renderings are as follows:

 

If you are not clear about the protocols of different versions of HTTP, you can refer to an article of mine. HTTP Detailed Explanation

Note: Both HTTP1.0 and HTTP1.1 protocols support long connections (Keep-Alive). In HTTP1.0, a "Connection: keep-alive" field needs to be added to the request header to enable long connections. In HTTP1.1, long connections are enabled by default, that is, when the client sends a request, the TCP connection will not be closed immediately, but will remain connected and wait for the server to respond. If the client wants to close the connection, it needs to add a "Connection: close" field in the HTTP request header. In this way, after the server sends the response, the TCP connection will be closed, and the client will know that the response has been completed, and can continue to send the next request. However, closing the connection means that the connection needs to be re-established, which has a certain impact on performance.

 4. Use socket to download https type pictures

 code show as below:

import socket
import re


class GetImage:
    def __init__(self):
        pass

    def main(self, url):
        # 创建套接字对象
        client = socket.socket()  # 默认TCP
        # 创建连接
        client.connect(('pic.netbian.com', 80)) 

        # 构造https请求
        http_res = 'GET ' + url + ' HTTP/1.0\r\nHost: pic.netbian.com\r\nUser-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36\r\n\r\n'
        client.send(http_res.encode())
        result = b''
        data = client.recv(1024)
        while data:
            result += data
            data = client.recv(1024)

        # 提取数据
        # re.S使 . 匹配包括换行在内的所有字符 去掉响应头
        image = re.findall(b'\r\n\r\n(.*)', result, re.S)
        filename = url.split('/')[-1]
        # 保存数据
        with open(filename, 'wb') as f:
            f.write(image[0])

        # 关闭套接字
        client.close()


if __name__ == '__main__':
   url_list = ['https://pic.netbian.com/uploads/allimg/230327/194745-16799176658fd9.jpg',
                'https://pic.netbian.com/uploads/allimg/230303/004437-1677775477ee49.jpg',
                'https://pic.netbian.com/uploads/allimg/230411/225955-16812251959808.jpg'
                ]
    getimage = GetImage()
    for url in url_list:
        getimage.main(url)

successfully downloaded

 I believe you have noticed that this time we are requesting https type images, but we still use port 80, shouldn't we use port 443? Why?

Note: https is compatible with http port 80, so if you use port 80, you will find that you can get data normally, but if you use port 443 directly, an error will be reported instead. You can try it yourself.

 

Why is this?

Because https requires encrypted communication using TLS/SSL . Therefore, if we must use port 443, we can take the following measures.

Importing the ssl package can be solved normally, as follows:

import ssl
# 创建套接字对象
client = socket.socket()  # 默认TCP
 # 创建SSL上下文
ssl_context = ssl.create_default_context()
# 建立加密连接
client = ssl_context.wrap_socket(client, server_hostname='pic.netbian.com')
# 连接服务器
client.connect(('pic.netbian.com', 443))

As long as the above code is added, the rest of the code logic is the same, and the picture can be obtained normally.

expand:

TLS (Transport Layer Security) and SSL (Secure Sockets Layer) are protocols used to protect network communication security. TLS is the successor of SSL. Both TLS and SSL provide security at the transport layer (Transport Layer) to protect data from eavesdropping, tampering or forgery during transmission in the network.

The TLS/SSL protocol is based on public key encryption, private key decryption and digital signature technology, uses X.509 digital certificates for identity verification and key exchange, and also supports data integrity verification and encryption. Under the TLS/SSL protocol, the communication between the client and the server is encrypted and the data transmission is secure.

The TLS/SSL protocol is widely used in data transmission on the Web, such as online banking transactions, e-mail, online shopping and other scenarios. Common applications using TLS/SSL include HTTPS, FTP, SMTP, POP3, etc.

It should be noted that although TLS and SSL are different protocols, their goals and design principles are basically the same, so they are often mentioned together in practical applications.

That's all for today's sharing, I hope it can give you a deeper understanding of socket.

Guess you like

Origin blog.csdn.net/qq_69218005/article/details/130237634