Python crawler - image download using socket module

what is socket

Socket is a communication mechanism for inter-process communication on the network. It is an application layer protocol, usually based on the TCP/IP protocol stack, that enables communication between different computers. Socket is essentially a file descriptor, which provides a set of API interfaces for network communication, which can perform operations such as data transmission, connection establishment, and port monitoring.

Using Socket can realize inter-process communication between different computers, such as communication between client and server, and also can realize communication between different processes in the same computer. Socket can support different transmission protocols, such as TCP and UDP , and different protocols can be selected for communication according to needs.

In Python, use the socket module to implement Socket programming, and create client and server programs for network communication. Socket programming can be used to develop network applications, such as web crawlers, chat rooms, file transfers, etc.
insert image description here

method describe
connect( (host,port) ) host represents the server host name or IP, and port represents the port number bound to the server process.
send send request message
recv Receive data

Workflow of crawlers

Workflow of crawlers
insert image description here

  1. Get the resource address:

The first thing a crawler needs to do is to obtain the resource address of the data. Only after we have an accurate address can we send the data to send the request

  1. Send request to get data

The second step is to get the web page, here is to get the source code of the web page. The source code contains some useful information of the web page, so as long as you get the source code, you can extract the desired information from it.

  1. Anti-crawler processing:

Some websites will take anti-crawler measures, such as setting access frequency restrictions, verification codes, etc., and these measures need to be dealt with to ensure the normal operation of the crawler program. Anti-crawler measures can be handled using techniques such as Python's captcha recognition.

  1. OK:

After obtaining the source code of the web page, the next step is to analyze the source code of the web page and extract the data we want from it. First of all, the most general method is to use regular expression extraction, which is a universal method, but it is more complicated and error-prone when constructing regular expressions. In addition, because the structure of web pages has certain rules, there are also some libraries for extracting web page information based on web page node attributes, CSS selectors or XPath, such as Beautiful Soup, pyquery, lxml, etc. Using these libraries, we can efficiently and quickly extract information from web pages, such as attributes of nodes, text values, etc. Extracting information is a very important part of crawlers. It can make messy data clear so that we can process and analyze the data later.

  1. save data:

After extracting information, we generally save the extracted data somewhere for later use. There are various ways of saving here, for example, it can be simply saved as TXT text or JSON text, it can also be saved to a database, such as MySQL and MongoDB, and it can also be saved to a remote server, such as operating with SFTP.

  1. Crawler Control:

The crawler program needs to control the number and frequency of web pages crawled to avoid excessive load on the target website. You can use Python's multi-thread or multi-process to achieve concurrent crawling, and you can also use time control to control the frequency of crawling.

  1. data analysis:

After the data is acquired, data analysis and processing can be performed, such as data visualization, machine learning, natural language processing, etc., to obtain more valuable information.

Socket crawling pictures

Why can I use socket to download pictures

The reason why pictures can be downloaded using Socket is because the HTTP protocol is an application layer protocol based on Socket, which uses the TCP/IP protocol family to transmit data. In the HTTP protocol, the client establishes a connection to the server through Socket, sends a request, the server receives the request and returns a response, and the client receives the response and processes the response data.

Since the HTTP protocol is based on Socket, Socket can be used to directly send HTTP requests and receive HTTP responses to realize data transmission and download. Using Socket to download pictures requires manual construction of HTTP requests and parsing of HTTP responses, and more code needs to be written to handle data transmission and error handling, but compared to other download methods, using Socket can be more flexible and customized, and can achieve more functions and application scenarios.

The difference between downloading pictures by socket and downloading pictures by request

Both Socket and Request can be used to download pictures, but their implementation methods and purposes are slightly different.

Socket is a low-level network communication protocol, which can establish a connection between the application layer and the transport layer for data transmission. Using Socket to download pictures requires manual construction of HTTP requests and parsing of HTTP responses , and more code needs to be written to handle data transmission and error handling. Socket is more suitable for low-level network communication applications, such as implementing custom protocols, online games, etc.

Request is a Python library that encapsulates the processing of HTTP requests and responses, making it easy to make network requests. Using Request to download pictures only needs simple code to complete, and you can easily set request headers, request parameters and other information. Request is more suitable for developing applications such as web crawlers and data collection.

In general, using Socket to download pictures is more low-level and flexible, but requires more code; using Request to download pictures is more high-level and convenient, but there may be some restrictions. Which method to use depends on specific requirements and application scenarios.

Example with code:

Use Socket to download pictures:

import socket

# 构造 HTTP 请求
request = b"GET /images/test.jpg HTTP/1.1\r\nHost: example.com\r\n\r\n"

# 建立连接并发送请求
s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.connect(("example.com", 80))
s.send(request)

# 接收响应并保存图片
response = s.recv(4096)
while response:
    if b"Content-Type: image/jpeg" in response:
        with open("test.jpg", "wb") as f:
            f.write(response)
            break
    response = s.recv(4096)

# 关闭连接
s.close()

Use Request to download pictures:

import requests

# 发送请求并保存图片
response = requests.get("http://example.com/images/test.jpg")
with open("test.jpg", "wb") as f:
    f.write(response.content)

It can be seen that using Request to download pictures is simpler and more convenient, while using Socket requires manual construction of HTTP requests and parsing of HTTP responses, requiring more code and processing.

Download a picture using socket

Take pictures http://image11.m1905.cn/uploadfile/2021/0922/thumb_0_647_500_20210922030733993182.jpgfor example.
To use socket to download this picture, the specific steps are as follows:

  1. Get the resource address url.

  2. Create a Socket client object client and connect to port 80 of the server image11.m1905.cn.

  3. Construct HTTP request, including request method, request address, request protocol version, request header and other information, and send the request.

  4. The loop receives the server response and adds the response data to the binary object result.

  5. Use regular expressions to extract the image data in the response data, that is, remove the response header.

  6. Store the picture data in a local file, that is, download the picture to the local.

code show as below:

import socket
import re
import time

# 获取到资源地址
url = 'http://image11.m1905.cn/uploadfile/2021/0922/thumb_0_647_500_20210922030733993182.jpg'

start = time.time()

# 创建套接字对象
client = socket.socket()

# 创建连接
client.connect(('image11.m1905.cn', 80))

# 构造http请求
http_res = 'GET ' + url + ' HTTP/1.0\r\nHost: image11.m1905.cn\r\nUser-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.36\r\n\r\n'
# print(http_res)

# 发送请求
client.send(http_res.encode())
# 建立一个二进制对象用来存储我们得到的数据

result = b''
data = client.recv(1024)

# 循环接收响应数据 添加到bytes类型
while data:
    result += data
    data = client.recv(1024)
# print(result)

# 提取数据
# re.S使 . 匹配包括换行在内的所有字符去掉响应头
images = re.findall(b'\r\n\r\n(.*)', result, re.S)
# print(images[0])

# 打开一个文件,将我们读取到的数据存入进去,即下载到本地我们获取到的图片
with open('小姐姐.png', 'wb')as f:
    f.write(images[0])

end_time = time.time()  # 记录结束时间
elapsed_time = end_time - start  # 计算代码执行时间
print(f'All images downloaded successfully in {elapsed_time:.2f} seconds')

Download multiple pictures using socket

Assuming we want to use socket to download multiple pictures, I will give two implementation methods here.

method 1

Make a list of the urls that need to be downloaded, then use the split function to /split the urls, then obtain the relative path and host value in the urls by slicing, and finally loop through the url list to achieve the effect of downloading multiple pictures. code show as below

import socket
import re
import time
start = time.time()
# 获取到的资源地址
urls = [
    'https://pic.netbian.com/uploads/allimg/220211/004115-1644511275bc26.jpg',
    'https://pic.netbian.com/uploads/allimg/220215/233510-16449393101c46.jpg',
    'https://pic.netbian.com/uploads/allimg/211120/005250-1637340770807b.jpg'
]

# 创建连接
for url in urls:
    # 解析URL
    parts = url.split('/')  # ['https:', '', 'pic.netbian.com', 'uploads', 'allimg', '220211', '004115-1644511275bc26.jpg']
    host = parts[2]  # pic.netbian.com
    path = '/' + '/'.join(parts[3:])  # /uploads/allimg/220211/004115-1644511275bc26.jpg
    # print(parts)
    # print(host)
    # print(path)

    # 创建套接字对象连接到主机
    client = socket.socket()
    client.connect((host, 80))

    # 构造HTTP请求
    http_req = f'GET {path} HTTP/1.1\r\nHost: {host}\r\nuser-agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) ' \
               f'AppleWebKit/537.36 (KHTML, like Gecko) Chrome/110.0.0.0 Safari/537.36\r\nConnection: close\r\n\r\n '
    # print(http_req)
    # 发送请求
    client.sendall(http_req.encode())

    # 接受响应数据
    result = b''
    data = client.recv(1024)
    while data:
        result += data
        data = client.recv(1024)

    # 提取图像数据
    images = re.findall(b'\r\n\r\n(.*)', result, re.S)
    # print(images[0])

    # 写入文件
    if images:
        with open(f'{parts[-1]}', 'wb') as f:
            f.write(images[0])
    else:
        print(f'No image data received for {url}')

    # 关闭连接
    client.close()

end_time = time.time()  # 记录结束时间
elapsed_time = end_time - start  # 计算代码执行时间
print(f'All images downloaded successfully in {elapsed_time:.2f} seconds')

Method 2

By creating a thread pool, download all pictures at the same time in a multi-threaded manner.

code show as below:

import socket
import re
import multiprocessing
import time


def download_image(url):
    # 解析URL
    parts = url.split('/')
    host = parts[2]
    path = '/' + '/'.join(parts[3:])

    # 创建套接字对象连接到主机
    client = socket.socket()
    client.connect((host, 80))

    # 构造HTTP请求
    http_req = f'GET {path} HTTP/1.1\r\nHost: {host}\r\nuser-agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) ' \
               f'AppleWebKit/537.36 (KHTML, like Gecko) Chrome/110.0.0.0 Safari/537.36\r\nConnection: close\r\n\r\n '

    # 发送请求
    client.sendall(http_req.encode())

    # 接受响应数据
    result = b''
    data = client.recv(1024)
    while data:
        result += data
        data = client.recv(1024)

    # 提取图像数据
    images = re.findall(b'\r\n\r\n(.*)', result, re.S)

    # 写入文件
    if images:
        with open(f'{parts[-1]}', 'wb') as f:
            f.write(images[0])
    else:
        print(f'No image data received for {url}')

    # 关闭连接
    client.close()


if __name__ == '__main__':
    start = time.time()
    urls = [
        'https://pic.netbian.com/uploads/allimg/220211/004115-1644511275bc26.jpg',
        'https://pic.netbian.com/uploads/allimg/220215/233510-16449393101c46.jpg',
        'https://pic.netbian.com/uploads/allimg/211120/005250-1637340770807b.jpg'
    ]

    # 创建进程池
    pool = multiprocessing.Pool(processes=3)

    # 同时下载所有图片
    pool.map(download_image, urls)

    # 关闭进程池
    pool.close()
    pool.join()

    end = time.time()
    elapsed_time = end - start  # 计算代码执行时间
    print(f'All images downloaded successfully in {elapsed_time:.2f} seconds')

Guess you like

Origin blog.csdn.net/m0_46467017/article/details/129339188