(python) Detailed tutorial for getting started with small crawlers (getting pictures, text, etc. from the website)

I. Introduction

Because I recently made a small app about image recognition, I needed a lot of pictures to make a data set. Downloading each one was too slow, so I researched and wrote a simple little crawler, thinking about recording these experiences every time. The HTML structure of a website is different, so corresponding changes need to be made for different websites. Reading this article may require some front-end knowledge. I have posted the total code at the end.

Here we use a clothing website as an example (you can use any website, the main thing is the method and code) ==》  https://www.black-up.kr/product/detail.html?product_no=32474&cate_no=96&display_group=1

2. Code

2.1 Three tool libraries requests, BeautifulSoup, os

import requests
from bs4 import BeautifulSoup
import os

2.2 Obtain and format browser request headers

We need to prepare a browser request header, which is in the form of a dictionary. The essence of a crawler is to use the request header to disguise itself as a browser and then access it. The request header format of each browser is different, but the content is the same. Yes, we need to change the request header into the form of a dictionary. Here I wrote a small script to help format the request header, but it may only be limited to the edge browser I use. If you use a different browser, you can also try this script. You can also format it manually, but it’s a little more troublesome.

The steps to obtain are as follows:

1) Open the browser and enter the target URL (I use edge browser), press F12 to enter the background, click Network, find the top item in the name list, which is the first item, and the position of the header is the right scroll wheel. Scroll down and you can see the four words of the request header. Start copying from Accept and copy it all the way to the user-agent at the end. You don’t need anything before Accept.

Format the copied request headers

The formatting code is as follows:

The content in the long string headerStr is the request header just copied. You only need to delete the content in headerStr and replace it with the request header you copied.

Note: The unformatted request header in headerStr should be in the same position as below. Copy the header. Do not leave spaces in front of the Accept attributes.

import re

def formulate_head():
    header_lines_value = ''
    headerStr = '''
Accept:
text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7
Accept-Encoding:
gzip, deflate, br
Accept-Language:
zh-CN,zh;q=0.9,en;q=0.8,en-GB;q=0.7,en-US;q=0.6
Cookie:
_gcl_au=1.1.511254458.1684805008; CUK45=cuk45_skfo900815_4d9322403b6cab3d50371082910e0531; CUK2Y=cuk2y_skfo900815_4d9322403b6cab3d50371082910e0531; _fbp=fb.1.1684805009113.971249844; black-up.kr-crema_device_token=LzzlyHO6YLN0keIv2jauEH73Xy2PRdBz; CFAE_CUK1Y=CFAE_CUK1Y.skfo900815_1.PZPJ94T.1684805009865; ch-veil-id=dafe7ccc-ecc3-43cd-ade9-bfcb8e1187dd; _wp_uid=2-45d93ef5ddc801f3dc97e1ec636ad20b-s1684809260.904994|windows_10|chrome-eqdzf9; _ga=GA1.2.1044427573.1684823556; recent_plist=31429%7C31417%7C31495%7C31421%7C31458%7C31211; wcs_bt=s_47112a8eb19a:1686401209; cto_bundle=k7_3Fl9saEF4MGhSelVRU3EwVDFZMElMRXZCdHpEZUk0WnNFcG5GRG5lS3dGJTJCSkIydXM5aXlHdnglMkJ5JTJGdFJ3QkxMYTRKRjdkYnJQcHA2eVp4U1lTbUFscXlvOHU2VHduaCUyQldHQXhLR3BJNmp5dFlKVXJGcUdPejZmVTNRV24xOXlMU3JGdmhKY3BBZnFKSVMwZEIzRCUyRm11SW93JTNEJTNE
Sec-Ch-Ua:
"Not/A)Brand";v="99", "Microsoft Edge";v="115", "Chromium";v="115"
Sec-Ch-Ua-Mobile:
?0
Sec-Ch-Ua-Platform:
"Windows"
Sec-Fetch-Dest:
document
Sec-Fetch-Mode:
navigate
Sec-Fetch-Site:
none
Sec-Fetch-User:
?1
Upgrade-Insecure-Requests:
1
User-Agent:
Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/115.0.0.0 Safari/537.36 Edg/115.0.1901.188
'''
    header_lines = headerStr.strip().split('\n')

    # print(header_lines)
    # exit()
    ret = ""
    jump = 0
    for i in range(jump,len(header_lines)):
        if i >= jump:
            if header_lines[i].rfind(':') != -1:
                ret += '\'' + header_lines[i] + ' ' + header_lines[i+1] + '\'' + ',\n'
                jump = i+2
    ret = re.sub(": ", "': '", ret)
    ret = ret[:-2]
    print(ret)
    return ret




formulate_head()

Run the above code. The results are as follows. You can see that they have been formatted into a dictionary. Copy the generated results and put them in the parameters.

Just put the formatted headers into the dictionary

if __name__ == '__main__':
    # 爬虫网址
    url = "https://www.black-up.kr/product/detail.html?product_no=32474&cate_no=96&display_group=1"
    # 文件命名
    pic_name = "download_pic_"
    # 保存文件目录
    save_dir = "D:\lableimg1.8\csdn\spider_pic"
    # 请求头
    headers = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7',
'Accept-Encoding': 'gzip, deflate, br',
'Accept-Language': 'zh-CN,zh;q=0.9,en;q=0.8,en-GB;q=0.7,en-US;q=0.6',
'Cookie': '_gcl_au=1.1.511254458.1684805008; CUK45=cuk45_skfo900815_4d9322403b6cab3d50371082910e0531; CUK2Y=cuk2y_skfo900815_4d9322403b6cab3d50371082910e0531; _fbp=fb.1.1684805009113.971249844; black-up.kr-crema_device_token=LzzlyHO6YLN0keIv2jauEH73Xy2PRdBz; CFAE_CUK1Y=CFAE_CUK1Y.skfo900815_1.PZPJ94T.1684805009865; ch-veil-id=dafe7ccc-ecc3-43cd-ade9-bfcb8e1187dd; _wp_uid=2-45d93ef5ddc801f3dc97e1ec636ad20b-s1684809260.904994|windows_10|chrome-eqdzf9; _ga=GA1.2.1044427573.1684823556; recent_plist=31429%7C31417%7C31495%7C31421%7C31458%7C31211; wcs_bt=s_47112a8eb19a:1686401209; cto_bundle=k7_3Fl9saEF4MGhSelVRU3EwVDFZMElMRXZCdHpEZUk0WnNFcG5GRG5lS3dGJTJCSkIydXM5aXlHdnglMkJ5JTJGdFJ3QkxMYTRKRjdkYnJQcHA2eVp4U1lTbUFscXlvOHU2VHduaCUyQldHQXhLR3BJNmp5dFlKVXJGcUdPejZmVTNRV24xOXlMU3JGdmhKY3BBZnFKSVMwZEIzRCUyRm11SW93JTNEJTNE',
'Sec-Ch-Ua': '"Not/A)Brand";v="99", "Microsoft Edge";v="115", "Chromium";v="115"',
'Sec-Ch-Ua-Mobile': '?0',
'Sec-Ch-Ua-Platform': '"Windows"',
'Sec-Fetch-Dest': 'document',
'Sec-Fetch-Mode': 'navigate',
'Sec-Fetch-Site': 'none',
'Sec-Fetch-User': '?1',
'Upgrade-Insecure-Requests': '1',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/115.0.0.0 Safari/537.36 Edg/115.0.1901.188'}
    get_content_black(url,pic_name,save_dir,headers)

2.3 request access

requests.get(url,headers) represents obtaining website information, url represents the URL, and headers represent the above request headers. The returned response.status_code will save the status code. 200 means the access is successful, and the others are failures.

response = requests.get(url, headers=headers)
if response.status_code != 200:
    print('访问失败')
    return
if response.status_code == 200:
    print('访问成功')

2.4 View the obtained html

beautifulSoup() converts the html structure returned by response into lxml format to facilitate subsequent operations

html = BeautifulSoup(response.content, 'lxml')
print(html) # 查看html

The printed html is as follows:

Next, you can get the content in html just like manipulating an array.

2.5 Get the desired content address in html

1) First, you need to find the html structure of your target location. For example, if you want a detailed picture of this website selling clothes, the operation is as follows:

Or press F12 to cut out the background, click the button with the mouse arrow on it, and then click the picture. At this time, the html structure of the picture will be displayed on the right. It can be seen that these detailed pictures are encapsulated in a class called detailArea. Below, the image address itself is encapsulated in the src attribute of an img tag, so it is enough to know this

2) Code looking for structure

First find all the tag information in the class name detailArea through .find_all

content = html.find_all('div', class_="detailArea")
print(content)

The print result is as follows:

We found that there is the img tag we want, and it also happens to be a detailed picture. You can compare it yourself. The next step is very clear. Use find_all to extract all the img tags. Content[0] is used here because it was returned in the previous step. In fact, it is a list format (note the square brackets in the picture above), and this list has only one element (because only one detailArea class name is found. If there are multiple div classes of detailArea, there will be multiple elements. Use commas in the list. separated), although it is very long, you still have to get the first element when calling, so content[0] appears.

pics = content[0].find_all('img')  # 第一个子集
print(pics)

The print result is as follows:

You can see that because there are multiple img tags under the detailArea class, the returns are also in list form, so the next step is to perform a for loop on this list to extract all src values.

Use a for loop to take out the value after each src equal sign, then splice the header of the URL, and then put the spliced ​​URL information into a new list. At this time, all the link addresses of the images are stored in the new list. Got it

# 循环获得子集里img的键值对
for i in pics:
    pic_url = 'https://www.black-up.kr/' + i['src']
    pic_urls.append(pic_url)

The list printing results are as follows:

The last step is to download and save the image. The code is as follows:

requests.get(url, headers) This method is used to download the content of the URL. The parameter url represents the URL, and headers are the request headers that are always used. Because each picture website is different, you have to judge whether the status_code is 200 means whether the access is successful. Successful access means the image download is successful. Next, set the file name and splice the save path and file name together, and finally save it locally.

num = 0
for each_img_url in pic_urls:

    response = requests.get(each_img_url, headers=headers)
    # print(response.status_code)
    # exit()
    if response.status_code == 200:
        # 获取文件名
        file_name = pic_name + str(num) + '.jpg'
        num = num + 1

        # 拼接保存图片的完整路径
        save_path = os.path.join(save_dir, file_name)

        # 保存图片到本地
        with open(save_path, 'wb') as file:
            file.write(response.content)
            print(f'图片已保存为 {file_name}')
            # print(f'图片已保存为 {file_name}')
    else:
        print('无法下载图片')

print(f'保存完毕')

The running screenshot is as follows:

3. Summary

That’s all over. Let’s talk about something extra. If you want to get some specific information, such as a specific picture, you can look at the difference between the html structure of this picture and other pictures, and then use some if conditions to give it Just separate it. It’s a process where practice makes perfect. When using the code, don’t forget to use your own headers and format them. That’s all I have to say. I’ve put all the codes below. If you still don’t understand or have any questions? If you have difficulties, you can also contact me by email and we can make progress together [email protected]

import requests
from bs4 import BeautifulSoup
import os



def get_content_black(url, pic_name, save_dir, headers):

    pic_urls = []

    response = requests.get(url, headers=headers)
    if response.status_code != 200:
        print('访问失败')
        return
    if response.status_code == 200:
        print('访问成功')
    # print(response) # 查看是否请求成功


    print("**开始获取图片**")
    html = BeautifulSoup(response.content, 'lxml')
    # print(html) # 查看html


    content = html.find_all('div', class_="detailArea")  # class为detail的子集
    # print(content)

    first_single = html.find_all('div', class_="keyImg")  # 首页第一个照片
    # print(first_single)
    first_single_a = first_single[0].find_all('a')  # 获取a标签
    first_single_a_i = first_single_a[0].find_all('img')  # 进入img标签
    first_single_src = first_single_a_i[0]['src']  # 取键值对
    first_single_src = 'https:' + first_single_src
    # first_single_src = first_single_src.replace("//", "", 1)  # 删除字符串开头的//
    pic_urls.append(first_single_src)  # 添加到数组中
    # print(first_single_src)


    pics = content[0].find_all('img')  # 第一个子集
    # print(pics)

    # 循环获得子集里img的键值对
    for i in pics:
        pic_url = 'https://www.black-up.kr/' + i['src']
        pic_urls.append(pic_url)
        # print(pic_url)

    # print(pic_urls)
    if len(pic_urls) == 0:
        print(f'图片获取失败')
        return

    print(f'图片获取成功')

    # 下面是下载图片
    num = 0
    for each_img_url in pic_urls:

        response = requests.get(each_img_url, headers=headers)
        # print(response.status_code)
        # exit()
        if response.status_code == 200:
            # 获取文件名
            file_name = pic_name + str(num) + '.jpg'
            num = num + 1

            # 拼接保存图片的完整路径
            save_path = os.path.join(save_dir, file_name)

            # 保存图片到本地
            with open(save_path, 'wb') as file:
                file.write(response.content)
                print(f'图片已保存为 {file_name}')
        else:
            print('无法下载图片')

    print(f'保存完毕')
    return pic_urls

if __name__ == '__main__':
    # 爬虫网址
    url = "https://www.black-up.kr/product/detail.html?product_no=32474&cate_no=96&display_group=1"
    # 文件命名
    pic_name = "download_pic_"
    # 保存文件目录
    save_dir = "D:\lableimg1.8\csdn\spider_pic"
    # 请求头
    headers = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7',
'Accept-Encoding': 'gzip, deflate, br',
'Accept-Language': 'zh-CN,zh;q=0.9,en;q=0.8,en-GB;q=0.7,en-US;q=0.6',
'Cookie': '_gcl_au=1.1.511254458.1684805008; CUK45=cuk45_skfo900815_4d9322403b6cab3d50371082910e0531; CUK2Y=cuk2y_skfo900815_4d9322403b6cab3d50371082910e0531; _fbp=fb.1.1684805009113.971249844; black-up.kr-crema_device_token=LzzlyHO6YLN0keIv2jauEH73Xy2PRdBz; CFAE_CUK1Y=CFAE_CUK1Y.skfo900815_1.PZPJ94T.1684805009865; ch-veil-id=dafe7ccc-ecc3-43cd-ade9-bfcb8e1187dd; _wp_uid=2-45d93ef5ddc801f3dc97e1ec636ad20b-s1684809260.904994|windows_10|chrome-eqdzf9; _ga=GA1.2.1044427573.1684823556; recent_plist=31429%7C31417%7C31495%7C31421%7C31458%7C31211; wcs_bt=s_47112a8eb19a:1686401209; cto_bundle=k7_3Fl9saEF4MGhSelVRU3EwVDFZMElMRXZCdHpEZUk0WnNFcG5GRG5lS3dGJTJCSkIydXM5aXlHdnglMkJ5JTJGdFJ3QkxMYTRKRjdkYnJQcHA2eVp4U1lTbUFscXlvOHU2VHduaCUyQldHQXhLR3BJNmp5dFlKVXJGcUdPejZmVTNRV24xOXlMU3JGdmhKY3BBZnFKSVMwZEIzRCUyRm11SW93JTNEJTNE',
'Sec-Ch-Ua': '"Not/A)Brand";v="99", "Microsoft Edge";v="115", "Chromium";v="115"',
'Sec-Ch-Ua-Mobile': '?0',
'Sec-Ch-Ua-Platform': '"Windows"',
'Sec-Fetch-Dest': 'document',
'Sec-Fetch-Mode': 'navigate',
'Sec-Fetch-Site': 'none',
'Sec-Fetch-User': '?1',
'Upgrade-Insecure-Requests': '1',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/115.0.0.0 Safari/537.36 Edg/115.0.1901.188'}
    get_content_black(url,pic_name,save_dir,headers)

Guess you like

Origin blog.csdn.net/calmdownn/article/details/132006805