Python crawler Xiaobai zero-based introductory tutorial 2023 latest version (practical teaching)

This tutorial adopts the simplest method to ensure that all beginners can successfully get started with python crawlers

I don’t have much bb about the introduction and principle of reptiles, etc., let’s go directly to the actual dry goods: In this case, I will use the website of Bi’antu.com as a tutorial.

Link below the original website: https://pic.netbian.com

open the website first

You can see that there are many good-looking pictures, a total of 21 pictures on one page,
insert image description here
we right-click to select 检查or directly press F12to the console

Click 箭头the or shortcut key in the upper left corner ctrl+shift+c, and then click on a picture

insert image description here
insert image description here
At this time, we can see the detailed information of this picture. srcThe following link is the link of the picture. Put the mouse on the link to see the picture. This is what we are going to climb this time.
insert image description here

1. Import related libraries (requests library)

import requests

The translation of requests means request, which is used to send a request to a certain website

2. Related parameters (url, headers)

Let's go back to the console just now, click on the top Network, press ctrl+rrefresh, and click on a picture.
insert image description here
insert image description here
Here we only need two simple parameters. This case is just a simple crawler tutorial, and other parameters are not considered for the time being.

parameter effect
Request URL The address of the website that sent the request, that is, the URL where the image is located
user-agent It is used to simulate a browser's access to the website to avoid illegal access detected by the website


insert image description here
Preparation of parameter code

url = "https://pic.netbian.com/uploads/allimg/210317/001935-16159115757f04.jpg"
headers = {
    "user-agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.90 Safari/537.36"
}

3. Make a request to the website

response = requests.get(url=url,headers=headers)
print(response.text) # 打印请求成功的网页源码,和在网页右键查看源代码的内容一样的

insert image description here
At this time we will find garbled characters? ! ! ! ! This is actually a headache for many beginners. It is not difficult to solve garbled characters.

# 通过发送请求成功response,通过(apparent_encoding)获取该网页的编码格式,并对response解码
response.encoding = response.apparent_encoding
print(response.text)

Looking at these densely packed areas, does it feel like your brain is going to explode? In fact, we only need to find what we need.
insert image description here

4. Matching (re library, regular expressions)

What are regular expressions? To put it simply, the user formulates a rule, and then the code matches the correct content in the specified content according to the rules we specified

We saw what the image information looked like earlier, and we can quickly find what we want according to the information. The next step is to
insert image description here
insert image description here
match the links and names of each image through regular expressions and store them in a list.

import re
"""
. 表示除空格外任意字符(除\n外)
* 表示匹配字符零次或多次
? 表示匹配字符零次或一次
.*? 非贪婪匹配
"""
# src后面存放的是链接,alt后面是图片的名字
# 直接(.*?)也是可以可以直接获取到链接,但是会匹配到其他不是我们想要的图片
# 我们可以在前面图片信息看到链接都是/u····开头的,所以我们就设定限定条件(/u.*?)这样就能匹配到我们想要的
parr = re.compile('src="(/u.*?)".alt="(.*?)"')
image = re.findall(parr,response.text)
for content in image:
    print(content)

insert image description here
In this way, our link and name are stored in the image list, and we can see the following content by printing

image[0]: the first element of the list, which is the link and picture
image[0][0]: the first value in the first element of the list, which is the link
image[0][1]: the second value in the first element of the list, which is the name
insert image description here

5. Get the picture and save it in the folder (os library)

First by os库creating a folder (currently you can also manually create a folder in the script directory)

import os
path = "彼岸图网图片获取"
if not os.path.isdir(path):
    ok.mkdir(path)

Then traverse the list to get the picture

# 对列表进行遍历
for i in image:
    link = i[0] # 获取链接
    name = i[1] # 获取名字
    """
    在文件夹下创建一个空jpg文件,打开方式以 'wb' 二进制读写方式
    @param res:图片请求的结果
    """
    with open(path+"/{}.jpg".format(name),"wb") as img:
        res = requests.get(link)
        img.write(res.content) # 将图片请求的结果内容写到jpg文件中
        img.close() # 关闭操作
    print(name+".jpg 获取成功······")

Run it and we will find that an error is reported. This is because our picture link is incomplete.
insert image description here
We go back to the picture homepage website, click on a picture, we can see in the address bar that our picture link is missing the front part, we copy Next, https://pic.netbian.com
insert image description here
add the copy you just copied before the sending request address to get the picture https://pic.netbian.cominsert image description here
. Run, OK, the acquisition is complete
insert image description here
insert image description here

full code

import requests
import re
import os

url = "https://pic.netbian.com/"
headers = {
    "user-agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.90 Safari/537.36"
}

response = requests.get(url=url,headers=headers)
response.encoding = response.apparent_encoding

"""
. 表示除空格外任意字符(除\n外)
* 表示匹配字符零次或多次
? 表示匹配字符零次或一次
.*? 非贪婪匹配
"""
parr = re.compile('src="(/u.*?)".alt="(.*?)"') # 匹配图片链接和图片名字
image = re.findall(parr,response.text)

path = "彼岸图网图片获取"
if not os.path.isdir(path): # 判断是否存在该文件夹,若不存在则创建
    os.mkdir(path) # 创建
    
# 对列表进行遍历
for i in image:
    link = i[0] # 获取链接
    name = i[1] # 获取名字
    """
    在文件夹下创建一个空jpg文件,打开方式以 'wb' 二进制读写方式
    @param res:图片请求的结果
    """
    with open(path+"/{}.jpg".format(name),"wb") as img:
        res = requests.get("https://pic.netbian.com"+link)
        img.write(res.content) # 将图片请求的结果内容写到jpg文件中
        img.close() # 关闭操作
    print(name+".jpg 获取成功······")

This tutorial is over here. Do you feel unsatisfied with only a few pictures on one page? For more python crawler information and python introductory tutorials, you can scan the CSDN official cooperation QR code below to get it!

Guess you like

Origin blog.csdn.net/BlueSocks152/article/details/130984961