The first python web crawler-crawling pictures

The first python web crawler-crawling pictures

When writing crawlers, everyone learns and encourages each other.
This time we want to crawl pictures, we must first find a picture webpage and view the source code of the webpage. Although we don't learn the front-end, we can understand the front-end code. First find the address of the first picture, such as this https://qq.yh31.com/tp/Photo7/ZJBQ/20099/200909291701134159.gif
This is the address of the second picture:
https://qq.yh31.com /tp/Photo7/ZJBQ/20099/200909291701136061.gif
You will find that the law of the picture address is the same as the URL position in the front. For the later changes, we can use the python built-in library re. ? These symbols are automatically matched differently local. Among them. ? Means to match any number of non-newline characters.
The first step: we need to match the URL first, import third-party libraries to import these libraries, you can use the import function that we learned before, and the third-party libraries need to be downloaded. I use the python editor, pycharm, and you can
Click on this plus sign here
directly search in file-settings The third-party library of requests is ok to download and call directly;
then you need to find the address of the image that needs to be crawled; the
second step is to request the download of the image, you can open an empty folder and then open it with the with function and then use the write function to write In, because the address of the picture we visited earlier is tp/Photo7/ZJBQ/20099/200909291701134159.gif, we need to splice the complete address together.
The explanation is not too detailed. Generally, those who have learned a little python should look at the source code. You can take a look at the blogs of the big guys to confirm each other.

#导入第三方库
import requests
import re
def get_urls():
    #请求目标网址
    response = requests.get('https://qq.yh31.com/zjbq/2920180.html')
    #匹配不同图片地址 .*?表示匹配任意数量不换行字符
    #<img border = "0" alt = "" src = "/tp/Photo7/ZJBQ/20099/200909291701134159.gif"
    url_add = r'<img border="0" .*? src="(.*?)"'
    # 找到所有要爬取图片的地址
    url_list = re.findall(url_add,response.text)
    #print(url_list)
    return url_li4st
#拼接完整网址下载数据
def get_gif(url,name):
    # 2.1请求要下载图片的uel
    response = requests.get(url)
    #下载图片到D:\python_yu\photo
    with open('D:\python_yu\photo\%d.gif'%name,'wb')as f:
        f.write(response.content)


if __name__ == '__main__':
    url_list = get_urls()
    a = 1
    for url in url_list:
        com_url='https://qq.yh31.com'+url
        get_gif(com_url, a)
        a =a+1
        #2.2调用get_gif(url)实现数据之间的传递

        print(com_url)

This is the effect of successful crawling.
Crawled image address
Open the folder and you can see the pictures

Guess you like

Origin blog.csdn.net/Lucifer_min/article/details/104169381