A simple example operation entry python crawler-crawling pictures of beautiful young ladies

In normal times, if we see a good-looking picture, we want to download it. If there are many pictures, it will take a lot of time for us to download one by one. At this time, we can use python to crawl the pictures and download them locally. Folder.

This article has a small amount of code, easy to understand, very suitable for introductory learning, and practical code will be given at the end of the article.
Okay, don’t say much, let’s start now~

1. The crawled website

We plan to crawl the homepage of the website as follows: https://www.vmgirls.com/
Insert picture description here
This is a photo site, there are many photos of young ladies, here we choose a specific page to crawl https://www.vmgirls .com/14236.html Let
Insert picture description here
's take this page as an example to download in batches~

Detailed steps for crawling

1. Use the requests library to get the information of the access address

Generally speaking, when we use the python crawler, it is more recommended to use the requests library, because requests is more convenient than urllib. Requests can directly construct get and post requests and initiate them, while urllib.request can only construct get and post requests first, and then Initiated.

	# 爬取图片的地址
	url = "https://www.vmgirls.com/14236.html"
	# 一个代理,如果不写的话,有的网站会识别出这我们是个python爬虫,触发反爬
	headers = {
    
    
    	"user-agent": "Mozilla / 5.0(Windows NT 10.0;Win64;x64)AppleWebKit / 537.36(KHTML, likeGecko) Chrome /84.0.4147.89Safari / 537.36Edg / 84.0.522.40"
	}
	response = requests.get(url, headers=headers)
    html = response.content.decode("utf-8")

The user-agent parameter can be clicked on the crawled page by pressing f12 (fn+f12 for notebook) or "check" with the right mouse button to enter the developer debugging page, which can be seen in the response header of "Network".
Insert picture description here

2. Analyze the page to obtain the image link address and download it locally

Here we can create a folder locally according to the title of the obtained page to store the pictures
Insert picture description here
that will be downloaded later. Next is to view the format of the picture. Use regular expressions to get the picture link
Insert picture description here
in the label. We can see in the picture above. The link address is in the href attribute of the a tag, where you can use regular expressions to get the link address.
When we get the link address of the picture, we need to visit it again with requests. This time write the obtained binary data into the local file to complete the download.

The specific code is as follows:

	# 这里我们使用bs4库中的BeautifulSoup解析刚刚获取到的页面信息
	bs = BeautifulSoup(html, "html.parser")
	# 获取标题 用于创建文件夹存放图片
    titile = bs.find("h1").text
    # 调用os库 按照页面的标题创建文件夹留存放下载的图片
    if not os.path.isdir(titile):
        os.mkdir(titile)
    # 正则表达式获取<a>标签的href属性的内容。()是需要获取的内容
	findpics = re.compile(r'<a href="(.*?)" alt=".*?" title=".*?">')
    # 解析获取图片地址
    pics = re.findall(findpics, html)  # 取到的地址为image/2020/07/2020072112422623-scaled.jpeg
    # 需要自己拼接 
    for pic in pics:
        # 图片连接
        pic = 'https://www.vmgirls.com/' + pic
        # 获取图片名称 2020072112422623-scaled.jpeg
        pic_name = pic.split('/')[-1]
        # print(pic_name,":",pic)
        # 保存到本地文件夹  
        response = requests.get(pic, headers=headers)
        f = open(titile + "\\" + pic_name, 'wb')  # 二进制写入
        f.write(response.content)
        print(f"正在下载{pic_name}...")
    print(f"所有小姐姐图片下载完毕!保存在{os.getcwd()}\下")
	# os.getcwd()获取当前路径

Run the program and you can see that the picture is being downloaded.
Insert picture description here
Check in the local folder and you can see the picture you just downloaded~ The Insert picture description here
final complete source code is as follows:

import requests, time, re, os
from bs4 import BeautifulSoup

findpics = re.compile(r'<a href="(.*?)" alt=".*?" title=".*?">')
headers = {
    
    
    "user-agent": "Mozilla / 5.0(Windows NT 10.0;Win64;x64) AppleWebKit / 537.36(KHTML, likeGecko) Chrome / 84.0.4147.89Safari / 537.36Edg / 84.0.522.40"
}


def main():
    url = "https://www.vmgirls.com/14236.html"
    html = askUrl(url)
    bs = BeautifulSoup(html, "html.parser")
    # 获取标题 用于创建文件夹存放图片
    titile = bs.find("h1").text
    if not os.path.isdir(titile):
        os.mkdir(titile)

    # 解析获取图片地址
    pics = re.findall(findpics, html)  # 取到的地址为image/2020/07/2020072112422623-scaled.jpeg
    # 需要自己拼接
    for pic in pics:
        # 图片连接
        pic = 'https://www.vmgirls.com/' + pic
        # 获取图片名称
        pic_name = pic.split('/')[-1]
        # print(pic_name,":",pic)
        # 保存到本地文件夹
        response = requests.get(pic, headers=headers)
        f = open(titile + "\\" + pic_name, 'wb')  # 二进制写入
        f.write(response.content)
        print(f"正在下载{pic_name}...")
    print(f"所有小姐姐图片下载完毕!保存在{os.getcwd()}\下")


# 访问url链接
def askUrl(url):
    time.sleep(1)  # 延迟一秒
    response = requests.get(url, headers=headers)
    html = response.content.decode("utf-8")
    return html


if __name__ == '__main__':
    main()

The source code can be downloaded here. One is to use regular expressions to obtain image links, and the other is to use xpath syntax to obtain image links https://github.com/zmk-c/spider/tree/master/spider_pics

Guess you like

Origin blog.csdn.net/qq_40169189/article/details/107786383