Python crawls website images (crawler entry demo)

Code function:

Crawl the teacher pictures from the website, create a PNG folder on the user's host computer, and save a total of 110 pictures. At the same time, write the teacher introduction content of each picture into the H3.txt file.

Implementation ideas:

After opening the webpage, use F12 to view the original HTML code of the webpage. It is found that the image is under the tag ul, but it is not the only one. Through the find_all function, it is found that the first element of the list is what we need. Finally, a for loop is used to traverse the extraction. Get the src corresponding to each image and then write the image to the PNG directory through the open binary format. Similarly, the acquisition of h3 also follows this idea. It should be noted that in addition to installing the import corresponding package, pip install html5lib must also be installed in the environment.

Code:

import requests
import os,sys
import shutil
from bs4 import BeautifulSoup

response = requests.get(url="http://www.mobiletrain.org/teacher/")

def get_resource_path(relative_path): # 利用此函数实现资源路径的定位
    if getattr(sys, "frozen", False):
        base_path = sys._MEIPASS # 获取临时资源
        print(base_path)
    else:
        base_path = os.path.abspath(".") # 获取当前路径
    return os.path.join(base_path, relative_path) # 绝对路径

if response.status_code == 200:    #404和405是页面消失报错
    print("连接成功!")
    # 设置返回源码的编码格式
    response.encoding = "UTF-8"
    # print(type(response.text))
    html = BeautifulSoup(response.text,"html5lib")
    ul=html.find_all("ul",attrs={"class":"clear"})[0]#找唯一的父节点再找子节点,或者找出后得到列表取第一个
    li_list = ul.find_all("li")

    i = 0
    PNG=get_resource_path('png')   #判断是否有PNG目录存在,存在则删除再创建,避免使用的时候报错
    if os.path.exists(PNG):
        shutil.rmtree(PNG)
    png = os.mkdir(PNG)

    for li in li_list:
        i += 1
        img_src = li.find("img")["src"]
        response_child = requests.get(img_src)
        fileWriter = open(get_resource_path(os.path.join("png", "{}.png".format(i))), "wb")
        fileWriter.write(response_child.content)
        h3 = li.find("h3").text
        text=open('H3.txt','a',encoding='utf-8')
        text.write(h3+'\n')
        text.close()
else:
    print("连接失败!")



Guess you like

Origin blog.csdn.net/weixin_56115549/article/details/126653567