(1) Picture of Miss Web Crawler --- Some websites browsed by Google browser are disabled "Right click check" and "F12", and the source code cannot be viewed. Rename the crawled image.

Web crawler girl picture - some websites browsed by Google Chrome are disabled "right click check" and "F12", and the source code cannot be viewed

(1) For some websites whose "right-click check" and "F12" are disabled, and the source code cannot be viewed, you can first enter "view-source:" in the input box, followed by the URL that needs to be crawled. For example:

Original website:

insert image description here

View the source code of the web page:

insert image description here

When the crawler program runs to crawl the website, if the header is not set to any name, it will be detected by some websites as a python crawler and access will be prohibited

import requests

# 当爬虫程序运行爬网站,若不设置header为任意一个名字,会被有些网站检查出是python爬虫,被禁止访问
# headers = {
    
    
#     "User-Agent":"hdy"
# }
# response = requests.get("https://www.vmgirls.com/15215.html, headers=headers")

response = requests.get("https://www.vmgirls.com/15215.html")

print(response.text)

The following figure can display the html format, indicating that the website can be crawled:

insert image description here

Complete code analysis:

# 在某一个网页中爬取图片
import requests
import re   # 正则(最简单的爬虫方法)
import os

# 当爬虫程序运行爬网站,若不设置header为任意一个名字,会被有些网站检查出是python爬虫,被禁止访问
headers = {
    
    
   "User-Agent" : "hdy"
}

# 请求网页
print("输入需要爬取图片的网站连接:")
urls = input()

response = requests.get(urls, headers=headers)
# print(response.request.headers)
# print(response.text)

html = response.text
# 解析网页
# dir_name = re.findall('<h1 class="post-title mb-3">空气都是甜的</h1>', html)
dir_name = re.findall('<h1 class="post-title mb-3">(.*?)</h1>', html)[-1]   # 正则
print("*", dir_name)

# 在指定文件夹中创建文件夹,如果文件目录不存在,创建目录
path = "F:\\PyQt_Serial_Assistant_Drive_Detect\\Friuts_Classify\\Data\\"
folder = os.path.exists(path + dir_name)
if not folder:
   os.makedirs(path + dir_name)
else:
   pass

# 正则查找图片链接
# urls = re.findall('<a rel="nofollow" href="https://img.vm.laomishuo.com/image/2020/12/2020120109200851.jpeg" alt="空气都是甜的" title="空气都是甜的">')
urls = re.findall('<a rel="nofollow" href="(.*?)" alt=".*?" title=".*?">', html)  # 找全部图片所对应的网址
print("**", urls)  # 打印全部图片的地址(数组的形式)

# 遍历每张图片
for url in urls:
   name = url.split('/')[-1]   # 分割每张图片所对应网址以获取每张图片名字
   print(name)
   # 网页get请求
   response=requests.get(url, headers=headers)
   # ”wb“以二进制方式打开dir_name文件夹下的命名为name并写入response.content进而得到图片
   with open(path + dir_name + '/' + name, 'wb') as f:
       f.write(response.content)
print("下载完毕")


# 批量修改图片命名,对图片以数字顺序进行批量重命名
data_path = path + dir_name  # 将要批量重命名的文件夹
class_name = ".jpg"  # 重命名后的文件后缀

all_file = os.listdir(data_path)  # 返回文件夹包含的所有文件
all_file_num = len(all_file)      # 获取文件数目
print(all_file, all_file_num)

for i in range(0, all_file_num):
   num = str(i + 1)
   new_name = os.rename(data_path + '/' + all_file[i], data_path + '/' + num + class_name)  # 重命名

file_out = os.listdir(data_path)  # 返回文件夹里面的所有文件
print(file_out)



The two regexes in the code are the title of the website and the corresponding information of the picture. Copy the information in the red box to the program and modify it with the regex.

insert image description here
insert image description here

Run the program: enter the URL (the program applies to this URL)

insert image description here

Enter the label name to label and classify the images downloaded in batches:

insert image description here

Guess you like

Origin blog.csdn.net/K_AAbb/article/details/127193324