Reptile-Baidu picture crawling

General crawlers and focused crawlers
According to usage scenarios, web crawlers can be divided into general crawlers and focused crawlers.

General web crawler: It is an important part of the search engine crawling system (Baidu, Google, Yahoo, etc.). It is to collect web pages and collect information from the Internet. These web page information are used to index search engines to provide support. It determines Whether the content of the entire engine system is rich and whether the information is instantaneous, so its performance directly affects the effectiveness of the search engine. The main purpose is to download web pages on the Internet to the local area to form a mirrored backup of Internet content.

Insert picture description here

Focused crawler: It is a web crawler program that is oriented to the needs of specific topics. The difference between it and the general search engine crawler is that the focused crawler will process and filter the content when implementing web crawling. Try to ensure that only crawling is related to the requirements. Page information. That is a reptile in a general sense.
Baidu image crawling
needs analysis, at least two functions must be realized: one is to search for images, and the other is to automatically download and
analyze the web page http://image.baidu.com/search/index?tn=baiduimage&word=cat (note: it can be deleted Unused parameters, does not affect the loading of the page) source code, cooperate with F12 to
write regular expressions or other parser code,
store data to the local
formally write python crawler code
page analysis is very important, different needs correspond to different URLs, different URLs The source code is obviously different, so mastering how to analyze the page is the first step in a successful crawler. The source code analysis of this page is shown below:

Insert picture description here

import os
import re

import requests
from colorama import Fore


def download_image(url, keyword):
    """
    下载图片
    :param url: 百度图片的网址
    :return: Bool
    """
    #  1. 向服务器发起HTTP请求
    response = requests.get(url)
    #  2. 获取服务器端的响应信息
    #  响应信息: status_code, text, url
    data = response.text  # 获取页面源码
    #  3. 编写正则表达式,获取图片的网址
    # data = ...[{"ObjURL":"http:\/\/images.freeimages.com\/images\/large-previews\/3bc\/calico-cat-outside-1561133.jpg",....}]...
    # 需要获取到的是: http:\/\/images.freeimages.com\/images\/large-previews\/3bc\/calico-cat-outside-1561133.jpg
    # 正则的语法: .代表除了\n之外的任意字符, *代表前一个字符出现0次或者无数次. ?代表非贪婪模式
    pattern = r'"objURL":"(.*?)"'
    # 4. 根据正则表达式寻找符合条件的图片网址.
    image_urls = re.findall(pattern, data)
    # 5. 根据图片网址下载猫的图片到本地
    index = 1
    for image_url in image_urls:
        print(image_url)  # 'xxxx.jpg   xxxx.png'
        # response.text 返回 unicode 的文本信息, response.text 返回 bytes 类型的信息
        try:
            response = requests.get(image_url)  # 向每一个图片的url发起HTTP请求
        except Exception as e:
            print(Fore.RED + "[-] 下载失败: %s" % (image_url))
        else:
            old_image_filename = image_url.split('/')[-1]
            if old_image_filename:
            	# 获取图片的后缀
                image_format = old_image_filename.split('.')[-1]
                # 处理 url 为...jpeg?imageview&thumbnail=550x0 结尾(传参)的情况
                if '?' in image_format:
                    image_format = image_format.split('?')[0]
            else:
                image_format = 'jpg'
                
            # 生成图片的存储目录
            keyword = keyword.split(' ', '-')
            if not os.path.exists(keyword):
                os.mkdir(keyword)
            image_filename = os.path.join(keyword, str(index) + '.' + image_format)
			# 保存图片
            with open(image_filename, 'wb') as f:
                f.write(response.content)
                print(Fore.BLUE + "[+] 保存图片%s.jpg成功" % (index))
                index += 1


if __name__ == '__main__':
    keyword = input("请输入批量下载图片的关键字: ")
    url = 'http://image.baidu.com/search/index?tn=baiduimage&word=' + keyword
    print(Fore.BLUE + '[+] 正在请求网址: %s' % (url))
    download_image(url, keyword)

Implementation results:
Insert picture description here
common problems:

Why is there only 30 pictures, and more than 30
Baidu pictures from Baidu are responsive, and continuous pull-down will continue to load new pictures, which means that the page in the browser is the result of JS processing data, involving Ajax crawler content No detailed explanation will be given here.
The URL of a single picture clicked under the search page is inconsistent with the ObjURL obtained in the program.
This may be the result of Baidu's cache processing. Each picture is essentially "external", not Baidu, so in the program Choose to initiate an HTTP request to the URL that actually stores the picture.

Published 29 original articles · praised 0 · visits 891

Guess you like

Origin blog.csdn.net/weixin_45734982/article/details/105700489