Picture category, and detailed picture of the process of crawling under the big data (b)

Part blog can be started, and crawling picture category and process images.

https://blog.csdn.net/qq_41479464/article/details/94393390 

 

 

 Here you can enter any pictures you want in name.txt file

(2) using the keywords pick up the picture of the process: category can be entered directly, where you can climb the keywords you want, such as beauty, dinosaurs, Yang Mi, etc. category. The default read by line.

Pictures with the keyword crawling process:

First of all, like I want to go this time photo site search keywords, the site will jump to that site is a problem to begin to analyze the association between the website:

The first is Baidu picture website: http: //image.baidu.com/ as shown in Figure:

 

                                                 Figure I 

 

Then I enter the keywords in the site, such as Tyrannosaurus rex, this time there will be related record jump page recommendation, and as shown in Figure II Photo Gallery: This is our analysis of the url, we found a rule, front this url is something like this url as just different keywords, plus the assignment of different parts of the back pn.

rex & pn = 0 http://image.baidu.com/search/flip?tn=baiduimage&ie=utf-8&word=

This is the above url url When we first page pn = 10 or pn = 30 at this time may be a page feed operation.

Shown in Figure 2:

 

                                                       Figure II

 

Access to good url web analytics, and keyword substitution, this time we began to carry out a single page of a single picture of the url. The same we use the Chrome browser, and check with Xpath aids, test results shown in Figure III:

You can clearly see on the right side of the picture, there have been url this time I only need to consider how you can perform paging operations, because it says, the feed is behind the url pn = 10,20,30. Such feed the problem is also solved over.

 

                                               图三

 虽然xpath可以定位到图片的url,但是这个时候还有一种方法定位图片url,因为上一个爬取类别的爬虫写的xpath,所以这个时候想采用正则表达式写调试代码的url如图四和图五所示,这时候可以看见正则表达式拿取了第一个页面所有的图片的url。

 

                                                            图四

 

                                                            图五

 

还有一个相关记录推荐,这个时候我们对其进行打印输出写入文件,可以看到图片和图片以及图片和文字的相关性。

下面是利用关键字爬取图片的完整代码:

 

import re
import requests
from urllib import error
from bs4 import BeautifulSoup
import os  #文件写入的路径
number = 0
numberPicture = 0 #图片路径的数量
file = ''
List = []
#查找单个页面的url函数
def Find(url):
    global List
    print('正在检测图片总数,请稍等.....')
    t = 0
    i = 1
    s = 0
    while t < 10000:
        #换页
        Url = url + str(t)
        try:
            Result = requests.get(Url, timeout=70)
        except BaseException:
            t = t + 10
            continue
        else:
            result = Result.text
            #正则表达式中,“.”的作用是匹配除“\n”以外的任何字符,也就是说,它是在一行中进行匹配。
            # 这里的“行”是以“\n”进行区分的。a字符串有每行的末尾有一个“\n”,不过它不可见。
            #如果不使用re.S参数,则只在每一行内进行匹配,如果一行没有,就换下一行重新开始,不会跨行。
            # 而使用re.S参数以后,正则表达式会将这个字符串作为一个整体,将“\n”当做一个普通的字符加入到这个字符串中,在整体中进行匹配。
            pic_url = re.findall('"objURL":"(.*?)",', result, re.S)  # 先利用正则表达式找到图片url [有一个参数为re.S。它表示“.”(不包含外侧双引)的作用扩展到整个字符串]
            # 拿取的图片多少个路径
            s += len(pic_url)
            # 如果图片的路径为0,那么直接跳出
            if len(pic_url) == 0:
                break
            else:
                List.append(pic_url)
                # 每次加10在响应关键字的网站上面
                t = t + 10
    return s
#记录相关推荐
def recommend(url):
    Re = []
    try:
        html = requests.get(url)
    except error.HTTPError as e:
        return
    else:
        # 设置html文档编码
        html.encoding = 'utf-8'
        # 拿取文档的text文本内容
        bsObj = BeautifulSoup(html.text, 'html.parser')
        div = bsObj.find('div', id='topRS')
        # 从div拿取a中的图片路径
        if div is not None:
            # 寻找a的节点
            listA = div.findAll('a')
            for i in listA:
                if i is not None:
                    #打印相关记录推荐
                    print(i.get_text())
                    Re.append(i.get_text())
        return Re
    # 下载图片的函数入口
def dowmloadPicture(html, keyword):
    # 全局变量
    global number
    # t =0
    # 首先我们利用正则表达式找到图片路径从html文当中找到
    pic_url = re.findall('"objURL":"(.*?)",', html, re.S)
    # 这时候输入关键字
    print('找到关键词:' + keyword + '的图片,即将开始下载图片...')
    # 拿到所解析的图片的每一个url
    for each in pic_url:
        print('正在下载第' + str(number + 1) + '张图片,图片地址:' + str(each))
        try:
            if each is not None:
                pic = requests.get(each, timeout=7)
            else:
                continue
        except BaseException:
            print('错误')
            continue
        else:
            # 写入文件以关键字命名
            string = file + r'\\' + keyword + '_' + str(number) + '.jpg'
            fp = open(string, 'wb')
            fp.write(pic.content)
            fp.close()
            number += 1
        if number >= numberPicture:
            return
        # 主函数入口
if __name__ == '__main__':
    tm = int(input('请输入每类图片的下载数量 '))
    numberPicture = tm
    line_list = []
    #打开我们创建的文档,文档中是我们想要的图片的类别,按换行读取
    with open('./name.txt', encoding='utf-8') as file:
        # 用 strip()移的空格除末尾
        line_list = [k.strip() for k in file.readlines()]
    for word in line_list:
        url = 'http://image.baidu.com/search/flip?tn=baiduimage&ie=utf-8&word=' + word + '&pn='
        # 调用寻找url的函数
        tot = Find(url)
        # 记录相关推荐
        Recommend = recommend(url)
        print('经过检测%s类图片共有%d张' % (word, tot))
        #关键字加文件命名文件夹
        file = word + '文件'
        y = os.path.exists(file)
        if y == 1:
            print('该文件已存在,请重新输入')
        else:
            #创建文件夹
            os.mkdir(file)
        t = 0
        tmp = url
        while t < numberPicture:
            try:
                url = tmp + str(t)
                result = requests.get(url, timeout=100)
                print(url)
            except error.HTTPError as e:
                print('网络错误,请调整网络后重试')
                t = t + 10
            else:
                dowmloadPicture(result.text, word)
                t = t + 10
        numberPicture = numberPicture + tm

    print('当前搜索结束,感谢使用')

Guess you like

Origin blog.csdn.net/qq_41479464/article/details/94393777