How to use Python crawler to automatically download Baidu pictures?

Github:https://github.com/nnngu/LearningNotes


Steps to make a crawler

Making a crawler is generally divided into the following steps:

  • Analyze requirements
  • Analyze web page source code and cooperate with developer tools
  • Write regular expressions or XPath expressions
  • Formally write python crawler code

Effect preview

The operation effect is as follows:

Folder to store images:

demand analysis

Our crawler must implement at least two functions: one is to search for pictures, and the other is to automatically download.

Searching for pictures: The easiest thing to think of is the result of crawling Baidu pictures, let's go to Baidu pictures to see:

Just search for a few keywords, and you can see that many pictures have been searched:

Analyze web pages

We right-click to view the source code:

After opening the source code, it is difficult to find a bunch of source code to find the resources we want.

At this time, it is necessary to use the developer tools! We go back to the previous page and call up the developer tools. What we need to use is the thing in the upper left corner: (mouse follow).

Then select the place where you want to see the source code, you can find that the code area below is automatically located to the corresponding position. As shown below:

We copied this address, and then searched the source code just now, and found its location, but here we are again confused. There are so many addresses for this picture, which one to use? We can see that there are thumbURL, middleURL, hoverURL, objURL

Through analysis, we can know that the first two are the reduced versions, hoverURL is the version displayed after the mouse is moved, and objURL should be what we need. You can open these URLs separately and find that the objURL is the largest and clearest.

Find the image address, and then we analyze the source code. See if all objURLs are images.

It is found that all pictures end in .jpg format.

Write regular expressions

pic_url = re.findall('"objURL":"(.*?)",',html,re.S)

Write crawler code

Here we use 2 packages, one is regular and the other is requests package

#-*- coding:utf-8 -*-
import re
import requests

Copy the link of Baidu Image Search, pass in requests, and then write the regular expression

url = 'https://image.baidu.com/search/index?tn=baiduimage&ie=utf-8&word=%E6%A0%97%E5%B1%B1%E6%9C%AA%E6%9D%A5%E5%A4%B4%E5%83%8F&ct=201326592&ic=0&lm=-1&width=&height=&v=index'

html = requests.get(url).text
pic_url = re.findall('"objURL":"(.*?)",',html,re.S)

Because there are many pictures, so we need to loop, we print out the results to see, and then use requests to get the URL, because some pictures may have a situation where the URL cannot be opened, so a 10-second timeout control is added.

pic_url = re.findall('"objURL":"(.*?)",',html,re.S)
i = 1
for each in pic_url:
    print each
    try:
        pic = requests.get(each, timeout = 10 )
     except requests.exceptions.ConnectionError:
         print ( ' [Error] The current picture cannot be downloaded ' )
         continue

The next step is to save the pictures. We set up an images directory in advance, put all the pictures in, and name them with numbers.

        dir = '../images/' + keyword + '_' + str(i) + '.jpg'
        fp = open(dir, 'wb')
        fp.write(pic.content)
        fp.close()
        i += 1

complete code

# -*- coding:utf-8 -*-
import re
import requests


def dowmloadPic(html, keyword):
    pic_url = re.findall('"objURL":"(.*?)",', html, re.S)
    i =  1 
    print ( ' Find the image with the keyword: '  + keyword +  ' , now start downloading the image... ' )
     for each in pic_url:
         print ( ' Downloading the first '  +  str (i) +  ' image, image Address: '  +  str (each))
         try :
            pic = requests.get(each, timeout = 10 )
         except requests.exceptions.ConnectionError:
             print ( ' [Error] The current picture cannot be downloaded ' )
             continue

        dir = '../images/' + keyword + '_' + str(i) + '.jpg'
        fp = open(dir, 'wb')
        fp.write(pic.content)
        fp.close()
        i += 1


if __name__ == '__main__':
    word = input("Input key word: ")
    url = 'http://image.baidu.com/search/flip?tn=baiduimage&ie=utf-8&word=' + word + '&ct=201326592&v=flip'
    result = requests.get(url)
    dowmloadPic(result.text, word)

We saw that some pictures were not displayed, and when we opened the URL, we found that they were indeed gone.

Because some of Baidu's pictures are cached on Baidu's server, we can still see it on Baidu, but its actual link has expired.

Summarize

enjoy our first image download crawler! Of course, it can not only download Baidu's pictures, but also draw scoops according to the gourd.

The complete code has been put on Githut  https://github.com/nnngu/BaiduImageDownload

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326255351&siteId=291194637