Python crawls Baidu images (HD original image)

Today’s goal is like the title, crawl Baidu pictures, high-definition original pictures, not thumbnails: https://image.baidu.com/

There are many pits on Baidu's photo website. Perhaps most people will immediately go to Network to find data when they see the website and find that it is dynamic, and then find that all thumbnails are stored in it. What should I do? Find in JS, then breakpoint debugging and so on. . . This is indeed a way, but we don't understand JS at all! Is there any other way? Of course there is. Today I will share one of the slightly simpler methods. In addition, if I have learned both the first two articles and today’s writing, basically 90% of the pictures can be picked up smoothly, and the remaining 10% Those who need to log in and VIP.

Okay, don’t say much, just start it (above): I

Insert picture description here
don’t know if you guys have noticed the difference between the two?

Yes, the URLs are different. The URL of the first picture cannot be seen at a glance, and the URL of the second picture is a relatively normal URL length, but the displayed content is indeed exactly the same-you can see from this Out: The URL of the first picture adds a lot of unnecessary things, and these things have no effect on the user, and it makes the crawler more difficult to crawl.

Now that you see the difference, how do you know which ones are available and which ones are not?

Insert picture description here
Looking at the picture, the things in the boxes are obviously useless. Since they are useless, they can be removed. In the same way, useless ones can be removed step by step. The rest is the real URL we are looking for.

Now that the URL is found, how to find the picture? We open the source code of the webpage and find that it is all JS code. How to do it? We haven't played JS.
In fact, this is also a pit made by Baidu. The image link we need is in the source code of the webpage. So how do we find what we need? It’s very simple. Look at the format of the picture, and then go to the source code of the webpage to find the
Insert picture description here
Insert picture description here
same picture, there are quite a lot of sizes, you can choose what you want, I choose the original picture

Now that the link to the image is found, it means that the website has been analyzed and just upload the code:

import re
import requests


url = 'https://image.baidu.com/search/index?tn=baiduimage&ipn=r&word=高清壁纸&pn=0'

headers = {
    
    
    'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.25 Safari/537.36 Core/1.70.3760.400 QQBrowser/10.5.4083.400',
}

res = requests.get(url,headers=headers).text
for img in re.findall('"objURL":"(.*?)",',res):
    print(img)
output:
http://img.pconline.com.cn/images/upload/upc/tx/wallpaper/1306/21/c1/22386490_1371808534385.jpg
http://b-ssl.duitang.com/uploads/item/201312/27/20131227233312_feEjH.jpeg
http://up.enterdesk.com/edpic/8c/d2/d9/8cd2d9421559855d153e872faf514137.jpg
http://01.minipic.eastday.com/20171011/20171011095832_49d23dd458b7446249d84fda3d4ea1c1_2.jpeg
http://up.enterdesk.com/edpic/f1/63/4d/f1634dc19bcaae62e769b3d9315cf194.jpg
http://a.hiphotos.baidu.com/zhidao/pic/item/e824b899a9014c08be3151a4087b02087bf4f4ad.jpg
http://up.enterdesk.com/edpic/2d/a3/18/2da318335152ebe82061e55afa883be5.jpg
http://up.enterdesk.com/edpic/58/bf/e9/58bfe913ea48cdb2b4174432cd103583.jpg
http://b.hiphotos.baidu.com/zhidao/pic/item/63d0f703918fa0ece9221cfe279759ee3c6ddb58.jpg
http://b.zol-img.com.cn/desk/bizhi/start/3/1379385428221.jpg
...

Because of the particularity of the location, I had to use the re regular expression to easily match the image link.

There is a small problem here. There are many formats of Baidu pictures, such as jpg, jpeg, png, etc., so in order to be more perfect, we use the os module to match the image suffix, all the code:

import re
import requests
import os 


def get_url(word):
    url = f'https://image.baidu.com/search/index?tn=baiduimage&ipn=r&word={word}&pn=0'
    # pn代表翻页,每30张图片为1页  0   30   60   90 ....
    headers = {
    
    
        'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.25 Safari/537.36 Core/1.70.3760.400 QQBrowser/10.5.4083.400',
    }

    res = requests.get(url,headers=headers).text
    for img in re.findall('"objURL":"(.*?)",',res):
        print(img)   

        image = requests.get(img,headers=headers)
        file_name = img.split("/")[-1]   #使用图片链接后缀作为图片名称
        with open("./images/"+str(file_name),"wb") as f:
            f.write(image.content)

if __name__ == "__main__":
    word = input("请输入您要采集的图片名称:")
    get_url(word)

Insert picture description here
Insert picture description here
Insert picture description here

Insert picture description here

Get it done! If you don’t understand, you can leave a message below! The big guys communicate together! !

Guess you like

Origin blog.csdn.net/weixin_51211600/article/details/108991396