Python crawls dynamically loaded website images

The first one is to crawl static web pages and static image websites. To be honest, it is the easiest. Today we will crawl a dynamic website.
Not much to say, the target URL: Duitang.com
https://www.duitang.com/search/?kw=%E6%A0%A1%E8%8A%B1&type=feed

Let’s analyze the website first, and find that the URL of this website does not turn pages. When the mouse goes down, new pictures will be loaded again every time a certain number of pictures are scrolled down. What’s the matter? Called a dynamic website, the most important thing about a dynamic website is not to write code, but to analyze the website to find the true location of the data.

Now that we know that this is a dynamic website, we need to use check (F12), right click-check, or directly press F12, click netwrok, and then click XHR

Insert picture description here
You will find it is blank, what's wrong? Very simple, refresh or press F5.
But still nothing after refreshing? How to do? Haha, you pull down the webpage and it comes out
Insert picture description here

This is the real URL that we need to crawl. If you continue to scroll down, you will find many similar URLs. You can try to find the differences between these URLs.

Now we begin to analyze this website and look directly at the picture (tiring typing...) by
Insert picture description here
following the above steps, we can see where the picture is actually stored, and found that it is in json format. How do we parse it? Here is a free json parsing website for you: https://www.json.cn/

First copy this URL, then open a new window, paste it, then copy the content inside, and then paste it into the website above, emmmmm... Simply copy and paste, you can see the picture without accident 'S address

After the analysis is complete, start our code:

import requests

url = "https://www.duitang.com/napi/blog/list/by_search/?kw=校花&type=feed&start=0" 

headers = {
    
    
    'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.25 Safari/537.36 Core/1.70.3760.400 QQBrowser/10.5.4083.400'
}

res = requests.get(url,headers=headers).json()
obj_list = res['data']["object_list"]
for img in obj_list:
    print(img["photo"]["path"])

output:
https://c-ssl.duitang.com/uploads/item/201511/17/20151117121559_3w8u4.jpeg
https://c-ssl.duitang.com/uploads/item/201509/18/20150918195433_fA4wF.jpeg
https://c-ssl.duitang.com/uploads/item/201611/13/20161113191506_Qhxcw.jpeg
https://c-ssl.duitang.com/uploads/item/201312/17/20131217203949_RziBx.jpeg
https://c-ssl.duitang.com/uploads/item/201509/18/20150918195644_yUaTx.jpeg
https://c-ssl.duitang.com/uploads/item/201504/19/20150419H0813_FWMv4.jpeg
https://c-ssl.duitang.com/uploads/item/201504/19/20150419H0549_cGkMQ.jpeg
...

You can get the picture URL in a few lines of code. It's that simple. The next step is to download it locally. All the code:

import requests

for page in range(0,240,24):        # 爬取10页图片
    url = f"https://www.duitang.com/napi/blog/list/by_search/?kw=校花&type=feed&start={page}" 

    headers = {
    
    
        'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.25 Safari/537.36 Core/1.70.3760.400 QQBrowser/10.5.4083.400'
    }

    res = requests.get(url,headers=headers).json()
    obj_list = res['data']["object_list"]
    for img in obj_list:
        img_url = img["photo"]["path"]

        filename = img_url.split("/")[-1]   # 使用网址后缀作为图片名称

        image = requests.get(img_url,headers=headers)
        with open("./images/"+ str(filename),"wb") as f:
            f.write(image.content)

The result is shown in the figure:
Insert picture description here
less than 20 lines of code. For a dynamically loaded website, analysis is the most important thing. As long as the analysis is correct, the code is a matter of minutes. That's it today. If you have any suggestions, you can leave me a message.

Guess you like

Origin blog.csdn.net/weixin_51211600/article/details/108919470