How to crawl 1000+ Baidu pictures? Python crawler tutorial with code

How to use Python to crawl Baidu pictures?

Experimental environment: Python 3.x
third-party library: Request 2.14.2

1. First, open Baidu, enter keywords to search for the required photos (here, take electronic scales as an example)

Insert picture description here

When you pull down, you can see that the image is loaded as the webpage slides down. This is a dynamic loading page. This is troublesome. If you check the source code of the page, you won't find the url of the picture. What can I do? Don't be afraid, we must first understand the principle of dynamic loading. Dynamic loading is to insert image data into the HTML tags of the webpage by running javascript, so we can't see the image information in the source code. But the picture can be loaded in the webpage, indicating that the webpage request has a data packet, and the url of the picture can be found as long as the file storing the data information is found.

2. Next, check the data package and find the url of the picture

Now for example, search for the Baidu picture of the electronic scale, click on the web developer tool , click on the network , and swipe down on the picture page, and you will find multiple pictures loaded and more acjson?tn=resultjson&ipn=... requested files, and then Click on the file preview and you will see a piece of json data, click on it and you will find more than 30 pieces of data, and click on it again and you will find that each piece of data contains the detailed information of a picture.
Insert picture description here

At this time, you will find that there are only 30 pieces of data in each file. How can you get 1000+ pieces of data? Then let’s find the difference. There are four acjson?tn=resultjson&ipn=… files in the picture above. If you think about it carefully, you get 30 pieces of data for each request, and the data requested is different each time, so the requested url must be affirmed. It is also different. Put the URLs of these 4 files together as follows:

https://image.baidu.com/search/acjson?tn=resultjson_com&logid=8615903434039220370&ipn=rj&ct=201326592&is=&fp=result&queryWord=%E7%94%B5%E5%AD%90%E7%A7%A4&cl=2&lm=-1&ie=utf-8&oe=utf-8&adpicid=&st=&z=&ic=&hd=&latest=&copyright=&word=%E7%94%B5%E5%AD%90%E7%A7%A4&s=&se=&tab=&width=&height=&face=&istype=&qc=&nc=1&fr=&expermode=&force=&pn=30&rn=30&gsm=1e&1614774604107=
https://image.baidu.com/search/acjson?tn=resultjson_com&logid=8615903434039220370&ipn=rj&ct=201326592&is=&fp=result&queryWord=%E7%94%B5%E5%AD%90%E7%A7%A4&cl=2&lm=-1&ie=utf-8&oe=utf-8&adpicid=&st=&z=&ic=&hd=&latest=&copyright=&word=%E7%94%B5%E5%AD%90%E7%A7%A4&s=&se=&tab=&width=&height=&face=&istype=&qc=&nc=1&fr=&expermode=&force=&pn=60&rn=30&gsm=3c&1614774604251=
https://image.baidu.com/search/acjson?tn=resultjson_com&logid=8615903434039220370&ipn=rj&ct=201326592&is=&fp=result&queryWord=%E7%94%B5%E5%AD%90%E7%A7%A4&cl=2&lm=-1&ie=utf-8&oe=utf-8&adpicid=&st=&z=&ic=&hd=&latest=&copyright=&word=%E7%94%B5%E5%AD%90%E7%A7%A4&s=&se=&tab=&width=&height=&face=&istype=&qc=&nc=1&fr=&expermode=&force=&pn=90&rn=30&gsm=5a&1614774716612=
https://image.baidu.com/search/acjson?tn=resultjson_com&logid=8615903434039220370&ipn=rj&ct=201326592&is=&fp=result&queryWord=%E7%94%B5%E5%AD%90%E7%A7%A4&cl=2&lm=-1&ie=utf-8&oe=utf-8&adpicid=&st=&z=&ic=&hd=&latest=&copyright=&word=%E7%94%B5%E5%AD%90%E7%A7%A4&s=&se=&tab=&width=&height=&face=&istype=&qc=&nc=1&fr=&expermode=&force=&pn=120&rn=30&gsm=78&1614774716744=

It is not difficult to find that everything is the same except for the first two letters of pn and gsm. pn is increased in steps of 30, and the first two letters of gsm are the two digits of the hexadecimal number of pn.

Three, the experimental code

Import library

import requests
import time

Simulate browser

# 请求头,伪装成浏览器
headers = {
    
    
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.116 Safari/537.36'
}
keyword = '电子秤' # 关键字
max_page = 34 
i=1 # 记录图片数

Create a folder and place it in the same directory as the code file, crawl and download pictures

for page in range(1,max_page+1):
    page = page*30
    # 网址
    url = 'https://image.baidu.com/search/acjson?tn=resultjson_com&ipn=rj&ct=201326592&is=&fp=result&queryWord='\
            +keyword+'&cl=2&lm=-1&ie=utf-8&oe=utf-8&adpicid=&st=-1&z=&ic=0&hd=&latest=&copyright=&word='\
            +keyword+'&s=&se=&tab=&width=&height=&face=0&istype=2&qc=&nc=1&fr=&expermode=&force=&cg=wallpaper&pn='\
            +str(page)+'&rn=30&gsm=1e&1596899786625='
    # 请求响应
    response = requests.get(url=url,headers=headers)
    # 得到相应的json数据
    json = response.json()
    if json.get('data'):
        for item in json.get('data')[:30]:
            # 图片地址
            img_url = item.get('thumbURL')
            # 获取图片
            image = requests.get(url=img_url)
            # 下载图片
            with open('./电子秤图片/%d.jpg' %i,'wb') as f:
                f.write(image.content) # 图片二进制数据
            time.sleep(1) # 等待1s
            print('第%d张%s图片下载完成...'%(i,keyword))
            i+=1
print('End!')

Fourth, crawl the results

Successfully crawl 1000+ pictures to the specified folder
Insert picture description here

Data reference
Short book: Hidden ink left blank
Link: https://www.jianshu.com/p/e7031f06307c

Guess you like

Origin blog.csdn.net/weixin_44763047/article/details/114340943
Recommended