Python Baidu image crawling

Python Baidu image crawling

Still worrying about not having enough pictures, download one by one, this time we crawled enough pictures at once!

The code this time is very conventional, but the link acquisition requires some skills, direct code explanation.

"""这次从逻辑上层到逻辑底层讲解"""
if __name__=='__main__':
    headers={
    
    
        'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) \
    Chrome/70.0.3538.102 Safari/537.36 Edge/18.18362',
        'Referer':'https://www.baidu.com/?tn=18029102_3_dg'
        }#设置必不可少的请求头
    name=input('请输入你要爬取图片的相关名:')
    num=int(input('请输入你要爬取的页数:'))
    print('爬取图片中……')
    pictureSpider(name,num)#爬虫入口,等会你就知道为啥要传这两个参数了
    print('程序执行完毕')
"""有必要说的就是程序可以美观但是人机交互还是要有的,这是个好习惯,没有提示的话那连作为作者的我们也会抓瞎,不知道错从何来"""      

The only modules imported this time are requests and json. Here are some tips for capturing packets, or it can be said to be a habit. As we all know, the transmission of data generally relies on the existence of data packets. It is impossible to obtain everything in the source code of the website. json is a data transmission method, and it is directly avoided in this crawling. The analysis of the website source code has been turned into training for json data processing.

"""代码很精简,也就表明,这里面调用的函数代码会比较长"""
def pictureSpider(name,num):
    pic_urls=get_detail_url(name,num) #这是用来获取百度图片的json数据包内容
    print('图片下载中……')
    for url in pic_urls:#逐一下载返回的图片链接内容
        downloads(url)

Take a look at how to implement the two functions get_detail_url and downloads

"""我们先从短的开始吧,都是些常规操作,要是你不熟悉的话可以把这个作为模板"""
def downloads(url):
    name=url.split('/')[-1]#截取图片链接最后的字段作为图片名
    response =requests.get(url,headers=headers)
    #最好想清楚图片要存在哪里,自行修改
    with open('E:/Photo/{}'.format(name),'wb')as f:
        f.write(response.content)
    print(url+'下载成功')

Are you ready for the bombing?

def get_detail_url(name,num):
    pic_urls=[] #用于存储图片链接
    for i in range(num):#输入的数字最好别太大,说不定已经被你爬完了,我的建议是30以内,计算是这样也已经有近千张图片了,你可以自己算算每个json有多少张图片
        index=i*30 #只是个索引,要分析比较网站才能知道,所以爬虫重在实战
        #由于此处会报错,故跳过
        if index==180:#这里是我在爬取idol时遇到的问题,你也可以自己做出调整
            continue
        url='http://image.baidu.com/search/acjson?tn=resultjson_com&ipn=rj&is=&fp=result&word={}&pn={}&rn=30'.format(name,index)
  
        try:
            response=requests.get(url,headers=headers)
            #过滤无效网址
            if len(response.text)>5000: #为了判断是不是我们需要的json网址
                data=json.loads(response.text,strict=False) #strict的意思大家可以自己查查,关于json的操作还是挺重要的
                thumb_urls=[]
                for i in range(30): #你猜到了吧,每个json差不多是30张图片链接
"""json的网站分析的时候可以用到谷歌浏览器的插件,可以方便很多哦,就是JSON Viewer,不知道的小伙伴请自行百度"""                
                    thumb_url=data['data'][i]['thumbURL']
                    thumb_urls.append(thumb_url)
                pic_urls.extend(thumb_urls)
                print(url+'访问成功')
            else:
                continue
        except:
            continue
    print('图片数据爬取完毕')
    return pic_urls

In fact, before writing this article, my code was still targeted to crawl my own idol portraits, but before I discovered that it can be universalized and get response images based on input, this article is available.

开始的 url 是这样的:
http://image.baidu.com/search/acjson?tn=resultjson_com&ipn=rj
&ct=201326592&is=&fp=result&queryWord=%E8%A5%BF%E9%87%8E%E4%B8%83%E6%BF%91&cl=2&lm=-1&ie=utf-8&oe=utf8&adpicid=&st=&z=&ic=&hd=&latest=&copyright=&word=%E8%A5%BF%E9%87%8E%E4%B8%83%E6%BF%91&s=&se=&tab=&width=&height=&face=&istype=&qc=&nc=1&fr=&expermode=&force=&pn={}&rn=30

Doesn’t it look as beautiful as the one above? The above shows my reduced URL.
Sometimes the URL is not all key information, we can artificially streamline the URL.

Friends who are familiar with encoding know that these %what these strings are some Chinese characters, but the encoding method is changed, for example:
"Hu Ge" after encoding is %E8%83%A1%E6%AD%8C

http://image.baidu.com/search/acjson?tn=resultjson_com&ipn=rj&is=&fp=result&word={}&pn={}&rn=30
The most useful keyword in this URL is word, this is you The name of the picture to be searched.

If you ask how to streamline the website, of course you will delete a part of it yourself. This method is faster. It seems that some websites provide this kind of service.
Hahaha, this also confirms the phrase "one-third of the sky is doomed, seven-point depends on hard work." Although hard work is important, sometimes it takes some luck. Let's take a moment to realize it.
Finally, attach a picture of the results!
Like it, please pay attention!
Insert picture description here

Guess you like

Origin blog.csdn.net/weixin_43594279/article/details/105825519