Crawler actual combat: crawl a large number of pictures through Baidu keywords

Target address: http://image.baidu.com/Enter
beauty

Analysis URL
Meta URL is shown in the figure

Pasted it is as follows
(here you will see that you clearly see Chinese in the URL bar of the browser, but copy the url and paste it into a notepad or code, it will become like this???)

https://image.baidu.com/search/index?tn=baiduimage&ipn=r&ct=201326592&cl=2&lm=-1&st=-1&fm=result&fr=&sf=1&fmq=1610771025434_R&pv=&ic=&nc=1&z=&hd=&latest=&copyright=&se=1&showtab=0&fb=0&width=&height=&face=0&istype=2&ie=utf-8&sid=&word=%E7%BE%8E%E5%A5%B3

Some get parameters or keywords are encoded in the URLs of many websites, so when we copy them out, problems will occur.

URL encoding and decoding

import urllib
from urllib import parse
import urllib.request

data = {
    
    'word': '美女'}

# Python3的urlencode需要从parse中调用,可以看到urlencode()接受的是一个字典
print(urllib.parse.urlencode(data))
# 通过urllib.request.unquote()方法,把URL编码字符串,转换回原先字符串
print(urllib.request.unquote('word=%E7%BE%8E%E5%A5%B3'))

Analyze the source code
F12 or right-click on the page to review the elements. After opening, locate the position of the picture

Copy the following URL,
pay attention to escape characters

imgurl="https:\/\/ss0.bdstatic.com\/70cFvHSh_Q1YnxGkpoWK1HF6hhy\/it\/u=2718853745,1288801299&fm=214&gp=0.jpg" 

Then right-click on the blank area of ​​the current webpage: view the source code of the webpage and
use the shortcut key CTRl+F to
search (I enter gp=0.jpg here to locate the image by entering the last few characters of the image URL)

Why does this picture have so many addresses, which one to use? You can see thumbURL, objURL and so on.
Through analysis, we can know that the first two are reduced versions, hover is the version displayed after the mouse is moved, and the address corresponding to objURL should be what we need. If you don’t believe it, you can open these URLs and find that obj is the largest and clearest .

Write regular expressions or XPath expressions

pic_url = re.findall(’“objURL”:"(.*?)",,html,re.S)

After objurl, all matches


Sometimes when we find the headers of the local computer network , we are unable to crawl some web pages, and a 403 error will appear, because these web pages have some anti-crawler settings to prevent others from maliciously collecting information.
We can set some Headers information and simulate it as a browser to visit these websites, which can solve this problem.
First, click on Baidu in the webpage to make an action happen on the webpage. A lot of data appears in the lower window, as shown in the figure.

At this time, click www.baidu.com in the figure, and it will appear as shown in the figure

In Headers, drag down to find the user-agent
string of information, which is the information used by the simulated browser below, and copy it out.

All code
language python

from urllib.parse import quote
import string
import re
from urllib import request
import  urllib.request
word = input('关键词:')
url = 'http://image.baidu.com/search/flip?tn=baiduimage&ie=utf-8&word=' + word + '&ct=201326592&v=flip'
url = quote(url, safe=string.printable)# # 解决ascii编码报错问题,不报错则可以注释掉
#模拟成浏览器
headers=("User-Agent","Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3947.100 Safari/537.36")
opener=urllib.request.build_opener()
opener.addheaders=[headers]
#将opener安装为全局
urllib.request.install_opener(opener)
#读取网页


url_request=request.Request(url)
url_response = request.urlopen(url_request,timeout=10)   # 请求数据,可以和上一句合并.表示一次http访问请求的时间最多10秒,一旦超过,本次请求中断,但是不进入下一条,而是继续重复请求这一条
html = url_response.read().decode('utf-8')  # 加编码,重要!转换为字符串编码,read()得到的是byte格式的。

jpglist = re.findall('"thumbURL":"(.*?)",',html,re.S)	#re.S将字符串作为整体,在整体中进行匹配。,thumbURL可以匹配其他格式的图
print(len(jpglist))
n = 1
for each in jpglist:
    print(each)
    try:
        request.urlretrieve(each,'D:\\deeplearn\\xuexicaogao\\图片\\%s.jpg' %n) #爬下载的图片放置在提前建好的文件夹里
    except Exception as e:
        print(e)
    finally:
        print('下载完成。')
    n+=1
    if n==90:
        break
print('结束')

Code analysis The
crawler reports an error UnicodeEncodeError:'ascii' codec can't encode characters in position 45-47: ordinal not...
Reason Python's default encoding is ascii, when non-ascii encoding appears in the program, python processing often reports such an error UnicodeDecodeError:'ascii' codec can't decode byte 0x?? in position 1: ordinal not in range(128), python can’t handle non-ascii codes, you need to set the default code of python by yourself, generally set to utf8 Encoding format.
Use urllib.parse.quote to convert.

Result folder

Code version 2
language python

import urllib
import urllib.request
from urllib.parse import quote
import re
import os

headers = {
    
    
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.67 Safari/537.36",
    "referer": "https://image.baidu.com"
}
print("****************************************************************************************")
keyword = input("请输入要下载的图片:")
last_dir = "C://Users//Shineion//Desktop//爬虫图"
dir = "C://Users//Shineion//Desktop//爬虫图//" + keyword
if os.path.exists(last_dir):
    if os.path.exists(dir):
        print("文件夹已经存在")
    else:
        os.mkdir(dir)
        print(dir + "已经创建成功")
else:
    os.mkdir(last_dir)
    if os.path.exists(dir):
        print("文件夹已经存在")
    else:
        os.mkdir(dir)
        print(dir + "已经创建成功")
keyword1 = quote(keyword, encoding="utf-8")
url = 'http://image.baidu.com/search/flip?tn=baiduimage&ie=utf-8&word=' + keyword1 + '&ct=201326592&v=flip'
req = urllib.request.Request(url, headers=headers)
f = urllib.request.urlopen(req).read().decode("utf-8")
key = r'thumbURL":"(.+?)"'
key1 = re.compile(key)
num = 0
for string in re.findall(key1, f):
    print("正在下载" + string)
    f_req = urllib.request.Request(string, headers=headers)
    f_url = urllib.request.urlopen(f_req).read()
    fs = open(dir + "/" + keyword + str(num) + ".jpg", "wb+")
    fs.write(f_url)
    fs.close()
    num += 1
    print(string + "已下载成功")
input("按任意键结束程序:")

Attention to the problem: the code easy to get stuck, stuck in obtaining a picture
Insert picture description here
of: Electrical - Yu Dengwu

Guess you like

Origin blog.csdn.net/kobeyu652453/article/details/112699472