Simple steps and implementation code of python crawler to crawl pictures

table of Contents

1. How to get web page information

1). Read directly from the network

2). Save the source code of the webpage to the local first, then read

2. Analyze the obtained web page information and extract the required information (picture address)

3. Use request to save the picture locally and some problems you will encounter

1) Obtain the picture information and save it to a local file

2). Timeout processing

3). Read and write timeout

4). Timeout retry

4. Use urllib to save the picture locally and some problems you will encounter

1). Use urllib 

2). Timeout processing

2). Download again

3). Show download progress

5. The problem of adding headers after urllib and requests set timeout

 1). Requests settings

2). Urllib settings

6. Summary


I learned a little bit of python, mainly java. Recently, for personal reasons, crawlers are needed to grab pictures. Recently, Baidu is programming.

To write a crawler, you first need to get the web page information, then find the picture information (address) from a bunch of tags on the web page, and then use the URL to save the picture.

 

1. How to get web page information

1). Read directly from the network

from bs4 import BeautifulSoup

url = "http://www.baidu.com" 
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.143 Safari/537.36'
}

# url 是网页地址
web_data = requests.get(url, headers=headers)    
soup = BeautifulSoup(web_data.text, 'lxml')  


The way to read is 

web_data = requests.get(url, headers=headers)    
soup = BeautifulSoup(web_data.text, 'lxml') 

2). Save the source code of the webpage to the local first, then read

from bs4 import BeautifulSoup

file = open('C:/Users/tj/Desktop/test.html','r',encoding='utf-8')
soup = BeautifulSoup(file,'lxml')

2. Analyze the obtained web page information and extract the required information (picture address)

It is assumed here that all the pictures are in the img tag, all the img tags are in a div with a class attribute named beautiful (there is only one), and the address information of the picture is in the src attribute of the img.

from bs4 import BeautifulSoup
import requests


# soup 有N多方法,find()、find_all()等 (具体用法百度), 

img_list=soup.find('div', class_="beautiful").find_all('img')

# 对 img_list进行遍历,获取其中的信息保存到数组中
li=[]
for x in range(len(img_list)):
    print(x+1,":      ",img_list[x].attrs["src"])   
    li.append(img_list[x].attrs["src"])

3. Use request to save the picture locally and some problems you will encounter

1) Obtain the picture information and save it to a local file

"""
描述
enumerate() 函数用于将一个可遍历的数据对象(如列表、元组或字符串)组合为一个索引序列,同时列出数据和数据下标,一般用在 for 循环当中。

Python 2.3. 以上版本可用,2.6 添加 start 参数。

语法
以下是 enumerate() 方法的语法:
enumerate(sequence, [start=0])

参数
sequence -- 一个序列、迭代器或其他支持迭代对象。
start -- 下标起始位置。
"""
from bs4 import BeautifulSoup
import requests

path="C:/Users/tj/Desktop/"

# i表示下标(从1开始), v表示数组的内容
for i,v in enumerate(li,start=1): 
    # 将 图片地址(即v) 再次放入request中
    image = requests.get(v, timeout=10) 
    """ 
        存取图片过程中,出现不能存储int类型,故而,我们对他进行类型转换str()。
        w:读写方式打开,b:二进制进行读写。图片一般用到的都是二进制。
    """
    with open( path + str(i)+'.jpg', 'wb') as file:
        # content:图片转换成二进制,进行保存。
        file.write(image.content)

  
    # 也可以使用如下方式保存到本地(和上面的保存到本地的方式其实一样)
    dir = path + str(i)+'.jpg'
    fp = open(dir, 'wb')
    fp.write(image.content)
    fp.close()

2). Timeout processing

Some images may not be opened, so timeout processing should be added:

from bs4 import BeautifulSoup
import requests

path="C:/Users/tj/Desktop/"

# i表示下标(从1开始), v表示数组的内容
for i,v in enumerate(li,start=1): 

    try:
        # 将 图片地址(即v) 再次放入request中
        image = requests.get(v, timeout=10) 
    except requests.exceptions.ConnectionError:
        print('【错误】当前图片无法下载')
        continue

    with open( path + str(i)+'.jpg', 'wb') as file:
        # content:图片转换成二进制,进行保存。
        file.write(image.content) 

    Network requests will inevitably encounter a request timeout. In requests, if you do not set up your program, you may lose response forever.
    The timeout can be divided into connection timeout and read timeout.

    Connection timeout refers to connect()the number of seconds that Request waits when your client connects to the remote machine port (corresponding to it ). Even if it is not set, there will be a default connection timeout (it is said to be 21 seconds).

3). Read and write timeout

    The read timeout refers to the time the client waits for the server to send a request. (Specifically, it refers to the time between the client waiting for the server to send bytes. In 99.9% of cases, this refers to the time before the server sends the first byte).

    Simply put, the connection timeout is the maximum length of time between initiating a request to connect to the establishment of the connection, and the read timeout is the maximum length of time to wait between the successful start of the connection and the server returning a response.

    If you set a single value as timeout, as follows:

r = requests.get('https://github.com', timeout=5)

This timeout value will be used as the timeout for both connect and read. If you want to formulate separately, pass in a tuple: 

r = requests.get('https://github.com', timeout=(3.05, 27))

There is no default value for the read timeout. If it is not set, the program will always wait. Our reptiles are often stuck and there is no error message, the reason is here.

4). Timeout retry

Generally, we will not return immediately after timeout, but will set up a three-time reconnection mechanism.

def gethtml(url):
    i = 0
    while i < 3:
        try:
            html = requests.get(url, timeout=5).text
            return html
        except requests.exceptions.RequestException:
            i += 1

Actually, the requests have been packaged for us. (But the code seems to have changed ...)

import time
import requests
from requests.adapters import HTTPAdapter

s = requests.Session()
s.mount('http://', HTTPAdapter(max_retries=3))
s.mount('https://', HTTPAdapter(max_retries=3))

print(time.strftime('%Y-%m-%d %H:%M:%S'))
try:
    r = s.get('http://www.google.com.hk', timeout=5)
    return r.text
except requests.exceptions.RequestException as e:
    print(e)
print(time.strftime('%Y-%m-%d %H:%M:%S'))

max_retries For the maximum number of retries, retry 3 times, plus the initial request, a total of 4 times, so the above code takes 20 seconds instead of 15 seconds

Note: From the timeout retry until here, refer to  https://www.cnblogs.com/gl1573/p/10129382.html , thank you here.

4. Use urllib to save the picture locally and some problems you will encounter

1). Use urllib 

urllib2 was changed to urllib.request in python3.x 

import urllib  

#i表示下标,从1开始; v表示数组的值(图片的地址)
for i,v in enumerate(li,start=1):   
	urllib.request.urlretrieve(v, path+str(x)+'.jpg')   

2). Timeout processing

Some images may not be able to open the URL, so you need to add timeout processing, but the timeout processing is set as follows:

import urllib  
import socket 

#设置超时时间为30s
socket.setdefaulttimeout(30)

#i表示下标,从1开始; v表示数组的值(图片的地址)
for i,v in enumerate(li,start=1):   
	urllib.request.urlretrieve(v, path+str(x)+'.jpg')   

2). Download again

At the same time, you can also use recursion to download again after timeout:

tips: The newly downloaded file will overwrite the original incompletely downloaded file.

import urllib  
import socket 

#设置超时时间为30s
socket.setdefaulttimeout(30)

def auto_down(url,filename):
    try:
        urllib.urlretrieve(url,filename)
    except urllib.ContentTooShortError:
        print 'Network conditions is not good.Reloading.'
        auto_down(url,filename)

#i表示下标,从1开始; v表示数组的值(图片的地址)
for i,v in enumerate(li,start=1):   
    auto_down(v, path+str(x)+'.jpg') 


However, the download will be tried several times, or even a dozen times, and occasionally it will fall into an endless loop. Need to avoid falling into an endless loop and improve operating efficiency.

import urllib  
import socket 

#设置超时时间为30s
socket.setdefaulttimeout(30)

#i表示下标,从1开始; v表示数组的值(图片的地址)
for i,url in enumerate(li,start=1):   
	urllib.request.urlretrieve(v, path+str(x)+'.jpg')   

try:
    urllib.request.urlretrieve(url,path+str(x)+'.jpg')
except socket.timeout:
    count = 1
    while count <= 5:
        try:
            urllib.request.urlretrieve(url,path+str(x)+'.jpg')                                                
            break
        except socket.timeout:
            err_info = 'Reloading for %d time'%count if count == 1 else 'Reloading for %d times'%count
            print(err_info)
            count += 1
    if count > 5:
        print("downloading picture fialed!")

Note: From using recursion to download again until here, refer to  https://www.jianshu.com/p/a31745fef1d8  , thank you here

3). Show download progress

At the same time using other parameters of urllib.request.urlertriever () can also show the download progress

import urllib
from urllib.request import urlretrieve

#解决urlretrieve下载文件不完全的问题且避免下载时长过长陷入死循环
def auto_down(url,filename):
    try:
        urlretrieve(url,filename,jindu)
    except socket.timeout:
        count = 1
        while count <= 15:
            try:
                urlretrieve(url, filename,jindu)
                break
            except socket.timeout:
                err_info = 'Reloading for %d time' % count if count == 1 else 'Reloading for %d times' % count
                print(err_info)
                count += 1
        if count > 15:
            print("下载失败")

""" 
urlretrieve()的回调函数,显示当前的下载进度
    a为已经下载的数据块
    b为数据块大小
    c为远程文件的大小
"""

global myper
def jindu(a,b,c):
    if not a:
        print("连接打开")
    if c<0:
        print("要下载的文件大小为0")
    else:
        global myper
        per=100*a*b/c
 
        if per>100:
           per=100
        myper=per
        print("当前下载进度为:" + '%.2f%%' % per)
    if per==100:
            return True 

Note: The above reference:  https://blog.csdn.net/HW140701/article/details/78254826  thanks here.

5. The problem of adding headers after urllib and requests set timeout

 1). Requests settings

    The request setting timeout is set by the object returned by requests.session (), add headers as follows:

    I didn't verify it myself, I don't know right or wrong.

import requests


cookies.clear
headers = { 
    "User-Agent" : "Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN; rv:1.9.1.6) ", 
}

conn = requests.session()#设置一个回话
resp = conn.post('https://www.baidu.com/s?wd=findspace',headers=headers)

# 打印请求的头
print(resp.request.headers)
print resp.cookies

# 再访问一次:
resp = conn.get('https://www.baidu.com/s?wd=findspace',headers=headers)
print(resp.request.headers)
print resp.cookies

2). Urllib settings

    urllib is set as follows:

    I didn't verify it myself, I don't know right or wrong.

    Note: This header is not a dictionary.

opener = urllib.request.build_opener()

opener.addheaders = 
[
('User-agent', 
'Opera/9.80 (Android 2.3.4; Linux; Opera Mobi/build-1107180945; U; en-GB) Presto/2.8.149 Version/11.10')
]

urllib.request.install_opener(opener)
urllib.request.urlretrieve(URL, path)   #path为本地保存路径
 

6. Summary

    In fact, it is quite simple (because I did not encounter difficulties ...), the most difficult thing is to analyze the structure of the web page, which are all routines.

Published 48 original articles · Like 36 · Visits 130,000+

Guess you like

Origin blog.csdn.net/weixin_42845682/article/details/102756027