Python3, this artifact, increases the crawling rate by 10 times.

1 Introduction

Xiao Diaosi : Is there any way to increase the crawling rate?
Xiaoyu : Well, there are many ways to improve it, such as multi-threading and decorators.
Xiao Diaosi : Well, is there any other method?
Xiaoyu : Uh, let me think about it,
insert image description here
Xiaoyu : Hey~ I thought of
Xiao Diaosi : What is that?
Small fish : requests_cache
insert image description here

At this time, Xiao Diaosi's expression

2、requests_cache

2.1 Introduction

requests_cache is an extension package of the requests library. Using it, we can easily cache requests and get the corresponding crawling results directly.

2.2 Installation

Old rules, pip install:

pip install requests-cache

Other ways to install:

" Python3, choose Python to automatically install third-party libraries, and say goodbye to pip! ! "
" Python3: I only use one line of code to import all Python libraries! !

After installation, let's take a look at its usage.

2.3 Code Examples

2.3.1 CachedSession method

1. The default request of requests
In order to reflect its speed, we first write a default request

Code display

# -*- coding:utf-8 -*-
# @Time   : 2022-03-18
# @Author : carl_DJ

'''
requests 方法请求
'''
import  requests
import time

#开始时间
start = time.time()
#session
session = requests.session()

#循环爬取,10次
for i in range(10):
     session.get('http://httpbin.org/delay/2')
     print(f'Finish{i + 1} requests')

end = time.time()
print('Cost time', end - start)

operation result

Finish2 requests
Finish3 requests
Finish4 requests
Finish5 requests
Finish6 requests
Finish7 requests
Finish8 requests
Finish9 requests
Finish10 requests
Cost time 24.35784935951233

Process finished with exit code 0

We can see that it took 24+ seconds

So, let's use the CachedSession method to see if we can speed
up 2. CachedSession method

Code display

# -*- coding:utf-8 -*-
# @Time   : 2022-03-18
# @Author : carl_DJ

import  requests_cache
import time

start = time.time()
#CachedSession方法,在本地生成 demo_cache.sqlite
session = requests_cache.CachedSession('demo_cache')

for i in range(10):
    session.get('http://httpbin.org/delay/2')
    print(f'Finish{i + 1} requests')

end = time.time()
print('Cost time', end -start)

operation result

Finish1 requests
Finish2 requests
Finish3 requests
Finish4 requests
Finish5 requests
Finish6 requests
Finish7 requests
Finish8 requests
Finish9 requests
Finish10 requests
Cost time 8.624990701675415

Process finished with exit code 0

Generate the content of the demo_cache.sqlite database
insert image description here
file locally, we
insert image description here
can see that the key in this key-value record is a hash value, the value is a Blob object, and the content is the result of the Response.

As you can guess, a corresponding key will be generated for each request, and then requests-cache will store the corresponding result in the SQLite database. Subsequent requests have the same URL as the first request. After some calculation of their keys It's all the same, so subsequent 2-10 requests are returned immediately.

Yes, using this mechanism, we can skip a lot of repeated requests, which greatly saves crawling time.

2.3.2 install_cache method

1. Patch writing
Of course, we have another method. Without modifying the original code request method, we
add a method to improve the crawling rate.

Code display

# -*- coding:utf-8 -*-
# @Time   : 2022-03-18
# @Author : carl_DJ

import requests
import requests_cache
import time

#调用requests_cache.install_cache方法
requests_cache.install_cache('demo_path_cache')

start = time.time()
session = requests.session()
for i in range(10):
    session.get('http://httpbin.org/delay/2')
    print(f'Finish{i + 1} requests')

end = time.time()
print('Cost time', end -start)

operation result

Finish1 requests
Finish2 requests
Finish3 requests
Finish4 requests
Finish5 requests
Finish6 requests
Finish7 requests
Finish8 requests
Finish9 requests
Finish10 requests
Cost time 7.516860723495483

Process finished with exit code 0

makefile
insert image description here

2. Modify the configuration
In the first two demos, requests-cache uses SQLite as the cache object by default,
and this time, we use filesystem as the cache object

Code display

# -*- coding:utf-8 -*-
# @Time   : 2022-03-18
# @Author : carl_DJ

import requests
import requests_cache
import time

#使用filesystem作为缓存对象
requests_cache.install_cache('demo_file_cache', backend='filesystem')

start = time.time()
session = requests.session()
for i in range(10):
    session.get('http://httpbin.org/delay/2')
    print(f'Finish{i + 1} requests')

end = time.time()
print('Cost time', end -start)

Running results
insert image description here
insert image description here
Other backends are:
['dynamodb', 'filesystem', 'gridfs', 'memory', 'mongodb', 'redis', 'sqlite']

Let's take a look at the specific differences

Backend Class Alias Dependencies
SQLite SQLiteCache sqlite
Say it again RedisCache say again redis-py
MongoDB MongoCache mongodb pymongo
GridFS GridFSCache gridfs pymongo
DynamoDB DynamoDbCache dynamodb boto3
Filesystem FileCache filesystem
Memory BaseCache memory

If using redis

backend = requests_cache.RedisCache(host='localhost', port=6379)
requests_cache.install_cache('demo_redis_cache', backend=backend)

3. Cache only one request

# -*- coding:utf-8 -*-
# @Time   : 2022-03-18
# @Author : carl_DJ

import time
import requests
import requests_cache

#allowable_methods 方式,只对post请求进行缓存
requests_cache.install_cache('demo_post_cache', allowable_methods=['POST'])

start = time.time()
session = requests.Session()
for i in range(10):
    session.get('http://httpbin.org/delay/2')
    print(f'Finished {i + 1} requests')
end = time.time()
print('Cost time for get', end - start)
start = time.time()

for i in range(10):
    session.post('http://httpbin.org/delay/2')
    print(f'Finished {i + 1} requests')
end = time.time()
print('Cost time for post', end - start)

operation result

Finished 1 requests
Finished 2 requests
Finished 3 requests
Finished 4 requests
Finished 5 requests
Finished 6 requests
Finished 7 requests
Finished 8 requests
Finished 9 requests
Finished 10 requests
Cost time for get 29.42441463470459
Finished 1 requests
Finished 2 requests
Finished 3 requests
Finished 4 requests
Finished 5 requests
Finished 6 requests
Finished 7 requests
Finished 8 requests
Finished 9 requests
Finished 10 requests
Cost time for post 2.611323595046997

Process finished with exit code 0

At this time, I saw that the GET request took more than 24 seconds to end because there was no cache, and the POST ended in more than 2 seconds because of the use of the cache.

2.3.3 Cache Headers method

In addition to our custom cache, requests-cache also supports parsing HTTP Request / Response Headers and caching based on the content of the Headers.

Code display

# -*- coding:utf-8 -*-
# @Time   : 2022-03-18
# @Author : carl_DJ

import time
import requests
import requests_cache

requests_cache.install_cache('demo_headers_cache')

start = time.time()
session = requests.Session()
for i in range(10):
    #Request Headers 里面加上了 Cache-Control 为 no-store
    session.get('http://httpbin.org',
                headers={
    
    
                    'Cache-Control': 'no-store'
                })
    print(f'Finished {i + 1} requests')

end = time.time()
print('Cost time for get', end - start)
start = time.time()

Add Cache-Control to no-store in Request Headers, even if we declare cache, it will not take effect.

3. Summary

Seeing this, today's sharing is almost here.
In practical applications, requests_cache is indeed a good method if crawling in a loop,
which saves time and improves efficiency.
The key is to use the saved time to take a bath.

Guess you like

Origin blog.csdn.net/wuyoudeyuer/article/details/123571369