Redis quickly imports large amounts of data

1. Problem description

There are a lot of data stored as "key'\t'value''value''……", examples are as follows:

0000    0475_48070 0477_7204 0477_7556 0480_33825 0481_206660 0482_76734 0436_33682 0484_13757 0538_217492 0727_83721 0910_39874 0436_82813 0421_24138 0433_
113233 0425_67342 0475_56710 0438_83702 0421_14436 0451_15490 0456_51031 0475_126541 0459_64108 0475_28238 0475_73706 0425_67481 0481_70285 0482_40188 0482_
95188 0484_13346 0484_13940 0538_164341 0538_183629 0545_163319 0545_165272 0546_30724 0730_32196 0910_96866 0427_12847 0425_23173 0424_25451 0475_114926 04
28_44669 0421_14377 0422_27895 0428_79517 0454_26686 0477_76526 0481_51805 0539_22388 0545_86479 0546_23459 0450_30062 0546_31676 0437_820 0740_6902 0740_90
53 0436_75434 0427_5396 0425_65534 0433_113207 0479_42501 0450_41529 0456_63985 0457_503 0458_20159 0470_30141
0001    0481_206403 0732_17231 0730_5902 0425_21756 0437_32744 0450_30493 0457_1339 0475_21591 0475_43849 0475_48123 0481_129436

Import about ten gigabytes of data into Redis. If only a simple loop is used, the sample code is as follows:

r = redis.Redis(host='localhost', port=6379, decode_responses=True, db=1)

with open('data/news.txt', 'r', encoding='utf-8') as f:
    data = f.read().strip().split('\n')

for line in data:
    line = line.split('\t')
    indice = line[1].split(' ')
    for index in indice:
        r.sadd(line[0], index)

It takes a long time (about a few hours? Run in the background, the specific time is not very clear).

Now we need a suitable way to quickly import the required data into Redis .

2. Import method test (3)

import redis
from timeUtil import TimeDuration

'''
127.0.0.1:6379> flushall
OK
127.0.0.1:6379> info keyspace
# Keyspace
127.0.0.1:6379>
'''

t = TimeDuration()

r = redis.Redis(host='localhost', port=6379, decode_responses=True, db=1)


with open('data/news.txt', 'r', encoding='utf-8') as f:
    data = f.read().strip().split('\n')

t.start()
for line in data[:1000]:
    line = line.split('\t')
    indice = line[1].split(' ')
    for index in indice:
        r.sadd(line[0], index)

t.stop()
# 0:00:01.974389


r = redis.Redis(host='localhost', port=6379, decode_responses=True, db=2)

t.start()
with r.pipeline(transaction=False) as p:
    for line in data[:1000]:
        line = line.split('\t')
        indice = line[1].split(' ')
        for index in indice:
            p.sadd(line[0], index)
    p.execute()
t.stop()
# 0:00:00.285798


r = redis.Redis(host='localhost', port=6379, decode_responses=True, db=3)

t.start()
for line in data[:1000]:
    line = line.split('\t')
    indice = line[1].split(' ')
    r.sadd(line[0], *indice)

t.stop()
# 0:00:00.166977

# 127.0.0.1:6379> info keyspace
# # Keyspace
# db1:keys=1000,expires=0,avg_ttl=0
# db2:keys=1000,expires=0,avg_ttl=0
# db3:keys=1000,expires=0,avg_ttl=0
# timeUtil.py
# 时间统计封装类
# 参考:https://blog.csdn.net/a_flying_bird/article/details/46431061

import datetime
import time


class TimeDuration(object):
    
    def __init__(self):
        pass

    def start(self):
        self.sTime = datetime.datetime.now()
        self.eTime = None

    def stop(self):
        if self.sTime is None:
            print("ERROR: start() must be called before stop().")
            return
        self.eTime = datetime.datetime.now()
        delta = self.eTime - self.sTime
        print(delta)

0:00:01.974389
0:00:00.285798
0:00:00.166977

It can be found that the third method takes the shortest time. But in actual use, the third method will report the following error:

redis.exceptions.ResponseError: MISCONF Redis is configured to save RDB snapshots, but it is currently not able to persist on disk. Commands that may modify the data set are disabled, because this instance is configured to report errors during writes if RDB snapshotting fails (stop-writes-on-bgsave-error option). Please check the Redis logs for details about the RDB error.

The analysis is due to the fact that in Redis storage, the ziplist compressed list is used to store small-volume data first to save memory. But when the value data keeps increasing in the sad process, it will change to use hashtable, so an error occurs. (In doubt)

3. Principle analysis

If the client wants to get or set multiple keys, it can send multiple get and set commands. But every time a command is transmitted from the client to the redis server, it consumes network time. Sending n times will consume network time n times (the first import method).

Client waiting time = n times of network time + n times of command time (including command queuing time and execution time)

Using mget or mset (*indice, the third import method), the command was sent only once.

Client waiting time = 1 network time + n command time (including command queuing time and execution time)

Using pipeline, you can package and send multiple commands to the server for batch processing at one time (the second import method).

Client waiting time = 1 network time + n command time (including command queuing time and execution time)

Reference link: https://blog.csdn.net/weixin_39975366/article/details/113963503

Question: Why does the second method and the third method take different time?

Guess you like

Origin blog.csdn.net/MaoziYa/article/details/114269129