Redis redo from the library Times Connection with master lost error

Background: When my colleague initialized the redis instance, I gave the monitor command to rename. Today, there is a problem with this group of instances. I need to use the monitor command to help analyze the problem. It turns out that the monitor command cannot be used, because the redis command renaming does not support dynamic modification, so I just want to restart the redis slave library after deleting the monitor command. After the restart, it was found that the master-slave replication state was always down, so I began to troubleshoot the problem.

First look at the log of the redis slave library:

43037:S 20 Apr 06:12:38.047 MASTER <-> SLAVE sync started
43037:S 20 Apr 06:12:38.047
Non blocking connect for SYNC fired the event.
43037:S 20 Apr 06:12:38.048 Master replied to PING, replication can continue...
43037:S 20 Apr 06:12:38.048
Partial resynchronization not possible (no cached master)
43037:S 20 Apr 06:12:39.112 Full resync from master: 96f2ae75d50e1f8b69737509d5b32b2da660e7c0:885061114038
43037:S 20 Apr 06:19:16.258
MASTER <-> SLAVE sync: receiving 20819568576 bytes from master
43037:S 20 Apr 06:20:08.203 MASTER <-> SLAVE sync: Flushing old data
43037:S 20 Apr 06:24:36.036
MASTER <-> SLAVE sync: Loading DB in memory
43037:S 20 Apr 06:28:18.538 MASTER <-> SLAVE sync: Finished with success
43037:S 20 Apr 06:28:19.782
Background append only file rewriting started by pid 173002
43037:S 20 Apr 06:28:19.982 # Connection with master lost.
43037:S 20 Apr 06:28:19.982 Caching the disconnected master state.
43037:S 20 Apr 06:28:20.984
Connecting to MASTER 10.93.157.52:6385
43037:S 20 Apr 06:28:20.985 MASTER <-> SLAVE sync started
43037:S 20 Apr 06:28:20.985
Non blocking connect for SYNC fired the event.
43037:S 20 Apr 06:28:20.985 Master replied to PING, replication can continue...
43037:S 20 Apr 06:28:20.986
Trying a partial resynchronization (request 96f2ae75d50e1f8b69737509d5b32b2da660e7c0:885062375607).
43037:S 20 Apr 06:28:22.073 Full resync from master: 96f2ae75d50e1f8b69737509d5b32b2da660e7c0:885240485270
43037:S 20 Apr 06:28:22.073
Discarding previously cached master state.
43037:S 20 Apr 06:33:30.800 # Timeout receiving bulk data from MASTER... If the problem persists try to set the 'repl-timeout' parameter in redis.conf to a larger value.
43037:S 20 Apr 06:33:30.801 Connecting to MASTER 10.93.157.52:6385
43037:S 20 Apr 06:33:30.802
MASTER <-> SLAVE sync started
43037:S 20 Apr 06:33:30.802 Non blocking connect for SYNC fired the event.
43037:S 20 Apr 06:33:30.802
Master replied to PING, replication can continue...
43037:S 20 Apr 06:33:30.803 Partial resynchronization not possible (no cached master)
43037:S 20 Apr 06:34:27.458
AOF rewrite child asks to stop sending diffs.
173002:C 20 Apr 06:34:27.470 Parent agreed to stop sending diffs. Finalizing AOF...
173002:C 20 Apr 06:34:27.470
Concatenating 1.19 MB of AOF diff received from parent.
173002:C 20 Apr 06:34:27.477 SYNC append only file rewrite performed
173002:C 20 Apr 06:34:28.757
AOF rewrite: 119 MB of memory used by copy-on-write
43037:S 20 Apr 06:34:29.961 Background AOF rewrite terminated with success
43037:S 20 Apr 06:34:29.961
Residual parent diff successfully flushed to the rewritten AOF (0.00 MB)
43037:S 20 Apr 06:34:29.961 Background AOF rewrite finished successfully
43037:S 20 Apr 06:35:08.471
Full resync from master: 96f2ae75d50e1f8b69737509d5b32b2da660e7c0:885393456775
43037:S 20 Apr 06:41:49.489 MASTER <-> SLAVE sync: receiving 20821399746 bytes from master
43037:S 20 Apr 06:42:28.911 # I/O error trying to sync with MASTER: connection lost
43037:S 20 Apr 06:42:32.642
Connecting to MASTER 10.93.157.52:6385
43037:S 20 Apr 06:42:32.646 MASTER <-> SLAVE sync started
43037:S 20 Apr 06:42:32.646
Non blocking connect for SYNC fired the event.
43037:S 20 Apr 06:42:32.647 Master replied to PING, replication can continue...
43037:S 20 Apr 06:42:32.647
Partial resynchronization not possible (no cached master)
43037:S 20 Apr 06:42:33.755 Full resync from master: 96f2ae75d50e1f8b69737509d5b32b2da660e7c0:885541422071
43037:S 20 Apr 06:49:15.956
MASTER <-> SLAVE sync: receiving 20821403571 bytes from master
43037:S 20 Apr 06:50:16.781 MASTER <-> SLAVE sync: Flushing old data
43037:S 20 Apr 06:54:23.078
MASTER <-> SLAVE sync: Loading DB in memory
43037:S 20 Apr 06:58:10.123 MASTER <-> SLAVE sync: Finished with success
43037:S 20 Apr 06:58:11.317
Background append only file rewriting started by pid 223387
43037:S 20 Apr 06:58:11.536 # Connection with master lost.
43037:S 20 Apr 06:58:11.536 * Caching the disconnected master state.

Let's take a look at the log of the redis main library:

304369:M 20 Apr 05:13:00.197 Slave 10.93.157.16:6383 asks for synchronization
304369:M 20 Apr 05:13:00.197
Full resync requested by slave 10.93.157.16:6383
304369:M 20 Apr 05:13:00.197 Starting BGSAVE for SYNC with target: disk
304369:M 20 Apr 05:13:00.902
Background saving started by pid 366254
366254:C 20 Apr 05:19:14.460 DB saved on disk
366254:C 20 Apr 05:19:14.961
RDB: 4613 MB of memory used by copy-on-write
304369:M 20 Apr 05:19:15.579 Background saving terminated with success
304369:M 20 Apr 05:20:18.303
Synchronization with slave 10.93.157.16:6383 succeeded
304369:M 20 Apr 05:22:32.216 # Client id=1768461 addr=10.93.157.16:26912 fd=10 name= age=572 idle=0 flags=S db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=32768 obl=0 oll=11450 omem=278945640 events=rw cmd=psync scheduled to
be closed ASAP for overcoming of output buffer limits.
304369:M 20 Apr 05:22:32.216 # Connection with slave 10.93.157.16:6383 lost.
304369:M 20 Apr 05:28:27.651 Slave 10.93.157.16:6383 asks for synchronization
304369:M 20 Apr 05:28:27.651
Unable to partial resync with slave 10.93.157.16:6383 for lack of backlog (Slave request was: 884116305579).
304369:M 20 Apr 05:28:27.651 Starting BGSAVE for SYNC with target: disk
304369:M 20 Apr 05:28:28.356
Background saving started by pid 396084
304369:M 20 Apr 05:32:37.471 # Client id=1768945 addr=10.93.157.16:21854 fd=22 name= age=250 idle=250 flags=S db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=16366 oll=11126 omem=271085384 events=r cmd=psync scheduled to
be closed ASAP for overcoming of output buffer limits.
304369:M 20 Apr 05:32:37.538 # Connection with slave 10.93.157.16:6383 lost.

From the log, we can see that when the redis master library receives a request from the slave library to resynchronize data, it first generates an rdb file, and then performs partial synchronization through psync. We can see that the problem lies in the partial synchronization, because the client The -output-buffer-limit value is set too small, causing the derivative to fail to be sent. After the failure, the slave library continues to initiate data synchronization requests, but every time it fails, the main redis library repeatedly generates rdb files. Although there is a fork sub-process, it has a certain impact on the throughput of the main redis library.

Now let's solve this problem, first look at the current size of client-output-buffer-limit:
127.0.0.1:6385> config get client-output-buffer-limit
1) "client-output-buffer-limit"
2) " normal 0 0 0 slave 268435456 67108864 60 pubsub 33554432 8388608 60"
As you can see, the current limit is 256M and no more than 64M within 60s. From our log information above, we can see that the psync data is obviously larger than 256M. We made the following settings to increase the limit:
config set client-output-buffer-limit 'slave 1073741824 268435456 60'
, and then observed it after adjusting, and found that the replication status of the slave library soon became up.
Our adjustment above is to increase the copy output buffer, and another way is to close the limit of the copy output buffer:
config set client-output-buffer-limit 'slave 0 0 0'

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324656855&siteId=291194637