redis之replica

一、什么是replica

场景1：redis网络波动，导致服务不可用

场景2：redis进程异常重启，重启过程中服务不可用

场景3：部署redis实例的机器挂掉，导致服务不可用

。。。

以上各种情况，都将导致redis不能提供服务，如何提高服务的可用性呢，因此replica诞生。

replica翻译为中文意思是复制品，这里指redis的主从复制，主redis被称为master, 复制品称为replica。在其他机器上有一个master实时数据的复制（replica）。

对于redis，虽然各种数据结构都很优秀，但是某些操作依然很耗时，比如集合操作等，而redis是单线程顺序执行命令，所以当大量执行这种耗时的命令时，将导致redis性能降低，所以replica也可用作读写分离。

因此replica主要两大功能:

数据冗余，热备份
读写分离，提供性能
HA基础，提供高可用

二、replica如何进行

replica同步过程分为两个阶段：

复制master所有的数据
同步后续新处理的写请求命令

未命名文件.png

大致流程如上图：

首先replica和master进行一些列的交互，登录认证等
replica发送psync命令给master，要求进行全量同步
master收到请求后，进行rdb持久化操作，将内存数据写入rdb文件中
当rdb文件写完后，master读取rdb文件，发送给replica
replica接收master的rdb文件，然后写入本地rdb文件中
replica接收完后，开始加载rdb文件到内存中，加载完后，replica和master的数据一致
后续，master接收处理新的写命令后，同时也将这些命令发送给replica
replica接收这些命令，进行执行，保持数据和master一致

整体的状态机如下：

未命名文件 (2).png

问题：

如何成为replica呢？默认是什么？

redis默认是master，如果要成为replica，可以有两种方式

通过配置

replicaof <masterip> <masterport>

通过命令

slaveof <masterip> <masterport>

三、如何减少磁盘I/O

根据上述图可知，在master中需要生成rdb，replica也需要先写rdb文件，都是在操作磁盘。这些操作是否能减少呢？

对于master端：

可通过开启如下配置，开启后master将不再生成rdb文件，直接将内存数据编码后发送给replica

repl-diskless-sync yes

未命名文件 (4).png

可看出开启此选项后，生成rdb的子进程不再写磁盘，而是直接通过管道将数据发送给master, master直接把数据发送给replica，不用从磁盘上读取rdb文件。

问题：

当多个replica都进行全量同步请求时，如果不是同时到达的请求，则需要多次fork进行rdb内容的生成，因为一旦rdb内容开始发送，后续达到的全量同步请求只能等待这次发送完成。

为了让更多的全量同步请求能够复用相同的内容，减少fork生成rdb内容的次数，所以有了如下配置，表示等待5秒，希望在这段时间里，能有更多的全量同步请求。

   repl-diskless-sync-delay 5

问题：

正因为有了上述延时配置，将会导致master和replica数据不一致性时间更长。

而且减少磁盘的选项，主要用在磁盘io缓慢，并且带宽良好的情况，对于数据一致性要求高的，这个延时也需要考虑。

对于replica端：

replica端比master端更复杂一些，因为涉及到新老数据的切换。相关配置如下，正常是disabled, 即创建rdb文件，然后开始加载时清空内存，加载rdb文件。

如果配置的时on-empty-db，则表示在内存数据为空的时候，可以直接处理数据，不用写磁盘rdb文件，否则依然需要写磁盘文件。

如果配置的时swapdb,则直接写入内存，不写磁盘文件，并且保留了原先的数据，即内存中有两份数据，如果内存不足可能会导致oom。

# "disabled"    - Don't use diskless load (store the rdb file to the disk first)
# "on-empty-db" - Use diskless load only when it is completely safe.
# "swapdb"      - Keep a copy of the current db contents in RAM while parsing
#                 the data directly from the socket. note that this requires
#                 sufficient memory, if you don't have it, you risk an OOM kill.
repl-diskless-load disabled

未命名文件.png

未命名文件 (1).png 问题：

当在同步过程中，或则replica和master之间已经连接已经断开了，replica是否还能提供读请求呢？

默认情况下，replica是不能提供请求的，直接返回error, 但是当开启了如下配置后，则可以使用老的数据进行响应请求。

replica-serve-stale-data yes

问题：

replica只能提供读吗？能进行写操作么？

默认请求下，replicate是只读的，防止误操作将数据写坏，但是如果非要可写，也可通过如下配置进行开启。

replica-read-only yes

问题：

如果replica和master之间连接断开，master是如何处理的呢？是无动于衷么？

默认情况下，replica和master断开后，master是不干什么的，毕竟只是一个复制品，master不在乎。但是可以通过如下配置，让master在乎。

min-replicas-to-write 3
min-replicas-max-lag 10

此配置表示在10秒钟内至少有3个replica在线，和master连接正常，否则master则不接受客户端的写操作，master在这段时间内变成了只读的，这样可以让replica尽可能的和master的差距减少。但同时也影响到了master的正常请求。

问题：

当网络波动导致master和replica之间的连接断开，难道只能又进行一次非常耗时耗带宽的全量同步么？还有没有其他方式？

部分同步

四、什么是部分同步

为了减少全量同步的次数，以及避免不必要的全量同步，产生了部分同步，即指同步master和replica连接断开期间新生成的命令，replica接收这些新的数据，即可和master数据保持一致。

五、如何进行部分同步

为了能缓存命令，所以master必然有一个缓存区，并且replica保存了一个偏移量，标识自己同步的位置，后续可以直接从这个偏移量的位置开始同步。

而master的内存是有限的，不可能无限的缓存命令，所以这个缓存区的大小有限，并且是循环写的，所以master和replica直接连接断开的太久，则无法根据偏移量进行部分同步，

只能进行全量同步。

未命名文件 (3).png

首先replica连接上master
进过一系列的交换认证过程后
发送同步命令，这次的psync命令的参数为replicaid, 和offset偏移量
master根据接收到的replicaid和offset进行判断是否是可以进行部分同步
如果可以部分同步，master则响应continue，然后后面则发送需要的命令数据

六、循环缓冲区如何工作的

循环缓冲区只在部分同步时有用，平时replica和master之间的命令流同步，都是master之间发送给replica的，同时master也会在缓冲区中写一份。

master_repl_offset表示已经产生的所有命令的字节长度，repl_backlog_histlen表示缓冲区中有效数据长度，repl_backlog_idx表示现在缓冲区写到哪个位置，

repl_backlog_off表示数据的第一字节位置，repl_backlog表示缓冲区， repl_backlog_size表示缓冲区大小。

如下图是缓冲区为空的情况

未命名文件 (4).png

当写入len字节的命令后

未命名文件 (5).png

当再次写入数据，当缓冲区写满后，循环写

未命名文件 (6).png

其中，红色部分表示被覆盖，而青色是虚拟的，开起来一直在往前写，实际缓冲区写满后，就从开头开始写，所以称为循环缓冲区。并且一旦缓冲区写满后，repl_backlog_histlen就将一直等于缓冲区大小。

当replica发送一个offset时，将判断这个offset是否在缓冲区中

未命名文件 (7).png 如果在这个有效数据中，则计算skip需要跳过多少字节，然后计算缓冲区有效数据头j，然后跳过skip长度，开始写

问题：

循环缓冲区大小是多少呢？

这个默认是1M, 可以通过如下配置进行调整。

repl-backlog-size 1mb

问题：

缓冲区是否配置的越大越好呢？

如果配置的过大，比如10G，而内存实际数据可能只有100M，当replica和master之间的连接断开的足够久，然后replica重连上了master，发送了一个offset过来，然后开始了10G命令的传输，

其实这个时候如果是全量同步，可能只需要100M的数据的传输。根据此情况，缓冲区还是不要太大。

问题：

如何都没有replica连接master， master也要分配一个缓冲区么？写没有用的缓存区，也是一种浪费。

缓存区有一个生命周期，可以通过如下配置进行调整，当多长时间没有replica连接时，将释放缓冲区。

repl-backlog-ttl 3600

问题：

replica和master如何知道之间的连接断开了呢？

master和replica有心跳机制，会定时的发送ping命令。

问题：

心跳是多久发送一次呢？

可通过如下配置调整。如下表示10秒发送一次ping命令。

repl-ping-replica-period 10

问题：

除了心跳超时，还有其他的超时检测不呢？

还有一个总的超时时间，如下配置

repl-timeout 60

问题：

repl-timeout可否配置的比repl-ping-replica-period小或者相等呢？

不可以，如果这样，一个网络波动就可能导致超时，然后开始一系列的重连超时。

问题：

还有哪些配置可以减少master和replica数据不一致性呢？

可以开启如下配置，默认是关闭的，Linux内核将使用大概40毫秒的时候缓存数据，一次性将尽可能多的数据发送出去，

开启配置后，将有数据包就立刻发送出去，即使是小包，可能会带来带宽问题。

repl-disable-tcp-nodelay yes

问题：

当某些key过期后，master将自动删除这些key, 那replica中出现过期key会如何做呢？

replica对过期key，不主动删除，等待master删除key后，同步生成del命令给replica，如果客户端请求到replica的过期key时，replica直接返回nil

q、代码实现

配置

# replicaof <masterip> <masterport>

# masterauth <master-password>
#
# masteruser <username>
#replica-serve-stale-data yes
replica-read-only yes

repl-diskless-sync no
repl-diskless-sync-delay 5
repl-diskless-load disabled

# repl-ping-replica-period 10
# repl-timeout 60
repl-disable-tcp-nodelay no
# repl-backlog-size 1mb

# repl-backlog-ttl 3600
replica-priority 100
# min-replicas-to-write 3
# min-replicas-max-lag 10
# replica-announce-ip 5.5.5.5
# replica-announce-port 1234

初始化

void initServerConfig(void) {
    ...    //生成replicate ID
    changeReplicationId();
    ...    
    //初始化主从复制相关字段（设置默认值）
     server.masterauth = NULL;   
       server.masterhost = NULL; 
          server.masterport = 6379;    
    server.master = NULL;  
      server.cached_master = NULL;  
        server.master_initial_offset = -1;    
    server.repl_state = REPL_STATE_NONE;   ////////////最开始的状态为NONE
    
    server.repl_transfer_tmpfile = NULL;   
     server.repl_transfer_fd = -1; 
        server.repl_transfer_s = NULL;    
    server.repl_syncio_timeout = CONFIG_REPL_SYNCIO_TIMEOUT;    
    server.repl_down_since = 0; /* Never connected, repl is down since EVER. */
    server.master_repl_offset = 0;    /* Replication partial resync backlog */
    server.repl_backlog = NULL;   
     server.repl_backlog_histlen = 0;  
       server.repl_backlog_idx = 0;  
         server.repl_backlog_off = 0;    
         server.repl_no_slaves_since = time(NULL);
    ...

}

从配置中获取master信息，一旦获取到master信息，则本实例则为replica 并且状态设置为REPL_STATE_CONNECT

loadServerConfig()
{
    ...
    loadServerConfigFromString();
    ...
}void loadServerConfigFromString(char *config) {
    ...    else if ((!strcasecmp(argv[0],"slaveof") ||
                    !strcasecmp(argv[0],"replicaof")) && argc == 3) {
            slaveof_linenum = linenum;       
                 server.masterhost = sdsnew(argv[1]);       
                      server.masterport = atoi(argv[2]);      
                            server.repl_state = REPL_STATE_CONNECT;  //////当配置有复制时，状态设置为 CONNECT
     }
    ...
}

定时任务，检测状态

serverCron(){
    ...    //每秒调用
    run_with_period(1000) replicationCron();
    ...}

void replicationCron(void) {
    ...    if (server.repl_state == REPL_STATE_CONNECT) {       
        if (connectWithMaster() == C_OK) {
            serverLog(LL_NOTICE,"MASTER <-> REPLICA sync started");
        }
    }
    ...}

replica主动连接master

//和master建立socket连接，建立成功后，状态变为REPL_STATE_CONNECTINGint connectWithMaster(void) {  
  server.repl_transfer_s = server.tls_replication ? connCreateTLS() : connCreateSocket();  
    if (connConnect(server.repl_transfer_s, server.masterhost, server.masterport,
                NET_FIRST_BIND_ADDR, syncWithMaster) == C_ERR) {
        serverLog(LL_WARNING,"Unable to connect to MASTER: %s",
                connGetLastError(server.repl_transfer_s));
        connClose(server.repl_transfer_s);  
              server.repl_transfer_s = NULL;    
                  return C_ERR;
    }  
      server.repl_transfer_lastio = server.unixtime;   
       server.repl_state = REPL_STATE_CONNECTING;  
         return C_OK;
}

进入状态机

void syncWithMaster(connection *conn) {   
 char tmpfile[256], *err = NULL;   
  int dfd = -1, maxtries = 5;   
   int psync_result;  
     //从replica切换为master，不需要后续操作
    /* If this event fired after the user turned the instance into a master
     * with SLAVEOF NO ONE we must just return ASAP. */
    if (server.repl_state == REPL_STATE_NONE) {
        connClose(conn);    
            return;
    }   
    
    
     //socket建立连接出现错误
    /* Check for errors in the socket: after a non blocking connect() we
     * may find that the socket is in error state. */
    if (connGetState(conn) != CONN_STATE_CONNECTED) {
        serverLog(LL_WARNING,"Error condition on socket for SYNC: %s",
                connGetLastError(conn));
        goto error;
    }   
    
    
     //和master之间socket建立成功，进行ping
    /* Send a PING to check the master is able to reply without errors. */
    if (server.repl_state == REPL_STATE_CONNECTING) {
        serverLog(LL_NOTICE,"Non blocking connect for SYNC fired the event.");        /* Delete the writable event so that the readable event remains
         * registered and we can wait for the PONG reply. */
        connSetReadHandler(conn, syncWithMaster);
        connSetWriteHandler(conn, NULL);     
           server.repl_state = REPL_STATE_RECEIVE_PONG;      
             /* Send the PING, don't check for errors at all, we have the timeout
         * that will take care about this. */
        err = sendSynchronousCommand(SYNC_CMD_WRITE,conn,"PING",NULL);     
           if (err) goto write_error;        
           return;
    }   
    
    
     //读取ping响应，确定master是否准备好
    /* Receive the PONG command. */
    if (server.repl_state == REPL_STATE_RECEIVE_PONG) {
        err = sendSynchronousCommand(SYNC_CMD_READ,conn,NULL);    
            /* We accept only two replies as valid, a positive +PONG reply
         * (we just check for "+") or an authentication error.
         * Note that older versions of Redis replied with "operation not
         * permitted" instead of using a proper error code, so we test
         * both. */
        if (err[0] != '+' &&
            strncmp(err,"-NOAUTH",7) != 0 &&
            strncmp(err,"-ERR operation not permitted",28) != 0)
        {
            serverLog(LL_WARNING,"Error reply to PING from master: '%s'",err);
            sdsfree(err);
            goto error;
        } else {
            serverLog(LL_NOTICE,             
               "Master replied to PING, replication can continue...");
        }
        sdsfree(err);   
             server.repl_state = REPL_STATE_SEND_AUTH;
    }   
     //如果master需要密码，进行登录认证
    /* AUTH with the master if required. */
    if (server.repl_state == REPL_STATE_SEND_AUTH) {     
       if (server.masteruser && server.masterauth) {
            err = sendSynchronousCommand(SYNC_CMD_WRITE,conn,"AUTH",    
                                                 server.masteruser,server.masterauth,NULL);  
                                                           if (err) goto write_error;        
                                                               server.repl_state = REPL_STATE_RECEIVE_AUTH;     
                                                                      return;
        } else if (server.masterauth) {
            err = sendSynchronousCommand(SYNC_CMD_WRITE,conn,"AUTH",server.masterauth,NULL);   
                     if (err) goto write_error;       
                          server.repl_state = REPL_STATE_RECEIVE_AUTH;   
                                   return;
        } else {      
              server.repl_state = REPL_STATE_SEND_PORT;
        }
    }   
    
    
     //读取认证响应
    /* Receive AUTH reply. */
    if (server.repl_state == REPL_STATE_RECEIVE_AUTH) {
        err = sendSynchronousCommand(SYNC_CMD_READ,conn,NULL);    
            if (err[0] == '-') {
            serverLog(LL_WARNING,"Unable to AUTH to MASTER: %s",err);
            sdsfree(err);
            goto error;
        }
        sdsfree(err);   
             server.repl_state = REPL_STATE_SEND_PORT;
    }    
    //发送replica自己的监听端口给master
    /* Set the slave port, so that Master's INFO command can list the
     * slave listening port correctly. */
    if (server.repl_state == REPL_STATE_SEND_PORT) {    
        int port;     
           if (server.slave_announce_port) port = server.slave_announce_port;     
              else if (server.tls_replication && server.tls_port) port = server.tls_port;    
                  else port = server.port;
        sds portstr = sdsfromlonglong(port);
        err = sendSynchronousCommand(SYNC_CMD_WRITE,conn,"REPLCONF",    
                    "listening-port",portstr, NULL);
        sdsfree(portstr);     
           if (err) goto write_error;
        sdsfree(err);     
           server.repl_state = REPL_STATE_RECEIVE_PORT;   
                return;
    }  
    
    
      //读取master的响应
    /* Receive REPLCONF listening-port reply. */
    if (server.repl_state == REPL_STATE_RECEIVE_PORT) {
        err = sendSynchronousCommand(SYNC_CMD_READ,conn,NULL);        /* Ignore the error if any, not all the Redis versions support
         * REPLCONF listening-port. */
        if (err[0] == '-') {
            serverLog(LL_NOTICE,"(Non critical) Master does not understand "
                                "REPLCONF listening-port: %s", err);
        }
        sdsfree(err);     
           server.repl_state = REPL_STATE_SEND_IP;
    }    
    //没有配置宣称IP，则进入发送replica能力
    /* Skip REPLCONF ip-address if there is no slave-announce-ip option set. */
    if (server.repl_state == REPL_STATE_SEND_IP &&     
       server.slave_announce_ip == NULL)
    {       
         server.repl_state = REPL_STATE_SEND_CAPA;
    }   
     //如果配置了replica宣称IP，则将IP发送给master
    /* Set the slave ip, so that Master's INFO command can list the
     * slave IP address port correctly in case of port forwarding or NAT. */
    if (server.repl_state == REPL_STATE_SEND_IP) {
        err = sendSynchronousCommand(SYNC_CMD_WRITE,conn,"REPLCONF",         
               "ip-address",server.slave_announce_ip, NULL);      
                 if (err) goto write_error;
        sdsfree(err);     
           server.repl_state = REPL_STATE_RECEIVE_IP;       
            return;
    }   
    
    
     //读取master响应
    /* Receive REPLCONF ip-address reply. */
    if (server.repl_state == REPL_STATE_RECEIVE_IP) {
        err = sendSynchronousCommand(SYNC_CMD_READ,conn,NULL);        /* Ignore the error if any, not all the Redis versions support
         * REPLCONF listening-port. */
        if (err[0] == '-') {
            serverLog(LL_NOTICE,"(Non critical) Master does not understand "
                                "REPLCONF ip-address: %s", err);
        }
        sdsfree(err);     
           server.repl_state = REPL_STATE_SEND_CAPA;
    }   
     //发送replica的能力给master
    /* Inform the master of our (slave) capabilities.
     *
     * EOF: supports EOF-style RDB transfer for diskless replication.
     * PSYNC2: supports PSYNC v2, so understands +CONTINUE <new repl ID>.
     *
     * The master will ignore capabilities it does not understand. */
    if (server.repl_state == REPL_STATE_SEND_CAPA) {
        err = sendSynchronousCommand(SYNC_CMD_WRITE,conn,"REPLCONF",       
                 "capa","eof","capa","psync2",NULL);   
                      if (err) goto write_error;
        sdsfree(err);       
         server.repl_state = REPL_STATE_RECEIVE_CAPA;   
              return;
    }    
    
    
    //读取master响应
    /* Receive CAPA reply. */
    if (server.repl_state == REPL_STATE_RECEIVE_CAPA) {
        err = sendSynchronousCommand(SYNC_CMD_READ,conn,NULL);       
         /* Ignore the error if any, not all the Redis versions support
         * REPLCONF capa. */
        if (err[0] == '-') {
            serverLog(LL_NOTICE,"(Non critical) Master does not understand "
                                  "REPLCONF capa: %s", err);
        }
        sdsfree(err);     
           server.repl_state = REPL_STATE_SEND_PSYNC;
    }   
     //发送同步命令给master
    /* Try a partial resynchonization. If we don't have a cached master
     * slaveTryPartialResynchronization() will at least try to use PSYNC
     * to start a full resynchronization so that we get the master run id
     * and the global offset, to try a partial resync at the next
     * reconnection attempt. */
    if (server.repl_state == REPL_STATE_SEND_PSYNC) {     
       if (slaveTryPartialResynchronization(conn,0) == PSYNC_WRITE_ERROR) {
            err = sdsnew("Write error sending the PSYNC command.");
            goto write_error;
        }       
         server.repl_state = REPL_STATE_RECEIVE_PSYNC;      
           return;
    }    
    
    
    /* If reached this point, we should be in REPL_STATE_RECEIVE_PSYNC. */
    if (server.repl_state != REPL_STATE_RECEIVE_PSYNC) {
        serverLog(LL_WARNING,"syncWithMaster(): state machine error, "
                             "state should be RECEIVE_PSYNC but is %d",              
                                            server.repl_state);
        goto error;
    }   
    
     //读取master响应
    psync_result = slaveTryPartialResynchronization(conn,1);   
     if (psync_result == PSYNC_WAIT_REPLY) return; /* Try again later... */

    /* If the master is in an transient error, we should try to PSYNC
     * from scratch later, so go to the error path. This happens when
     * the server is loading the dataset or is not connected with its
     * master and so forth. */
    if (psync_result == PSYNC_TRY_LATER) goto error;   
     /* Note: if PSYNC does not return WAIT_REPLY, it will take care of
     * uninstalling the read handler from the file descriptor. */

    //部分同步
    if (psync_result == PSYNC_CONTINUE) {
        serverLog(LL_NOTICE, "MASTER <-> REPLICA sync: Master accepted a Partial Resynchronization.");       
         if (server.supervised_mode == SUPERVISED_SYSTEMD) {
            redisCommunicateSystemd("STATUS=MASTER <-> REPLICA sync: Partial Resynchronization accepted. Ready to accept connections.\n");
            redisCommunicateSystemd("READY=1\n");
        }      
          return;
    }  
    
    
      /* PSYNC failed or is not supported: we want our slaves to resync with us
     * as well, if we have any sub-slaves. The master may transfer us an
     * entirely different data set and we have no way to incrementally feed
     * our slaves after that. */
    disconnectSlaves(); /* Force our slaves to resync with us as well. */
    freeReplicationBacklog(); /* Don't allow our chained slaves to PSYNC. */

    //不支持psync，降级为sync命令
    /* Fall back to SYNC if needed. Otherwise psync_result == PSYNC_FULLRESYNC
     * and the server.master_replid and master_initial_offset are
     * already populated. */
    if (psync_result == PSYNC_NOT_SUPPORTED) {
        serverLog(LL_NOTICE,"Retrying with SYNC...");     
           if (connSyncWrite(conn,"SYNC\r\n",6,server.repl_syncio_timeout*1000) == -1) {
            serverLog(LL_WARNING,"I/O error writing to MASTER: %s",
                strerror(errno));
            goto error;
        }
    }   
    
    
     //为全量同步做准备
    /* Prepare a suitable temp file for bulk transfer */
    //如果是写文件，则创建文件
    if (!useDisklessLoad()) {      
      while(maxtries--) {
            snprintf(tmpfile,256,      
                      "temp-%d.%ld.rdb",(int)server.unixtime,(long int)getpid());
            dfd = open(tmpfile,O_CREAT|O_WRONLY|O_EXCL,0644);    
                    if (dfd != -1) break;
            sleep(1);
        }    
            if (dfd == -1) {
            serverLog(LL_WARNING,"Opening the temp file needed for MASTER <-> REPLICA synchronization: %s",strerror(errno));
            goto error;
        }      
          server.repl_transfer_tmpfile = zstrdup(tmpfile);    
              server.repl_transfer_fd = dfd;
    }    
    
    //设置接收全量数据回调函数，异步回调
    /* Setup the non blocking download of the bulk file. */
    if (connSetReadHandler(conn, readSyncBulkPayload)
            == C_ERR)
    {       
     char conninfo[CONN_INFO_LEN];
        serverLog(LL_WARNING,          
          "Can't create readable event for SYNC: %s (%s)",
            strerror(errno), connGetInfo(conn, conninfo, sizeof(conninfo)));
        goto error;
    }   
    
     //状态切换为REPL_STATE_TRANSFER
    server.repl_state = REPL_STATE_TRANSFER;  
      server.repl_transfer_size = -1; 
         server.repl_transfer_read = 0;   
          server.repl_transfer_last_fsync_off = 0; 
             server.repl_transfer_lastio = server.unixtime;   
              return;

              
error:    if (dfd != -1) close(dfd);
    connClose(conn);  
      server.repl_transfer_s = NULL;    
      if (server.repl_transfer_fd != -1)
        close(server.repl_transfer_fd);   
         if (server.repl_transfer_tmpfile)
        zfree(server.repl_transfer_tmpfile);  
          server.repl_transfer_tmpfile = NULL;   
           server.repl_transfer_fd = -1;   
            server.repl_state = REPL_STATE_CONNECT;  
              return;

write_error: 
/* Handle sendSynchronousCommand(SYNC_CMD_WRITE) errors. */
    serverLog(LL_WARNING,"Sending command to master in replication handshake: %s", err);
    sdsfree(err);
    goto error;
}

检测超时以及重连

void replicationCron(void) { 
   static long long replication_cron_loops = 0;  
   
     
   /* Non blocking connection timeout? */
    //连接登录认证阶段
    if (server.masterhost &&
        (server.repl_state == REPL_STATE_CONNECTING ||
         slaveIsInHandshakeState()) &&
         (time(NULL)-server.repl_transfer_lastio) > server.repl_timeout)
    {
        serverLog(LL_WARNING,"Timeout connecting to the MASTER...");
        cancelReplicationHandshake();
    } 
    
    
       /* Bulk transfer I/O timeout? */
    //全连同步阶段
    if (server.masterhost && server.repl_state == REPL_STATE_TRANSFER &&
        (time(NULL)-server.repl_transfer_lastio) > server.repl_timeout)
    {
        serverLog(LL_WARNING,"Timeout receiving bulk data from MASTER... If the problem persists try to set the 'repl-timeout' parameter in redis.conf to a larger value.");
        cancelReplicationHandshake();
    }  
    
    
      /* Timed out master when we are an already connected slave? */
    //正常命令同步阶段
    if (server.masterhost && server.repl_state == REPL_STATE_CONNECTED &&
        (time(NULL)-server.master->lastinteraction) > server.repl_timeout)
    {
        serverLog(LL_WARNING,"MASTER timeout: no data nor PING received...");
        freeClient(server.master);
    }    /* Check if we should connect to a MASTER */
   
   
   
    //进行重连
    if (server.repl_state == REPL_STATE_CONNECT) {
        serverLog(LL_NOTICE,"Connecting to MASTER %s:%d",         
           server.masterhost, server.masterport);       
            if (connectWithMaster() == C_OK) {
            serverLog(LL_NOTICE,"MASTER <-> REPLICA sync started");
        }
    }

   ...    //ping replica
    /* First, send PING according to ping_slave_period. */
    if ((replication_cron_loops % server.repl_ping_slave_period) == 0 &&
        listLength(server.slaves))
    {        /* Note that we don't send the PING if the clients are paused during
         * a Redis Cluster manual failover: the PING we send will otherwise
         * alter the replication offsets of master and slave, and will no longer
         * match the one stored into 'mf_master_offset' state. */
        int manual_failover_in_progress =    
                server.cluster_enabled &&     
                       server.cluster->mf_end &&
            clientsArePaused();    
                if (!manual_failover_in_progress) {
            ping_argv[0] = createStringObject("PING",4);
            replicationFeedSlaves(server.slaves, server.slaveseldb,
                ping_argv, 1);
            decrRefCount(ping_argv[0]);
        }
    }

  
   ...   //释放循环缓冲区
    if (listLength(server.slaves) == 0 && server.repl_backlog_time_limit &&     
       server.repl_backlog && server.masterhost == NULL)
    {
        time_t idle = server.unixtime - server.repl_no_slaves_since;      
          if (idle > server.repl_backlog_time_limit) {
            ...
            freeReplicationBacklog();
            ...
        }
    } 
    
    
    //还有replica还在等待全量同步
    /* Start a BGSAVE good for replication if we have slaves in
     * WAIT_BGSAVE_START state.
     *
     * In case of diskless replication, we make sure to wait the specified
     * number of seconds (according to configuration) so that other slaves
     * have the time to arrive before we start streaming. */
    if (!hasActiveChildProcess()) {
        time_t idle, max_idle = 0;      
          int slaves_waiting = 0;       
           int mincapa = -1;
        listNode *ln;
        listIter li;

        listRewind(server.slaves,&li);    
            while((ln = listNext(&li))) {  
                      client *slave = ln->value; 
                                 if (slave->replstate == SLAVE_STATE_WAIT_BGSAVE_START) {
                idle = server.unixtime - slave->lastinteraction;              
                  if (idle > max_idle) max_idle = idle;
                slaves_waiting++;
                mincapa = (mincapa == -1) ? slave->slave_capa :
                                            (mincapa & slave->slave_capa);
            }
        }      
          if (slaves_waiting &&
            (!server.repl_diskless_sync ||
             max_idle > server.repl_diskless_sync_delay))
        {           
         /* Start the BGSAVE. The called function may start a
             * BGSAVE with socket target or disk target depending on the
             * configuration and slaves capabilities. */
            startBgsaveForReplication(mincapa);
        }
    }  
      /* Remove the RDB file used for replication if Redis is not running
     * with any persistence. */
    removeRDBUsedToSyncReplicas();   
     /* Refresh the number of slaves with lag <= min-slaves-max-lag. */
    refreshGoodSlavesCount();
    replication_cron_loops++; 
    /* Incremented with frequency 1 HZ. */
    }

连接断开或者超时，如何找到断点偏移量

void freeClient(client *c) {
   
    ...    
    /* If it is our master that's beging disconnected we should make sure
     * to cache the state to try a partial resynchronization later.
     *
     * Note that before doing this we make sure that the client is not in
     * some unexpected state, by checking its flags. */
    if (server.master && c->flags & CLIENT_MASTER) {
        serverLog(LL_WARNING,"Connection with master lost.");      
          if (!(c->flags & (CLIENT_PROTOCOL_ERROR|CLIENT_BLOCKED))) {
            c->flags &= ~(CLIENT_CLOSE_ASAP|CLIENT_CLOSE_AFTER_REPLY);
            replicationCacheMaster(c);        
                return;
        }
    }
    
    ...

}void replicationCacheMaster(client *c) {
    ...   
     server.master->read_reploff = server.master->reploff;
    ...
    resetClient(c);  
      /* Save the master. Server.master will be set to null later by
     * replicationHandleMasterDisconnection(). */
    server.cached_master = server.master;

  
    replicationHandleMasterDisconnection();
}
void replicationHandleMasterDisconnection(void) {
    ...    
    server.master = NULL;   
     server.repl_state = REPL_STATE_CONNECT;  
       server.repl_down_since = server.unixtime;
    ...
}

将缓存的断点发送给master

int slaveTryPartialResynchronization(connection *conn, int read_reply) {
       ...       
        if (server.cached_master) {
            psync_replid = server.cached_master->replid;
            snprintf(psync_offset,sizeof(psync_offset),"%lld", server.cached_master->reploff+1);
            serverLog(LL_NOTICE,"Trying a partial resynchronization (request %s:%s).", psync_replid, psync_offset);
        } else {
            serverLog(LL_NOTICE,"Partial resynchronization not possible (no cached master)");
            psync_replid = "?";
            memcpy(psync_offset,"-1",3);
        }        
        /* Issue the PSYNC command */
        reply = sendSynchronousCommand(SYNC_CMD_WRITE,conn,"PSYNC",psync_replid,psync_offset,NULL);
       ...       
        return PSYNC_WAIT_REPLY;
    }
    ...}

replica重启，如何获取断点在replica进行rdb持久化时，将断点写入到了文件中

int rdbSave(char *filename, rdbSaveInfo *rsi) {
    ...  
      if (rdbSaveRio(&rdb,&error,RDBFLAGS_NONE,rsi) == C_ERR) {
        errno = error;
        goto werr;
    }
   ...
   }
   
   int rdbSaveRio(rio *rdb, int *error, int rdbflags, rdbSaveInfo *rsi) {
   
    ...  
      if (rdbSaveInfoAuxFields(rdb,rdbflags,rsi) == -1) goto werr;
    ...  
      for (j = 0; j < server.dbnum; j++) {
      ...  
        }
    ...
    }
    
    int rdbSaveInfoAuxFields(rio *rdb, int rdbflags, rdbSaveInfo *rsi) {
   ...   
    /* Add a few fields about the state when the RDB was created. */
    if (rdbSaveAuxFieldStrStr(rdb,"redis-ver",REDIS_VERSION) == -1) return -1;   
     if (rdbSaveAuxFieldStrInt(rdb,"redis-bits",redis_bits) == -1) return -1;  
       if (rdbSaveAuxFieldStrInt(rdb,"ctime",time(NULL)) == -1) return -1;  
         if (rdbSaveAuxFieldStrInt(rdb,"used-mem",zmalloc_used_memory()) == -1) return -1; 
            /* Handle saving options that generate aux fields. */
    if (rsi) {       
     if (rdbSaveAuxFieldStrInt(rdb,"repl-stream-db",rsi->repl_stream_db)
            == -1) return -1;    
                if (rdbSaveAuxFieldStrStr(rdb,"repl-id",server.replid)
            == -1) return -1;     
               if (rdbSaveAuxFieldStrInt(rdb,"repl-offset",server.master_repl_offset)
            == -1) return -1;
    }
    ...
    }

在启动时加载rdb文件时，将解析到响应的偏移量

void loadDataFromDisk(void) {
    ...      
    if (rdbLoad(server.rdb_filename,&rsi,RDBFLAGS_NONE) == C_OK) {
      ...   
           /* Restore the replication ID / offset from the RDB file. */
        if ((server.masterhost ||         
           (server.cluster_enabled &&    
                   nodeIsSlave(server.cluster->myself))) &&    
                           rsi.repl_id_is_set &&         
                              rsi.repl_offset != -1 &&       
                                   /* Note that older implementations may save a repl_stream_db
             * of -1 inside the RDB file in a wrong way, see more
             * information in function rdbPopulateSaveInfo. */
            rsi.repl_stream_db != -1)
        {
            memcpy(server.replid,rsi.repl_id,sizeof(server.replid));
            server.master_repl_offset = rsi.repl_offset;      
                  /* If we are a slave, create a cached master from this
             * information, in order to allow partial resynchronizations
             * with masters. */
            replicationCacheMasterUsingMyself();
            selectDb(server.cached_master,rsi.repl_stream_db);
        }
    } 
    ...
    }

void replicationCacheMasterUsingMyself(void) {
    ...   
     server.master_initial_offset = server.master_repl_offset;   
      /* The master client we create can be set to any DBID, because
     * the new master will start its replication stream with SELECT. */
    replicationCreateMasterClient(NULL,-1);    /* Use our own ID / offset. */
    memcpy(server.master->replid, server.replid, sizeof(server.replid));    /* Set as cached master. */
    unlinkClient(server.master);
    server.cached_master = server.master;
    server.master = NULL;
}

循环缓冲区写

void createReplicationBacklog(void) {
    serverAssert(server.repl_backlog == NULL);    
    server.repl_backlog = zmalloc(server.repl_backlog_size); 
       server.repl_backlog_histlen = 0;
           server.repl_backlog_idx = 0; 
              /* We don't have any data inside our buffer, but virtually the first
     * byte we have is the next byte that will be generated for the
     * replication stream. */
    server.repl_backlog_off = server.master_repl_offset+1;
}
void feedReplicationBacklog(void *ptr, size_t len) {
    unsigned char *p = ptr;  
      server.master_repl_offset += len;    
      /* This is a circular buffer, so write as much data we can at every
     * iteration and rewind the "idx" index if we reach the limit. */
    while(len) {
        size_t thislen = server.repl_backlog_size - server.repl_backlog_idx;       
         if (thislen > len) thislen = len;
        memcpy(server.repl_backlog+server.repl_backlog_idx,p,thislen);   
             server.repl_backlog_idx += thislen;
                     if (server.repl_backlog_idx == server.repl_backlog_size)  
                               server.repl_backlog_idx = 0;
        len -= thislen;
        p += thislen;       
         server.repl_backlog_histlen += thislen;
    }    
    if (server.repl_backlog_histlen > server.repl_backlog_size)     
       server.repl_backlog_histlen = server.repl_backlog_size; 
          /* Set the offset of the first byte we have in the backlog. */
    server.repl_backlog_off = server.master_repl_offset -                
                  server.repl_backlog_histlen + 1;
}

循环缓冲区的部分同步

long long addReplyReplicationBacklog(client *c, long long offset) {
    long long j, skip, len;

    serverLog(LL_DEBUG, "[PSYNC] Replica request offset: %lld", offset);

    if (server.repl_backlog_histlen == 0) {
        serverLog(LL_DEBUG, "[PSYNC] Backlog history len is zero");
        return 0;
    }

    serverLog(LL_DEBUG, "[PSYNC] Backlog size: %lld",
             server.repl_backlog_size);
    serverLog(LL_DEBUG, "[PSYNC] First byte: %lld",
             server.repl_backlog_off);
    serverLog(LL_DEBUG, "[PSYNC] History len: %lld",
             server.repl_backlog_histlen);
    serverLog(LL_DEBUG, "[PSYNC] Current index: %lld",
             server.repl_backlog_idx);

    /* Compute the amount of bytes we need to discard. */
    skip = offset - server.repl_backlog_off;
    serverLog(LL_DEBUG, "[PSYNC] Skipping: %lld", skip);

    /* Point j to the oldest byte, that is actually our
     * server.repl_backlog_off byte. */
    j = (server.repl_backlog_idx +
        (server.repl_backlog_size-server.repl_backlog_histlen)) %
        server.repl_backlog_size;
    serverLog(LL_DEBUG, "[PSYNC] Index of first byte: %lld", j);
    /* Discard the amount of data to seek to the specified 'offset'. */
    j = (j + skip) % server.repl_backlog_size;

    /* Feed slave with data. Since it is a circular buffer we have to
     * split the reply in two parts if we are cross-boundary. */
    len = server.repl_backlog_histlen - skip;
    serverLog(LL_DEBUG, "[PSYNC] Reply total length: %lld", len);
    while(len) {
        long long thislen =
            ((server.repl_backlog_size - j) < len) ?
            (server.repl_backlog_size - j) : len;

        serverLog(LL_DEBUG, "[PSYNC] addReply() length: %lld", thislen);
        addReplySds(c,sdsnewlen(server.repl_backlog + j, thislen));
        len -= thislen;
        j = 0;
    }
    return server.repl_backlog_histlen - skip;}

一、什么是replica

二、replica如何进行

三、如何减少磁盘I/O

四、什么是部分同步

五、如何进行部分同步

六、循环缓冲区如何工作的

q、代码实现

猜你喜欢