《redis设计与实现》-17 集群重新分片reshard

一重新分片

Redis集群的重新分片操作可以将任意数量已经指派给某个节点（源节点）的槽改为指派给另一个节点（目标节点），并且相关槽所属的键值对也会从源节点被移动到目标节点。
重新分片操作可以在线进行，在重新分片过程中，集群不需要下线，并且源节点和目标节点都可以继续处理命令请求。

二重新分片的原理

Redis的集群管理工具redis-trib，通过向源节点和目标节点发送命令来进行重新分片操作。

集成在redis的源码src目录下，redis-trib.rb是用ruby完成的。好在有注释看个大概流程。

可以看到redis-trib.rb具有以下功能：

1、create：创建集群
2、check：检查集群
3、info：查看集群信息
4、fix：修复集群
5、reshard：在线迁移slot
6、rebalance：平衡集群节点slot数量
7、add-node：将新节点加入集群
8、del-node：从集群中删除节点
9、set-timeout：设置集群节点间心跳连接的超时时间
10、call：在集群全部节点上执行命令
11、import：将外部redis数据导入集群

这里我们只看reshard的代码。redis-trib.rb主要有两个类：ClusterNode和RedisTrib。ClusterNode保存了每个节点的信息，RedisTrib则是redis-trib.rb各个功能的实现。

2.1 ClusterNode对象

先分析ClusterNode源码。ClusterNode有下面几个成员变量（ruby的类成员变量是以@开头的）：

@r：执行redis命令的客户端对象。
@info：保存了该节点的详细信息，包括cluster nodes命令中自己这行的信息和cluster info的信息。
@dirty：节点信息是否需要更新，如果为true，我们需要把内存的节点更新信息到节点上。
@friends：保存了集群其他节点的info信息。其信息为通过cluster nodes命令获得的其他节点信息。

ClusterNode有下面一些成员方法：

initialize：ClusterNode的构造方法，需要传入节点的地址信息。
friends：返回@friends对象。
slots：返回该节点负责的slots信息。
has_flag?：判断节点info信息的的flags中是否有给定的flag。
to_s：类似java的toString方法，返回节点的地址信息。
connect：连接redis节点。
assert_cluster：判断节点开启了集群配置。
assert_empty：确定节点目前没有跟任何其他节点握手，同时自己的db数据为空。
load_info：通过cluster info和cluster nodes导入节点信息。
add_slots：给节点增加slot，该操作只是在内存中修改，并把dirty设置成true，等待flush_node_config将内存中的数据同步在节点执行。
set_as_replica：slave设置复制的master地址。dirty设置成true。
flush_node_config：将内存的数据修改同步在集群节点中执行。
info_string：简单的info信息。
get_config_signature：用来验证集群节点见的cluster nodes信息是否一致。该方法返回节点的签名信息。
info：返回@info对象，包含详细的info信息。
is_dirty?：判断@dirty。
r：返回执行redis命令的客户端对象。

2.2 reshard 函数参数

reshard         host:port
                --from <arg>
                --to <arg>
                --slots <arg>
                --yes
                --timeout <arg>
                --pipeline <arg>

host:port：这个是必传参数，用来从一个节点获取整个集群信息，相当于获取集群信息的入口。
--from <arg>：需要从哪些源节点上迁移slot，可从多个源节点完成迁移，以逗号隔开，传递的是节点的node id，还可以直接传递--from all，这样源节点就是集群的所有节点，不传递该参数的话，则会在迁移过程中提示用户输入。
--to <arg>：slot需要迁移的目的节点的node id，目的节点只能填写一个，不传递该参数的话，则会在迁移过程中提示用户输入。
--slots <arg>：需要迁移的slot数量，不传递该参数的话，则会在迁移过程中提示用户输入。
--yes：设置该参数，可以在打印执行reshard计划的时候，提示用户输入yes确认后再执行reshard。
--timeout <arg>：设置migrate命令的超时时间。
--pipeline <arg>：定义cluster getkeysinslot命令一次取出的key数量，不传的话使用默认值为10。

2.3 reshard 执行过程

1、通过load_cluster_info_from_node方法装载集群信息。

2、执行check_cluster方法检查集群是否健康。只有健康的集群才能进行迁移。

3、获取需要迁移的slot数量，用户没传递--slots参数，则提示用户手动输入。

4、获取迁移的目的节点，用户没传递--to参数，则提示用户手动输入。此处会检查目的节点必须为master节点。

5、获取迁移的源节点，用户没传递--from参数，则提示用户手动输入。此处会检查源节点必须为master节点。--from all的话，源节点就是除了目的节点外的全部master节点。这里为了保证集群slot分配的平均，建议传递--from all。

6、执行compute_reshard_table方法，计算需要迁移的slot数量如何分配到源节点列表。

7、打印出reshard计划，如果用户没传--yes，就提示用户确认计划。

8、根据reshard计划，一个个slot的迁移到新节点上，迁移使用move_slot方法，该方法被很多命令使用，具体可以参见下面的迁移流程。move_slot方法传递dots为true和pipeline数量。

源码如下，看注释更好理解。

 def reshard_cluster_cmd(argv,opt)
        opt = {'pipeline' => MigrateDefaultPipeline}.merge(opt)

        load_cluster_info_from_node(argv[0])
        check_cluster
        if @errors.length != 0
            puts "*** Please fix your cluster problems before resharding"
            exit 1
        end

        @timeout = opt['timeout'].to_i if opt['timeout'].to_i

        # Get number of slots
        if opt['slots']
            numslots = opt['slots'].to_i
        else
            numslots = 0
            while numslots <= 0 or numslots > ClusterHashSlots
                print "How many slots do you want to move (from 1 to #{ClusterHashSlots})? "
                numslots = STDIN.gets.to_i
            end
        end

        # Get the target instance
        if opt['to']
            target = get_node_by_name(opt['to'])
            if !target || target.has_flag?("slave")
                xputs "*** The specified node is not known or not a master, please retry."
                exit 1
            end
        else
            target = nil
            while not target
                print "What is the receiving node ID? "
                target = get_node_by_name(STDIN.gets.chop)
                if !target || target.has_flag?("slave")
                    xputs "*** The specified node is not known or not a master, please retry."
                    target = nil
                end
            end
        end

        # Get the source instances
        sources = []
        if opt['from']
            opt['from'].split(',').each{|node_id|
                if node_id == "all"
                    sources = "all"
                    break
                end
                src = get_node_by_name(node_id)
                if !src || src.has_flag?("slave")
                    xputs "*** The specified node is not known or is not a master, please retry."
                    exit 1
                end
                sources << src
            }
        else
            xputs "Please enter all the source node IDs."
            xputs "  Type 'all' to use all the nodes as source nodes for the hash slots."
            xputs "  Type 'done' once you entered all the source nodes IDs."
            while true
                print "Source node ##{sources.length+1}:"
                line = STDIN.gets.chop
                src = get_node_by_name(line)
                if line == "done"
                    break
                elsif line == "all"
                    sources = "all"
                    break
                elsif !src || src.has_flag?("slave")
                    xputs "*** The specified node is not known or is not a master, please retry."
                elsif src.info[:name] == target.info[:name]
                    xputs "*** It is not possible to use the target node as source node."
                else
                    sources << src
                end
            end
        end

        if sources.length == 0
            puts "*** No source nodes given, operation aborted"
            exit 1
        end

        # Handle soures == all.
        if sources == "all"
            sources = []
            @nodes.each{|n|
                next if n.info[:name] == target.info[:name]
                next if n.has_flag?("slave")
                sources << n
            }
        end

        # Check if the destination node is the same of any source nodes.
        if sources.index(target)
            xputs "*** Target node is also listed among the source nodes!"
            exit 1
        end

        puts "\nReady to move #{numslots} slots."
        puts "  Source nodes:"
        sources.each{|s| puts "    "+s.info_string}
        puts "  Destination node:"
        puts "    #{target.info_string}"
        reshard_table = compute_reshard_table(sources,numslots)
        puts "  Resharding plan:"
        show_reshard_table(reshard_table)
        if !opt['yes']
            print "Do you want to proceed with the proposed reshard plan (yes/no)? "
            yesno = STDIN.gets.chop
            exit(1) if (yesno != "yes")
        end
        reshard_table.each{|e|
            move_slot(e[:source],target,e[:slot],
                :dots=>true,
                :pipeline=>opt['pipeline'])
        }
    end

    # This is an helper function for create_cluster_cmd that verifies if
    # the number of nodes and the specified replicas have a valid configuration
    # where there are at least three master nodes and enough replicas per node.
    def check_create_parameters
        masters = @nodes.length/(@replicas+1)
        if masters < 3
            puts "*** ERROR: Invalid configuration for cluster creation."
            puts "*** Redis Cluster requires at least 3 master nodes."
            puts "*** This is not possible with #{@nodes.length} nodes and #{@replicas} replicas per node."
            puts "*** At least #{3*(@replicas+1)} nodes are required."
            exit 1
        end
    end

其中move_slot方法可以在线将一个slot的全部数据从源节点迁移到目的节点，fix、reshard、rebalance都需要调用该方法迁移slot。

move_slot流程如下：

1、如果没有设置cold，则对源节点执行cluster importing命令，

2、对目的节点执行migrating命令。（fix的时候有可能importing和migrating已经执行过来，所以此种场景会设置cold。）

3、通过cluster getkeysinslot命令，一次性获取远节点迁移slot的pipeline个key的数量.

4、对这些key执行migrate命令，将数据从源节点迁移到目的节点。

5、重复执行步骤3、4，直到返回的key数量为0，就退出循环。

6、如果没有设置cold，对每个节点执行cluster setslot命令，把slot赋给目的节点。

7、如果设置update，则修改源节点和目的节点的slot信息。至此完成了迁移slot的流程。

书上有个图

 # Move slots between source and target nodes using MIGRATE.
    #
    # Options:
    # :verbose -- Print a dot for every moved key.
    # :fix     -- We are moving in the context of a fix. Use REPLACE.
    # :cold    -- Move keys without opening slots / reconfiguring the nodes.
    # :update  -- Update nodes.info[:slots] for source/target nodes.
    # :quiet   -- Don't print info messages.
    def move_slot(source,target,slot,o={})
        o = {:pipeline => MigrateDefaultPipeline}.merge(o)

        # We start marking the slot as importing in the destination node,
        # and the slot as migrating in the target host. Note that the order of
        # the operations is important, as otherwise a client may be redirected
        # to the target node that does not yet know it is importing this slot.
        if !o[:quiet]
            print "Moving slot #{slot} from #{source} to #{target}: "
            STDOUT.flush
        end

        if !o[:cold]
            target.r.cluster("setslot",slot,"importing",source.info[:name])
            source.r.cluster("setslot",slot,"migrating",target.info[:name])
        end
        # Migrate all the keys from source to target using the MIGRATE command
        while true
            keys = source.r.cluster("getkeysinslot",slot,o[:pipeline])
            break if keys.length == 0
            begin
                source.r.client.call(["migrate",target.info[:host],target.info[:port],"",0,@timeout,:keys,*keys])
            rescue => e
                if o[:fix] && e.to_s =~ /BUSYKEY/
                    xputs "*** Target key exists. Replacing it for FIX."
                    source.r.client.call(["migrate",target.info[:host],target.info[:port],"",0,@timeout,:replace,:keys,*keys])
                else
                    puts ""
                    xputs "[ERR] Calling MIGRATE: #{e}"
                    exit 1
                end
            end
            print "."*keys.length if o[:dots]
            STDOUT.flush
        end

        puts if !o[:quiet]
        # Set the new node as the owner of the slot in all the known nodes.
        if !o[:cold]
            @nodes.each{|n|
                next if n.has_flag?("slave")
                n.r.cluster("setslot",slot,"node",target.info[:name])
            }
        end

        # Update the node logical config
        if o[:update] then
            source.info[:slots].delete(slot)
            target.info[:slots][slot] = true
        end
    end

这个函数就是调用了redis的命令。cluster setslot imporing、cluster setslot migrating、cluster getkeysinslot、migrate和setslot node，下面对这四个命令进行分析：

（1）cluster setslot $slot imporing $sourceid

该命令是对target节点设置的，target节点收到该命令，会记录该$slot有一个迁入源$sourceid，该节点在收到对该$slot的请求时会进行处理。

（2）cluster setslot $slot migrating $targetid

该命令是对source节点设置的，source节点收到该命令，会记录该$slot有一个接收节点$targetid，该节点在处理对该$slot的请求时，如果发现key不在，则会回包给用户，让用户重定用到$targetid。为什么要重定向？这是因为在迁移slot过程中，一是要让新的key写入$targetid，二是正因为该slot的新key写入$targetid，读新key时也需要重定向到$targetid才能获取新key数据。

（3）cluster getkeysinslot $slot $count

该命令是发给source节点的，目的是获取该$slot的$count个keys。

（4）migrate $host $port $key $destination-db $timeout

该命令是发给source节点的，作用是原子地把$key原子地从source节点迁到target节点。通过(3)和（4）可以把该$slot所有keys迁到target节点。

（5）setslot $slot node $targetid

该命令是发给所有集群节点，让所有集群节点更新该$slot的归属为$targetid，同时清除importing和migrating状态，这里因为会更新哈希槽的归属，故target节点会更新epochconfig值，以保证其他节点更新为该epochconfig值的哈希配置。

ASK 错误

在重新分片期间, 可能会出现槽的键值同时分布在源和目的节点中的情况.
若源节点发现键已不在自身上, 则向客户端返回ASK错误, 来指引客户端转向正在导入槽的目标节点.

总结：

通过redis-trib迁移过程中不用停止redis cluster对外服务，只要用户能处理重定向情况，就可以完全感知不到迁移slot。

参考：

http://weizijun.cn/2016/01/08/redis%20cluster%E7%AE%A1%E7%90%86%E5%B7%A5%E5%85%B7redis-trib-rb%E8%AF%A6%E8%A7%A3/

https://blog.csdn.net/whycold/article/details/42585967