ceph/crush/mapper.c 源代码解析

（1）crush_find_rule函数

int crush_find_rule(const struct crush_map *map, int ruleset, int type, int size)

crush_find_rule函数是根据指定的ruleset、type、size在crush_map中找到相应的的crush_rule id。
参数：
map：crush_map
ruleset：存储规则集id（用户定义）
type：存储规则集类型（用户定义）
size：输出集大小

（2）bucket_perm_choose

static int bucket_perm_choose(struct crush_bucket *bucket,int x, int r)

根据bucket的随机排列进行选择。给定一个crush输入x和副本位置（通常，输出集中的位置）r，将在bucket中生成一个item。

（3）bucket_uniform_choose

static int bucket_uniform_choose(const struct crush_bucket_uniform *bucket, struct crush_work_bucket *work, int x, int r)

uniform类型适用于每个items具有相同的权重，且items 很少添加和删除，也就是item的数量比较固定。它用了伪随机排列算法。

（4）bucket_list_choose

static int bucket_list_choose(const struct crush_bucket_list *bucket, int x, int r)

List类型的bucket中，其子item在内存中使用数据结构中的链表来保存，其所包含的item可以具有任意的权重。集群扩展时，新设备加到表头，数据迁移很少。但是移除设备时，会产生很多数据移动。具体查找算法如下：
1）从List_Bucket的表头 item开始查找，它先得到表头item的权重Wh，剩余链表中所有item的权重之和为Ws。
2）根据Hash（x，r，i）函数得到一个[0-1]的值v，假如这个值v在[0~Wh/Ws]之中，则选择表头item ，并返回表头item的id值。
3）否则继续遍历剩余的链表，继续递归选择。
查找复杂度为O(n)

（5） bucket_tree_choose

static int bucket_tree_choose(const struct crush_bucket_tree *bucket, int x, int r)

Tree类型的Bucket其item的组织成树结构：每个item组成决策树的叶子节点。根节点和中间节点是虚节点，其权重等于左右子树的权重之和。由于item在叶子节点，所以每次选择只能走到叶子节点才能选择一个item出来。其具体查找方法如下：
1）从该Tree bucket的root item （虚节点）开始遍历。
2）它先得到节点的左子树的权重Wl，得到节点的权重Wn，然后根据哈希函数Hash（x，r，i）得到一个[0~1]值v：
a）如果值v在[0~Wl/Wn]之间，那么左子树中继续选择item。
b）否则在右子树中继续选择item。
c）继续遍历子树，直到到达叶子节点，叶子节点item为最终选出的一个结果。

由上述过程可知，Tree bucket每次选择一个item都要遍历到子节点。其查找复杂度是O(log n)。
当bucket中包含大量的item时，效率会比List型的高。

（6）bucket_straw_choose

static int bucket_straw_choose(struct crush_bucket_straw *bucket,int x, int r)

函数bucket_straw_choose用于straw类型的bucket的选择，输入参数x为pgid，r为副本数。

Straw类的Bucket为默认的选择算法。该Bucket中的item选中概率是相同的，其实现如下：
1）函数f(Wi)为和item的权重Wi相关的函数，决定了每个item被选中的概率。
2）给每个item计算出一个长度，其公式为length=f(Wi)*hash(x,r,i)
length值最大的item就是被选中的item。

List buckets和Tree buckets的结构决定了只有有限的哈希值需要计算并与权重进行比较以确定bucket中的项。这样做的话，他们采用了分而治之的方式，要么给特定项以优先权（比如那些在列表开头的项），要么消除完全考虑整个子树的必要。尽管这样提高了副本定位过程的效率，但当向buckets中增加项、删除项或重新计算某一项的权重以改变其内容时，其重组的过程是次最优的。

Straw类型bucket允许所有项通过类似抽签的方式来与其他项公平“竞争”。定位副本时，bucket中的每一项都对应一个随机长度的straw，且拥有最长长度的straw会获得胜利（被选中）。每一个straw的长度都是由固定区间内基于CRUSH输入 x,，副本数目r,，以及bucket项 i，的哈希值计算得到的一个值。每一个straw长度都乘以根据该项权重的立方获得的一个系数 f(wi)，这样拥有最大权重的项更容易被选中。尽管straw类型bucket定位过程要比List buckets和Tree buckets慢，但是straw类型的bucket在修改时最近邻项之间数据的移动（重组过程）是最优的。

（7）bucket_straw2_choose

static int bucket_straw2_choose(struct crush_bucket_straw2 *bucket,int x, int r)

Straw bucket 的改进，可以减少数据的迁移量。例如，增加一个设备给项目C从而改变它的权重后，或者删除项目C以后，数据只会移动到它上面或者从它上面移动到其他地方，而不会在bucket内的其它项目之间出现数据移动。

（8）crush_bucket_choose

static int crush_bucket_choose(struct crush_bucket *in, int x, int r)

函数crush_bucket_choose根据不同的类型bucket，选择不同的算法来实现从bucket中选出item。
crush_bucket_choose是CRUSH最重要的函数，应为默认的bucket类型是straw，常见的情况下我们会使用straw类型bucket，然后就会进入bucket_straw_choose。

（9）crush_choose_firstn

static int crush_choose_firstn(const struct crush_map *map,
			       struct crush_bucket *bucket,
			       const __u32 *weight, int weight_max,
			       int x, int numrep, int type,
			       int *out, int outpos,
			       int out_size,
			       unsigned int tries,
			       unsigned int recurse_tries,
			       unsigned int local_retries,
			       unsigned int local_fallback_retries,
			       int recurse_to_leaf,
			       unsigned int vary_r,
			       unsigned int stable,
			       int *out2,
			       int parent_r)

深度优先，调用函数crush_choose_firstn。
函数调用crush_bucket_choose选择需要的副本数，并对选择出来的OSD做了相关的冲突检查，如果冲突或者失效或者过载，继续选择新的OSD。
这个函数递归的选择特定bucket或者设备，并且可以处理冲突，失败的情况。
如果当前是choose过程，通过调用crush_bucket_choose来直接选择。
如果当前是chooseleaf选择叶子节点的过程，该函数将递归直到得到叶子节点。
参数：
map：crush_map
bucket：我们从中选择一个item的bucket
x：crush输入值
numrep：要选择的item数
type：要选择的item类型
out：指向输出向量的指针
outpos：我们在该向量中的位置
out_size：out向量的大小
tries：尝试的次数
rerserse_tries：递归chooseleaf的尝试次数
local_retries：本地化重试
local_fallback_retries：本地化后备重试
recurse_to_leaf：如果我们想要在给定类型的每个item下有一个设备，则为true（chooseleaf而不是choose）
stable：稳定模式在所有副本的递归调用中启动rep = 0
vary_r：将r传递给递归调用
out2：叶子item的第二个输出向量（如果是recurse_to_leaf）
parent_r：从父级传递的r值

（10）crush_choose_indep

static void crush_choose_indep(const struct crush_map *map,
			       struct crush_bucket *bucket,
			       const __u32 *weight, int weight_max,
			       int x, int left, int numrep, int type,
			       int *out, int outpos,
			       unsigned int tries,
			       unsigned int recurse_tries,
			       int recurse_to_leaf,
			       int *out2,
			       int parent_r)

纠删码存储过程
广度优先，调用函数crush_choose_indep。

（11）crush_do_rule

int crush_do_rule(const struct crush_map *map,
		  int ruleno, int x, int *result, int result_max,
		  const __u32 *weight, int weight_max,
		  int *scratch)

函数crush_do_rule根据step的数量，循环调用相关的函数选择bucket。如果是深度优先，调用函数crush_choose_firstn；如果是广度优先，调用函数crush_choose_indep。
参数：
map：crush map结构
ruleno：ruleset的号
x：输入，一般是pg的id
result：输出osd列表
result_max：输出osd列表的数量
weight：所有osd的权重，通过它来判断osd是否out
weight_max：所有osd的数量
sratch：私人使用的scratch矢量; 必须> = 3 * result_max

ceph/crush/mapper.c 源代码解析

猜你喜欢