分布式系统基础知识

1、常见指标

Send 1M bytes over 1Gbps network（基于千兆网传输1M数据）：10ms

Round trip within data center（同数据中心请求读取）：0.5ms

Disk seek（磁盘寻道时间）：8~10ms

Read 1MB sequentially from disk（从磁盘顺序读取1M数据）：20~25ms

Read 1M sequentially from memory（从内存顺序读取1M数据，速率为：40~50M）： 0.25ms

2、性能估算

需要假设程序的执行环境，如集群规模及机器配置，集群上其它服务占用资源的比例。

常见估算场景:

1）、内存排序时间估算：排序时间 = 比较时间（分支预测错误） + 内存访问时间。

2）、MapReduce应用处理时间：Map处理时间 + shuffle和排序时间 + reduce处理时间（虽然shuffle、map处理和排序可以部分并行，但性能估算的时候不必考虑）；Map处理时间 = 输入读取时间 + Map函数处理时间 + 输出中间结果时间；Reduce处理时间 = reduce函数处理时间 + 最终结果输出时间

3）、Bigtable设计的性能指标分析：单个磁盘读取时间：磁盘寻道时间 + 读取时间；可达到的理论值：n个盘*1/单个磁盘读取时间

3、CAP

一致性(Consistency)：任何一个读操作总是能读取到之前完成的写操作结果；

可用性(Availability)：每一个操作总是能够在确定的时间内返回；

分区可容忍性(Tolerance of network Partition)：在出现网络分区的情况下，仍然能够满足一致性和可用性；

CAP理论认为，三者不能同时满足，证明如下：假设系统出现网络分区为G1和G2两个部分，在一个写操作W1后面有一个读操作R2，W1写G1，R2读取G2，由于G1和 G2不能通信，如果读操作R2可以终结的话，必定不能读取写操作W1的操作结果。

4、一致性模型

1）强一致性

2）弱一致性：存在“不一致性窗口”，不能保证后续读取操作能读取到最新值。

3）最终一致性：，如果没有失败发生的话，“不一致性窗口”的大小依赖于以下的几个因素：交互延迟，系统的负载，以及复制技术中replica的个数。

5、NOSQL与SQL

常见模型：

KV模型：只支持最简单的针对<key, value>对的操作；

支持简单table schema的模型，如Bigtable模型

NOSQL有一些共同的设计原则：

假设失效是必然发生的

限定应用模式，支持的接口永远不可能和SQL相比

扩容支持成倍增加，常用算法：一致性Hash

6、Two-Phase commit

说明参考：http://www.nosqlnotes.net/archives/62#more-62

Two-phase commit的算法实现 (from <<Distributed System: Principles and Paradigms>>)：
协调者(Coordinator)：
write START_2PC to local log;
multicast VOTE_REQUEST to all participants;
while not all votes have been collected {
wait for any incoming vote;
if timeout {
write GLOBAL_ABORT to local log;
multicast GLOBAL_ABORT to all participants;
exit;
}
record vote;
}
if all participants sent VOTE_COMMIT and coordinator votes COMMIT {
write GLOBAL_COMMIT to local log;
multicast GLOBAL_COMMIT to all participants;

} else {
write GLOBAL_ABORT to local log;
multicast GLOBAL_ABORT to all participants;
}
参与者(Participants)
write INIT to local log;
wait for VOTE_REQUEST from coordinator;
if timeout {
write VOTE_ABORT to local log;
exit;
}
if participant votes COMMIT {
write VOTE_COMMIT to local log;
send VOTE_COMMIT to coordinator;
wait for DECISION from coordinator;
if timeout {
multicast DECISION_REQUEST to other participants;
wait until DECISION is received; /* remain blocked*/
write DECISION to local log;
}
if DECISION == GLOBAL_COMMIT
write GLOBAL_COMMIT to local log;
else ifDECISION == GLOBAL_ABORT
write GLOBAL_ABORT to local log;
} else {
write VOTE_ABORT to local log;
send VOTE_ABORT to coordinator;
}

另外，每个参与者维护一个线程专门处理其它参与者的DECISION_REQUEST请求，处理线程流程如下：
while true {
wait until any incoming DECISION_REQUEST is received;
read most recently recorded STATE from the local log;
if STATE == GLOBAL_COMMIT
send GLOBAL_COMMIT to requesting participant;
else if STATE == INIT or STATE == GLOBAL_ABORT;
send GLOBAL_ABORT to requesting participant;
else
skip; /* participant remains blocked */
}

如果协调者出现类似磁盘坏这种永久性错误，该事务将成为被永久遗弃的孤儿。一种可行的解决方法是当前的协调者宕机的时候有其它的备用协调者接替，用于同一时刻只能允许一个协调者存在，二者之间有一个选举的过程，这里需要用到Paxos协议。

7、Paxos

Paxos选举过程如下：
 Phase 1
(a) A proposer selects a proposal number n and sends a prepare request with number n to a majority of acceptors.
(b) If an acceptor receives a prepare request with number n greater than that of any prepare request to which it has already responded, then it responds to the request with a promise not to accept any more proposals numbered less than n and with the highest-numbered proposal (if any) that it has accepted.
 Phase 2
(a) If the proposer receives a response to its prepare requests (numbered n) from a majority of acceptors, then it sends an accept request to each of those acceptors for a proposal numbered n

with a value v, where v is the value of the highest-numbered proposal among the responses, or is any value if the responses reported no proposals.
(b) If an acceptor receives an accept request for a proposal numbered n, it accepts the proposal unless it has already responded to a prepare request having a number greater than n.

分布式系统基础知识

猜你喜欢