How to estimate the reliability of distributed storage system?

This article is published by NetEase Cloud

Under normal circumstances, we generally use multi-copy technology to improve the reliability of the storage system, whether it is structured database storage (such as typical mysql), document-based Nosql database storage (mongodb) or conventional blob storage systems (GFS, Hadoop) ) and so on.

Because data can almost be regarded as the core of enterprise vitality, ensuring the reliability of data storage systems is not a trivial matter for any enterprise.

Data loss and copyset (replication group)

"In a 3-copy storage system consisting of 999 disks, what is the probability of data loss when three disks are damaged at the same time?" This is closely related to the design of the storage system. Let's first consider two extreme designs.

Design 1: Group 999 disks into 333 disk pairs.

With this design, data loss occurs only when one of the disk pairs is selected.

In this design, the probability of missing data is 333/C(999,3) = 5.025095326058336*e-07.

Design 2: The data is randomly scattered into 999 disks.

In extreme cases, the replica data of logical data on a random disk is scattered across 998 disks in all clusters. Under this design, the probability of missing data is C(999,3)/C(999,3)=1, that is, it must exist.

Through these two extreme examples, we can see that the probability of data loss is closely related to the degree of data dispersion. For the convenience of subsequent reading, here we introduce a new concept copyset (copy group).

CopySet: A device combination that contains all copy data of a data. For example, if a piece of data is written to three disks of 1, 2, and 3, then {1, 2, 3} is a copy group.

In a cluster of 9 disks, the minimum number of combinations of copysets is 3, copysets = {1,2,3}, {4,5,6}, {7,8,9}, that is, a data write If only one of the replication groups can be selected for input, data loss will only occur if {1,2,3}, {4,5,6} or {7,8,9} are both bad at the same time. That is, the minimum number of copies is N/R.

The maximum number of copies in the system is C(N,R), where R is the number of copies and N is the number of disks. In the case of completely randomly selecting nodes to write replica data, the number of replicas in the system will reach the maximum value C(N,R). That is, if R disks are arbitrarily selected, three copies of a part of the data will all be on the R disks.

The number of disks is N, and the number of copies is S in a storage system with R replicas, N/R < S < C(N, R) 

Disk Failure and Storage System Reliability Estimation


1. Disk failure and Poisson distribution

Before formally estimating the relevant probability, it is necessary to popularize a basic probability distribution: the Poisson distribution. The Poisson distribution mainly describes the probability of random events occurring in a system, such as the probability that the number of waiting guests at a bus station is a certain value, the probability that N newborns are born in a hospital within 1 hour, etc. For a more visual introduction, see Yifeng Ruan's Poisson and Exponential Distributions: A 10-Minute Tutorial.


如上为泊松分布的公式。其中,P 表示概率,N 表示某种函数关系,t 表示时间,n 表示数量,λ 表示事件的频率。

举个例子:1000 块磁盘在 1 年内出现 10 块故障的概率为 P (N(365) = 10) [注:t 的平均单位为天]。λ 为 1000 块磁盘 1 天内发生故障磁盘的数量,按照 google 的统计,年故障率在 8%,那么 λ = 1000*8%/365 。

如上只是损坏 N 块磁盘概率的统计,那么怎么利用这个公式计算分布式系统中数据可靠性 (即数据丢失概率) 的近似值呢?

2. 分布式存储系统中丢失率的估算


2.1 T 时间内的故障率

对于分布式存储系统中如何进行年故障率的估算,我们先假定一种情况:T 为 1 年的情况下,系统存满数据,坏盘不处理,这种情况下统计一下数据的年故障率。

这里我们先定义一些值

N: 磁盘数量
T:统计时间
K:坏盘数量
S:系统中 copyset 数量 (复制组的个数)
R:备份数量

如何计算 T(1年)时间内数据丢失的概率,从概率统计角度来说就是把 T (1 年) 时间内所有可能出现数据丢失的事件全部考虑进去。包含 N 个磁盘 R 副本冗余的系统中,在 T 时间内可能出现数据丢失的事件,即坏盘大于等于 R 的事件,即 R,R+1,R+2,… N ( 即为 K∈[R,N] 区间所有的事件 )。这些随机事件发生时,什么情况下会造成数据丢失?没错,就是命中复制组的情况下。

K 个损坏情况下 (随机选择 K 个盘情况下) 命中复制组的概率为:

p = X/C(N,K) 其中 X 为随机选择 K 个磁盘过程中命中复制组的组合数

那么系统出现 K 个磁盘损坏造成数据丢失的概率为:

Pa(T,K) = p * P(N(T)=K)

最后系统中 T 时间内出现数据丢失的概率为所有可能出现数据丢失的事件的概率总和。

Pb(T) = Σ Pa(T,K) ; K∈[R,N]

2.2 分布式系统衡量年故障率

以上我们假设在一年中,不对任何硬件故障做恢复措施,那么 t 用一年代入即可算出此种系统状态下的年故障率。但是在大规模存储系统中,数据丢失情况下往往会启动恢复程序,恢复完了之后理论上又算是从初始状态的随机事件,加入这个因素之后计算可靠性会变得比较复杂。

理论上大规模存储系统中坏盘、恢复是极其复杂的连续事件,这里我们把这个概率模型简化为不同个单位时间 T 内的离散事件来进行统计计算。只要两个 T 之间连续事件发生的概率极小,并且 T 时间内绝大部份坏盘情况能够恢复,那么下个时间 T 就是重新从新的状态开始,则这种估算能够保证近似正确性。T 的单位定义为小时,那么 1 年可以划分为 365*24/T 个时间段,那么系统的年故障率可以理解为 100% 减去所有单位 T 时间内都不发生故障的概率。



即系统整体丢失数据的概率为:

Pc = 1 – (1-Pb(T))**(365*24/T)


网易云对象存储服务

网易云对象存储服务 NOS(Netease Object Storage)是高性能、高可用、高可靠的云端存储服务。NOS 支持标准 RESTful API 接口,并提供丰富的数据在线处理服务,一站式解决互联网时代非结构化数据管理难题。

其中,网易云采取多重备份机制,为用户文件提供多重备份保障,在任何一台服务器或硬盘故障时,将立即进行数据恢复,确保数据安全无隐患。欢迎广大用户试用和体验。

最后,如想对本文内容(即分布式存储系统可靠性估算)作进一步学习和探究的,可参阅作者的另一篇文章:work-jlsun.github.io/20

参考文献:

Google’s Disk Failure Experience

泊松分布

泊松分布和指数分布:10 分钟教程

概率论,二项分布和 Poisson 分布


了解 网易云

网易云官网:https://www.163yun.com/

新用户大礼包:https://www.163yun.com/gift

网易云社区:https://sq.163yun.com/



Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324889830&siteId=291194637