Qiniu’s storage algorithm guess

When individuals browse the web, they have the habit of tagging. Recently, when I was sorting out past tags, I found that I had accumulated some web pages about the storage strategy of Qiniu Company, so I decided to compile an article to prepare for memory. Of course, I also hope it will be useful to others.

Because the storage strategy of Qiniu Company is mainly based on erasure codes (EC), the following will be extended from erasure codes.

Introduction: What is Erasure Code?

The explosive growth of data has caused the scale of storage systems to continue to increase, but the reliability of storage devices has not been significantly improved (the reliability of SSDs has continued to decline from SLC to MLC and TLC, and disks have become more reliable as more data is written per unit area. performance cannot be improved), thus bringing huge challenges to the persistent storage of data. In addition, as the scale of storage systems increases, the increase in cold data in the storage system will far exceed the increase in hot data. How to safely store cold data and obtain cold data when needed has also become an important challenge in the storage system.

Copy strategy and encoding strategy are two important methods to ensure data redundancy. When part of the original data is lost, both the copy strategy and the encoding strategy can ensure that the data can still be obtained correctly. The copy strategy makes one or more copies of the original data for storage, while the encoding strategy divides the original data into blocks and encodes them to generate redundant data blocks, ensuring that if a certain amount of data blocks are lost, the original data can still be obtained.

The performance comparison of the two strategies is as follows:

|| Strategy || Storage efficiency || Computational overhead || Repair efficiency ||

|| Copy Policy || Low || None || High ||

|| Coding Strategy || High || Yes || Low ||

Although the encoding strategy has the disadvantages of lower computational overhead and lower repair efficiency than the copy strategy (which will be introduced later), the advantage of greatly reducing storage overhead still wins huge space for the encoding strategy. In practice, there are storage systems with copy strategies and encoding strategies. For example, in distributed storage systems (HDFS, GFS, TFS), hot data is often saved through copy strategies, while cold data is saved through encoding to save storage space. The ECC encoding strategy is also used in ECC memory and SSD pages to ensure data redundancy.

From the perspective of information theory and coding, erasure coding belongs to group linear coding. The encoding process can be represented by the multiplication (dot product) of a coding matrix GM and the block data. That is to say, the coding matrix GM defines how the data is encoded into redundant data.

                                           

                                             coding matrix

C0 ~ C5 are redundant data. All redundant data can be expressed as the multiplication of GM × D{D0, D1, D2, D3}. Each redundant data block Ci is the product of the corresponding row of the matrix and the data block ( yellow mark). Each element of GM in the coding matrix is ​​the multiplication coefficient corresponding to the original data block.

The above is excerpted from: Erasure Codes in Storage Systems—Overview (1)

The basic field of operations in erasure coding——finite field

  • What is a finite field?

In mathematics, constructing some symbols and operations based on them constitute a "field". For example, integer numbers and their arithmetic operations (addition, subtraction, multiplication and division) constitute an integer field. If the number of symbols in the field is finite, it is called a finite field, otherwise it is an infinite field. For example, the integer field is an infinite field, and the Boolean field we are familiar with is a finite field. The size of a finite field is a prime number (denoted p) or a prime raised to a power (p^q).

  • Operational properties of fields
    • The closedness of addition and multiplication, the results of addition and multiplication are still in the domain
    • Commutative properties of addition and multiplication; a + b = b + a and a · b = b · a
    • Distributive law of addition and multiplication: a + ( b + c ) = ( a + b ) + c and a · ( b · c ) = ( a · b ) · c
    • Distributive law of addition and multiplication: a · ( b + c ) = ( a · b ) + ( a · c )
    • 存在加法和乘法的单位元素:a+0 = a 和 a·1 = 1(这里的0,1 不是自然数的0,1,代表着域上的加法单位元素和乘法单位元素)
    • 每个域上的元素都存在其负元和逆元: a+(-a) = 0, a·(a -1 ) = 1
  • 计算机运算与有限域

如大家知道的每种计算机语言,不论静态语言还是动态语言,都有其最基本的数据类型,每种数据类型都有其数据范围,这就决定了其间的运算必定是有限域运算。有限域由数学家伽罗瓦最早提出的,一种有限域的构造方法即使由他提出的,也称作Galois Field,这样的域记作GF(p) 和GF(pq),计算机中的p为2,因为其数据是二进制形式,其运算也是基于二进制的运算,即GF(2w),w可为1、2、4、8、16等。如果w为1,则GF(2)的加法运算即为异或运算,乘法运算即为与运算。当w大于1时,其加法运算仍未异或预算,而乘法运算则稍有不同,在此不再详解。

七牛的存储方案

在技术上存储系统的核心诉求是成本和可靠性,而这两者又是一对矛盾,想降低丢失数据的风险,势必要增加每份数据拷贝的份数,而增加每份数据拷贝的份数,又势必增加成本,七牛使用EC冗余算法来平衡这对矛盾。这个算法将一份数据拆分成M份,并将这M份数据代入一个多元线性方程组,算出N份校验数据,然后将这M+N份数据存储,在存储下来的M+N份数据中,有任何一份或多份数据损坏,我们都可以通过这个多元线性方程组将损坏的数据算回,由这个原理我们很容易得到一个结论,使用EC冗余算法的系统最多支持N份数据损毁而不丢失数据,七牛使用了一些自己独有的技术,将M和N都做到比较大的数值,M又远远大于N,使得系统的备份数非常低 — (M+N)/M (M远大于N),而可靠性又非常高 — 可同时损坏N份数据(N也是较大的数值)。我前面讲过七牛使用EC冗余算法,获得了极高的数据可靠性,并在此基础上又引入双数据中心互备来避免单机房灾难性事故,通过这些努力,七牛做到保护企业的数据零丢失。

域代数遵循自然代数的加减乘除规律,但数据值控制在有限区域,不管怎么算,结果都在0到255这个域里面,所以叫域代数。存储文件可以认为是0到255的一个序列,举个例子,一个100K的文件拆成10份,每份是10K,存在10个地方,但文件仍然是一份。这时候用域代数里的加法(其实就是计算机中的异或操作),从这10份数据里取出一份校验数据,数据变成了11份,它的冗余度是1.1。这是一种基于校验码的存储方式,成本比较低,但效果和双副本差不多,其中任何一个数据丢了,都能恢复回去。

  • «存储系统的那些事 —-七牛新存储(V2)上线有感 许世伟于2014年6月5日»

新存储的第一大亮点是引入了纠删码(EC)这样的算数冗余方案,而不再是经典的3副本冗余方案。我们的EC采用的是28+4,也就是把文件切分为28份,然后再根据这28份数据计算出4份冗余数据,最后把这32份数据存储在32台不同的计算机上。这样做的好处是既便宜,又提升了可靠性和可用性。从成本角度,同样是要存储1PB的数据,要买的存储服务器只要3副本存储的36.5%,经济效益相当好。从可靠性方面,以前3副本只允许同事损坏2块盘,现在允许同时损坏4块盘,直观来说大大改善了可靠性。从可用性角度,以前能够接受2台服务器下线,现在能否同时允许4台服务器下线。新存储的第二大两点是修复速度,我们把单盘修复时间从三小时提升到了30分钟以内。修复时间对提升可靠性有着重要意义。28+4(M=4)方案保证数据丢失概率是1e-16,即可以做到16个9的可靠性。

上面一段有待商榷的是,通过36.5%这个经济效益值不知许先生是怎么计算出来的?我认为是38.95%【(28+4)/ (28 * 3) = 38.95%】。

上面三段引论中,都是我从互联网上搜集来的七牛存储方案的只言片语。可以确认的是,其计算过程一定是基于有限域的纠删码算法。不能确认的是它的纠删码算法,即28+4(M=4)方案【如果按照七牛云存储CTO韩拓的说法,应当是M=28, N=4】中,这4份冗余数据是如何计算出来的?

大体来说,纠删码的算法有分为XOR码和RS码两类,XOR码基于有限域GF(2),因其编解码的加法运算基于计算机能够快速完成异或运算(bit-wise exclusive-OR)。常见的XOR码有

  • 低密度奇偶校验码(Low Density Parity Code, LDPC)
  • 柯西-里德所罗门码(Cauchy-Reed-Solomon Codes,CRS)
  • RAID码(如RAID5、RAID6)
  • 奇偶码(EVENODD)
  • X-码(X-code)

从第二段引论,可以确认的是七牛的存储算法应该是基于GF(2^3)有限域,其纠删码算法应该是XOR码算法的一种。其GM应该是4行7列矩阵(矩阵上每个元素亦可视为一个矩阵块,即一份数据分块),D则应该为7行一列矩阵,其校验矩阵C则应该为四行一列矩阵(每个元素就是一个校验数据块)。

个人的推算也仅能到此为止。或许随着七牛公司以后披露更多的资料,我继续对这篇文章进行补充。

以上分析可能谬论百出,还请各位行家抛砖,继续修改本篇文章,以求共同进步。

Guess you like

Origin blog.csdn.net/menggucaoyuan/article/details/41913555