In distributed data storage location and the corresponding algorithm

Transfer: http://blog.chinaunix.net/uid-20593827-id-4046681.html

 

In the distributed storage solution, when the client needs to access a piece of data (via a path and offset id or a file object), the first step is to locate the data stored on which servers. There are two approaches, one is to specify one or several servers to manage a separate location mapping data, calculate the position of the other is a path through a certain algorithm id or a file object compact design and direct offset .

The first method, although simple and easy to occur, such as the hdfs namenode, swift the proxy node, but the disadvantage is also obvious, is a single point of failure, must be used to ensure that HA or loadbalancer adequate security and offload request.

Storage real data to ensure security, so there must be a backup, usually three very important, common data can be configured into two, in order to save space without loss of security, you can also adopt EC (Erasure Coding) but some large computation EC, if the storage node CPU idle, consider this. Also for security reasons, there is a general backup rule, different backup the same data often exist on a different rack, then use the EC, for security, EC of the best data to be distributed to different racks.

Also very important is the distributed storage lateral extensibility, i.e. nodes automatically increase or decrease. That there are two approaches, one of the original data do not understand, only the new data storage to put on the new node; Another method is the original data redistribution, which is bound to modify the data to the physical storage the mappings. Now storage solutions in order to consider the average disk usage, existing data do tend to redistribute, if the implementation is not good enough, it may lead to heavy data traffic between the storage nodes, consistency and performance deterioration of external access, and even system It becomes unavailable.


Swift consistency Hash Algorithm
Swift object interface is provided, the object data location algorithm is as follows:
1. do hash of the names of all the objects, the hash value after taking 0-2 ^ n, end to end to form a hash space ring
2. Each storage device is also assigned a different hash values, hash and evenly distributed over the ring. (Later, in order to reduce the imbalance data due to changes in the device caused by migration, each device is assigned a plurality of virtual hash value, the hash value of the virtual device of different staggered distribution)
3. When an object is written, this object is calculated hash value from the hash ring starting position corresponding to the forward looking, a first device is the storage location of the found object.
4. When a device decrease, hash value of the original device remains unchanged, but the device needs to be adjacent to data migration, in order to ensure the accessibility of the data.

On implementation, hash between the two is called a device the Partition, because the fixed hash algorithm, therefore an object has only a partition id in the life cycle, and to find the devs by a two-dimensional array (_replica2part2dev_id). The partition 9527 corresponding dev are: _replica2part2dev_id [0] [9527], _replica2part2dev_id [. 1] [9527], _replica2part2dev_id [2] [9527]

critical Thus, SWIFT ring is located in the data from the device to the hash map generating device when changing the map also changes. Since the proxy node swift employed to manage data, you must synchronize the mapping relationship between the proxy node.

Ceph's Crush algorithm

Ceph provide more storage interface, but the bottom or the use of stored objects.


Ceph between the object and the apparatus has two concepts, Pool and Placement Group (PG), each object corresponding to the calculated first Pool, and then calculate the corresponding PG, PG can be obtained by multiple copies of the position corresponding to the object , three copies of the first one is the Primary, the rest is called replcia.
Suppose an object foo, bar it is in the pool, the device is calculated as follows:
1. Calculate hash foo is worth to 0x3F4AE323
pool bar is calculated to obtain ID 2. 3
Number of PG 3. pool bar is 256,0x3F4AE323 mod 256 = 23, so the id is 3.23 PG
4. PG found in the corresponding OSD mapping table by PG [24, 3, 12], wherein 24 is the primary, 3,12 replica is

wherein the fourth step is the algorithm CRUSH core, CRUSH algorithm to calculate the distribution of the data object weight by the weight of each device. Object distribution is determined by the cluster map and data distribution policy is. cluster map describes the available storage resources and hierarchies (such as the number of racks, the number of servers per rack, the number of disks per server). data distribution policy composed by the placement rules. Each rule determines how many copies of the data objects, storing a copy of these constraints (such as 3 copies in different racks).

CRUSH The cluster, rule pgid and x is calculated to set a set of OSD (OSD object storage device):

(osd0, osd1, osd2 … osdn) = CRUSH(cluster, rule, pgid)

CRUSH HASH function using multiple parameters, including HASH function parameters x, such that x from the set to the OSD is deterministic and independent. CRUSH only use cluster map, placement rules, x. CRUSH is a pseudorandom algorithm, there is no correlation between the results of similar input.


CRUSH different consistency and swift hash is that, the hash swift rings are linearly distributed device, CRUSH is the use of a hierarchical structure, the user can define detailed strategy, therefore ceph profile more complicated.
The begin Crush the Map  

devices

device 10 device10

device 11 osd.11

device 12 osd.12

device 13 osd.13

device 21 osd.21

device 22 osd.22

device 23 osd.23

}

buckets

host ceph1 {

        id -2

        item osd.11 weight 1.000

        item osd.12 weight 1.000

        item osd.13 weight 1.000

}

host ceph2 {

        id -4

        item osd.21 weight 1.000

        item osd.22 weight 1.000

        item osd.23 weight 1.000

}

rack unknownrack

{

        id -3 alg straw

        hash 0  # rjenkins, hash algorithm

        item ceph1 weight 3.000

        item ceph2 weight 3.000

}

Pool (root) default {

        id -1         

        item unknownrack weight 24.000

}

# rules

rule data {

        ruleset 0

        type replicated

        min_size 1

        max_size 10

        step take default

        step chooseleaf firstn 0 type host

        step emit

}

# end crush map

 

GlusterFS flexibility Hash algorithm
GlusterFS provide both a normal file system interface, only pathname of the file when accessing the client provides.
GlusterFS uses a decentralized way to store, locate the file on the client to complete.
GlusterFS creation and client services in the same directory structure of the front end, that is, any file on the client to create a folder on each storage node will correspond to a real folder, but a specific file exists on which node is to be calculated to get of. File on the storage node is positioned as follows:
1. All files hash is a 32-bit integer to the space
2. For each folder, the extended attributes of the storage nodes will indicate the folder where the folder hash Range, fell to the range of the file is stored on node
3. when the client connected to the server, the hash range will require access to a folder on all storage nodes
4. when accessing a file, the client computing file hash value, then the corresponding access node
5. when the new node, does not change the extended attributes of an existing folder, thus creating a new file in an existing folder, the file will still be placed on the old node
6. to balance disk utilization, you can create a link on the old remote node, create a real file on the new node. (This will cause a large amount of data transmission between nodes)
7. The correspondence relationship may be re-mapped files and the storage node manually rebalance.

In comparison, glusterfs file positioning is relatively simple, but the nodes that are expanded when the operation even more trouble. The particular choice up, but also to determine the feature service which is more suitable.

Guess you like

Origin www.cnblogs.com/gongxin12/p/11357808.html