Distributed Algorithms

Distributed Algorithms

Transfer: https://blog.51cto.com/alanwu/1431397

 

An important issue faced by the distributed memory is how the distribution of data across multiple storage nodes. Learn GFS such students know the file system can use the metadata server (MS) is decided distributing data blocks mapped on the storage node. Using metadata server can be a good way of separating the data and metadata, access the file system when the command space mapping information can be obtained directly from the file on the metadata server. MS-based distributed storage architecture as shown below:

 

wKioL1OsPT3CtySKAAEjsk0Zj-Y415.jpg

 

Metadata server-based approach is the classic architecture of distributed storage, although apparently perfect, but the following two major problems still exist:

1, scalability is limited by the capacity of the metadata server. All metadata information is centralized metadata server above, so when Client wants to get metadata is required to access the server. Thus, the overall load capacity (the number of Client) is limited by the ability to metadata server. The metadata server is a potential choke points throughout the distributed system. Especially when the Client to access small files, will produce large amounts of metadata, then the metadata server will become the bottleneck in system performance.

2, the metadata server is a single point of failure in the distributed system. Once the metadata server fails, the whole distributed storage system will not work, therefore, the reliability of the metadata server is particularly important.

To sum up, the biggest problem based on distributed memory architecture that metadata server scalability and reliability. And the core point is that these issues are on the metadata server. In this regard there are a lot of system optimization tools, for example, the metadata server system scalability issues affecting sexuality, means distributed metadata servers can be adopted to ease, however, will additionally introduced data between distributed metadata server synchronization and locking mutually exclusive issues. For metadata server issues single point of failure, HA means can be used to enhance system reliability, many manufacturers try to do a lot of metadata server HA in the Hadoop distributed file system.

But no matter how optimization, using the metadata server of the distributed storage mode can not achieve the object of the linearly expandable. Substantially scalability LOG curve exhibits several ways pair. In order to achieve the ability to linearly scalable, industry began to consider how to remove metadata server, that is, to the center. Which developed the algorithm HASH algorithm, consistency HASH algorithm, elastic HASH algorithm and CRUSH algorithm. Here focused on consistency HASH algorithm.

When it comes to consistency HASH, we first need to consider HASH algorithm. HASH algorithm is very simple in distributed storage, which can be described as follows:

 

wKioL1OsPVjBJ6ZGAAD1rNlM2vU473.jpg

 

When a file is written to Client Storage requires the file path can be used as a Key value is calculated HASH value, the HASH algorithm requires good distribution. Having reached this HASH value, then the number N and the Storage Node done modulo operation, the results between 0 to N-1, the result is the need to access the Storage Node ID. From the point of view this method, a mapping relationship between a layout file does not need to intervene in the metadata server Storage Node, files, and storage nodes determined by HASH function, and is computable.

HASH algorithm looks very perfect, but its problem is that if the dynamic increase after a node, this data mapping relationship will be destroyed because of HASH algorithm N changed. In order to establish a new mapping relationship, we had the need to introduce a large amount of data migration operations, which in large-scale distributed storage is not allowed to happen. To solve this problem, the introduction of consistency HASH algorithm.

The core idea is to HASH HASH consistency of results to make a space domain, and assign a tag value for all storage nodes, the value of these tags belong to this HASH value space. This relationship can be generally described as a hash ring, this space constitutes the HASH ring, all storage nodes is a point on the ring. It can be described as follows:

 

wKioL1OsPXOBhOKMAACQVrK-zOM449.jpg

 

When a file is written to Client Storage need, the same file path as a parameter can be HASH function, and then obtain a HASH value. The resulting value will definitely belong to the HASH HASH value space, that is sure to find a point in the corresponding HASH ring above. For example, this point is located between SN1 and SN2. According to the agreement, may be selected from the nearest node in the clockwise HASH value stored as data points. That file can be stored in the newly written SN2.

The biggest advantage of consistency HASH algorithm is the large-scale migration of data after adding the storage node. For example, in the earlier example, if later between SN1 and SN2 added a SN8, then part of the data previously stored in the SN2 need to migrate to SN8, however, the remaining node does not require any data migration operations.

 

wKioL1OsPYzjtXotAACekkfmy_A840.jpg

 

Obviously, this method greatly reduces the amount of data migration, but also good to avoid problems caused by the metadata server. Therefore, the consistency of HASH algorithm has been widely applied to the CDN system, SWIFT object storage system, Amazon's dynamo storage system.

Guess you like

Origin www.cnblogs.com/gongxin12/p/11357333.html