Application of Consistent Hash on DynamoDB

Consistent hashing is defined in Wikipedia

Consistent hashing is a special hashing algorithm. After using the consistent hashing algorithm, the change in the number of slots (size) of the hash table only requires remapping of K/n keys on average, where K is the number of keys and n is the number of slots. However, in a traditional hash table, adding or removing a slot requires remapping of almost all keys.

​This article uses this topic to learn about the application of Dynamo in consistent hashing, and familiarize yourself with its application scenarios and principles.

 

1. Introduction to the features of dynamo

Dynamo means generator in Chinese, which means like a generator, providing a steady stream of services. It is a distributed Key/Value storage NoSQL database provided by Amazon, fully hosted in the cloud, supporting document and key-value storage models.


Its main features are as follows:

Features describe
Flexible data model schema freee, Nosql must support, nothing to say
Efficiency (speed) Data is stored in SSD, and the average server-side latency is usually less than ten milliseconds. As data volumes grow, DynamoDB uses automatic partitioning and SSD technology to meet throughput demands and provide low latency for databases of any size.
High availability In order to achieve high availability and durability and prevent node failure or data loss due to node failure, Dynamo backs up the same data multiple times on the coordinator and subsequent N-1 nodes
Highly scalable dynamo does not limit the size of the data, it will distribute the data to each machine, just specify the target usage rate for it, and the capacity will automatically expand or shrink according to the increase or decrease in the number of application requests, without worrying about the database scaling problem
fully managed There is no need to worry about database management, such as cluster management, configuration and preset of software and hardware, as well as monitoring and deployment, eliminating the need for developers to deploy, monitor, and maintain the database environment
decentralization Node symmetry and decentralization: The system adopts P2P architecture, and each node is equal and has the same responsibility
ACID property In order to obtain a more flexible and horizontally scalable data model, NoSQL databases usually give up some of the ACID properties of traditional relational database management systems (RDBMS), and often have poor availability when ensuring ACID data storage. Dynamo's target application is high availability, weak consistency (the "C" in ACID).


I think the most attractive thing about dynamo is that it is highly scalable and fully managed, which saves developers a lot of operational work.
Of course, the downside is that its data consistency requirements are not very high, around 99.94%, and inconsistencies encountered are thrown to the upper layer to solve, similar to the git merge operation, if the consistency requirements are relatively high , this is quite troublesome, of course, this mainly depends on the selection requirements of the application, and will be introduced in detail later.

The high scalability of dynamo is achieved by using the principle of consistent hashing. Let's focus on analyzing how it achieves high scalability by adopting consistent hashing.

 

2. Data distribution

When creating a table in dynamo, you must specify a partition key (partition). The partition key can be a hash value or can be specified by the user. When it is used as a unique primary key, there cannot be duplicates.

When you are new to Dynamo, it is not very clear how to apply partition keys. So why use a partition key?

 

Let's review the implementation principle of consistent hashing. Consistent hashing is to map data evenly into a linear space to ensure the uniformity of distribution and improve the monotonicity of data. At the same time, in order to reduce the data items that are moved too much due to too few nodes, virtual nodes are added. as follows:

 

After the introduction of "virtual node", the mapping relationship is converted from [object--->node] to [object--->virtual node--->node]. The mapping relationship of the node where the query object is located is shown in the following figure.

The above virtual node can be called partition. When adding a new node server, since the virtual node does not change, the hash value of the data is also fixed, so we only need to deal with the reallocation of virtual node and node, this The impact on data migration is minimal.

Let's look at the code implementation :

We assume that there are 256 nodes (2^8) and the number of partitions is 4096 (2^12). Since the MD5 code is 32 bits, use PARTITION_SHIFT (equal to 32- PARTITION_POWER) to map the MD5 hash value of the data item to the 2^12 space of the partition. partition power is introduced here.

ITEMS = 10000000
NODES = 256
PARTITION_POWER = 12
PARTITION_SHIFT = 32
PARTITION = PARTITION_SHIFT - PARTITION_POWER
node_stat = [0 for i in range(NODES)]

#Get hash value
def _hash(value):
    k = md5(str(value)).digest()
    ha = unpack_from(">I", k)[0]
    return ha

ring = []
part2node = {}

#Mapping relationship between virtual nodes and node nodes
for part in range(2 ** PARTITION_POWER):
    ring.append(part)
    part2node[part] = part % NODES

for item in range(ITEMS):
    #Map 32-bit hash value to 12-bit space
    h = _hash(item) >> PARTITION
    #find the nearest partition
    partition = bisect_left(ring, h)
    n = partition % NODES
    node_stat[n] += 1

_ave = ITEMS / NODES
_max = max(node_stat)
_min = min (node_stat)


This is why dynamo specifies the partition key partition when creating a table, because to ensure the high scalability of its data, it needs to allocate the data to different node data servers.
With a partition, the data of a table can be distributed to different nodes. At the same time, when the data is expanded and added to the node, because the partition of the data has not changed, only the node mapping corresponding to the partition has changed. The migration impact is minimal.

 

3. Data redundancy

In the above model, although the problem of data scalability has been solved, the problem of high data availability has not been achieved. The data of each node is single. If this node fails, how to deal with the data?


In order to achieve high availability and durability of the system and prevent data loss due to node downtime or failure, the data in Dynamo is replicated into N copies and stored in multiple hosts.
In addition to locally storing each node within its scope backing up the same data multiple times on the coordinator and subsequent N-1 nodes, N is a configurable value, 3 by default, and its theoretical basis is mainly From the NWR strategy.

NWR is a strategy used to control the consistency level in distributed storage systems. In Amazon's Dynamo cloud storage system, NWR is used to control consistency. The meaning of each letter is as follows:

 

      N: the number of copies of the same data backup

      W: is the number of copies that need to be ensured to be successfully updated when updating a data object

      R: The minimum number of copies of nodes (backups) that need to be read to read a data

 

As long as W+R > N is satisfied, it can be guaranteed that at least one valid data can be read when there is no more than one machine failure. If the application attaches great importance to read efficiency, it can set W=N, R=1; if the application needs to make a trade-off between read/write, it can generally be set to N=3, W=2, R=2. Dynamo recommends a combination of 322.  

 

We call it Replica here. In a distributed system, a single point of data is not allowed to exist, that is, the situation where the number of Replica that normally exists on the line is 1 is very dangerous.


Because once this Replica fails again, a permanent error of the data may occur. If we set N to 2, then as long as one storage node is damaged, there will be a single point. So N must be greater than 2. The higher the N, the higher the maintenance and overall cost of the system. The industry usually sets N to 3.

For example, in the above figure, the value of the yellow area will be stored in the three nodes A, B and C, and the blue area will be processed by the three nodes D, E, and F. The list of nodes responsible for storing a specific key-value pair is called Preference list, because virtual nodes will exist randomly in the ring, in order to ensure that the availability and durability will not be affected in the event of node failure, all nodes in the preference list must be different physical nodes.

Let's see how to implement

We introduce replicas in the above code, and the number is set to 3, where node_ids records the node ids stored by the 3 replicas. part2node[part] is to find the corresponding node id according to the partition id, which is also the mapping relationship between the partition and the node node.

ITEMS = 10000000
NODES = 100
PARTITION_POWER = 12
PARTITION_SHIFT = 32
PARTITION = PARTITION_SHIFT - PARTITION_POWER
PARTITION_MAX =2**PARTITION_POWER-1
REPLICAS = 3

node_stat = [0 for i in range(NODES)]

def _hash(value):
    k = md5(str(value)).digest()
    ha = unpack_from(">I", k)[0]
    return ha

ring = []
part2node = {}

#Mapping relationship between virtual nodes and node nodes
for part in range(2 ** PARTITION_POWER):
    ring.append(part)
    part2node[part] = part % NODES

for item in range(ITEMS):
        #Map 32-bit hash value to 12-bit space
    h = _hash(item) >> PARTITION
    part = bisect_left(ring, h)
    node_ids = [part2node[part]]
    node_stat[node_ids[0]] += 1

#Data replica processing, a piece of data is stored in three adjacent physical nodes
    for replica in xrange(1, REPLICAS):
        while part2node[part] in node_ids:
            part += 1
            if part > PARTITION_MAX:
                part = 0
        node_ids.append(part2node[part])
        node_stat[node_ids[-1]] += 1


_ave = ITEMS / NODES* REPLICAS
_max = max(node_stat)
_min = min (node_stat)

 

4. Data Consistency

1. Conflict

Due to the addition of replica, especially when the NWR is 322, a read operation must wait for the data of 2 nodes to return the corresponding result before the current request is considered to be over, which means that the request time will be affected by the slowest node , the same is true for writing. The only difference is that if a conflict is found in the data in the node, the conflict attempt will be resolved and the result will be rewritten back to the corresponding node.

Dynamo's data consistency requirements are not so high, and data inconsistency will occur. Of course, in most cases, Dynamo will not have this situation, and even if it does, Dynamo can ensure that once there is a conflict between data, it will not be lost. , but there may be problems with data that has been deleted reappearing.

In response to this situation, Dynamo provides a vector clock to solve. There are one or more vector clocks in each object version. Every time the data is updated, the version of the vector clock will be updated.
If each item of the vector clock of the data to be updated is not smaller than the local vector clock, then the data has no conflict and the new value can be accepted. When the client requests again, it will find that there is a conflict in the data. Since Dynamo cannot automatically resolve it according to the vector clock, it requires the client to merge different data versions. Just like git's merge operation, the problem is thrown to the caller to solve.

 

2. Failure

When a node has a temporary failure, the data will automatically enter the next node in the list for write operation, and mark it as handoff data. If the machine provides services again within the specified time, other machines will be discovered through the Gossip protocol. And send the staging data back to the temporarily faulty machine.

 

If the machine is still in the down state after a certain period of time out, it will be considered to be permanently offline, and data needs to be synchronized from other replicas. To detect inconsistencies between replicas faster and reduce the amount of data transferred, Dynamo uses Merkle trees. Merkle tree is a hash tree, its leaf nodes are the hash values ​​of each key, and the higher parent nodes in the tree are the hashes of their respective child nodes. The tree structure constructed in this way can ensure the whole A tree cannot be tampered with, and any changes can be detected immediately. In this way, the detection is fast, and the amount of data transmission is also small. During synchronization, only files with different values ​​of all nodes from the root node to the leaves are synchronized.

Quote:

https://draveness.me/dynamo

http://www.cnblogs.com/yuxc/archive/2012/06/22/2558312.html

end

 


Pay attention to game development and personal growth, devote yourself to the promotion of the game technology community, scan the QR code, and pay attention to more original articles.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324401578&siteId=291194637