Blockchain: Hash Algorithms and Consistent Hash Algorithms

This article mainly introduces the hash algorithms commonly used in blockchain.

1 Hash algorithm

1.1 Definition and characteristics

Hash algorithm refers to the process of converting input data of any length (such as files, messages, numbers, etc.) through a hash function (Hash Function) to generate a fixed-length hash value (Hash Value).
In the blockchain, the hash algorithm is often used for block verification and security assurance. In order to ensure security, the hash algorithm must meet the following three conditions:

Collision-resistance: Different inputs cannot produce the same output. In order to ensure the collision resistance of the hash algorithm, the following methods are generally adopted: mapping the input information to the input space as evenly as possible, introducing randomness, and increasing the computational complexity. A good hash algorithm design can greatly reduce the probability of hash collisions and improve data security and integrity.
Information hiding (information hiding): That is, the output of the hash function cannot be reversed to its input.
Concealment (puzzle friendly): That is, any small input change will cause unpredictable changes in the distribution of the output hash value, which can ensure that the attacker cannot predict the change of the output hash value by changing a part of the input data.

1.2 Commonly used hash algorithms

Commonly used hash algorithms in cryptography include MD5, SHA1, SHA2, SHA256, SHA512\SHA3, and RIPEMD160. Here we only take MD5 and SHA1 as examples to show the Python code, as follows:

import hashlib

message='今天是2023年7月13号，今年天气很热！'

#MD5
md5=hashlib.md5()
md5.update(message.encode('utf-8'))
md5_mess=md5.hexdigest()
print("MD5加密后的内容为:{}".format(md5_mess))

#SHA1
sha1=hashlib.sha1()
sha1.update(message.encode('utf-8'))
sha1_mess=sha1.hexdigest()
print("SHA1加密后的内容为:{}".format(sha1_mess))

The result is as follows:

The content after MD5 encryption is: 77815d973ce48613428d52956a1f6979
The content after SHA1 encryption is: 572379a7baa62fd26e39bb4bbaf511497dbf838c

2 Consistent Hash Algorithm

Consistent Hashing (CH) is an extension of the hash algorithm, mainly to solve the data distribution and load balancing problems in distributed systems.

2.1 Algorithm principle

The consistent hashing algorithm works as follows:

Set an address space ranging from $\sim (2^{32}-1)$ the hash ring;
Use the characteristic value of the node (generally use the node ip address) to calculate the hash value, and map the hash value to a certain point on the hash ring;
Use the same hash function to calculate the hash value for the data key, and also map it to the hash ring;
Searching clockwise from the position of the data mapping, the first storage node encountered is the address of the storage node corresponding to the key value.

However, when there are fewer storage nodes on the hash ring structure, the distribution of storage nodes on the hash ring is more random, resulting in uneven load on the storage nodes. In order to avoid this phenomenon, virtual nodes are introduced. The idea is to place multiple copies of a node in different positions of the hash ring (that is, a virtual node, using the characteristics of the real node corresponding to the virtual node to calculate the hash value to obtain the position of the virtual node on the hash ring). After joining the virtual node, the mapping relationship of the key value needs to go through two steps:

Calculate the mapping relationship between the key value and the virtual node;
Find the real storage node corresponding to the key value according to the mapping relationship between the virtual node and the storage node;

2.2 Case

Use python code to simulate the use of consistent hash algorithm to solve the problem of distributed data distribution. The case code is as follows:

import hashlib

class ConsistentHashing:
    def __init__(self, nodes=None, replica_count=10):
        self.replica_count = replica_count #虚拟节点的数量
        self.ring = {
    
    } #记录哈希环地址及其对应的真实节点
        self.sorted_keys = []
        if nodes:
            for node in nodes:
                self.add_node(node)
                
    #新节点加入
    def add_node(self, node):
        for i in range(self.replica_count):
            key = self.get_key(node, i)
            self.ring[key] = node
            self.sorted_keys.append(key)
 
        self.sorted_keys.sort()
        
    #节点退出
    def remove_node(self, node):
        for i in range(self.replica_count):
            key = self.get_key(node, i)
            del self.ring[key]
            self.sorted_keys.remove(key)
            
    #获取数据内容的保存地址
    def get_node(self, data_key):
        if not self.ring:
            return None

        hashed_key = self.hash_key(data_key)
        for key in self.sorted_keys:
            if hashed_key <= key:
                return self.ring[key]

        return self.ring[self.sorted_keys[0]]
    
    #根据真实节点构造虚拟节点的特征值，并用此特征值计算虚拟节点在哈希环上的地址
    def get_key(self, node, replica_index):
        return self.hash_key(f"{
      
      node}#{
      
      replica_index}")

    #这里使用的哈希函数为SHA1
    def hash_key(self, key):
        hashed_key = hashlib.sha1(key.encode()).digest()
        return int.from_bytes(hashed_key, byteorder='big')


nodes = ['node1', 'node2', 'node3', 'node4']
CH = ConsistentHashing(nodes=nodes, replica_count=5)

data_keys = ['data1', 'data2', 'data3', 'data4']
for key in data_keys:
    node = CH.get_node(key)
    print(f"'{
      
      key}' belongs to '{
      
      node}'")

One of the possible results of the operation is as follows:

‘data1’ belongs to ‘node1’
‘data2’ belongs to ‘node3’
‘data3’ belongs to ‘node1’
‘data4’ belongs to ‘node2’

References

"Comparative Research on Consistent Hash Algorithms"
"Blockchain in Vernacular"