[Data structure] Hash underlying structure

Table of contents

1. Hash concept

2. Hash implementation

1. Closed hashing

1.1. Linear detection

1.2. Second detection

2. Open hash

2.1. The concept of open hashing

2.2, Open hash structure

2.3. Open hash lookup

2.4. Open hash insertion

2.5. Deletion of open hash

3. Performance analysis


1. Hash concept

 In the sequential structure and balanced tree, there is no corresponding relationship between the element key and its storage location, so when looking for an element, it must go through multiple comparisons of the key. The time complexity of sequential search is O(N), and the height of the tree in the balanced tree is O(logN). The efficiency of the search depends on the number of comparisons of elements during the search process.

 Ideal search method: the element to be searched can be directly obtained from the table at one time without any comparison. If a storage structure is constructed, and a one-to-one mapping relationship can be established between the storage location of an element and its key code through a certain function (hashFunc), then the element can be quickly found through this function when searching.

When adding to this structure:

  • Insert element:
    According to the key code of the element to be inserted, use this function to calculate the storage location of the element and store it according to this location.
  • Search element:
    Carry out the same calculation on the key code of the element, take the obtained function value as the storage position of the element, compare the element according to this position in the structure, if the key code is equal, the search is successful.

 This method is a hash (hash) method, the conversion function used in the hash method is called a hash (hash) function, and the constructed structure is called a hash table (or hash table).

For example: data set {1, 7, 6, 4, 5, 9};
the hash function is set to: hash(key) = key % capacity;  capacity  is the total size of the underlying space of the storage element.

 Searching with this method does not need to compare multiple key codes, so the search speed is relatively fast. But it may cause hash collision.

Hash overall code structure:

enum State
{
	EMPTY,
	EXIST,
	DELETE
};

template<class K, class V>
struct HashData
{
	pair<K, V> _kv;
	State _state = EMPTY;
};

template<class K>
struct HashFunc
{
	//key本身可以进行隐式类型转换
	size_t operator()(const K& key)
	{
		return key;
	}
};

template<>
struct HashFunc<string>
{
	//BKDR哈希
	size_t operator()(const string& s)
	{
		size_t hash = 0;
		for (auto ch : s)
		{
			hash += ch;
			hash *= 31; //通过这种方式减少不同字符串的ascll码值相同造成的冲突问题
						//这个值可以取31、131、1313、131313等等
		}
		return hash;
	}
};

template<class K, class V, class Hash = HashFunc<K>>
class HashTable
{
public:
	bool Insert(const pair<K, V>& kv)
	{}

private:
	vector<HashData<K, V>> _tables;
	size_t _n = 0; //存储的数据个数
};

 Among the template parameters, Hash is a functor used to convert the key value into an integer. If the key is a string type, specialization is used to convert it to an integer by means of BKDR.

2. Hash implementation

 For the keywords k_i and k_j(i != j) of two data elements, there is k_i != k_j, but there is: Hash(k_i) ==Hash(k_j), that is, different keywords are calculated by the same hash number The same hash address is generated, this phenomenon is called hash collision or hash collision. Data elements with different keys but the same hash address are called "synonyms".

 The reason for the hash collision is that the design of the hash function is not reasonable enough. Hash function design principles:

  • The definition domain of the hash function must include all key codes that need to be stored, and if the hash table allows m addresses, its value range must be between 0 and m-1.
  • The addresses calculated by the hash function can be evenly distributed in the entire space.
  • The hash function should be relatively simple.

Common hash functions:

  1. Direct addressing method--(commonly used)
    take a linear function of the keyword as the hash address: Hash (Key) = A*Key + B.
    Advantages: simple and uniform.
    Disadvantage: Need to know the distribution of keywords in advance.
    Usage scenario: suitable for finding relatively small and continuous situations.
  2. Remainder method--(commonly used)
     Let the number of addresses allowed in the hash table be m, take a prime number p that is not greater than m, but closest to or equal to m as the divisor, according to the hash function: Hash(key) = key% p(p<=m), convert the key code into a hash address.

There are two main methods to resolve hash conflicts: closed hashing and open hashing.

1. Closed hashing

 Closed hashing: also known as open addressing method, when a hash conflict occurs, if the hash table is not full, it means that there must be an empty space in the hash table, then the key can be stored in the "bottom" of the conflicting position a" to go in the empty slot. So how to find the next empty position?

1.1. Linear detection

Linear detection: start from the position where the conflict occurs, and probe backwards in turn until the next empty position is found.

Insert :

  1. Obtain the position of the element to be inserted in the hash table through the hash function.
  2. If there is no element in this position, insert a new element directly. If there is an element in this position, a hash collision occurs, use linear detection to find the next empty position, and insert a new element.

Insert code :

bool Insert(const pair<K, V>& kv)
{
	if (Find(kv.first))
	{
		return false;
	}

    Hash hash;
	
    size_t hashi = hash(kv.first) % _tables.size(); //这里是size,而不是capacity
                                          //因为 [] 无法访问 size 外的数值
	//线性检测
	size_t i = 1;
	size_t index = hashi;
	while (_tables[index]._state == EXIST)
	{
		index = hashi + i;
		index %= _tables.size();
		++i;
	}

	_tables[index]._kv = kv;
	_tables[index]._state = EXIST;
	_n++;
    return true;
}

Because the size  may be 0 or the capacity is not enough, the expansion operation is required:

//size为0,或者负载因子超过 0.7 就扩容
if (_tables.size() == 0 || _n * 10 / _tables.size() >= 7)
{
	size_t newsize = _tables.size() == 0 ? 10 : _tables.size() * 2;
	vector<HashData> newtables(newsize); //创建一个新的vector对象
	//遍历旧表,重新映射到新表
	for (auto& data : _tables)
	{
		if (data._state == EXIST)
		{
			//重新算在新表中的位置
            size_t hashi = hash(data.kv.first) % newtables.size();
			size_t i = 1;
			size_t index = hashi;
			while (newtables[index]._state == EXIST)
			{
				index = hashi + i;
				index %= newtables.size();
				++i;
			}

			newtables[index]._kv = data.kv;
			newtables[index]._state = EXIST;
		}
	}
	_tables.swap(newtables);
}

 It should be noted that when expanding the capacity, a vector object needs to be re-opened, and all data must be re-inserted. It is not possible to expand the capacity of the original vector object, because in this way, after the expansion, the mapping position relationship will change, the original non-conflicting values ​​may conflict, and the original conflicting values ​​may not conflict.

 Because the above writing method has code redundancy, the following writing method can be used to simplify the code:

bool Insert(const pair<K, V>& kv)
{
	if (Find(kv.first))
	{
		return false;
	}

    Hash hash;

	//负载因子超过 0.7 就扩容
	if (_tables.size() == 0 || _n * 10 / _tables.size() >= 7)
	{
		size_t newsize = _tables.size() == 0 ? 10 : _tables.size() * 2;
		HashTable<K, V> newht;  //创建一个新的哈希表对象
		newht._tables.resize(newsize); //扩容
		//遍历旧表,重新映射到新表
		for (auto& data : _tables)
		{
			if (data._state == EXIST)
			{
				newht.Insert(data._kv);
			}
		}
		_tables.swap(newht._tables);
	}

	size_t hashi = hash(kv.first) % _tables.size(); //这里是size,而不是capacity

	//线性检测
	size_t i = 1;
	size_t index = hashi;
	while (_tables[index]._state == EXIST)
	{
		index = hashi + i;
		index %= _tables.size();
		++i;
	}

	_tables[index]._kv = kv;
	_tables[index]._state = EXIST;
	_n++;
    return true;
}

 delete:

 When using closed hashing to handle hash conflicts, existing elements in the hash table cannot be physically deleted casually. If elements are directly deleted, the search for other elements will be affected. For example, if you delete element 4, if you delete it directly, the lookup of 44 may be affected. So linear probing uses marked pseudo-deletion to delete an element. 

Delete code:

HashData<K, V>* Find(const K& key)
{
	if (_tables.size() == 0)
	{
		return nullptr;
	}

    Hash hash;

	size_t hashi = hash(key) % _tables.size();

	size_t i = 1;
	size_t index = hashi;
	while (_tables[index]._state != EMPTY)
	{
		if (_tables[index]._state == EXIST && _tables[index]._kv.first == key)
		{
			return &_tables[index];
		}

		index = hashi + i;
		index %= _tables.size();
		++i;

		//如果找了一圈,那么说明全是存在或删除
		if (index == hashi)
		{
			break;
		}
	}
	return nullptr;
}

bool Erase(const K& key)
{
	HashData<K, V>* ret = Find(key);
	if (ret)
	{
		ret->_state = DELETE;
		--_n;
		return true;
	}
	else
	{
		return false;
	}
}

 It should be noted that in order to prevent the infinite loop problem caused by only existence and deletion in the hash table, a judgment needs to be added in the function to limit the number of searches.

 Advantages of linear detection: the implementation is very simple,
 disadvantages of linear detection: once a hash conflict occurs, all the conflicts are connected together, which is easy to generate data "accumulation", that is, different key codes occupy available empty positions, making it difficult to find a certain key code The position of requires many comparisons, resulting in reduced search efficiency.

1.2. Second detection

 The defect of linear detection is that conflicting data are piled up together, which has something to do with finding the next empty position, because the way to find empty positions is to find one after another, so in order to avoid this problem, the second detection should find the next The method of empty position is: H_i = (H_0 + i^2) % m, or: H_i = (H_0 - i^2) % m. Among them: i = 1, 2, 3..., H_0 is the position obtained by calculating the key code key of the element through the hash function Hash(x), and m is the size of the table.

 Research shows that: when the length of the table is a prime number and the table load factor a does not exceed 0.5, new entries must be inserted, and any position will not be probed twice. Therefore, as long as there are half of the empty positions in the table, there will be no problem of the table being full. You can ignore the fullness of the table when searching, but you must ensure that the loading factor a of the table does not exceed 0.5 when inserting. If it exceeds, you must consider increasing the capacity. Therefore: the biggest disadvantage of hashing is that the space utilization rate is relatively low, which is also the defect of hashing.

2. Open hash

2.1. The concept of open hashing

 The open hash method is also called the chain address method (open chain method). First, the hash function is used to calculate the hash address for the key code set. The key codes with the same address belong to the same sub-set. Each sub-set is called a bucket. The elements in the bucket are linked through a singly linked list, and the head nodes of each linked list are stored in the hash table.

 As can be seen from the figure above, each bucket in the open hash contains elements that have hash collisions.

2.2, Open hash structure

template<class K, class V>
struct HashNode
{
	HashNode<K, V>* _next;
	pair<K, V> _kv;

	HashNode(const pair<K, V>& kv)
		:_next(nullptr)
		,_kv(kv)
	{}
};

template<class K>
struct HashFunc
{
	//key本身可以进行隐式类型转换
	size_t operator()(const K& key)
	{
		return key;
	}
};

template<>
struct HashFunc<string>
{
	//BKDR哈希
	size_t operator()(const string& s)
	{
		size_t hash = 0;
		for (auto ch : s)
		{
			hash += ch;
			hash *= 31; //通过这种方式减少不同字符串的ascll码值相同造成的冲突问题
						//这个值可以取31、131、1313、131313等等
		}
		return hash;
	}
};

template<class K, class V, class Hash = HashFunc<K>>
class HashTable
{
	typedef HashNode<K, V> Node;
public:
    ~HashTable()
	{
		for (auto& cur : _tables)
		{
			while (cur)
			{
				Node* next = cur->_next;
				delete cur;
				cur = next;
			}
			cur = nullptr;
		}
	}
	bool Insert(const pair<K, V>& kv)
	{}

private:
	vector<Node*> _tables;
	size_t _n = 0;
};

2.3. Open hash lookup

Node* Find(const K& key)
{
	if (_tables.size() == 0)
		return nullptr;

    Hash hash;

	size_t hashi = hash(key) % _tables.size();
	Node* cur = _tables[hashi];
	while (cur)
	{
		if (cur->_kv.first == key)
			return cur;
		cur = cur->_next;
	}
	return nullptr;
}

2.4. Open hash insertion

bool Insert(const pair<K, V>& kv)
{
	if (Find(kv.first))
		return false;

    Hash hash;

    size_t hashi = hash(kv.first) % _tables.size();
	//头插
	Node* newnode = new Node(kv);
	newnode->_next = _tables[hashi];
	_tables[hashi] = newnode;
	++_n;

	return true;
}

 Because it is possible that the size of the hash table is 0, or it needs to be expanded. So the hash table needs to be expanded. The larger the load factor, the higher the probability of conflict, the lower the search efficiency, and the higher the space utilization rate.

 Because the nodes in the original table are all custom types, they will not be automatically destructed. We only need to recalculate the position of the nodes in the original table and move them to the new table.

bool Insert(const pair<K, V>& kv)
{
	if (Find(kv.first))
		return false;

    Hash hash;

	//负载因子为1时,扩容
	if (_n == _tables.size())
	{
		size_t newsize = _tables.size() == 0 ? 10 : _tables.size() * 2;
		vector<Node*> newtables(newsize, nullptr);
		for (Node*& cur : _tables)
		{
			while (cur)
			{
				Node* next = cur->_next;
				size_t hashi = hash(cur->_kv.first) % newtables.size();
				//头插到新表
				cur->_next = newtables[hashi];
				newtables[hashi] = cur;

				cur = next;
			}
		}
		_tables.swap(newtables);
	}

	size_t hashi = hash(kv.first) % _tables.size();
	//头插
	Node* newnode = new Node(kv);
	newnode->_next = _tables[hashi];
	_tables[hashi] = newnode;
	++_n;
	return true;
}

2.5. Deletion of open hash

bool Erase(const K& key)
{
    Hash hash;

	size_t hashi = hash(key) % _tables.size();
	Node* prev = nullptr;
	Node* cur = _tables[hashi];
	while (cur)
	{
		if (cur->_kv.first == key)
		{
			if (prev == nullptr)
			{
				_tables[hashi] = cur->_next;
			}
			else
			{
				prev->_next = cur->_next;
			}
			delete cur;
			return true;
		}
		else
		{
			prev = cur;
			cur = cur->_next;
		}
	}
	return false;
}

 Applying the chain address method to deal with overflow requires adding a link pointer, which seems to increase storage overhead. In fact: Since the open address method must maintain a large amount of free space to ensure search efficiency, such as the secondary detection method requires a loading factor a <= 0.7, and the space occupied by the table entry is much larger than the pointer, so using the chain address method instead Compared with the open address method, it saves storage space.

3. Performance analysis

 For the hash of the open hash, the time complexity of adding, deleting, checking and modifying is O(1)  , although in the worst case (all values ​​are hung on the same subscript, that is, in the same bucket) , the time complexity is O(N), but because of the expansion operation, this worst case is almost impossible.

 If there is an extreme situation, all the data will be in one bucket. Then, when a single bucket exceeds a certain length, change the bucket to a red-black tree: set the hash data type to a structure, which includes the linked list pointer, the length of the bucket, and the pointer to the tree. If the bucket If the length exceeds the specified value, the pointer of the tree is used, otherwise, the pointer of the linked list is used.


That’s all for the content of the hash underlying structure. I hope you will support me a lot. If there is something wrong, please correct me, thank you!

Guess you like

Origin blog.csdn.net/weixin_74078718/article/details/130747216