[C++] Implementation of hash table

What is hash

Hash table (also called hash table) is a data structure that is directly accessed based on the key value. That is, it accesses records by mapping key values ​​to a location in the table to speed up lookups. This mapping function is called a hash function, and the array storing the records is called a hash table.

The method of a hash table is actually very simple. It converts the Key into an integer number through a fixed algorithm function, the so-called hash function, and then modulates the number to the length of the array, and the remainder result is treated as an array. The subscript of the value is stored in the array space with the number as the subscript.

When using a hash table for query, the hash function is used again to convert the key into the corresponding array subscript, and locate the space to obtain the value. In this way, the positioning performance of the array can be fully utilized for data processing. position

The efficiency of the search depends on the number of comparisons of elements during the search process. Therefore, the time complexity of searching in a sequential structure is O ( N ) O(N) O(N), and the time complexity of searching in a balanced tree is the height of the tree O ( log N ) O(logN)O(logN).

The most ideal search method is to obtain the element to be searched directly from the table at one time without any comparison, that is, the time complexity of the search is O ( 1 ) O(1)O(1).

If you construct a storage structure that can establish a one-to-one mapping relationship between the storage location of an element and its key through a certain function, then you can quickly find the element through this function when searching.

Understanding hashes

A hash function is a function that maps an input (usually a large data set) to a fixed-size output. An important feature of the hash function is that its output length is fixed regardless of the size of the input data (if the output length of the hash function is not fixed, it will lead to uneven data distribution or unclear bucket size, affecting the hash function. table performance), which makes hash functions very useful because they can quickly map large amounts of data to a smaller range.

Container for hashing

Both unordered_map and unordered_set are containers in the C++ standard library, used to store a set of elements. Their underlying implementations are all based on hash tables.

unordered_map is an associative container that stores a set of key-value pairs. Each key is unique and associated with a value. unordered_map uses a hash function to map keys to specific buckets and stores values ​​in the buckets. Because it is implemented using a hash table, unordered_map's insertion, search, and deletion operations all have constant average time complexity.

unordered_set is a collection container that stores a unique set of elements. unordered_set uses a hash function to map elements to a specific bucket and stores the elements in the bucket. Similar to unordered_map, unordered_set's insertion, search, and deletion operations also have constant average time complexity.

The difference between unordered_map and unordered_set is that unordered_map stores key-value pairs, while unordered_set only stores values. Therefore, unordered_map can be used to solve the problem of quickly finding the corresponding value based on the key, while unordered_set can be used to quickly determine whether a value exists in the set.

advantage:

  • Fast query speed: Due to the use of hash tables, the query operations of unordered_map and unordered_set have constant average time complexity.
  • Efficient insertion and deletion: The operations of inserting and deleting elements also have constant average time complexity.
  • Flexibility: Different types of keys and values ​​can be stored.

shortcoming:

  • Large memory consumption: Due to the need to maintain the hash table, unordered_map and unordered_set usually consume more memory space.
  • Disorder: The storage location of elements in the container is disordered, and the order of the elements cannot be guaranteed.

Calculate key value method

There are several methods for hash calculation and storage of key codes:

  1. Direct addressing method ( commonly used ):
    take a linear function of the keyword as the hash address: H ash ( K ey ) = A ∗ K ey + B
  • Idea: Store the key code directly as an address, that is, H(key) = key. It is suitable for situations where the distribution of key codes is relatively even.
  • Difference: The direct addressing method does not need to calculate the hash value and directly uses the key code itself as the address, so the search efficiency is very high.
  • Advantages: The average time complexity of the search operation is O(1), that is, constant time complexity.
  • Disadvantages: When the distribution of key codes is uneven, conflicts will occur (multiple key codes map to the same address), and a method to resolve the conflict is required.
  1. Numerical analysis:
  • Idea: Analyze the key code and select a representative number as the hash address. For example, when hashing an ID card number, you can select the year in it as the address.
  • Difference: The numerical analysis method requires analyzing the key code and selecting the appropriate number as the address based on the analysis results, which is suitable for certain specific data sets.
  • Advantages: For data sets that conform to specific rules, the digital analysis method can achieve better hashing effects.
  • Disadvantages: For data sets without obvious patterns, digital analysis methods may not be able to obtain better hashing results.
  1. Square-Medium method:
  • Idea: After squaring the key code, take the middle digits as the hash address. For example, after squaring the key key, take the middle m bits as the address H(key).
  • Difference: The square-middle method converts the key code through square operation, and then takes the middle digits as the hash address.
  • Advantages: Compared with the direct addressing method and the numerical analysis method, the square-centering method can distribute key codes more evenly and reduce the probability of conflicts.
  • Disadvantages: The square centering method requires additional square operations and digit operations. Compared with the direct addressing method and the digital analysis method, it will increase the calculation cost.
  1. Division with remainder method (commonly used):
  • Idea: Divide the key code by a number not larger than the size of the hash table, and then take the remainder as the hash address. That is, H(key) = key%p, where p is a prime number not larger than the size of the hash table.
  • Difference: The division and remainder method calculates the hash address through division and remainder operations.
  • Advantages: The division with remainder method is relatively simple and fast in calculation.
  • Disadvantages: If the selected prime number p is related to the characteristics of the key code, it may lead to more conflicts.

Insertion and lookup of hashes

Data set {1, 7, 6, 4, 5, 9};

The hash function is set to: hash(key) = key % capacity; capacity is the total size of the underlying space for storing elements

Insert image description here

When we insert a numerical bit again 11, we find that if we insert it in the same way, it will conflict with the element at the 1 subscript position. At this time we have to solve the conflict problem. The conflict problem is solved below.

  • Insert an element.
    According to the key code of the element to be inserted (which will be introduced below), this function calculates the storage location of the element and stores it according to this location.
//插入两种方法
		//bool Insert(const T& data)
		pair<iterator,bool> Insert(const T& data)
		{
    
    
			KeyOfT kot;//将不同对象进行提取
			iterator it = Find(kot(data));
			if (it != end())
			{
    
    
				return make_pair(it,false);
			}
			Hash hash;
			//进行扩容
			if (_n == _tables.size())
			{
    
    
				//不调用自定义析构的方法
				size_t newsize = _tables.size() == 0 ? 10 : _tables.size() * 2;
				vector<Node*> newtables(newsize, nullptr);
				//for (Node*& cur : _tables)
				for(auto& cur : _tables)
				{
    
    
					while (cur)
					{
    
    
						Node* next = cur->_next;
						size_t hashi = hash(kot(cur->_data)) % newtables.size();

						//头插到新列表
						cur->_next = newtables[hashi];
						newtables[hashi] = cur;
						cur = next;
					}
				}
				_tables.swap(newtables);
			}
			size_t hashi = hash(kot(data)) % _tables.size();
			//头插
			Node* newnode = new Node(data);
			newnode->_next = _tables[hashi];
			_tables[hashi] = newnode;
			_n++;
			return make_pair(iterator(newnode,this),true);
		}
  • Find the element.
    Perform the same calculation on the key code of the element. Use the obtained function value as the storage location of the element. Compare the elements according to this location in the structure. If the key codes are equal, the search is successful.
//查找
		iterator Find(const K& key)
		{
    
    
			if (_tables.size() == 0)
			{
    
    
				return end();
			}
			Hash hash;//根据类型不同来计算他的整体key值
			KeyOfT kot;//将不同对象进行提取查找对应值
			size_t hashi = hash(key) % _tables.size();
			Node* cur = _tables[hashi];
			while (cur)
			{
    
    
				if (kot(cur->_data) == key)
				{
    
    
					return iterator(cur,this);
				}
				else
				{
    
    
					cur = cur->_next;
				}
			}
			return end();
		}

Resolve hash collisions

Data set {1, 7, 6, 4, 5, 9};

The hash function is set to: hash(key) = key % capacity; capacity is the total size of the underlying space for storing elements

Insert image description here

When we insert a numerical bit again 11, we find that if we insert it in the same way, it will conflict with the element at the 1 subscript position. Then we have to resolve the conflict

Two common ways to resolve hash collisions are: closed hashing and open hashing

Closed hashing is also called open addressing method

Open Addressing: Store key-value pairs directly in each hash bucket. When a conflict occurs, the next available bucket is found through certain exploration rules. Common exploration rules include linear detection and quadratic detection .

When a hash conflict occurs, if the hash table is not full, it means that there must be an empty position in the hash table
, then the key can be stored in the "next" empty position in the conflict position.

The open addressing method does not require additional memory to store pointers and is better cache-friendly. However, when the load factor is high, it may increase the probability of continuous conflicts, thereby affecting performance.
Linear detection

Insert image description here
The problem of deleting elements
When using closed hashing to handle hash conflicts, you cannot physically delete existing elements in the hash table. Direct deletion of elements will affect the search for other elements. For example, if you delete element 4 directly, the search for 44 may be affected. Therefore linear probing uses marked pseudo-deletion to delete an element.

secondary detection
The flaw of linear detection is that conflicting data accumulates together, which is related to finding the next empty position, because the way to find empty positions is to find them one by one. Therefore, in order to avoid this problem, secondary detection needs to find the next empty position. The method for empty position is:

Hi=(H0+i ^2)%m ( i = 1 , 2 , 3 , . . . )
H0: The position calculated by the hash function on the key code of the element.
H i: The storage location of the conflict element obtained after secondary detection.
m: The size of the table.
Insert image description here

Secondary detection is used to find the next location for data that generates hash conflicts. Compared with linear detection, the distribution of elements in the hash table using secondary detection will be relatively sparse, which is less likely to lead to data accumulation.

Like linear detection, you also need to pay attention to the load factor of the hash table when using secondary detection. For example, if you use secondary detection to insert the above data into a hash table with a table length of 20, the number of conflicts will also be reduced:
Insert image description here

open hash

Use a linked list or other data structure to store conflicting key-value pairs in each hash bucket. When a conflict occurs, a new key-value pair can simply be added to the end of the linked list. This method is simple and easy to implement and suitable for situations where conflicts occur frequently. The disadvantage of the chain address method is that it requires additional memory to store the pointers of the linked list, and the traversal efficiency of the linked list may be low when dealing with a large number of conflicts.

For example, we use the division and remainder method to insert the sequence {1, 6, 15, 60, 88, 7, 40, 5, 10} into a hash table with a length of 10. When a hash conflict occurs, we use open In the form of hashing, elements with the same hash address are linked to the same hash bucket. The insertion process is as follows:
Insert image description here
closed hashing resolves hash conflicts in a revengeful way, "My position is occupied. I’ll just go and take another spot.” Open hashing uses an optimistic approach to resolve hash conflicts: "Although my position is occupied, it doesn't matter, I can 'hang' under this position."

Different from closed hashing, this method of linking elements with the same hash address through a singly linked list and then storing the head node of the linked list in the hash table will not affect elements with different hash addresses. The efficiency of adding, deleting, checking and modifying, so the load factor of open hashing can be slightly larger than that of closed hashing.

  • For the open addressing method of closed hashing, the load factor cannot exceed 1, and it is generally recommended to control it between [0.0, 0.7].
  • For open hash buckets, the load factor can exceed 1, and it is generally recommended to control it between [0.0, 1.0].

In practice, the hash bucket structure of open hashing is more practical than closed hashing for two main reasons.

  • The load factor of the hash bucket can be larger and the space utilization is high.
  • Hash buckets also have solutions available in extreme cases.

The extreme case of a hash bucket is that all elements conflict and are eventually placed in the same hash bucket. At this time, the efficiency of adding, deleting, checking, and modifying the hash table degrades to O (N)

Insert image description here

Hash closed hash implementation

closed hash structure

We use an enumeration to represent its current status

//枚举:标识每个位置的状态
enum State
{
    
    
	EMPTY,//为空
	EXIST,//存在
	DELETE//删除
};

Reason for setting current state:

  • When we insert, search, and delete, we can get the status of the current position effectively and quickly.
    • When inserting, we will always make judgments to determine whether the position at this time has a value. We will not stop until we encounter an empty or deleted position for storage.
    • When searching, we will make judgments at all times. When the status is empty, we will exit directly, because when it is empty, it means that there is no value in the range at this time for us to check. On the other hand, if we want to find a value
      Insert image description here
      - after deleting the data, we need to set the position at this time to the deleted state, so that there will be no errors when searching and inserting.

Therefore, when we set up its structure, we need to initialize each position of it to an empty state, and also add the length of the table to prevent the table from being overloaded.

//哈希表每个位置存储的结构
template<class K, class V>
struct HashData
{
    
    
	pair<K, V> _kv;
	State _state = EMPTY; //状态
};

Create a class specifically to insert various operation functions

//哈希表
template<class K, class V>
class HashTable
{
    
    
public:
	//...
private:
	vector<HashData<K, V>> _table; //哈希表
	size_t _n = 0; //哈希表中的有效元素个数
};
Closed hash structure insertion

When inserting, we need to pay attention to its expansion problem. Here our load factor is generally controlled between 0 and 0.7. If it exceeds this range, it needs to be expanded.
When expanding, we do not expand in place, but first set up a new container, then double the original size, then move all the data there, and finally exchange the arrays.

bool Insert(const pair<K, V>& kv)
		{
    
    
			if (Find(kv.first))
				return false;

			// 负载因子超过0.7就扩容
			//if ((double)_n / (double)_tables.size() >= 0.7)
			if (_tables.size() == 0 || _n * 10 / _tables.size() >= 7)
			{
    
    
				size_t newsize = _tables.size() == 0 ? 10 : _tables.size() * 2;
				HashTable<K, V> newht;
				newht._tables.resize(newsize);

				// 遍历旧表,重新映射到新表
				for (auto& data : _tables)
				{
    
    
					if (data._state == EXIST)
					{
    
    
						newht.Insert(data._kv);
					}
				}

				_tables.swap(newht._tables);
			}

			size_t hashi = kv.first % _tables.size();

			// 线性探测
			size_t i = 1;
			size_t index = hashi;
			while (_tables[index]._state == EXIST)
			{
    
    
				index = hashi + i;
				index %= _tables.size();
				++i;
			}

			_tables[index]._kv = kv;
			_tables[index]._state = EXIST;
			_n++;

			return true;
		}
closed hash search

When searching, we first need to use a hash function to calculate the corresponding key value, and then our remaining job is to traverse the elements and find the corresponding element value.
You only need to check whether the current status is existence or deletion. When it is empty, it means that the element you are looking for does not exist.

HashData<K, V>* Find(const K& key)
		{
    
    
			if (_tables.size() == 0)
			{
    
    
				return false;
			}

			size_t hashi = key % _tables.size();

			// 线性探测
			size_t i = 1;
			size_t index = hashi;
			while (_tables[index]._state != EMPTY)
			{
    
    
				if (_tables[index]._state == EXIST
					&& _tables[index]._kv.first == key)
				{
    
    
					return &_tables[index];
				}

				index = hashi + i;
				index %= _tables.size();
				++i;

				// 如果已经查找一圈,那么说明全是存在+删除
				if (index == hashi)
				{
    
    
					break;
				}
			}

			return nullptr;
		}
closed hash delete

Deleting elements in the hash table is very simple. We only need to perform pseudo deletion, that is, set the status of the element to be deleted to DELETE.

The steps to delete data in the hash table are as follows:

  • Check whether the key-value pair of this key exists in the hash table. If it does not exist, the deletion will fail.
  • If it exists, just change the status of the key-value pair to DELETE.
  • The number of valid elements in the hash table is reduced by one.

Note: Although the data at this position is not cleared to 0 when deleting an element, but the state of the element is set to DELETE, it does not cause a waste of space, because when inserting data, we can insert data into the state of DELETE. position, the data inserted at this time will overwrite the data.

bool Erase(const K& key)
		{
    
    
			HashData<K, V>* ret = Find(key);
			if (ret)
			{
    
    
				ret->_state = DELETE;
				--_n;
				return true;
			}
			else
			{
    
    
				return false;
			}
		}

Hash open hash implementation (linked list)

open hash structure

In an open hash table, each position in the hash table actually stores the head node of a singly linked list, that is, the data stored in each hash bucket is actually a node type. In addition to storing the given data, the node type also needs to store a node pointer to point to the next node.

//每个哈希桶中存储数据的结构
template<class K, class V>
struct HashNode
{
    
    
	pair<K, V> _kv;
	HashNode<K, V>* _next;

	//构造函数
	HashNode(const pair<K, V>& kv)
		:_kv(kv)
		, _next(nullptr)
	{
    
    }
};

Different from the closed hash hash table, when implementing the open hash hash table, we do not need to set a status field for each position in the hash table, because in the open hash hash table, We put elements with the same hash address into the same hash bucket and do not need to search for the so-called "next position".

The open hash implementation of the hash table also needs to determine whether the capacity needs to be increased based on the load factor when inserting data, so we should also store the number of valid elements in the entire hash table at all times. When the load factor is too large, we should Expand the hash table.

//哈希表
template<class K, class V>
class HashTable
{
    
    
public:
	//...
private:
	//typedef HashNode<K, V> Node;
	vector<Node*> _table; //哈希表
	size_t _n = 0; //哈希表中的有效元素个数
};
Open hash structure insertion
  • If the load factor of the hash table is already equal to 1, first create a new hash table with twice the size of the original hash table, and then traverse the original hash table and add the data in the original hash table to The data is inserted into the new hash table, and finally the original hash table is exchanged with the new hash table.
  • Important: In the process of inserting the data of the original hash table into the new hash table, do not insert the data in the original hash table into the new hash table by reusing the insertion function, because in this process we need to create the same The data nodes are inserted into the new hash table. After the insertion is completed, the nodes in the original hash table need to be released, which is unnecessary.

In fact, we only need to traverse each hash bucket of the original hash table, and use the hash function to find the corresponding position of the node in each hash bucket and insert it into the new hash table. There is no need to modify the node. Create and release.
Insert image description here

bool Insert(const pair<K, V>& kv)
		{
    
    
			if (Find(kv.first))
			{
    
    
				return false;
			}

			Hash hash;

			// 负载因因子==1时扩容
			if (_n == _tables.size())
			{
    
    
				size_t newsize = _tables.size() == 0 ? 10 : _tables.size() * 2;
				vector<Node*> newtables(newsize, nullptr);
				//for (Node*& cur : _tables)
				for (auto& cur : _tables)
				{
    
    
					while (cur)
					{
    
    
						Node* next = cur->_next;

						size_t hashi = hash(cur->_kv.first) % newtables.size();

						// 头插到新表
						cur->_next = newtables[hashi];
						newtables[hashi] = cur;

						cur = next;
					}
				}

				_tables.swap(newtables);
			}

			size_t hashi = hash(kv.first) % _tables.size();
			// 头插
			Node* newnode = new Node(kv);
			newnode->_next = _tables[hashi];
			_tables[hashi] = newnode;

			++_n;
			return true;
		}
Open hash structure search

The steps to find data in a hash table are as follows:

  • First determine whether the size of the hash table is 0. If it is 0, the search fails.
  • The corresponding hash address is calculated through the hash function.
  • Find the singly linked list in the corresponding hash bucket through the hash address, and then traverse the singly linked list to search.
		Node* Find(const K& key)
		{
    
    
			if (_tables.size() == 0)
				return nullptr;

			Hash hash;
			size_t hashi = hash(key) % _tables.size();
			Node* cur = _tables[hashi];
			while (cur)
			{
    
    
				if (cur->_kv.first == key)
				{
    
    
					return cur;
				}

				cur = cur->_next;
			}

			return nullptr;
		}
Open hash structure delete

The steps to delete data in the hash table are as follows:

  • The corresponding hash bucket number is calculated through the hash function.
  • Traverse the corresponding hash bucket to find the node to be deleted.
  • If the node to be deleted is found, the node is removed from the singly linked list and released.
  • After deleting the node, reduce the number of valid elements in the hash table by one.
		bool Erase(const K& key)
		{
    
    
			Hash hash;
			size_t hashi = hash(key) % _tables.size();
			Node* prev = nullptr;
			Node* cur = _tables[hashi];
			while (cur)
			{
    
    
				if (cur->_kv.first == key)
				{
    
    
					if (prev == nullptr)
					{
    
    
						_tables[hashi] = cur->_next;
					}
					else
					{
    
    
						prev->_next = cur->_next;
					}
					delete cur;

					return true;
				}
				else
				{
    
    
					prev = cur;
					cur = cur->_next;
				}
			}

			return false;
		}

Guess you like

Origin blog.csdn.net/wh9109/article/details/132894280