Hashtable (how to build an industrial grade hashtable)

Table of contents

Hash ideas

hash function

 hash collision

1. Open addressing method

2. Linked list method

Solve the problem of too large loading factor

Choosing an Appropriate Hash Collision Resolution Method


Hash ideas


Hash table (hashtable) is an extension of the array, evolved from the array, and the bottom layer relies on the array to support the
feature of quickly accessing elements by pressing the index. In other words, without arrays, there is no hash table. Let's explain with an example.

Suppose there are 69 players participating in the school sports meeting. In order to facilitate the recording of results, each contestant will have his own entry number pasted on his chest. The 69 players are numbered 1~69 in turn. Now we hope to implement such a function in programming: quickly find the corresponding player information by number. How to realize this function?
We can put the information of these 69 players in an array. For the information of the player numbered 1, put it in the position with the subscript 1 in the array; for the information of the player with the number 2, put it in the position with the subscript 2 in the array, and so on, for the player with the number k Put the information in the array at the position subscript k.
There is a one-to-one correspondence between the entry number and the subscript of the array. When you need to query the information of the player with the entry number x, you only need to take out the array element with the subscript x. The time complexity is 0(1), and the efficiency is very high.
In fact, this example has already used the hash idea. In this example, the entry number is a natural number and forms a one-to-one mapping relationship with the array subscript. Therefore, using the feature that the time complexity of accessing an element by subscript in the array is 0(1), we can achieve fast Find player information.
However, the hash idea contained in this example is not obvious enough, so let's modify this example a little.
Assuming that the principal makes a request, the entry number cannot be so simple, and detailed information such as grade and class must be added, so we slightly modify the numbering rules and use 6 digits to represent it, such as 01120639, where the first two digits 01 represent the grade, The middle two digits 12 represent the class, and the last two digits are still the original number 1~69. At this time, how to store player information so as to support quick search of player information by number?

The solution to the problem is similar to the previous one. Although the entry number cannot be directly used as an array subscript now, we can intercept the last two digits of the entry number as an array subscript. When querying the player information through the entry number, we use the same method to take the last two digits of the entry number as the array subscript, and get the player information corresponding to this subscript from the array.
This is typical hash thinking. Wherein, the number of the contestant is called a key (key) or a keyword (keyword). We use keys to identify a player. Our mapping method for converting entry numbers into array subscripts is called a hash function. The value calculated by the hash function is called the hash value.

 The hash table uses the feature that the time complexity of accessing the elements in the array is O(1). We map the key value of the element into a subscript through a hash function, and then store the corresponding data in the position corresponding to the subscript in the array. When querying elements by key value, we use the same hash function to convert the key value into an array subscript, and fetch data from the position corresponding to the subscript in the array.

hash function

Hash functions play a key role in hash tables. The hash function is first a function, we can define it as hash(key), where key represents the key value of the element, and the value of hash(key) represents the hash value calculated by the hash function.

In the above example, the hash function is relatively simple and the code is easy to implement. But if the contestant's number is a randomly generated 6-digit number, or a string of a~z, how to construct a hash function? The following summarizes the three basic requirements for hash function design

:

  • 1) The hash value calculated by the hash function is a non-negative integer;
  • 2) If key1=key2, then hash(key1)==hash(key2);
  • 3) If keyl≠key2, then hash(key1)≠hash(key2).

Among them, the first basic requirement should be understood without any problem, because the array subscript starts from 0, therefore, the hash value generated by the hash function must also be a non-negative integer. The second basic requirement is also easy to understand. The hash value obtained by the same key through the hash function should also be the same. The third basic requirement may be difficult to understand. This requirement seems reasonable, but in actual situations, it is almost impossible to find a hash function without conflicts (different keys correspond to different hash values). impossible. Even well-known hash algorithms such as MD5, SHA, and CRC in the industry cannot completely avoid hash collisions. Moreover, because the storage space of the array is limited, the probability of hash collision will also be increased.

For example, there is a data set {11, 8, 6, 14, 5, 9};

Hash function : hash(key) = key%capacity , capacity is the total size of the underlying space of the storage element.
If we store the collection in a hash table with a capacity of 10, the storage location of each element corresponds to the following:

With this method for storage, it is only necessary to use the hash function to determine whether the element to be searched is stored in the corresponding position when searching, without having to compare the key codes many times, so the search speed is relatively fast.

According to the above hash method, insert element 44 into the set, what problems will occur?

 hash collision

Even the best hash function cannot avoid hash collisions. So, how to solve the hash conflict? There are two commonly used methods: open addressing method and linked list method.

1. Open addressing method

The core idea of ​​the open addressing method: once a hash conflict occurs, the conflict is resolved by re-detecting the new location. How to re-probe the new position?
The simplest detection method is the linear detection method .

For example, in the above scenario, now you need to insert element 44, first calculate the hash address through the hash function, hashAddr is 4, so 44 should theoretically be inserted at the position with the subscript 4, but the element with the value of 4 has already been placed at this position, that is, it will happen hash collision.

Insertion : When inserting data into the hash table, if a certain data has been calculated by the hash function and the corresponding storage location has been occupied, we start from this location and search backwards in the array until we find a free space position.

 Deletion : Hash tables not only support insertion and lookup operations, but also deletion operations. For a hash table that uses linear probing to resolve conflicts, the deletion operation is a bit special, and you cannot simply set the position of the element to be deleted to NULL. When searching for data, once the free position is traversed through the linear detection method, we believe that the data does not exist in the hash table. However, if the free position is deleted later, the original search algorithm will be invalidated. Data that exists in the first place may be deemed not to exist. For example, to delete element 4, if it is deleted directly, the lookup of 44 may be affected. So linear probing uses marked pseudo-deletion to delete an element. Specially mark the storage space of the data to be deleted as "deleted". When using the linear detection method to search for data, it will not stop when it encounters a space marked "deleted", but will continue to detect.

// Each space in the hash table is marked
// EMPTY is empty, EXIST already has an element, and DELETE has been deleted
enum State{

        EMPTY,

        EXIST,

        DELETE

     }; 

For the linear detection method, when more and more data are inserted into the hash table, there will be fewer and fewer free positions, the probability of hash collisions will increase, and the linear detection time will become longer and longer. In the extreme case, the entire hash table needs to be probed to find a free location and insert the data, so the worst-case time complexity is 0(n). In the same way, it is also possible to linearly probe the entire hash table when deleting data and looking up data.

For open addressing, besides linear probing, there are two other classical probing methods: quadratic probing and double hashing .

The secondary detection method is very similar to the linear detection method. The detection step of the linear detection method is 1, and the detection subscript sequence is hash(key)+0, hash(key)+1, hash(key+2..... The detection step of the quadratic detection method becomes the original "quadratic", and the detection subscript sequence is hash(key)+0, hash(key)+1^2, hash(key)+2^2 ...
double hashing uses multiple hash functions: hash1(key), hash2(key), hash3(key).... If the storage location calculated by the first hash function has been Occupied, then use the second hash function to recalculate the storage location, and so on, until a free storage location is found.

Studies have shown that: when the length of the table is a prime number and the table loading factor (loading factor - the number of elements in the hash table / the length of the hash table (the number of "slots")) a does not exceed 0.5, the new table entry Insertion must be possible without any single location being probed twice. Therefore, as long as there are half of the empty positions in the table, there will be no problem of the table being full. You can ignore the fullness of the table when searching, but you must ensure that the load factor a of the table does not exceed 0.5 when inserting. If it exceeds, you must consider increasing the capacity.

Therefore: the biggest defect of the open addressing method is the relatively low space utilization rate, which is also a defect of hashing.

code:

enum State
	{
		EMPTY,
		EXIST,
		DELETE
	};

	template<class K, class V>
	struct HashData
	{
		pair<K, V> _kv;
		State _state = EMPTY;
	};

	template<class K, class V>
	class HashTable
	{
	public:
		bool Insert(const pair<K, V>& kv)
		{
			if (Find(kv.first))
				return false;

			// 负载因子超过0.7就扩容
			if (_tables.size() == 0 || _n * 10 / _tables.size() >= 7)
			{
				size_t newsize = _tables.size() == 0 ? 10 : _tables.size() * 2;
				HashTable<K, V> newht;
				newht._tables.resize(newsize);

				// 遍历旧表,重新映射到新表
				for (auto& data : _tables)
				{
					if (data._state == EXIST)
					{
						newht.Insert(data._kv);
					}
				}

				_tables.swap(newht._tables);
			}

			size_t hashi = kv.first % _tables.size();

			// 线性探测
			size_t i = 1;
			size_t index = hashi;
			while (_tables[index]._state == EXIST)
			{
				index = hashi + i;
				index %= _tables.size();
				++i;
			}

			_tables[index]._kv = kv;
			_tables[index]._state = EXIST;
			_n++;

			return true;
		}

		HashData<K, V>* Find(const K& key)
		{
			if (_tables.size() == 0)
			{
				return false;
			}

			size_t hashi = key % _tables.size();

			// 线性探测
			size_t i = 1;
			size_t index = hashi;
			while (_tables[index]._state != EMPTY)
			{
				if (_tables[index]._state == EXIST
					&& _tables[index]._kv.first == key)
				{
					return &_tables[index];
				}

				index = hashi + i;
				index %= _tables.size();
				++i;

				// 如果已经查找一圈,那么说明全是存在+删除
				if (index == hashi)
				{
					break;
				}
			}

			return nullptr;
		}

		bool Erase(const K& key)
		{
			HashData<K, V>* ret = Find(key);
			if (ret)
			{
				ret->_state = DELETE;
				--_n;
				return true;
			}
			else
			{
				return false;
			}
		}

	private:
		vector<HashData<K, V>> _tables;
		size_t _n = 0; // 存储的数据个数

	};

2. Linked list method

The linked list method is a more commonly used method to resolve hash collisions, and it is much simpler than the open addressing method. In the hash table, each "bucket" or "slot" (slot) corresponds to a linked list, and we put elements with the same hash value into the linked list corresponding to the same slot.
When inserting data, we only need to calculate the corresponding "slot" through the hash function, and then insert the data into the linked list corresponding to this "slot". The time complexity of inserting data is O(1). When we want to find and delete data, we also calculate the corresponding "slot" through the hash function, and then traverse the linked list to find or delete data.

For a hash table based on the linked list method to resolve conflicts, the time complexity of the search and delete operations is proportional to the length k of the linked list, that is, O(k). For a hash function with a relatively uniform hash, theoretically speaking, k=n/m, where n represents the number of data in the hash table, and m represents the number of "slots" in the hash table. When k is When it is a small constant, we can roughly think that the time complexity of finding and deleting data in the hash table is O(1).

We call the above k the loading factor. The loading factor is expressed by the formula: loading factor = number of elements in the hash table/length of the hash table (number of "slots"). The larger the load factor, the longer the length of the linked list, and the lower the performance of the hash table.

Solve the problem of too large loading factor

The larger the load factor, the more elements there are in the hash table and the fewer free positions, the greater the probability of hash collision, and the performance of inserting, deleting, and searching will decrease accordingly. For static data collections without frequent insertion and deletion operations, because the data is statically known, we can easily design a good hash function with few conflicts according to the characteristics of the data. However, for dynamic data collections with frequent insertion and deletion operations, it is impossible to apply for a sufficiently large hash table in advance because the number of data to be added cannot be estimated in advance. As more and more data is added, the load factor will become larger and larger. When the load factor is large to a certain extent, a large number of hash collisions will lead to a sharp decline in the performance of the hash table. What should we do at this time?
For the hash table, when the loading factor is too large, we can also perform dynamic expansion, re-apply for a larger hash table, and move the data in the original hash table to the new hash table . Assume that every time the capacity is expanded, a new hash table twice the size of the original hash table is re-applied. If the load factor of the original hash table is 0.8, after expansion, the load factor of the new hash table is reduced to half of the original value and becomes 0.4. For the expansion of arrays, stacks and queues, the data movement operation is relatively simple. However, for the hash table n to represent the expansion of the size of the hash table, the data movement operation is much more complicated. Because the size of the hash table has changed, the storage location of the data has also changed, so we need to recalculate the storage location of each data in the new hash table through the hash function. In most cases, inserting new data does not trigger expansion, so the best time complexity for inserting is O(1). If the load factor is too high and exceeds the threshold set in advance, when new data is inserted, capacity expansion will be triggered, and memory space needs to be reapplied, the hash value of each data will be recalculated, and the data will be moved from the original hash table to the new hash table, therefore, the worst case time complexity of the insert operation is O(n).

For the Harbin table that supports dynamic capacity protection, in most cases, the speed of inserting data is very fast. However, in special cases, when the loading factor has reached the threshold, the capacity needs to be expanded before inserting data. At this time, inserting data will become very slow, even unacceptable.
Let's explain with an extreme example. If the current size of the hash table is 1GB, when the expansion is started, the hash value needs to be recalculated for the 1GB data and moved to the new hash table. Such an operation sounds time-consuming. If the code of our project serves users directly, the response time requirements are relatively high. Although in most cases, the insertion speed of data is very fast, very few insertion operations with a very slow speed will also cause users to "crash". At this time, this centralized expansion mechanism is not suitable.
In order to solve the problem of excessive time-consuming centralized expansion, we intersperse the expansion operation in the process of multiple insertion operations and complete it in batches. When the loading factor reaches the threshold, we only create a new hash table, but do not move all the data in the original hash table to the new hash table.
When there is new data to be inserted, in addition to inserting the new data into the new hash table, a piece of data will be moved from the original hash table to the new hash table. Every time we insert a new data into the hash table, we repeat the above process. After multiple insertion operations, the data in the original hash table is moved to the new hash table little by little. In this way, we disperse the data movement work into multiple data insertion operations. Without centralized one-time large-scale data movement operations, all insertion operations become very fast.

 Through this method, the cost of expansion is evenly distributed among multiple insertion operations, which avoids the problem of excessive time-consuming expansion. Based on this expansion method, in any case, the time complexity of inserting data is O(1).
However, before the data in the original hash table is completely moved to the new hash table, the memory occupied by the original hash table will not be released, and the memory usage will be more. Moreover, for query operations, in order to be compatible with the new and old hash table To retrieve the data in the hash table, we need to look it up in the new and old hash tables at the same time.

code:

template<class K, class V>
	struct HashNode
	{
		HashNode<K, V>* _next;
		pair<K, V> _kv;

		HashNode(const pair<K, V>& kv)
			:_next(nullptr)
			, _kv(kv)
		{}
	};

	template<class K>
	struct HashFunc
	{
		size_t operator()(const K& key)
		{
			return key;
		}
	};

	// 特化
	template<>
	struct HashFunc<string>
	{
		// BKDR
		size_t operator()(const string& s)
		{
			size_t hash = 0;
			for (auto ch : s)
			{
				hash += ch;
				hash *= 31;
			}

			return hash;
		}
	};

	template<class K, class V, class Hash = HashFunc<K>>
	class HashTable
	{
		typedef HashNode<K, V> Node;
	public:
		~HashTable()
		{
			for (auto& cur : _tables)
			{
				while (cur)
				{
					Node* next = cur->_next;
					delete cur;
					cur = next;
				}

				cur = nullptr;
			}
		}

		Node* Find(const K& key)
		{
			if (_tables.size() == 0)
				return nullptr;

			Hash hash;
			size_t hashi = hash(key) % _tables.size();
			Node* cur = _tables[hashi];
			while (cur)
			{
				if (cur->_kv.first == key)
				{
					return cur;
				}

				cur = cur->_next;
			}

			return nullptr;
		}

		bool Erase(const K& key)
		{
			Hash hash;
			size_t hashi = hash(key) % _tables.size();
			Node* prev = nullptr;
			Node* cur = _tables[hashi];
			while (cur)
			{
				if (cur->_kv.first == key)
				{
					if (prev == nullptr)
					{
						_tables[hashi] = cur->_next;
					}
					else
					{
						prev->_next = cur->_next;
					}
					delete cur;

					return true;
				}
				else
				{
					prev = cur;
					cur = cur->_next;
				}
			}

			return false;
		}
		size_t GetNextPrime(size_t prime)
		{
			// SGI
			static const int __stl_num_primes = 28;
			static const unsigned long __stl_prime_list[__stl_num_primes] =
			{
				53, 97, 193, 389, 769,
				1543, 3079, 6151, 12289, 24593,
				49157, 98317, 196613, 393241, 786433,
				1572869, 3145739, 6291469, 12582917, 25165843,
				50331653, 100663319, 201326611, 402653189, 805306457,
				1610612741, 3221225473, 4294967291
			};

			size_t i = 0;
			for (; i < __stl_num_primes; ++i)
			{
				if (__stl_prime_list[i] > prime)
					return __stl_prime_list[i];
			}

			return __stl_prime_list[i];
		}

		bool Insert(const pair<K, V>& kv)
		{
			if (Find(kv.first))
			{
				return false;
			}

			Hash hash;

			// 负载因因子==1时扩容
			if (_n == _tables.size())
			{
				size_t newsize = GetNextPrime(_tables.size());
				vector<Node*> newtables(newsize, nullptr);
				for (auto& cur : _tables)
				{
					while (cur)
					{
						Node* next = cur->_next;

						size_t hashi = hash(cur->_kv.first) % newtables.size();

						// 头插到新表
						cur->_next = newtables[hashi];
						newtables[hashi] = cur;

						cur = next;
					}
				}

				_tables.swap(newtables);
			}

			size_t hashi = hash(kv.first) % _tables.size();
			// 头插
			Node* newnode = new Node(kv);
			newnode->_next = _tables[hashi];
			_tables[hashi] = newnode;

			++_n;
			return true;
		}

		size_t MaxBucketSize()
		{
			size_t max = 0;
			for (size_t i = 0; i < _tables.size(); ++i)
			{
				auto cur = _tables[i];
				size_t size = 0;
				while (cur)
				{
					++size;
					cur = cur->_next;
				}

				if (size > max)
				{
					max = size;
				}
			}

			return max;
		}
	private:
		vector<Node*> _tables; // 指针数组
		size_t _n = 0; // 存储有效数据个数
	};

Choosing an Appropriate Hash Collision Resolution Method

Open addressing method
Advantages:
For the hash table based on open addressing method to resolve conflicts, the data is stored in the array, which can effectively use the CPU cache to speed up the query. Compared with the hash table based on the linked list method to resolve conflicts, the hash table based on the open addressing method to resolve conflicts does not involve linked lists and pointers, which is convenient for serialization.
Disadvantages:
Based on the open addressing method to resolve conflicts in the hash table, the operation of deleting data will be more troublesome, and the deleted data needs to be specially marked. Moreover, in the open addressing method, all data is stored in an array, which has a higher probability of conflict than the linked list method. Therefore, the load factor of the hash table based on the open addressing method to resolve conflicts cannot be too large and must be less than 1, while the load factor of the hash table based on the linked list method to resolve conflicts can be greater than 1. This leads to storing the same amount of data, and the open addressing method requires more storage space than the linked list method.
To sum up, when the data is the smallest. When the loading factor is small, the open addressing method is suitable.
Linked list method
A hash table based on the linked list method to resolve conflicts, and the data is stored in the linked list. A hash table based on open addressing to resolve conflicts, and data is stored in arrays. Linked list nodes can be created when they are used, while arrays must be created in advance. Therefore, the memory utilization rate of the linked list method is higher than that of the open addressing method.
The linked list method is more tolerant to large load factors than the open addressing method. Open addressing is only applicable when the load factor is less than 1. When the load factor is close to 1, there will be a large number of hash collisions, resulting in a large number of detections, rehashing, etc., and a sharp drop in performance. But for the linked list method, as long as the value calculated by the hash function is relatively random and uniform, even if the load factor becomes 10, the length of the linked list is only a little longer, and the performance degradation is not much. However, the nodes in the linked list need to store the next pointer, so additional memory space will be consumed. For the storage of small objects, the memory consumption may be doubled. Moreover, the nodes in the linked list are scattered in the memory, not continuous, and are not friendly to the CPU cache, which also has a certain impact on the performance of the hash table.
Of course, if we store a large object, that is, the size of the object is much larger than the size of a pointer (4B or 8B), then the memory consumption of the pointer in the linked list can be ignored.
In fact, we can transform the linked list in the linked list method into other more efficient data structures, such as red-black trees. In this way, even if there is a hash conflict, in extreme cases, all data will be hashed into the same "bucket", and the final hash table will only degenerate into a red-black tree, and the query efficiency will not be too bad. The complexity is O(log N). This can effectively avoid hash collisions.

The query efficiency of the hash table cannot be generally considered as the time complexity O(1), because it is related to the hash function, loading factor and hash collision. If the hash function is not well designed or the loading factor is too high, the probability of hash collisions may increase and query efficiency will decrease.
In extreme cases, some malicious attackers may use carefully constructed data to make all the data hash into the same "slot" after passing through the hash function. If we use a conflict resolution method based on a linked list, then at this time, the hash table will degenerate into a linked list, and the time complexity of the query will drastically degrade from O(1) to O(n).
If there are 100,000 data in the hash table, the query time of the degenerated hash table becomes 100,000 times of the original. For example, if it only took 0.1s to execute 100 queries before, it takes 10000s now. In this way, it is possible that the system cannot respond to other requests because the query operation consumes a large amount of CPU or thread resources, thereby allowing malicious attackers to achieve the purpose of a "denial of service" attack (DoS). This is the basic principle of hash table collision attack.

When we design a hash table, the hash table should have the following characteristics:

  • Support fast query, insert and delete operations;
  • The memory usage is reasonable, and too much memory space cannot be wasted;
  • The performance is stable, and in extreme cases, the performance of the hash table will not degrade to an unacceptable level.

We need to set an appropriate hash function, set a reasonable load factor threshold, design a dynamic expansion strategy, and choose an appropriate hash conflict resolution method.

Guess you like

Origin blog.csdn.net/m0_55752775/article/details/129434945