[C++] The underlying implementation of the "strongest search" hash table

The time complexity of hash table lookup is O(1)~

Article directory

  • foreword
  • 1. Hash collisions and hash functions
  • Second, the underlying implementation of the hash table
    • 1. Open address method
    • 2. Chain address method
  • Summarize


foreword

Hash concept:

In the sequential structure and balanced tree , there is no corresponding relationship between the element key and its storage location, so when looking for an element
, must go through multiple comparisons of the key code . The time complexity of sequential search is O(N) , and the height of the tree in the balanced tree is
O(logN) , the efficiency of the search depends on the number of comparisons of elements during the search.
Ideal search method: the element to be searched can be directly obtained from the table at one time without any comparison .
If a storage structure is constructed, a certain function (hashFunc) can be used to establish the relationship between the storage location of the element and its key code
One-to-one mapping relationship, then the element can be found quickly through this function when searching
When adding to this structure:
Insert element: According to the key code of the element to be inserted, use this function to calculate the storage location of the element and store it according to this location
Search element: Perform the same calculation on the key code of the element, and use the obtained function value as the storage position of the element, and compare the element according to this position in the structure. If the key code is equal, the search is successful
This method is the hash ( hash ) method, the conversion function used in the hash method is called the hash ( hash ) function, and the constructed structure is called
Hash Table ( or Hash Table)
For example: data set {1 , 7 , 6 , 4 , 5 , 9} ;
The hash function is set to: hash(key) = key % capacity ; capacity is the total size of the underlying space of the storage element.

 Searching with this method does not need to compare multiple key codes, so the search speed is relatively fast


1. Hash collisions and hash functions

Hash collision:

For the number we inserted above, what would happen if we inserted 44? 44%10==4, but the position of 4 has already been occupied, which is a hash collision.

Different keywords calculate the same hash address through the same hash number. This phenomenon is called hash collision or hash collision .

Data elements with different keys but the same hash address are called " synonyms " .

Hash function:

One reason for the hash collision may be that the design of the hash function is not reasonable enough .
Hash function design principles :
1. The domain of the hash function must include all key codes that need to be stored, and if the hash table allows m addresses, its value domain must be between 0 and m-1 .
2. The addresses calculated by the hash function can be evenly distributed in the entire space.
3. The hash function should be relatively simple.
Common hash functions:
1. Direct addressing method -- ( commonly used )
Take a linear function of the key as the hash address: Hash ( Key ) = A*Key + B
Advantages: simple and uniform.
Disadvantage: Need to know the distribution of keywords in advance.
Usage scenario: suitable for finding relatively small and continuous situations.
2. Remainder method --( commonly used )
Let the number of addresses allowed in the hash table be m , and take a prime number p that is not greater than m but closest to or equal to m as the divisor. According to the hash function: Hash(key) = key% p(p<=m), the The key is converted into a hash address.
3. The method of taking the middle of the square --( Understand )
Assuming that the keyword is 1234 , its square is 1522756 , and the middle 3 digits 227 are extracted as the hash address;
Another example is that the keyword is 4321 , its square is 18671041 , and the middle 3 digits 671 ( or 710) are extracted as the hash address
The square method is more suitable: the distribution of keywords is not known, and the number of digits is not very large
4. Folding method --( Understand )
The folding method is to divide the keyword into several parts with equal digits from left to right ( the last part can be shorter ) , then superimpose and sum these parts, and make the last few digits as the hash according to the length of the hash table. column address.
The folding method is suitable for the distribution of keywords that do not need to be known in advance, and is suitable for situations where there are many keywords.
5. Random number method --( understand )
Choose a random function, take the random function value of the keyword as its hash address, that is, H(key) = random(key), where
random is a random number function.
This method is usually used when the length of keywords is not equal.
6. Mathematical analysis method --( understand )
There are n d -digits, and each bit may have r different symbols, and the frequencies of these r different symbols appearing on each bit are not necessarily
In the same way, it may be evenly distributed on some bits, and each symbol has an equal chance of appearing, and the uneven distribution on some bits only
Certain symbols appear frequently. According to the size of the hash table, a number of bits in which various symbols are evenly distributed can be selected as the hash table.
column address. For example:
Suppose you want to store the employee registration form of a certain company. If you use the mobile phone number as the keyword, it is very likely that the first 7 digits are the same
Yes, then we can choose the last four bits as the hash address. If such extraction work is prone to conflicts, it is still
The extracted numbers can be reversed ( for example, 1234 is changed to 4321) , right ring shifted ( for example, 1234 is changed to 4123) , left ring shifted
digits, superposition of the first two numbers and the last two numbers ( such as changing 1234 to 12+34=46) and other methods.

The number analysis method is usually suitable for dealing with the situation where the number of keywords is relatively large, if the distribution of the keywords is known in advance and the number of bits of the keyword is evenly distributed.

Note: The more sophisticated the hash function is, the less likely hash collisions will occur, but hash collisions cannot be avoided.

Resolution of hash collisions:

Two common ways to resolve hash collisions are: closed hashing and open hashing.
1. Closed hashing
Closed hashing: also known as open addressing method, when a hash collision occurs, if the hash table is not full, it means that there must be a hash table in the hash table
empty position, then the key can be stored in the " next " empty position in the conflicting position . how to find the next empty slot
Woolen cloth?
1. Linear detection : start from the position where the conflict occurs, and probe backwards one by one until the next empty position is found .
delete
When using closed hashing to handle hash conflicts, you cannot physically delete existing elements in the hash table. If you delete elements directly
Can affect the search of other elements . For example, delete element 4 , if it is deleted directly, 44 may be affected when searching
ring. So linear probing uses marked pseudo-deletion to delete an element .

Linear probing advantages: very simple to implement.

Disadvantages of linear detection: Once a hash conflict occurs, all the conflicts are connected together, which is prone to data " pile-up " , that is, different keys occupy the available empty positions, so that many comparisons are required to find the position of a certain key, resulting in Search efficiency is reduced . How to alleviate it? Using the secondary detection method:

The defect of linear detection is that conflicting data accumulates together, which has something to do with finding the next empty position, because finding the empty position
The way of setting is to find one by one next to each other, so in order to avoid this problem, the second detection method is to find the next empty position
Because: hashi = hashi + i*i or: hashi = hashi + i  of n second direction: i =  
1,2,3…
For 2.1, if you want to insert 44 , a conflict occurs, and the resolved situation is:
Research shows that: when the length of the table is a prime number and the table loading factor a does not exceed 0.5 , new entries must be inserted, and any
No single location is probed twice. Therefore, as long as there are half of the empty positions in the table, there will be no problem of the table being full. exist
You can ignore the fullness of the table when searching, but you must ensure that the loading factor a of the table does not exceed 0.5 when inserting.
Capacity must be considered.
Therefore: the biggest disadvantage of hashing is that the space utilization rate is relatively low, which is also the defect of hashing.
2. Open hash
1. Open hash concept
The open hash method is also called the chain address method ( open chain method ) . First, the hash function is used to calculate the hash address for the key code set, which has the same
The keys of the address belong to the same sub-collection, each sub-collection is called a bucket, and the elements in each bucket are linked by a singly linked list
Next, the head node of each linked list is stored in the hash table .

As can be seen from the figure above, each bucket in the open hash contains elements that have hash collisions .

Open and closed hash comparison:

Applying the chain address method to deal with overflow requires adding a link pointer, which seems to increase storage overhead . In fact:
Since the open address method must maintain a large amount of free space to ensure search efficiency, such as the secondary detection method requires a load factor a <=
0.7 , and the space occupied by the entry is much larger than that of the pointer, so the use of the chain address method saves storage space compared to the open address method.

Second, the underlying implementation of the hash table

1. Open address method

 First, we put the code in a namespace to prevent naming conflicts later, and then use a structure to save what kind of data is stored in each location. Here we take the kv structure as an example:

enum State
	{
		EMPTY,
		DELETE,
		EXIST
	};
	template <class K, class V>
	struct HashDate
	{
		pair<K, V> _kv;
		State _state = EMPTY;
	};

The enumeration type we defined has three states: empty, delete, and there are three states. As for why we use state representation instead of directly deleting the data in the hash table, everyone must have an answer, because our open address method resolves conflicts Sometimes, if there is already a number in this position, we need to search later. If we delete this position, how can we find the next number. When we initialize HashDate, we need to set each position at the beginning to the EMPTY state, because we will insert and delete according to the state later.

template <class K, class V>
	class HashTable
	{
	public:

    private:
		vector<HashDate<K, V>> _tables;
		size_t _n = 0;   //记录插入了多少个元素

	};

We directly use vector for the main body of the hash table, because the function of vector is very complete, and it will be more troublesome if we implement it ourselves. Each vector stores data of HashDate type (remember to add template parameters), and then we use a variable to record how much data is inserted into the table. Here we cannot directly use the size() of the vector, because we will have a delete state. If the state is deleted with size(), it will also be recorded.

bool insert(const pair<K, V>& kv)
		{
			size_t hashi = kv.first % _tables.size();
			size_t i = 1;
			size_t index = hashi;
			while (_tables[index]._state == EXIST)
			{
				index = hashi + i;
				index %= _tables.size();
				++i;
			}
			_tables[index]._kv = kv;
			_tables[index]._state = EXIST;
			++_n;
			return true;
		}

The above is the code for inserting the hash table. Let’s not consider the problem of capacity expansion. Here we must not calculate the location of the element mapping to %capacity(). Let’s draw a picture as an example:

 We must use the [] operator if we want to use vector, but this operator can only access the value of size(), and if it exceeds size(), an error will be triggered, such as an array size() = 10, capacity() = 20 , we can access [5] but not [15], so the position we calculate the mapping must be %size(). Then we need to judge whether there is an element in the mapping position. If there is an element, we need to detect it backwards Empty position, the purpose of using index is that it will be very simple to change the second detection in the future. In the process of searching backwards, in order to prevent the index from crossing the boundary, we will % the actual capacity of the hash table every time. After finding the position, insert the key-value pair and Change the state to exist, and then let the counter increase. Next, we consider the issue of capacity expansion. Before expansion, we need to know a concept:

bool insert(const pair<K, V>& kv)
		{
			if (_tables.size() == 0 || _n * 10 / _tables.size() >= 7)
			{
				//扩容
				size_t newsize = _tables.size() == 0 ? 10 : 2 * _tables.size();
				HashTable<K, V> newtable;
				newtable._tables.resize(newsize);
				for (auto& e : _tables)
				{
					if (e._state == EXIST)
					{
						newtable.insert(e._kv);
					}
				}
				_tables.swap(newtable._tables);
			}
			size_t hashi = kv.first % _tables.size();
			size_t i = 1;
			size_t index = hashi;
			while (_tables[index]._state == EXIST)
			{
				index = hashi + i;
				index %= _tables.size();
				++i;
			}
			_tables[index]._kv = kv;
			_tables[index]._state = EXIST;
			++_n;
			return true;
		}

That is to say, we need to see what the load factor is. The load factor is the actual size of the table divided by the number actually inserted in the table (remember that the actual size is size()), but because the two integers in the computer will not change no matter how they are divided Decimals, so multiplying both sides by 10 solves this problem, of course, it can also be forced to double to remove. In order to prevent the problem of dividing by 0, we judge whether the hash table is 0 or whether the load factor is greater than 0.7. The new space is expanded by twice the original space each time. Here, let’s think about whether it is possible to directly expand the original array? The answer is no, because the original mapped position will change after expansion. For example, the original size() is 10, and the number 11 will be placed in the position of 1. However, after expanding to size()=20, the number 11 will be placed in the The position of 11 is correct. In order to prevent this problem, we directly re-create a hash table object, and expand the table in this hash table object to the new space size. It should be noted that only resize() will change the size of size(), and reserve will only To change capacity, we actually use size(), so size() must be changed. After opening the space, we traverse the data in the old table to see if there are elements in each position, and insert them into the new table if there are any (calling insert here will not expand the capacity, because it is called by the new table, and the space of the new table is After the insertion is completed, the original vector and the vector in the new table can be exchanged directly.

HashDate<K, V>* Find(const K& key)
		{
			if (_tables.size() == 0)
			{
				return nullptr;
			}
			size_t hashi = key % _tables.size();
			size_t index = hashi;
			size_t i = 1;
			while (_tables[index]._state != EMPTY)
			{
				if (_tables[index]._state == EXIST
					&& _tables[index]._kv.first == key)
				{
					return &_tables[index];
				}
				index = hashi + i;
				index %= _tables.size();
				if (index == hashi)
				{
					break;
				}
				++i;
			}
			return nullptr;
		}

The Find interface is relatively simple to implement. When the table is empty, we just return empty. Then calculate the mapped position and go directly to this position to find whether the element exists. It should be noted that when we search, we will search as long as this position is not empty, because this position may be in the deleted state. If the deleted state needs to be searched behind this position , so the condition is not empty. After entering the loop, we need to judge whether the current element is equal to the key of the element we are looking for and this position must also exist. Only when this condition is met will we return the data at this position (here we use reference, and the return value is a pointer type, but as we said when referring to the reference, the reference is implemented by the pointer, so there is no problem with the return value here), when we search around and return to the original mapping position, this If the time is definitely not found, just exit the loop directly.

bool eraser(const K& key)
		{
			HashDate<K, V>* tmp = Find(key);
			if (tmp)
			{
				tmp->_state = DELETE;
				--_n;
				return true;
			}
			else
			{
				return false;
			}
		}

For the deletion interface, we directly use the Find function to find it. If it is found, the status of the current location is set to delete, and then the counter is decremented and returned to true. We can also use Find to judge when inserting. If the value to be inserted already exists, we will not insert it.

 The above are the three important interfaces of the open address method. Let's test it below:

void TeshHashTable1()
	{
		int a[] = { 3,33,2,13,5,12,102 };
		HashTable<int, int> ht;
		for (auto& e : a)
		{
			ht.insert(make_pair(e, e));
		}
		ht.insert(make_pair(16, 16));
		auto t = ht.Find(13);
		if (t)
		{
			cout << "13在" << endl;
		}
		else
		{
			cout << "13不在" << endl;
		}
		ht.eraser(13);
		t = ht.Find(13);
		if (t)
		{
			cout << "13在" << endl;
		}
		else
		{
			cout << "13不在" << endl;
		}
	}

 No problem, let's see if the mapping is successful during expansion:

 The running result is fine. Next, we implement the chain address method.

2. Chain address method (hash bucket)

 Similarly, we put the code in the namespace, and then we need to use struct to implement the node, which will be hung in a certain position of the hash table in the future.

template <class K, class V>
	struct HashNode
	{
		HashNode<K, V>* _next;
		pair<K, V> _kv;
		HashNode(const pair<K, V>& kv)
			:_kv(kv)
			, _next(nullptr)
		{

		}
	};

The node only needs to have a next pointer pointing to other nodes, and then a key-value pair is done. Since it is a node, we must definitely need to create it by opening a new space, so we write a constructor to construct this node through a pair. Can.

template <class K, class V>
	class HashTable
	{
		typedef HashNode<K, V> Node;
	public:

    private:
		vector<Node*> _tables;
		size_t _n = 0;
	};

The main body also uses a vector, which stores pointers to nodes, and also needs a counter to record how many elements are inserted.

bool insert(const pair<K, V>& kv)
		{
			size_t hashi = kv.first % _tables.size();
			Node* newnode = new Node(kv);
			newnode->_next = _tables[hashi];
			_tables[hashi] = newnode;
			++_n;
			return true;
		}

Similarly, we don't consider the problem of expansion, directly calculate the mapping position, then create a new node, and then insert the head, let the next node of the new node link to the head node in the original table, and then let the new node become the mapping position In this way, the header node is completed. After the header is inserted, let the counter ++. Let's consider the expansion problem:

bool insert(const pair<K, V>& kv)
		{
			if (_n == _tables.size())
			{
				//扩容
				size_t newsize = _tables.size() == 0 ? 10 : 2 * _tables.size();
				vector<Node*> newtable(newsize, nullptr);
				for (auto& cur : _tables)
				{
					while (cur)
					{
						Node* next = cur->_next;
						size_t hashi = cur->_kv.first % newtable.size();
						cur->_next = newtable[hashi];
						newtable[hashi] = cur;
						cur = next;
					}
				}
				_tables.swap(newtable);
			}
			size_t hashi = kv.first % _tables.size();
			Node* newnode = new Node(kv);
			newnode->_next = _tables[hashi];
			_tables[hashi] = newnode;
			++_n;
			return true;
		}

The expansion of the hash bucket only needs to be expanded when each bucket has elements, so as to ensure that the data in each bucket is similar. When the inserted element is divided by the actual element, that is, the load factor is 1, we can expand the capacity of the hash bucket. For the expansion of the hash bucket, we can also open a new hash table like the above open address method, but this is too inefficient. You must know It is too inefficient to reinsert the linked nodes in the bucket and release the space one by one after the insertion is successful, so we directly re-open a vector, and then directly remap the nodes in the old hash table to the vector one by one , so that we don't need to release the space of the node after the mapping is completed, because we use the old node to remap, and there is no new node. Remapping is also very simple, that is, traversing the old hash table. When the node at this position is not empty, we save the next node of this node, and then calculate the new mapping position of this node (here the calculation must use the new size() space to map, which is called remapping), and then let the current node link to the head node of the mapping position, and then let the current node become the head node of the mapping position to complete the head insertion. After the insertion is complete, the vector can be exchanged.

Node* Find(const K& key)
		{
			if (_tables.size() == 0)
			{
				return nullptr;
			}
			size_t hashi = key % _tables.size();
			Node* cur = _tables[hashi];
			while (cur)
			{
				if (cur->_kv.first == key)
				{
					return cur;
				}
				cur = cur->_next;
			}
			return nullptr;
		}

The lookup function also first checks whether the table is empty, and returns a null pointer if it is empty. Then we calculate the mapping position and directly get the head node at this position, and then traverse from the head node. If we find the element we are looking for, we will return to the current node. If we have not found it by the end of the loop, we will return a null pointer.

bool eraser(const K& key)
		{
			size_t hashi = key % _tables.size();
			Node* cur = _tables[hashi];
			Node* prev = nullptr;
			while (cur)
			{
				if (cur->_kv.first == key)
				{
					if (prev == nullptr)
					{
						Node* next = cur->_next;
						_tables[hashi] = next;
					}
					else
					{
						prev->_next = cur->_next;
					}
					delete cur;
					return true;
				}
				else
				{
					prev = cur;
					cur = cur->_next;
				}
			}
			return false;
		}

The deletion interface first calculates the position to be mapped, then gets the head node of this position, uses a variable to save the previous node, enters the loop when the head node is not empty, and continues traversing if the node to be deleted is not found, Before traversing, give the current position to the prev node to record the previous position. When we find the node to be deleted, we need to judge whether the current node is the head node. If it is the head node, let the next of the head node be the head node directly. The original head node can be released. If the node to be deleted is not the head node, just let the next link of the previous node delete the next of the node, and then release the node.

Let's test the code below:

void TeshHashTable2()
	{
		int a[] = { 3,33,2,13,5,12,1002 };
		HashTable<int, int> ht;
		for (auto& e : a)
		{
			ht.insert(make_pair(e, e));
		}
		ht.insert(make_pair(16, 16));
		ht.insert(make_pair(14, 14));
		ht.insert(make_pair(15, 15));
		ht.insert(make_pair(17, 17));
		auto t = ht.Find(13);
		if (t)
		{
			cout << "13在" << endl;
		}
		else
		{
			cout << "13不在" << endl;
		}
		ht.eraser(13);
		t = ht.Find(13);
		if (t)
		{
			cout << "13在" << endl;
		}
		else
		{
			cout << "13不在" << endl;
		}
	}

 There is no problem with the interface, let's look at the expansion problem:

 The above is the hash table without expansion, let’s take a look at what it looks like after expansion:

 We can see that all the values ​​have been remapped after the expansion. Next, let’s implement the destructor, because when our program ends, the vector will only release its own space, and the space of the linked list for each position will not be released. , all we need to release manually:

~HashTable()
		{
			for (auto& cur : _tables)
			{
				while (cur)
				{
					Node* next = cur->_next;
					delete cur;
					cur = next;
				}
				cur = nullptr;
			}
		}

 When destructing, we traverse directly. When the head node at this position is not empty, we save the next node at this position, then release the current node and let cur become the node just saved and execute the delete operation again. When all the data in a bucket is released, we just set the pointer of the current bucket to null.


Summarize

The above is the underlying implementation of the hash table. In the next article, I will encapsulate the hash bucket and then become the underlying layer of unordered_map and unordered_set. We also encapsulated the red-black tree before. This time the encapsulation is still red-black tree The difference is not much, but it will be a little more troublesome than the red-black tree.

Guess you like

Origin blog.csdn.net/Sxy_wspsby/article/details/130790030