[High-order data structure] Hash (hash table, hash bucket)

⭐Blog homepage: ️CS semi homepage
⭐Welcome to follow: like, favorite and leave a message
⭐ Column series: Advanced C++
⭐Code repository: Advanced C++
Home It is not easy for people to update. Your likes and attention are very important to me. Friends, please like and follow me. Your support is the biggest motivation for my creation. Friends are welcome to send private messages to ask questions. Family members, don’t forgetLike and collect + follow! ! !


1. Hash concept

In the sequential structure and balanced tree, there is no corresponding relationship between the element key and its storage location, so when searching for an element, It must go through multiple comparisons ofkey codes. The time complexity of sequential search is O(N). In a balanced tree, it is the height of the tree, that is, O( l o g 2 N log_2 N log2N), the efficiency of the search depends on the number of comparisons of elements during the search process.

Ideal search method: without any comparison, get the element to be searched directly from the table at once, that is, the time complexity of the search is O(1).

If a storage structure is constructed, a one-to-one mapping relationship can be established between the storage location of the , then the element can be found quickly through this function during search. element and its key code through a certain function (hashFunc)

When entering this structure:

  • insert element

According to the key code of the element to be inserted, use this function to calculate the storage location of the element and store it according to this location.

  • Search element

Perform the same calculation on the key code of the element,Regard the obtained function value as the storage location of the element, and compare the elements according to this location in the structure< a i=2>, if the key codes are equal, the search is successful.

This method isHash (hash) method. The conversion function used in the hash method is called hash (hash) ) function, the structure constructed by is called a Hash Table (or hash table).

Insert image description here

2. Hash collision

Different keywords calculate the same hash address through the same hash function. This phenomenon is called hash conflict or hash collision.

A very simple example, as shown in the figure below, if a number 45 is added to the data information, after modulo 10, it becomes 5, and then it is ambiguous with the original 5, and the two are mapped to the same position. This causes hash collision/hash collision.
Insert image description here

3. Hash function

One reason for hash conflict may be:The hash function design is not reasonable enough.

Hash function design principles

  1. The domain of the hash function must include all keys that need to be stored, and if the hash table allows m addresses, its value range must be between 0 and m-1
  2. The addresses calculated by the hash function can be evenly distributed throughout the space
  3. The hash function should be relatively simple

Common hash functions are as follows:

  1. Direct addressing method (commonly used)
    Take a linear function of the keyword as the hash address:Hash(Key)= A*Key + B
    Advantages: Simple and uniform
    Disadvantages: Need to know the distribution of keywords in advance: Suitable for finding relatively small and continuous situations
    Usage scenarios

  2. Division with remainder method (commonly used)
    Suppose the number of addresses allowed in the hash table is m, choose one that is not larger than m, but is the closest Or a prime number p equal to m is used as the divisor, according to the hash function: Hash(key) = key% p(p<=m),Convert the key code into a hash address .

  3. Medium method of squares
    Assume that the keyword is 1234, and its square is 1522756. Extract the three digits 227 in the middle as the hash address; for another example, the keyword is 4321, and its square is 1522756. The square is 18671041, and the middle three digits 671 (or 710) are extracted as the hash address.
    The square-middle method is more suitable: the distribution of keywords is not known, and the number of digits is not very large

  4. Folding method
    The folding method is to divide the keyword into several parts with equal digits from left to right (the last part can have shorter digits), and then superimpose these parts. Calculate the sum and take the last few digits as the hash address according to the length of the hash table.
    The folding method is suitable for situations where the distribution of keywords does not need to be known in advance and is suitable for situations where there are many keyword digits

  5. Random number method
    Choose a random function, and take the random function value of the keyword as its hash address, that is, H(key) = random(key), where random is Random number function.
    This method is usually used when keywords are of different lengths

  6. Mathematical analysis method
    Suppose there are n d digits. Each digit may have r different symbols. The frequency of these r different symbols appearing in each digit is not necessarily the same. The same, it may be that the distribution in some bits is relatively even, and each symbol has an equal chance of appearing, but the distribution in some bits is uneven, and only certain symbols appear frequently. According to the size of the hash table, several bits in which various symbols are evenly distributed can be selected as the hash address. For example:

Insert image description here
Suppose we want to store a company's employee registration form. If we use mobile phone numbers as keywords, then it is very likely that the first 7 digits are the same. Then we can choose the last four digits as the hash address. If such extraction work is easy If a conflict occurs, you can also reverse the extracted numbers (such as 1234 to 4321), shift the right ring (such as 1234 to 4123), shift the left ring, and superpose the first two numbers and the last two numbers (such as 1234 to 4123). 12+34=46) and other methods.

The numerical analysis method is usually suitable for processing situations where the keyword digits are relatively large. If the distribution of keywords is known in advance and the distribution of several digits of the keyword is relatively even,

4. Hash conflict resolution

Two common ways to resolve hash collisions are: Closed hashing and open hashing

1. Closed hashing – open addressing method

It is also called the open addressing method. When a hash conflict occurs, if the hash table is not full, it means that there must be an empty position in the hash table, then the key can be stored in the "next" empty position in the conflict position. Go in.

(1) Linear exploration

Starting from the position where the conflict occurs, probe backwards until the next empty position is found.

Let's insert the value 41 as shown below, then proceed with linear exploration and find an empty position to sit down.
Insert image description here

Hi = (H0 + i) % m(i=1,2,3……)

Hi: The empty position found by the conflicting element through linear exploration.
H0: The position calculated by the hash function on the key code of the element.
m: Hash table capacity

We can see the problem at a glance. If we insert too much data or repeatedly insert the same data after taking the remainder, wouldn't it mean that there are many linear exploration times and there may be insufficient capacity? ? Let’s take a look at it with a picture:

Insert image description here

As shown above, we found that 1000 conflicted twice, 101 conflicted twice, and 40 conflicted 4 times. There are too many things to search for in linear exploration, so:

We insert data into a limited space. The more elements there are in the space, the greater the probability of conflict when inserting elements. After multiple conflicts, the elements of the hash table are inserted, < a i=1>The efficiency in searching will also be reduced. Between this, the load factor (load factor) is introduced in the hash table:

Load factor = number of valid data in the hash table/total capacity

Insert image description here

Insert image description here
The hash conflicts after expansion in the above figure are significantly reduced, but we will find that a lot of space is wasted. This is the load factor problem we talked about above. The load factor needs to be guaranteed to be in the range of 0.7-0.8.

To summarize:
The advantages of linear exploration: It is simple and easy to understand.
Disadvantages of linear exploration: Once a hash conflict occurs, all conflicts are connected together, which can easily lead to data "piling up" , many different data are easy to pile up together, and it is necessary to find empty positions later, and multiple comparisons are needed to find empty positions. Finally, the control of load factor is also a difficulty.

(2) Second exploration

The flaw of linear detection is that conflicting data accumulates together, which is related to finding the next empty position, because the way to find empty positions is to find them one by one, so the secondary detection is to avoid this Question, The method to find the next empty position is:

+i numbers from the calculated position

H i H_i Hi = ( H 0 H_0 H0 + i 2 i^2 i2 )% m(i=1,2,3……)

H 0 H_0 H0 is the position obtained by calculating the key code key of the element through the hash function Hash(x).
m m m This is the size of the table.
H i Hi Hi is the storage location of the conflict element obtained after secondary detection

Insert image description here

Secondary detection is used to find the next location for data that generates hash conflicts. Compared with linear detection, the distribution of elements in the hash table using secondary detection will be relatively sparse, which is less likely to lead to data accumulation. But 10 spaces are indeed a bit sparse, but we still compared them many times. Let’s expand to 20 spaces and take a look:

Insert image description here

Research shows: When the length of the table is a prime number and the table loading factor a does not exceed 0.5, new table entries can definitely be inserted, and no position will be explored. twice. Therefore, as long as there are half empty positions in the table, there will be no table full problem. When searching, you do not need to consider that the table is full, but when inserting, you must ensure that the loading factor a of the table does not exceed 0.5. If it exceeds, you must consider increasing the capacity.

2. Open hashing-chain address method

The open hash method is also called the chain address method (open chain method). First, a hash function is used to calculate the hash address of the key code set. Key codes with the same address belong to the same sub-set. Each sub-set is called a bucket. The elements in the bucket are linked through a singly linked list, and the head node of each linked list is stored in the hash table.

Insert image description here

Of course, the hash bucket can also be expanded by opening a hash bucket. When each node of the opened space has a head node hung, and a different number is hung under each head node, each time a new number is inserted, The values ​​​​will cause hash conflicts and require new numbers to be hung below. Although this is okay and there is no problem, the performance of the hash has been reduced, so we can expand the capacity so that some of the newly inserted values ​​​​will be placed below. in the subsequent value bucket.

3. The difference between closed and scattered

For the open addressing method of closed hashing, the load factor cannot exceed 1, and it is generally recommended to control it between [0.0, 0.7].
For a hash bucket with open hashing, the load factor can exceed 1. It is generally recommended to control it between [0.0, 1.0].

So in our real life, there are more hash buckets that use open hashing, because the load factor of the hash bucket of open hashing can be very large, that is, it can accommodate more values, and in the extreme case of hash buckets (All inserted data is in one bucket) There is also a solution:

Insert image description here

The above situation is all in one bucket, so the search efficiency is very low, O(N), so our approach is to insert these values ​​according to the storage mode of the red-black tree instead of using a linked list. mode:

Insert image description here

Then, even if there are 1 billion data, we only need to traverse to find this value 30 times!
So in some compilers, when too much data is inserted into a hash bucket, for example, in the new JAVA compiler, if more than 8 data are hung below, it will become a red-black tree. , if it is less than or equal to 8, it is still a singly linked list.

5. Closed hashing implementation of hash table

1. The structure of the hash table

In a closed hash hash table, in addition to storing the given data, each location in the hash table should also store the current state of the location. The possible states of each location in the hash table are as follows:

0, EMPTY: There is no data in this position and it is empty.
1. EXITS: There is data at this location and it is not empty.
2. DELETE: This position originally had a value, but was deleted. It has no value and is empty.

(1) Doubtful question – Why should it be a state? Why three states?

So a question arises: Why do we need these three states? Can't we just directly judge whether it is empty or not? Why don't we just give it two states?

First explain the status

A reminder, what if we want to find whether an element exists in the table? For example, in our hash table below, if we want to find the position of element 40.

Insert image description here

We use linear exploration to find that the hash address of element 40 in the hash table is 0 by dividing with the remainder method. Search backwards starting from the 0 subscript. If 40 is found, it means it exists. If an empty position is found, there is no 40 in this table.

Because the meaning of hashing is that after a hash conflict occurs after the remainder is taken, the empty space is inserted backward! Because linear detection searches backwards in order when finding the next position for conflicting elements, since we have found an empty position, it means that there will no longer be conflicting elements starting from the subscript 0 position behind this empty position. .

For another example, we are now looking for the number 80. We search from address 0 and back, and find that the fifth empty position is empty. We immediately judge that 80 does not exist, because no matter what you do, 80 and 40 are from address 0 and back. Search is performed, so this value must not exist when the first empty position is found!

So what is our status? Use 0 to represent absence and 1 to represent presence? It seems that there is no problem at first glance, but there is a very serious problem. So is the number I inserted itself 1 or 0? Isn't this a big mistake? So we are using state!

Explain that there are three states

Next, we use the delete operation to delete the value 1000. Let’s take a look at what the hash table looks like at this time:

Insert image description here

Let's find out whether the value 40 exists at this time. Now you tell me that the only two states are EMPTY and EXITS. Then we start from the 0 address and search backward according to the remainder method. We find that the position 2 is an empty position, marked with EMPTY, according to the structure of the hash table we mentioned above, when an empty position is encountered, it is directly judged not to exist, so isn't it a big mistake? There is obviously a value of 40 later! So we introduce the DELETE button here, which means that it existed before but was deleted. Skip this pit and continue looking for it later.

Insert image description here

In this way, when we are searching for elements in the hash table, if the element at the current position does not match the element to be found, but the status of the current position is EXITS or DELETE, then we should continue to search later. , and when we insert an element, we can insert the element into a position where the status is EMPTY or DELETE.

(2) Status and load factor

So, our status is:

// 标记状态
enum State
{
    
    
	EMPTY, // 空
	EXITS, // 存在
	DELETE // 删除
};

Hash data:

// HashData数据
template<class K, class V>
struct HashData
{
    
    
	pair<K, V> _kv;
	State _state = EMPTY;
};

Hash table:

// HashTable
template<class K, class V>
class HashTable
{
    
    
public:

private:
	vector<HashData<K, V>> _table; //哈希表
	size_t _n = 0; //哈希表中的有效元素个数(用来控制负载因子)
};

(3) Preliminary work

// 将kv.first强转成size_t
// 仿函数
template<class K>
struct DefaultHashTable
{
    
    
	size_t operator()(const K& key)
	{
    
    
		return (size_t)key;
	}
};

Insert image description here

2. Insertion into hash table

Insertion has the following four steps:

  1. Check whether the key-value pair of the key value exists in the hash table. If it already exists, the insertion fails.
  2. The load factor is too large and the hash table needs to be adjusted.
  3. Insert key-value pairs into hash table
  4. Valid elements in the hash table +1

Adjust hash table

Because 0.7 is a decimal, we multiply 0.7 by 10. When the load factor of the hash table is greater than 0.7, we need to create a new table. The size of this new table is twice the size of the original table, which is simply For expansion, the data in the original hash table is inserted into different locations in the new table according to the location of the new table, and finally the original table and the new table are exchanged.

How to insert key value pair into hash table?

First calculate the corresponding hash address through the hash function (remainder algorithm), and then check whether there is a hash conflict. If there is a hash conflict, then find the EMPTY or DELETE position from the hash address and insert it (note, At this time, because the Find function has been called before to confirm that the value is not in the hash table), the value is finally inserted into this position and the status of this position is changed to EXITS.

Note: When a hash conflict occurs and is detected backwards, a suitable location will be found for insertion, because the load factor of the hash table is controlled below 0.7, which means that the hash table will never be full.

code show as below:

	HashTable()
	{
    
    
		// 构造10个空间
		_table.resize(10);
	}

	bool Insert(const pair<K, V>& kv)
	{
    
    
		// 查看哈希表中是否存在该键值的键值对,若已存在则插入失败
		// 负载因子过大都需要对哈希表的大小进行调整
		// 将键值对插入哈希表
		// 哈希表中的有效元素个数加1

		// 1、判断哈希表中是否存在该键值的键值对
		HashData<K, V>* key_ = Find(kv.first); 
		// 利用Find函数的bool值来判断是否存在
		if (key_)
		{
    
    
			return false; // 键值已经存在了,返回false
		}

		// 2、负载因子过大需要调整 -- 即大于0.7
		if (_n * 10 / _table.size() >= 7)
		{
    
    
			// 增容
			// a.创建新的哈希表,并将新表开辟到原始表的两倍
			size_t newSize = _table.size() * 2;
			HashTable<K, V, HashFunc> NewHashTable;
			NewHashTable._table.resize(newSize);

			// b.将原始表数据迁移到新表
			for (size_t i = 0; i < _table.size(); i++)
			{
    
    
				if (_table[i]._state == EXITS)
				{
    
    
					NewHashTable.Insert(_table[i]._kv);
				}
			}
			//for (auto& e : _table)
			//{
    
    
			//	if (e._state == EXITS)
			//	{
    
    
			//		NewHashTable.Insert(e._kv);
			//	}
			//}
			// c.交换原始表和新表
			_table.swap(NewHashTable._table);
		}

		// 3、将键值对插入到哈希表
		// 线性探索方式
		// a.取余计算原始哈希地址
		HashFunc hf;
		size_t hashi = hf(kv.first) % _table.size();
		// b.当是DELETE或者是EMPTY即插入,EXIST即跳过
		while (_table[hashi]._state == EXITS)
		{
    
    
			++hashi;
			hashi %= _table.size();// 防止超出
		}
		// c.插入进去并将状态进行改变
		_table[hashi]._kv = kv;
		_table[hashi]._state = EXITS;
		// 3、哈希表中的有效元素个数加1
		++_n;

		return true;
	}

3. Search in hash table

The method has two steps:

  1. Calculate the hash address through the hash function
  2. Starting from the hash address, linear detection is used to search the data backwards and forwards until the element to be found is found, which is judged as a successful search, or a location with a status of EMPTY is found, which is judged as a search failure.

Note: During the search process, the element whose position status is EXIST and whose key value matches must be found to be considered successful. If only the key value matches, but the current status of the position is DELETE, you need to continue searching because the element at that position has been deleted.

	// 查找函数
	HashData<K, V>* Find(const K& key)
	{
    
    
		// 1通过哈希函数计算出对应的哈希地址。
		// 2从哈希地址处开始,采用线性探测向后进行数据的查找,
		// 直到找到待查找的元素判定为查找成功,或找到一个状态为EMPTY的位置判定为查找失败。
		HashFunc hf;
		size_t hashi = hf(key) % _table.size();
		while (_table[hashi]._state != EMPTY)
		{
    
    
			if (_table[hashi]._state == EXITS
				&& _table[hashi]._kv.first == key)
			{
    
    
				return (HashData<K, V>*)&_table[hashi];
			}
			++hashi;
			hashi %= _table.size();
		}

		// 没找到
		return nullptr;
	}

4. Deletion of hash table

Pseudo deletion does not need to actually delete the data, but only needs to change the status of the data to DELETE.

  1. Check whether the key-value pair of this key exists in the hash table. If it does not exist, the deletion will fail.
  2. If it exists, just change the status of the key-value pair to DELETE.
  3. The number of valid elements in the hash table is reduced by 1

Note that we did not completely delete the data here, but changed the status of the data to DELETE. The actual data was not deleted, but when we want to insert data, we can overwrite the old data with the newly inserted data. The data.

	// 删除
	bool Erase(const K& key)
	{
    
    
		// 1查看哈希表中是否存在该键值的键值对,若不存在则删除失败
		// 2若存在,则将该键值对所在位置的状态改为DELETE即可
		// 3哈希表中的有效元素个数减一
		HashData<K, V>* key_ = Find(key);
		if (key_)
		{
    
    
			key_->_state = DELETE;
			--_n;
			return true;
		}
		return false;
	}

6. Open hash implementation of hash table (hash bucket)

1. The structure of the hash table

In an open hash table, the pointer type of each storage location in the hash table is the same. In short, it is the head node of a singly linked list. Of course, in addition to storing the given data, the node type In addition, it is also necessary to store a next node pointer pointing to the next node.

	template<class K, class V>
	struct HashNode
	{
    
    
		HashNode(const pair<K, V>& kv)
			:_kv(kv)
			,_next(nullptr)
		{
    
    }

		pair<K, V> _kv;
		HashNode<K, V>* _next;
	};

We still remember that in our closed hashing before, we need to set a state (ie EXITS, EMPTY, DELETE) for each hash table position, but the hash table with open hashing does not need it, because it is for each hash table. The location of the hash table stores the inserted head node, and the inserted new node may be connected below. In the hash table with open hashing, we put elements with the same hash address into the same hash. In the Greek bucket, there is no need to search for the so-called "next location".

So is it necessary to increase the capacity? The answer is of course that the capacity needs to be increased, because data can be stored in our hash bucket, which is a linked list storage. Of course, when the load factor is too large and almost fills up the entire hash table, then At this time, the load factor is too large. Every time a new number is added, there is almost a hash conflict. The variable list needs to be tail-inserted. The operation is still very complicated, so it is better to expand the capacity and make the load factor smaller. , which is not that big, can make the efficiency of the entire hash bucket better. Why not?

	//哈希表
	template<class K, class V>
	class HashTable
	{
    
    
		typedef HashNode<K, V> Node;
	public:

	private:
		vector<Node*> _table; //哈希表
		size_t _n = 0; //哈希表中的有效元素个数
	};

2. Preliminary preparation

	// 将kv.first强转成size_t
	// 仿函数
	template<class K>
	struct DefaultHashTable
	{
    
    
		size_t operator()(const K& key)
		{
    
    
			return (size_t)key;
		}
	};

	template<class K, class V>
	struct HashNode
	{
    
    
		HashNode(const pair<K, V>& kv)
			:_kv(kv)
			, _next(nullptr)
		{
    
    }

		pair<K, V> _kv;
		HashNode<K, V>* _next;
	};

Insert image description here

3. Insertion of hash table (bucket)

The insertion of our hash table (bucket) mainly has the following four steps:

  1. Check whether the key-value pair of the key value exists in the hash table. If it already exists, the insertion fails.
  2. Determine whether the load factor needs to be adjusted. If the load factor is too large, the size of the hash table needs to be adjusted.
  3. Insert key-value pairs into hash table
  4. The number of valid elements in the hash table is increased by 1

Because the load factor has reached 1, we need to expand the capacity to adjust the load factor. Our idea is to expand the capacity to twice the capacity. We need to traverse the initial hash table, carry out easy pickings, and directly migrate the data of the nodes in the original hash table to the new table. , and then exchange the new table with the old table. This avoids the process of releasing the old table nodes and is more convenient and faster. Then let’s take a look at the distribution of inserted elements after expansion:

Explanation: We only need to traverse each hash bucket of the original hash table, and use the hash function to find the corresponding position of the node in each hash bucket and insert it into the new hash table. There is no need to create nodes. and release.

Methods to deal with hash conflicts:
1. Calculate the corresponding hash address through the hash function.
2. If a hash conflict occurs, just insert the node into the corresponding singly linked list.

The following is an example of expansion. Because if all 10 positions are to be filled, the operation requires too many steps, so we use 7 nodes for simulation.
Insert image description here

code show as below:

		HashTable()
		{
    
    
			_table.resize(10, nullptr);
		}

		// 插入
		bool Insert(const pair<K, V>& kv)
		{
    
    
			// 1查看哈希表中是否存在该键值的键值对,若已存在则插入失败
			// 2判断是否需要调整负载因子,若负载因子过大都需要对哈希表的大小进行调整
			// 3将键值对插入哈希表
			// 4哈希表中的有效元素个数加一


			HashFunc hf;
			// 1、查看键值对,调用Find函数
			Node* key_ = Find(kv.first);
			if (key_) // 如果key_是存在的则插入不了
			{
    
    
				return false; // 插入不了
			}

			// 2、判断负载因子,负载因子是1的时候进行增容
			if (_n == _table.size()) // 整个哈希表都已经满了
			{
    
    
				// 增容
				// a.创建一个新表,将原本容量扩展到原始表的两倍
				int HashiNewSize = _table.size() * 2;
				vector<Node*> NewHashiTable;
				NewHashiTable.resize(HashiNewSize, nullptr);

				// b.遍历旧表,顺手牵羊,将原始表数据逐个头插到新表中
				for (size_t i = 0; i < _table.size(); ++i)
				{
    
    
					if (_table[i]) // 这个桶中的有数据/链表存在
					{
    
    
						Node* cur = _table[i]; // 记录头结点
						while (cur)
						{
    
    
							Node* next = cur->_next; // 记录下一个结点
							size_t hashi = hf(kv.first) % _table.size(); // 记录一下新表的位置
							// 头插到新表中
							cur->_next = _table[hashi];
							_table[hashi] = cur;

							cur = next; // 哈希这个桶的下一个结点
						}
						_table[i] = nullptr;
					}
				}
				// c.交换两个表
				_table.swap(NewHashiTable);
			}

			// 3、将键值对插入到哈希表中
			size_t hashii = hf(kv.first) % _table.size();
			Node* newnode = new Node(kv);
			// 头插法
			newnode->_next = _table[hashii];
			_table[hashii] = newnode;

			// 4、将_n++
			++_n;
			return true;
		}

4. Search in hash table (bucket)

  1. Calculate the corresponding hash address through the hash function
  2. Find the singly linked list in the corresponding hash bucket through the hash address and traverse the singly linked list to search.
		// 查找
		HashNode<K, V>* Find(const K& key)
		{
    
    
			//1通过哈希函数计算出对应的哈希地址
			//2通过哈希地址找到对应的哈希桶中的单链表,遍历单链表进行查找即可
			HashFunc hf;
			size_t hashi = hf(key) % _table.size();

			Node* cur = _table[hashi]; // 刚好到哈希桶的位置
			while (cur)
			{
    
    
				// 找到匹配的了
				if (cur->_kv.first == key)
				{
    
    
					return (HashNode<K, V>*)cur;
				}
				cur = cur->_next;
			}
			return nullptr;
		}

5. Deletion of hash table (bucket)

  1. Calculate the corresponding hash bucket number through the hash function
  2. Traverse the corresponding hash bucket to find the node to be deleted
  3. If the node to be deleted is found, remove the node from the singly linked list and release it
  4. After deleting the node, reduce the number of valid elements in the hash table by 1
		// 删除
		bool Erase(const K& key)
		{
    
    
			//1通过哈希函数计算出对应的哈希桶编号
			//2遍历对应的哈希桶,寻找待删除结点
			//3若找到了待删除结点,则将该结点从单链表中移除并释放
			//4删除结点后,将哈希表中的有效元素个数减一
			HashFunc hf;
			size_t hashi = hf(key) % _table.size();
			// prev用来记录前面一个结点,有可能是删除的是哈希桶第一个结点
			// cur记录的是当前结点,当然是要删除的结点
			Node* prev = nullptr;
			Node* cur = _table[hashi];

			while (cur) // 遍历到结尾
			{
    
    
				if (cur->_kv.first == key) // 刚好找到这个值
				{
    
    
					// 第一种情况:这个要删除的值刚好是哈希桶的头结点
					if (prev == nullptr)
					{
    
    
						_table[hashi] = cur->_next;
					}
					// 第二种情况:这个要删除的值不是哈希桶的头结点,而是下面挂的值
					else // (prev != nullptr)
					{
    
    
						prev->_next = cur->_next;
					}
					delete cur; // 删除cur结点
					_n--;
					return true;
				}
				prev = cur;
				cur = cur->_next;
			}
			return false;
		}

6. Printing of hash table (bucket)

Two loops, a small loop inside the big loop.

		// 打印一下
		void Print()
		{
    
    
			// 大循环套小循环
			for (size_t i = 0; i < _table.size(); ++i)
			{
    
    
				printf("[%d]->", i);
				Node* cur = _table[i];
				// 小循环
				while (cur)
				{
    
    
					cout << cur->_kv.first << ":" << cur->_kv.second << "->";
					cur = cur->_next;
				}
				printf("NULL\n");
			}
		}

7. Insert a string type value into the hash table

We have not yet implemented string type values, so we can solve this problem by using overloaded struct and a specialized template. Our code is as follows:

131 is a prime number. It can also be replaced by some other prime numbers. Its main function is to prevent the hash values ​​calculated from two different strings from being equal.

	// 模版的特化(其他类型) 
	template<>
	struct DefaultHashTable<string>
	{
    
    
		size_t operator()(const string& str)
		{
    
    
			// BKDR
			size_t hash = 0;
			for (auto ch : str)
			{
    
    
				// 131是一个素数,也可以用其他一些素数代替,主要作用防止两个不同字符串计算得到的哈希值相等
				hash *= 131;
				hash += ch;
			}

			return hash;
		}
	};

7. Do an experiment – ​​why it is better to use prime numbers for the hash table length

Let’s use composite numbers and prime numbers to illustrate:

The composite number 10 and the prime number 11.

  1. Factors of the composite number 10: 1 2 5 10
  2. Factors of the prime number 11: 1 11

We select 5 sequences for experiments based on the above factors:

The interval between numbers is 1:
s1=:{1 2 3 4 5 6 7 8 9 10}

Number interval 2:
s2={2 4 6 8 10 12 14 16 18 20}

The interval between numbers is 5:
s3={5 10 15 20 25 30 35 40 45 50}

The interval between numbers is 10:
s4={10 20 30 40 50 60 70 80 90 100}

The interval between numbers is 11:
s5={11 22 33 44 55 66 77 88 99 101}

Experiment 1:
Insert the sequence of numbers with an interval of 1 into the hash buckets with a table length of 10 and a table length of 11 respectively:

Insert image description here
Insert image description here

Experiment 2:
Insert the sequence with a number interval of 2 into the hash bucket with a table length of 10 and a table length of 11 respectively:

Insert image description here

Insert image description here

Experiment 3:
Insert the sequence of numbers separated by 5 into hash buckets with a table length of 10 and a table length of 11 respectively:

Insert image description here
Insert image description here

Experiment 4:
Insert the sequence of numbers with an interval of 10 into the hash buckets with a table length of 10 and a table length of 11 respectively:

Insert image description here
Insert image description here

Experiment 5:
Insert the sequence of numbers with an interval of 11 into the hash buckets with a table length of 10 and a table length of 11 respectively:

Insert image description here

Insert image description here

get conclusion:

1. When everyone’s interval is 1, they are evenly distributed no matter how long the table is.
2. When the interval between numbers is the length of the hash table or a multiple of the length of the hash table, a hash conflict will inevitably occur. From the second number A hash collision occurs at the beginning.
3. Whenever the distance between numbers is a factor of the table length, the probability of a hash conflict will become very large. Basically, it will happen if there are not three numbers. to 2 hash collisions.

We found that prime numbers have the fewest factors, only 1 and itself! So prime numbers are more suitable for table lengths!

Guess you like

Origin blog.csdn.net/m0_70088010/article/details/133483476