Hash table + package map and set

hashDo you really understand hash tables?


In C++98, STL provides a series of associative containers whose bottom layer is a red-black tree structure, and the efficiency can reach log2_N when querying, that is, in the worst case, it is necessary to compare the height of the red-black tree. When the nodes in the tree Very often, the query efficiency is not ideal. The best query is to find elements with a small number of comparisons. Therefore, in C++11, STL provides four unordered series of associative containers. The association between these four containers and the red-black tree structure The use methods of type containers are basically similar, but their underlying structures are different. In this article, only unordered_map and unordered_set are introduced.

unordered_map和unordered_set

nature

  1. unordered_map is an associative container that stores <key, value> key-value pairs. The key-value key is usually used to uniquely represent an element, and its mapping value is an object whose content is associated with the key-value key. value.
  2. Internally, the key-value pairs <key, value> are not sorted in a specific order. In order to find the vlaue corresponding to the key within a constant range, the container puts the key-value pairs with the same hash value into the same bucket.
  3. The unordered_map container accesses a single element by key faster than map, but the traversal efficiency is lower.
  4. Implemented [] overload, allowing direct access to value through key
  5. Iterators are one-way (forward) iterators.

The bottom layer of unordered_set is a hash, while the bottom layer of set is a red-black tree. The usage method of unordered_set is not much different from that of set, so I won’t introduce it here.

performance comparison

The performance comparison here does not use map and unordered_map for comparison, but set and unordered_set for comparison.

The bottom layer of set is a red-black tree, while the bottom layer of unordered_set is a hash table. Here, the performance comparison of the two bottom layers is designed.

int main()
{
    
    
	const size_t N = 1000000;
	unordered_set<int> us;
	set<int> s;
	vector<int> v;
	v.reserve(N);
	srand(time(0));
	for (size_t i = 0; i < N; i++)
	{
    
    
		//数据大多数不相同时,set更有优势
		//v.push_back(rand());//插入随机数,但只有3w多个,大部分重复了;insert:us<s
		v.push_back(rand() + i);//少量的重复值的随机数 ;insert :差不多
		//v.push_back(i);//有序数;insert:s<us
	}

	size_t begin1 = clock();
	for (auto e : v)
	{
    
    
		us.insert(e);
	}
	size_t end1 = clock();


	size_t begin2 = clock();
	for (auto e : v)
	{
    
    
		s.insert(e);
	}
	size_t end2 = clock();


	size_t begin3 = clock();
	for (auto e : v)
	{
    
    
		us.find(e);
	}
	size_t end3 = clock();


	size_t begin4 = clock();
	for (auto e : v)
	{
    
    
		s.find(e);
	}
	size_t end4 = clock();
	cout << us.size() << endl;
	cout << s.size() << endl; 
	size_t begin5 = clock();
	for (auto e : v)
	{
    
    
		us.erase(e);
	}
	size_t end5 = clock();


	size_t begin6 = clock();
	for (auto e : v)
	{
    
    
		s.erase(e);
	}
	size_t end6 = clock();


	cout << "unordered_set insert:" << " " << end1 - begin1 << endl;

	cout << "set insert:" << " " << end2 - begin2 << endl;

	cout << "unordered_set find:" << " " << end3 - begin3 << endl;

	cout << "set find:" << " " << end4 - begin4 << endl;

	cout << "unordered_set erase:" << " " << end5 - begin5 << endl;

	cout << "set erase:" << " " << end6 - begin6 << endl;

	return 0;
}

Compare the insert performance of set and unordered_set (running under release)
  1. There are a large number of repeated values ​​under 100w data (random numbers can generate up to 32768), it can be seen that insert of unordered_set still has a great advantage compared to set

image-20230417110228383

  1. Under 100w data, compared with the random number with less repeated values ​​in the previous experiment (this time the random number is more than 60w), the insert of unordered_set is similar to that of set

image-20230417110534023

  1. In a set of 100w ordered numbers (no duplicate values), the insert of unordered_set is slightly inferior to that of set. The underlying red-black tree is a simple left-handed or right-handed for the insertion of ordered numbers, which is the strongest form of the red-black tree!

image-20230417110903282

Compare the find performance of set and unordered_set (run under debug)

Running the compiler under release will optimize it, and the time spent by set and unordered_set find is 0, so no conclusion can be drawn. So you have to run it under debug

image-20230417111544557

  1. There are a lot of duplicate values ​​under 100w data (random numbers can generate up to 32768), you can see that the find speed of unordered_set is faster than set

image-20230417111852526

  1. Under 100w data, compared with the random number with less repeated values ​​in the previous experiment (this time the random number is more than 60w), the find speed of unordered_set is faster than that of set

    image-20230417112028303

  2. In a set of 100w ordered numbers (no duplicate values), unordered_set find is still faster than set. To sum up, at the find layer, even if the red-black tree is the strongest form, it is still inferior to unordered_set

image-20230417112201421

Compare the erase performance of set and unordered_set (running under release)
  1. There are a large number of repeated values ​​under 100w data (the random number can generate up to 32768), it can be seen that the erase of unordered_set still has an advantage over set

    image-20230417112722854

  2. Under 100w data, compared with the random number with less repeated values ​​in the previous experiment (this time the random number is more than 60w), the erase of unordered_set is much faster than that of set

    image-20230417112842261

  3. In a set of 100w ordered numbers (no duplicate values), the erase of unordered_set is slightly inferior to that of set.image-20230417112956791

After reading the performance comparison, in short, the query efficiency of the underlying hash table is faster than that of the red-black tree, so let's start to introduce the hash table.

hash table

concept

In the sequential structure and balanced tree, there is no corresponding relationship between the element key and its storage location, so when looking for an element, it must go through multiple comparisons of the key. The time complexity of sequential search is O(N), and the height of the tree in the balanced tree is O(log_2 N). The efficiency of the search depends on the number of comparisons of elements during the search process.

So is there a structure that allows it to get the element to be searched in the structure without comparison or a small amount of comparison when querying elements?

have! hash table. Through the hash function, the storage location of the element is directly mapped to the key code one by one, so it is very fast to query the element.

When inserting elements into a hash table structure:

According to the key code of the inserted element, the location where the element should be stored is calculated through the hash function and stored.

When searching for an element in a hash table:

The hash function is also calculated for the key code of the element, and the obtained parameter is used as the storage location of the element, and then the key code is compared according to this position. If the key code is the same, the search is successful.

The hash function has the following rules:

  1. The definition domain of the hash function must include all stored key codes. If there are m addresses in the hash table, its value range must be between 0~m-1
  2. The addresses calculated by the hash function can be evenly distributed in the entire space
  3. The hash function should be relatively simple

Common hash functions:

  1. Direct addressing method (described below)

  2. Take a linear function of the key code as the hash address: Hash (Key) = A*Key + B

    Advantages: simple and uniform

    Disadvantage: Need to know the distribution of key codes in advance

    Scenario: suitable for situations where the search range is relatively small and continuous

  3. Remainder method (described below)

  4. Square method – (understand)
    Assuming that the keyword is 1234, its square is 1522756, and the middle 3-digit 227 is extracted as the hash address; another example is the keyword 4321, its square is 18671041, and the middle 3-digit 671 is extracted (or 710) as a hash address. The square method is more suitable: the distribution of the key code is not known, and the number of digits is not very large

  5. Folding method – (Understanding)
    The folding method is to divide the keyword from left to right into several parts with equal digits (the last part can be shorter), and then superimpose and sum these parts, and make them long according to the hash table, Take the last few digits as the hash address.
    The folding method is suitable for the distribution of key codes that do not need to be known in advance, and is suitable for situations where the number of key codes is relatively large

  6. Random number method – (understand)
    choose a random function, take the random function value of the keyword as its hash address, that is, H(key) = random(key), where random is a random number function. This method is usually used when the length of the key code is not equal

  7. mathematical analysis, etc.

The more sophisticated the design of the various hash functions mentioned above, the less likely hash collisions will occur, but hash collisions cannot be avoided. **What is the hash collision?

When there is an element at the position where the key is mapped, that is, when different elements are mapped to the same position, it is called a hash collision or a hash collision .

There are two common approaches to resolve hash collisions**: closed hashing and open hashing**. That is, there are two main methods of its hash table structure. One is closed hashing, also known as open addressing method. When a hash conflict occurs, if there is still a position in the table, fill in the empty position in the table ; Hash is also known as chain address method. Firstly, the key code of the inserted element is calculated and placed in the corresponding position through the hash function. When a hash conflict occurs, the key code with the same address is assigned to the same subset, that is, a single linked list Linked in a way, the header is in the hash table, and such a set of elements with the same key code is called a hash bucket. , the following two methods are highlighted.

closed hash

Direct Addressing

According to the conversion of known objects into shaping and the range of known shaping, open up a continuous space of a certain size (such as vector) and map them one by one according to the subscript of the continuous space and the data. Generally used for shaping, and the data range is relatively concentrated.

As shown in the figure below, the data is stored in the vector, and one-to-one mapping is performed according to the subscript of the data and the vector (here, the subscript is regarded as hash). In the second case, when inserting 1 and 99999, only 2 pieces of data are inserted, but nearly 10w ints have to be opened up, resulting in a huge waste of space. If the data is not shaped, then after the data is converted through the hashfun function, if the size of the shaped is uncontrollable, the range of the data is even more uncontrollable.

image-20230416203808839

In summary, it is concluded that the direct addressing method is only applicable to the case where the data is shaped and the data range is relatively concentrated.

remainder method

The size of the key% table passed in is hash=key%_table.size(), and the obtained hash is the mapped position.

For example, 17%10=7, then the mapped position is 7. When 27 comes, 27%10=7, but there is data at position 7. At this time, look back and save it at the position with subscript 8. When 8 and 28 come in sequence, there is data in the second half of the position, so I jump to the front to find it.

image-20230416204936581

As shown in the moving picture, when there are elements at the location where hash is mapped, that is, when different values ​​are mapped to the same location, it is called hash collision or hash collision . Then look for empty locations to store one by one according to hashi. The method of detecting space in this way is called linear detection method .

division and remainder method 2

accomplish

functor

Since the hash table needs to convert the object into an integer for one-to-one mapping of the spatial position, when the passed object is an integer, it can be forced into a unified integer (size_t). But when the passed object is not plastic, you need to match the object with an algorithm, which can convert the object into plastic. A functor is needed here to convert the key of the passed object into an integer - I call it the HashFunc function.

Assuming that the key of the object passed is a string, you can add the ASCII code value of each character in the string to form an integer. But for strings like "abc", "bac", and "cba", such strings are prone to conflicts. Referring to the big guy's method, I used BKDR hashing. BKDR hash function

template<class K>
struct HashFunc
{
    
    
	size_t operator()(const K& key)
	{
    
    
		return (size_t)key;//能转化为整形的对象强制转化为无符号整形
	}
};
	

	template<>
	struct HashFunc<string>
	{
    
    
		size_t operator()(const string& s)
		{
    
    
			size_t ch = 0;
			for (auto e : s)
			{
    
    
				ch *= 131;//BKDR哈希
				ch += e;
			}
			return ch;
		}
	};

hash node

The nodes here have three forms: exist [EXIST], empty [EMPTY], delete [DELETE]

Since the bottom layer of the hash table here is vector, if you delete elements, the loss will be great, so use the pseudo-deletion method: mark the position of the deleted element as DELETE.

A key-value pair object and a state are stored in the node

enum State//状态
{
    
    
	EMPTY,
	EXIST,
	DELETE,
};
template<class K,class V>
struct HashNode
{
    
    
	pair<K, V> _kv;
	State _state = EMPTY;
};

Find function

Here is the address of the returned element node.

The upper layer passes the key value key, and then searches for the existing element in the table according to the key. If the element status is not empty, it will keep searching. When the key value of the element == key and the status of the element is EXIST , it is considered to be found, and return The address of the element node.

Hash Find functionIf the status is empty, it means that after searching in the table, the table is not found, and returns empty.

	node* Find(const K& key)
	{
    
    
		HashF hf;
		size_t hashi =hf( key) % _table.size();
		size_t starti = hashi;
		while (_table[hashi]._state != EMPTY)
		{
    
    
			if (_table[hashi]._state==EXIST &&_table[hashi]._kv.first == key)
			{
    
    
				return &_table[hashi];
			}else
			{
    
    
				hashi++;
				hashi %= _table.size();
				if (starti == hashi)//防止极端情况:当哈希表中不是删除位就是存在位时,没有空位,这里就出不去,死循环了--所以这里表走了一圈就得出去了
					break;
			}
			
		}
		return nullptr;
	}

Insert function

Since the bottom layer uses vector, expansion is inevitable. When is the best time to scale up? In a certain space, the more elements are inserted, the greater the probability of element hash collision. Here a concept of load factor is introduced. Load factor a=number of elements in the table/length of the hash table. When a is larger, it means that the number of original elements in the table is larger, and the probability of hash collision when inserting new elements is greater. Conversely, when a is smaller, it means that the number of original elements in the table is smaller, and the probability of hash collision when inserting new elements is smaller.

According to the statistics and calculations of the big guys, the load factor should be strictly limited to 0.7-0.8, and the CPU buffer miss rate of the lookup table exceeding 0.8 will increase according to the exponential function. For example, the load factor in the JAVA library is 0.75, and the hash table will be resized if it exceeds this value.

Here I stipulate that the load factor a=0.7, if it exceeds 0.7, the capacity will be expanded;

There are also different implementation methods for capacity expansion. Here I create a new table. When expanding capacity, the new table is first expanded to twice the size of the old table, and then the old table is traversed. When the element status encountered is EXIST, the hash table can be reused . Insert function , and finally exchange the pointers of the new table and the old table.

	bool Insert(const pair<K, V>& kv)
	{
    
    
        auto cur = Find(kv.first);
		if (cur)
		{
    
    
			//不为空--找到了
			return false;
		}
		HashF hf;
		if (_n * 10 / _table.size() > 7)//负载因子大于0.7时,需要扩容
		{
    
    
			HashTable<K, V> newHT;
			newHT._table.resize(2 * _table.size());//扩容二倍
			for (auto& e : _table)
			{
    
    
				if(e._state==EXIST)
				newHT.Insert(e._kv);
			}
			_table.swap(newHT._table);
		}
		size_t hashi =hf( kv.first) % _table.size();
		while (_table[hashi]._state == EXIST)
		{
    
    
			hashi++;
			hashi %= _table.size();
		}
		_table[hashi]._kv = kv;
		_table[hashi]._state = EXIST;
		_n++;
		return true;
	}

overall code

The bottom layer I use here is vector, so copy construction, destructor, etc. do not need to be implemented by hand, but the built-in type _n remembers to initialize.


enum State//状态
{
    
    
	EMPTY,
	EXIST,
	DELETE,
};

template<class K>
struct HashFunc
{
    
    
	size_t operator()(const K& key)
	{
    
    
		return (size_t)key;
	}
};
	

	template<>
	struct HashFunc<string>
	{
    
    
		size_t operator()(const string& s)
		{
    
    
			size_t ch = 0;
			for (auto e : s)
			{
    
    
				ch *= 131;
				ch += e;
			}
			return ch;
		}
	};


template<class K,class V>
struct HashNode
{
    
    
	pair<K, V> _kv;
	State _state = EMPTY;
};
template<class K, class V,class HashF=HashFunc<K>>
class HashTable
{
    
    
	typedef HashNode<K,V> node;
public:
	HashTable()
		:_n(0)
	{
    
    
		_table.resize(10);
	}

	bool Insert(const pair<K, V>& kv)
	{
    
    
		auto cur = Find(kv.first);
		if (cur)
		{
    
    
			//不为空--找到了
			return false;
		}
		HashF hf;
		if (_n * 10 / _table.size() > 7)//负载因子大于0.7时,需要扩容
		{
    
    
			HashTable<K, V> newHT;
			newHT._table.resize(2 * _table.size());//扩容二倍
			for (auto& e : _table)
			{
    
    
				if(e._state==EXIST)
				newHT.Insert(e._kv);
			}
			_table.swap(newHT._table);
		}
		size_t hashi =hf( kv.first) % _table.size();
		while (_table[hashi]._state == EXIST)
		{
    
    
			hashi++;
			hashi %= _table.size();
		}
		_table[hashi]._kv = kv;
		_table[hashi]._state = EXIST;
		_n++;
		return true;
	}

	node* Find(const K& key)
	{
    
    
		HashF hf;
		size_t hashi =hf( key) % _table.size();

		while (_table[hashi]._state != EMPTY)
		{
    
    
			if (_table[hashi]._state==EXIST &&_table[hashi]._kv.first == key)
			{
    
    
				return &_table[hashi];
			}else
			{
    
    
				hashi++;
				hashi %= _table.size();
			}
			
		}
		return nullptr;
	}

	bool Erase(const K& key)//删除元素用伪删除法---把元素的状态置为DELETE
	{
    
    
		node* cur = Find(key);
		if (cur)
		{
    
    
			cur->_state = DELETE;
			_n--;
			return true;
		}
		else
		{
    
    
			return false;
		}
	}

private:
	vector<node> _table;
	size_t _n = 0;
};
secondary detection

The linear detection is used above, and the essence is to detect from the starting point start+i in sequence, which will easily lead to a certain part of the data in the table being relatively concentrated, and if there is a conflict, it will easily cause all the conflicts to be together . : **Different key codes occupy available vacant positions, making many comparisons required to find the position of a certain key code, resulting in reduced search efficiency. **Slightly better than linear in this regard is quadratic probing. Probe from the starting point start+i^2, as shown in the figure

Hash secondary detection

According to Daxie's research, when the length of the table is a prime number and the table loading factor a does not exceed 0.5, new entries must be inserted, and any position will not be checked twice. Therefore, as long as there are half of the empty positions in the table, there will be no problem of the table being full. You can ignore the fullness of the table when searching, but you must ensure that the load factor a of the table does not exceed 0.5 when inserting. If it exceeds, you must consider increasing the capacity.

Even so, the defect of closed hashing cannot be avoided: the space utilization rate is relatively low! **So it also leads to another hash table structure-open hash.

open hash

As defined earlier:

The open hash method is also called the chain address method (open chain method). First, the hash function is used to calculate the hash address for the key code set. The key codes with the same address belong to the same sub-set. Each sub-set is called a bucket. The elements in the bucket are linked through a singly linked list, and the head nodes of each linked list are stored in the hash table.

The location of the hash table in the closed hash method stores objects, while the location of the open hash table stores pointers. The open hash table can be regarded as an array of pointers. The uninserted positions are initialized to empty.

functor

The functor here is the same as that of the closed hash, and it needs to be converted into an integer for mapping through the functor for the object .

template<class K>
struct HashFunc
{
    
    
	size_t operator()(const K& key)
	{
    
    
		return (size_t)key;//能转化为整形的对象强制转化为无符号整形
	}
};
	

	template<>
	struct HashFunc<string>
	{
    
    
		size_t operator()(const string& s)
		{
    
    
			size_t ch = 0;
			for (auto e : s)
			{
    
    
				ch *= 131;//BKDR哈希
				ch += e;
			}
			return ch;
		}
	};

node

The nodes here are not like closed hashes. There is a state for marking. Here, the deletion can be deleted directly. There is an additional next pointer to connect the elements where the hash conflict occurs.

template<class K,class V>
	struct HashNode
	{
    
    
		pair<K, V> _kv;
		HashNode<K, V>* _next;

		HashNode(const pair<K,V>& kv)
		:_kv(kv)
		,_next(nullptr)
		{
    
    }
	};

Find function

The Find function here needs to go to the bucket of the position on the table to traverse

node* Find(const K& key)
		{
    
    
			size_t hashi = HashF()(key) % _tables.size();
			node* cur = _tables[hashi];
			while (cur)
			{
    
    
				if (cur->_kv.first == key)//找到了--返回节点的地址
				{
    
    
					return cur;
				}
				else
				{
    
    
					cur = cur->_next;
				}
			}
			return nullptr;//没找到返回空
		}

Insert function

When inserting an element, the address position corresponding to the key code is calculated through the hash function, and then the header is inserted. When a hash conflict occurs, continue to insert elements at this position, and use a singly linked list to link the elements that have a hash conflict.

As shown in the animation

Hash bucket insertion animation

code show as below

In order to reduce the load factor and reduce the probability of hash collision when implementing the closed hash table, the load factor is set to 0.7. In the open hash table, hash collisions can be allowed, and when there is a collision, they will be linked in the form of hash buckets (single linked list), and the load factor here is set to 1.

When the load factor is 1, the capacity will be expanded by inserting elements again, and the capacity expansion here is the same as that in the closed hash table. Create a node, and then take the value of the old table node to create a new table node, which is quite inefficient.

bool Insert(const pair<K, V>& kv)
	{
    
    
		if (Find(kv.first))//找到了不为空
			return false;

		//没找到则插入
		if (_n == _tables.size())//载荷因子为1---满了扩容
		{
    
    
			//扩容旧方法
			BucketHash<K, V> newHT;
			newHT._tables.resize(2 * _tables.size());//扩容二倍
			for (auto cur : _tables)
			{
    
    
				while (cur)//桶不为空
				{
    
    
					newHT.Insert(cur->_kv);
					cur = cur->_next;
				}
			}
			_tables.swap(newHT._tables);
		}
		size_t hashi = HashF()(kv.first) % _tables.size();
		node* newnode = new node(kv);
		newnode->_next = _tables[hashi];
		_tables[hashi] = newnode;
		++_n;

	}

Insert the values ​​of 17, 27, 8, 28, 22, 23, 24, 25, 35, 45, 11, 11, 28, 5 in sequence here, and observe through debugging

Before expansion, the address with element 45 is 0x0134f108, the address with element 35 is 0x0134f140, and the address with element 25 is 0x0134f648.

image-20230418164851031

Before expansion, the address with element 45 is 0x0134f450, the address with element 35 is 0x0134f220, and the address with element 25 is 0x0134f098. Here, the address of the node before and after the expansion has changed, indicating that the newly created node is in the new table.

image-20230418165008189

New method of capacity expansion: the nodes of the old table can be moved to the new mapping position of the new table, which saves the consumption of creating new nodes and destroying old nodes.

		bool Insert(const pair<K, V>& kv)
		{
    
    
			if (Find(kv.first))//找到了不为空
				return false;

			//没找到则插入
			if (_n == _tables.size())//载荷因子为1---满了扩容
			{
    
    
				//旧方法
				/*BucketHash<K, V> newHT;*/
				//newHT._tables.resize(2 * _tables.size());//扩容二倍
				//for (auto cur : _tables)
				//{
    
    
				//	while (cur)//桶不为空
				//	{
    
    
				//		newHT.Insert(cur->_kv);
				//		cur = cur->_next;
				//	}
				//}
				//_tables.swap(newHT._tables);
				vector<node*> newtables;
				newtables.resize(2 * _tables.size());
				for (size_t i = 0; i < _tables.size(); i++)
				{
    
    
					node* cur = _tables[i];
					while (cur)
					{
    
    
						node* next =cur->_next ;
						size_t hashi = HashF()(cur->_kv.first) % newtables.size();
						cur->_next = newtables[hashi];
						newtables[hashi] = cur;
						cur = next;
					}
					_tables[i] = nullptr;
				}
				_tables.swap(newtables);
			}
			size_t hashi = HashF()(kv.first) % _tables.size();
			node* newnode = new node(kv);
			newnode->_next = _tables[hashi];
			_tables[hashi] = newnode;
			++_n;

		}

Insert the values ​​of 17, 27, 8, 28, 22, 23, 24, 25, 35, 45, 11, 11, 28, 5 in sequence here, and observe through debugging

Before expansion, the address with element 45 is 0x0181f6e8, the address with element 35 is 0x0181f800, and the address with element 25 is 0x0181f410.

image-20230418163659454

After expansion, the address of element 45 is 0x0181f6e8, the address of element 35 is 0x0181f800, and the address of element 25 is 0x0181f410. Indicates that the node (address) has not been changed, but has been transferred from the old table to the new table.

image-20230418164052115

Erase function

The Erase function here also needs to go to the bucket of the position on the table to find it, and delete it when it is found.

bool Erase(const K& key)
		{
    
    
			
			size_t hashi = HashF()(key) % _tables.size();
			node* cur = _tables[hashi];
			node* prev = nullptr;
			while (cur)
			{
    
    
				if (cur->_kv.first == key)
				{
    
    
					if (cur == _tables[hashi])//如果找到的是哈希桶的第一个元素
					{
    
    
						_tables[hashi] = cur->_next;
					}
					else
					{
    
    
						prev->_next = cur->_next;
					}
					--_n;
					delete cur;
					return true;
				}
				else
				{
    
    
					prev = cur;
					cur = cur->_next;
				}
			}
			return false;
		}

destructor

Here, the open hash table and the closed hash table are the same in that the bottom layer is vector storage elements, while the closed hash table stores objects, and the destructor of the vector can be called when destructed. The open hash stores pointers, and the default destructor will destroy the pointers in the table, but will not go to the hash bucket to destroy the nodes, so the destructor here needs to be written .

~BucketHash()
		{
    
    
			for (size_t i = 0; i < _tables.size(); i++)
			{
    
    
				node* prev = nullptr;
				node* cur = _tables[i];
				while (cur)
				{
    
    
					node* prev = cur;
					cur = cur->_next;
					delete prev;
				}
				_tables[i] = nullptr;
			}
			
		}

Let's talk about expansion

When the load factor of the above Insert function reaches 1, it needs to be expanded. The above implementation is twice the original space. And the big brother designed to remove the remainder method is: it is best to modulo a prime number (choose the nearest prime number for each expansion ), how to quickly take a prime number similar to twice the relationship each time:在ST sgi版本中有素数表(用于增容) ul: unsigned long

size_t GetNextPrime(size_t prime)
{
    
    
const int PRIMECOUNT = 28;
static const size_t primeList[PRIMECOUNT] =
{
    
    
53ul, 97ul, 193ul, 389ul, 769ul,
1543ul, 3079ul, 6151ul, 12289ul, 24593ul,
49157ul, 98317ul, 196613ul, 393241ul, 786433ul,
1572869ul, 3145739ul, 6291469ul, 12582917ul,
25165843ul,
50331653ul, 100663319ul, 201326611ul, 402653189ul,
805306457ul,
1610612741ul, 3221225473ul, 4294967291ul
};
size_t i = 0;
for (; i < PRIMECOUNT; ++i)
{
    
    
if (primeList[i] > prime)
return primeList[i];
}

Changed Insert function


		bool Insert(const pair<K, V>& kv)
		{
    
    
			if (Find(kv.first))//找到了不为空
				return false;

			//没找到则插入
			if (_n == _tables.size())//载荷因子为1---满了扩容
			{
    
    
				//旧方法
				BucketHash<K, V> newHT;
				newHT._tables.resize(__stl_next_prime(_n));//扩容二倍
				for (auto cur : _tables)
				{
    
    
					while (cur)//桶不为空
					{
    
    
						newHT.Insert(cur->_kv);
						cur = cur->_next;
					}
				}
				_tables.swap(newHT._tables);
			/*	vector<node*> newtables;
				newtables.resize(2 * _tables.size());
				for (size_t i = 0; i < _tables.size(); i++)
				{
					node* cur = _tables[i];
					while (cur)
					{
						node* next =cur->_next ;
						size_t hashi = HashF()(cur->_kv.first) % newtables.size();
						cur->_next = newtables[hashi];
						newtables[hashi] = cur;
						cur = next;
					}
					_tables[i] = nullptr;
				}*/
				//_tables.swap(newtables);
			}
			size_t hashi = HashF()(kv.first) % _tables.size();
			node* newnode = new node(kv);
			newnode->_next = _tables[hashi];
			_tables[hashi] = newnode;
			++_n;
		}
//扩容函数素数表---用内联函数展开
		inline unsigned long __stl_next_prime(unsigned long n)
		{
    
    
			static const int __stl_num_primes = 28;
			static const unsigned long __stl_prime_list[__stl_num_primes] =
			{
    
    
				53, 97, 193, 389, 769,
				1543, 3079, 6151, 12289, 24593,
				49157, 98317, 196613, 393241, 786433,
				1572869, 3145739, 6291469, 12582917, 25165843,
				50331653, 100663319, 201326611, 402653189, 805306457,
				1610612741, 3221225473, 4294967291
			};

			for (int i = 0; i < __stl_num_primes; ++i)
			{
    
    
				if (__stl_prime_list[i] > n)
				{
    
    
					return __stl_prime_list[i];
				}
			}

			return __stl_prime_list[__stl_num_primes - 1];
		}

overall code

	template<class K,class V>
	struct HashNode
	{
    
    
		pair<K, V> _kv;
		HashNode<K, V>* _next;

		HashNode(const pair<K,V>& kv)
		:_kv(kv)
		,_next(nullptr)
		{
    
    }
	};

	template<class K,class V,class HashF=HashFunc<K>>
	class BucketHash
	{
    
    
		typedef HashNode<K, V> node;

	public:

		BucketHash()
			:_n(0)
		{
    
    
			_tables.resize(__stl_next_prime(0));
		}

		~BucketHash()
		{
    
    
			for (size_t i = 0; i < _tables.size(); i++)
			{
    
    
				node* prev = nullptr;
				node* cur = _tables[i];
				while (cur)
				{
    
    
					node* prev = cur;
					cur = cur->_next;
					delete prev;
				}
				_tables[i] = nullptr;
			}
			
		}

		inline unsigned long __stl_next_prime(unsigned long n)
		{
    
    
			static const int __stl_num_primes = 28;
			static const unsigned long __stl_prime_list[__stl_num_primes] =
			{
    
    
				53, 97, 193, 389, 769,
				1543, 3079, 6151, 12289, 24593,
				49157, 98317, 196613, 393241, 786433,
				1572869, 3145739, 6291469, 12582917, 25165843,
				50331653, 100663319, 201326611, 402653189, 805306457,
				1610612741, 3221225473, 4294967291
			};

			for (int i = 0; i < __stl_num_primes; ++i)
			{
    
    
				if (__stl_prime_list[i] > n)
				{
    
    
					return __stl_prime_list[i];
				}
			}

			return __stl_prime_list[__stl_num_primes - 1];
		}

		bool Insert(const pair<K, V>& kv)
		{
    
    
			if (Find(kv.first))//找到了不为空
				return false;

			//没找到则插入
			if (_n == _tables.size())//载荷因子为1---满了扩容
			{
    
    
				//旧方法
				/*BucketHash<K, V> newHT;
				newHT._tables.resize(__stl_next_prime(_n));//扩容二倍
				for (auto cur : _tables)
				{
					while (cur)//桶不为空
					{
						newHT.Insert(cur->_kv);
						cur = cur->_next;
					}
				}*/
		//		_tables.swap(newHT._tables);
				vector<node*> newtables;
				newtables.resize(__stl_next_prime(_n));
				for (size_t i = 0; i < _tables.size(); i++)
				{
    
    
					node* cur = _tables[i];
					while (cur)
					{
    
    
						node* next =cur->_next ;
						size_t hashi = HashF()(cur->_kv.first) % newtables.size();
						cur->_next = newtables[hashi];
						newtables[hashi] = cur;
						cur = next;
					}
					_tables[i] = nullptr;
				}
				_tables.swap(newtables);
			}
			size_t hashi = HashF()(kv.first) % _tables.size();
			node* newnode = new node(kv);
			newnode->_next = _tables[hashi];
			_tables[hashi] = newnode;
			++_n;
            return true;
		}

		node* Find(const K& key)
		{
    
    
			size_t hashi = HashF()(key) % _tables.size();
			node* cur = _tables[hashi];
			while (cur)
			{
    
    
				if (cur->_kv.first == key)//找到了--返回节点的地址
				{
    
    
					return cur;
				}
				else
				{
    
    
					cur = cur->_next;
				}
			}
			return nullptr;//没找到返回空
		}

		bool Erase(const K& key)
		{
    
    
			
			size_t hashi = HashF()(key) % _tables.size();
			node* cur = _tables[hashi];
			node* prev = nullptr;
			while (cur)
			{
    
    
				if (cur->_kv.first == key)
				{
    
    
					if (cur == _tables[hashi])//如果找到的是哈希桶的第一个元素
					{
    
    
						_tables[hashi] = cur->_next;
					}
					else
					{
    
    
						prev->_next = cur->_next;
					}
					--_n;
					delete cur;
					return true;
				}
				else
				{
    
    
					prev = cur;
					cur = cur->_next;
				}
			}
			return false;
		}

	private:
		size_t _n = 0;
		vector<node*> _tables;
	};

Encapsulate Unordered_map and Unordered_set with hash table

take key

Since the value of Unordered_map is a key-value pair<const K, V>, the key code extracted is the object of K in the pair. The value of Unordered_set is K, and the key code taken out is the object of K. The two are different, so the upper layer Unordered_map and Unordered_set need to pass in their respective key code keys through template parameters.

Unordered_map's key function
		struct KeyofMap
		{
    
    
			const K& operator()(const pair<const K, V>& kv)
			{
    
    
				return kv.first;
			}
		};
Unordered_set's key function
	struct KeyofSet
		{
    
    
			const K& operator()(const K& key)
			{
    
    
				return key;
			}
		};

iterator

Iterators need to be implemented with a class

frame

template<class K, class T,class HashF, class KeyofT>
	class BucketHash;//前置声明---迭代器需要用到哈希表的成员
	template<class K,class T,class HashF,class KeyofT>
	struct __HTIterator
	{
    
    
	//public:
		typedef __HTIterator<K, T, HashF, KeyofT> Self;//迭代器
		typedef HashNode<T> node;//节点
		typedef BucketHash<K, T, HashF, KeyofT> HT;//哈希表
		HT* _ht;
		node* _node;
		__HTIterator(node* pnode,HT* ht)
            //需要传节点的指针和哈希表的指针过来构造
			:_node(pnode)
			,_ht(ht)
		{
    
    }
	};

operator++

Traverse the current bucket first, and when the next node of this node in the current bucket is not empty, take the next node. When the current bucket is empty, first calculate where the current bucket is located in the hash table, and then go back to traverse the non-empty buckets.

		Self& operator++()
		{
    
    
		  //当前桶没遍历完
			if (_node->_next)
			{
    
    
				_node = _node->_next;
			}
			else//当前桶遍历完了,遍历后面的桶
			{
    
    
				KeyofT kot;
				
				size_t hashi = HashF()(kot(_node->_Data))%_ht->_tables.size();//算出当前桶的位置
				hashi++;
				while (hashi < _ht->_tables.size())
				{
    
    
					if (_ht->_tables[hashi])
					{
    
    
						_node = _ht->_tables[hashi];
						break;
					}
					else
					{
    
    
						hashi++;
					}
				}
				if(hashi==_ht->_tables.size())
				_node = nullptr;
			}
			return *this;
		}

The overall code of the iterator

	template<class K, class T,class HashF, class KeyofT>
	class BucketHash;//前置声明---迭代器需要用到哈希表的成员
	template<class K,class T,class HashF,class KeyofT>
	struct __HTIterator
	{
    
    
	//public:
		typedef __HTIterator<K, T, HashF, KeyofT> Self;
		typedef HashNode<T> node;
		typedef BucketHash<K, T, HashF, KeyofT> HT;
		HT* _ht;
		node* _node;
		__HTIterator(node* pnode,HT* ht)
			:_node(pnode)
			,_ht(ht)
		{
    
    }

		T& operator*()const
		{
    
    
			return _node->_Data;
		}

	T* operator->()const
		{
    
    
			return &_node->_Data;
		}



		Self& operator++()
		{
    
    
		  //当前桶没遍历完
			if (_node->_next)
			{
    
    
				_node = _node->_next;
			}
			else//当前桶遍历完了,遍历后面的桶
			{
    
    
				KeyofT kot;
				
				size_t hashi = HashF()(kot(_node->_Data))%_ht->_tables.size();//算出当前桶的位置
				hashi++;
				while (hashi < _ht->_tables.size())
				{
    
    
					if (_ht->_tables[hashi])
					{
    
    
						_node = _ht->_tables[hashi];
						break;
					}
					else
					{
    
    
						hashi++;
					}
				}
				if(hashi==_ht->_tables.size())
				_node = nullptr;
			}
			return *this;
		}


		bool operator!=(const Self& s)const
		{
    
    
			return _node != s._node;
		}

		bool operator==(const Self& s)const
		{
    
    
			return _node == s._node;
		}

	};

begin

Get the first non-empty position traversed from the beginning to the end in the hash table. If the traverse is complete, return empty (construct end)

iterator begin()
		{
    
    
			for (size_t i = 0; i < _tables.size(); i++)
			{
    
    
				node* cur = _tables[i];
				if (cur)
				{
    
    
					return iterator(cur, this);
				}
			}
			return iterator(nullptr, this);
		}

end

Construct with empty nodes


		iterator end()
		{
    
    
			return iterator(nullptr, this);
		}

For encapsulation, the hash table part has also been changed

main changes

conversion key

image-20230419211548277

The process of taking keywords (with conversion into shaping process)

image-20230419213348437

Hash table overall code

template<class T>
	struct HashNode
	{
    
    
		T _Data;
		HashNode<T>* _next;

		HashNode(const T& Data)
		:_Data(Data)
		,_next(nullptr)
		{
    
    }
	};

	template<class K, class T,class HashF, class KeyofT>
	class BucketHash;
	template<class K,class T,class HashF,class KeyofT>
	struct __HTIterator
	{
    
    
	//public:
		typedef __HTIterator<K, T, HashF, KeyofT> Self;
		typedef HashNode<T> node;
		typedef BucketHash<K, T, HashF, KeyofT> HT;
		HT* _ht;
		node* _node;
		__HTIterator(node* pnode,HT* ht)
			:_node(pnode)
			,_ht(ht)
		{
    
    }

		T& operator*()const
		{
    
    
			return _node->_Data;
		}

	T* operator->()const
		{
    
    
			return &_node->_Data;
		}



		Self& operator++()
		{
    
    
		  //当前桶没遍历完
			if (_node->_next)
			{
    
    
				_node = _node->_next;
			}
			else//当前桶遍历完了,遍历后面的桶
			{
    
    
				KeyofT kot;
				
				size_t hashi = HashF()(kot(_node->_Data))%_ht->_tables.size();//算出当前桶的位置
				hashi++;
				while (hashi < _ht->_tables.size())
				{
    
    
					if (_ht->_tables[hashi])
					{
    
    
						_node = _ht->_tables[hashi];
						break;
					}
					else
					{
    
    
						hashi++;
					}
				}
				if(hashi==_ht->_tables.size())
				_node = nullptr;
			}
			return *this;
		}


		bool operator!=(const Self& s)const
		{
    
    
			return _node != s._node;
		}

		bool operator==(const Self& s)const
		{
    
    
			return _node == s._node;
		}

	};

	template<class K, class T, class HashF, class KeyofT>
	class BucketHash
	{
    
    
		typedef HashNode<T> node;

		template<class K, class T, class HashF, class KeyofT>
		friend	struct __HTIterator;
		

	public:
		typedef __HTIterator<K, T,HashF,KeyofT> iterator;
	
		iterator begin()
		{
    
    
			for (size_t i = 0; i < _tables.size(); i++)
			{
    
    
				node* cur = _tables[i];
				if (cur)
				{
    
    
					return iterator(cur, this);
				}
			}
			return iterator(nullptr, this);
		}

		iterator end()
		{
    
    
			return iterator(nullptr, this);
		}

		

		BucketHash()
			:_n(0)
		{
    
    
			_tables.resize(__stl_next_prime(0));
		}

		~BucketHash()
		{
    
    
			for (size_t i = 0; i < _tables.size(); i++)
			{
    
    
				node* prev = nullptr;
				node* cur = _tables[i];
				while (cur)
				{
    
    
					node* prev = cur;
					cur = cur->_next;
					delete prev;
				}
				_tables[i] = nullptr;
			}
			
		}

		inline unsigned long __stl_next_prime(unsigned long n)
		{
    
    
			static const int __stl_num_primes = 28;
			static const unsigned long __stl_prime_list[__stl_num_primes] =
			{
    
    
				53, 97, 193, 389, 769,
				1543, 3079, 6151, 12289, 24593,
				49157, 98317, 196613, 393241, 786433,
				1572869, 3145739, 6291469, 12582917, 25165843,
				50331653, 100663319, 201326611, 402653189, 805306457,
				1610612741, 3221225473, 4294967291
			};

			for (int i = 0; i < __stl_num_primes; ++i)
			{
    
    
				if (__stl_prime_list[i] > n)
				{
    
    
					return __stl_prime_list[i];
				}
			}

			return __stl_prime_list[__stl_num_primes - 1];
		}

		pair<iterator,bool> Insert(const T& Data)
		{
    
    
			KeyofT kot;
			iterator cur = Find(kot(Data));
			if (cur != end())
			{
    
    
				return make_pair(cur, false);
			}

			//没找到则插入
			if (_n == _tables.size())//载荷因子为1---满了扩容
			{
    
    
				//旧方法
				//BucketHash<K, T,> newHT;
				//newHT._tables.resize(__stl_next_prime(_n));//扩容二倍
				//for (auto cur : _tables)
				//{
    
    
				//	while (cur)//桶不为空
				//	{
    
    
				//		newHT.Insert(cur->_Data);
				//		cur = cur->_next;
				//	}
				//}
				//_tables.swap(newHT._tables);


				vector<node*> newtables;
				newtables.resize(__stl_next_prime(_n));
				for (size_t i = 0; i < _tables.size(); i++)
				{
    
    
					node* cur = _tables[i];
					while (cur)
					{
    
    
						node* next =cur->_next ;
						size_t hashi = HashF()(kot(cur->_Data)) % newtables.size();
						cur->_next = newtables[hashi];
						newtables[hashi] = cur;
						cur = next;
					}
					_tables[i] = nullptr;
				}
				_tables.swap(newtables);
			}
			size_t hashi = HashF()(kot(Data)) % _tables.size();
			node* newnode = new node(Data);
			newnode->_next = _tables[hashi];
			_tables[hashi] = newnode;
			++_n;
			return make_pair(iterator(_tables[hashi],this),true);
		}

		iterator Find(const K& key)
		{
    
    
			KeyofT kot;
			size_t hashi = HashF()(key) % _tables.size();
			node* cur = _tables[hashi];
			while (cur)
			{
    
    
				if ( kot(cur->_Data) == key)//找到了--返回节点的地址
				{
    
    
					return iterator(cur,this);
				}
				else
				{
    
    
					cur = cur->_next;
				}
			}
			return iterator( nullptr,this);//没找到返回空
		}

		bool Erase(const K& key)
		{
    
    
			KeyofT kot;
			size_t hashi = HashF()(key) % _tables.size();
			node* cur = _tables[hashi];
			node* prev = nullptr;
			while (cur)
			{
    
    
				if ( kot(cur->_Data)== key)
				{
    
    
					if (cur == _tables[hashi])//如果找到的是哈希桶的第一个元素
					{
    
    
						_tables[hashi] = cur->_next;
					}
					else
					{
    
    
						prev->_next = cur->_next;
					}
					--_n;
					delete cur;
					return true;
				}
				else
				{
    
    
					prev = cur;
					cur = cur->_next;
				}
			}
			return false;
		}

	private:
		size_t _n = 0;
		vector<node*> _tables;
	};

The encapsulated hash table const iterator cannot reuse ordinary iterators

After encapsulation, you can see that the iterator only implements ordinary iterators, and does not reuse ordinary iterators to implement const iterators. Refer to the source code of the stl library

image-20230419215339076

image-20230419221057838

Unordered_map overall code

template<class K,class V,class HashF= HashFunc<K>>
	class Unordered_Map
	{
    
    
		struct KeyofMap
		{
    
    
			const K& operator()(const pair<const K, V>& kv)
			{
    
    
				return kv.first;
			}
		};

		typedef typename BUCKET::BucketHash<K, pair<const K, V>, HashF, KeyofMap>::iterator iterator;
	public:
		iterator begin()
		{
    
    
			return _mp.begin();
		}

		iterator end()
		{
    
    
			return _mp.end();
		}

		pair<iterator,bool> Insert(const pair<K, V>& kv)
		{
    
    
			return _mp.Insert(kv);
			
		}

		iterator Find(const K& key)
		{
    
    
			return _mp.Find(key);
		}

		bool Erase(const K& key)
		{
    
    
			return _mp.Erase(key);
		}

		V& operator[](const K& key)
		{
    
    
			pair<iterator, bool> ret = _mp.Insert(make_pair(key, V()));
			return ret.first->second;
		}

	private:
	 BUCKET::BucketHash<K,pair<const K,V> , HashF ,KeyofMap> _mp;
	};

Unordered_set overall code

template<class K,class HashF=HashFunc<K>>
	class Unordered_Set
	{
    
    
	
		struct KeyofSet
		{
    
    
			const K& operator()(const K& key)
			{
    
    
				return key;
			}
		};
	public:
		typedef typename BUCKET::BucketHash<K,K, HashF, KeyofSet>::iterator iterator;
		iterator begin()
		{
    
    
			return _st.begin();
		}

		iterator end()
		{
    
    
			return _st.end();
		}

		pair<iterator,bool> Insert(const K& key)
		{
    
    
			return _st.Insert(key);
		}


		iterator Find(const K& key)
		{
    
    
			return _st.Find(key);
		}

		bool Erase(const K& key)
		{
    
    
			return _st.Erase(key);
		}


	private:
		BUCKET::BucketHash<K, K, HashF, KeyofSet> _st;
	};

Guess you like

Origin blog.csdn.net/m0_71841506/article/details/130254900