Data structure: Hash table explained

1. Hash concept

Hash:Amapping idea, also called hashing. That is, a keyword establishes an association with another value. NoteThere are various associations here. For example, if you are given a keyword, you can determine whether the value exists or obtain other information through the mapping relationship. Be sure to store another value.

Hash table: Also called a hash table, it embodies the idea of ​​hashing. That is, the keyword and the storage location are associated. The relationship here is relatively specific. Usually the key-value pair is stored in the hash table, and the storage location of the key-value pair is found through the key, so as to quickly search for the value.

Hash tables are mainly used to improve search efficiency. Here is a comparison:

  • Sequence table: time complexity is O(N), brute force search.
  • Balanced search tree: time complexity is O( l o g 2 n log_2 n log2n), the efficiency is stable and relatively fast.
  • Hash table: The average time complexity is O(1), the average constant level search (here is the average complexity, the hash table has extreme degradation, which will be analyzed later).



2. Determine the storage location through the key code

2.1 Hash method

We usually convert the key code to determine the storage location.This conversion method is the hash method. In the hash method Theconversion function used is called a hash function (the method is a guideline, hash function designs can differ).


⭐Hash Function Two common operations in relational hash tables:

  1. Insert element
    According to the key code of the element to be inserted, this function calculates the storage location of the element and stores it according to this location
  2. Find element
    Perform the same calculation on the key code of the element, regard the obtained function value as the storage location of the element, and compare the element according to this position in the structure. If the key If the codes are equal, the search is successful

This article mainly talks about two hashing methods:

  • direct address method
  • division leaving remainder method

2.2 Direct addressing method

The methodhash function:hashi = a * key + b (where a and b are custom constants, a != 0).
Concept: Value and position establish a unique relationship.

Applicable scenarios: Key codes are concentrated < a i=13> (For example, counting the number of occurrences of letters, the key codes are letters, and they are all concentrated in a small interval). Disadvantages: For the key codes are scattered, it will cause serious < /span>. Waste of space



Insert image description here



2.3 Division with remainder method

This method'shash function:hashi = key % len (where hashi represents the storage subscript, key represents the key code, and len represents the length of the hash table).
Concept: By converting the key code, let the storage location fall into the hash representation space >.


Applicable scenarios: Key codesThis method can be used whether centralized or decentralized, through the hash function After calculation, the storage location falls within a fixed space.

Disadvantages:The storage locations calculated by different key codes through the hash function may be the same, thus causing conflict. This phenomenon is calledhash conflict, and resolving hash conflicts is the core behind.

Insert image description here


3. Hash collision concept

Concept:Different keywords calculate the same hash address through the same hash function, This phenomenon is calledhash collision or hash collision.


The occurrence of hash conflicts is related to the hash function. The more reasonable the hash function design, the fewer hash conflicts , here are several hashing methods:

  • Direct addressing method (commonly used), which has been analyzed before and will not be discussed here.
  • Division with remainder method (commonly used), which has been analyzed before and will not be discussed here.
  • Squaring the middle method (understanding)
    Assume the keyword is 1234, square it is 1522756, and extract the three digits 227 in the middle as the hash address;
    Another example is the keyword 4321, which is 18671041 when squared, and the three digits 671 (or 710) in the middle are extracted as the hash address;
    The square-middle method is more suitable: I don’t know the key The distribution of words, and the number of digits is not very large
  • Folding method (understand)
    The folding method is to divide the keyword into several parts with equal digits from left to right (the last part can be shorter), and then divide these parts into Several parts are superimposed and summed, and according to the length of the hash table, the last few bits are taken as the hash address.
    The folding method is suitable for situations where the distribution of keywords does not need to be known in advance and is suitable for situations where there are many keyword digits
  • Random number method (understand)
    Choose a random function, and take the random function value of the keyword as its hash address, that is, H(key) = random(key), Where random is a random number function.
    This method is usually used when the keyword lengths are different

summary:

  1. There is no hash conflict in the direct addressing method, but the applicable scenarios are relatively limited.
  2. Several other methods have the possibility of hash conflict. Take the division with remainder method as an example, Applicable scenarios are wider. The hashing method used in this article is the division-leaving-remainder method.



4. Resolve hash conflicts

Two common ways to resolve hash collisions are: closed hashing and open hashing

4.1 Closed hashing

4.1.1 Concept

Closed hashing: Also known as open addressing method, that is, the current location is occupied (hash conflict), and according to certain rules in the open space, find an unoccupied one Location storage.
As for the methods of finding unoccupied positions, here are two methods:

  • Linear detection
    Start from the position where the conflict occurs and detect backwards until the next empty position is found.
    That is, hashi = hashi + i (i >= 0), repeat this process.
  • Second Detection
    The formula for detection has just changed.
    That is, hashi = hashi + i ^ 2, repeat this process until an empty position is found.

家它细节

When the detection processhashi exceeds the hash table length n, a remainder is required to correct the subscript, that is. Don't worry about not finding an empty position in the hash table, the hash table will be expanded later. hashi = hashi % n


4.1.2 Hash table expansion

Load factor: that is_n / _table.size(), the former is the number of inserted elements number, the latter is the space size of the hash table. In order to reduce the number of detection searches, we generally control the load factor below 0.7, and expand the capacity if it exceeds 0.7.


Key points of hash table expansion:
You cannot directly apply for space and then copy, because we originally determined the storage location based onhashi = hashi % len(len is the length of the hash table), nowlen has changed, and the mapping between the value and the storage location has changed. , the mapping needs to be re-established.
Insert image description here


⭐The essence of hash table expansion:
When there are many conflicts, expansion and re-establishing mapping can effectively reduce conflicts. is very rare, so the efficiency degradation of hash table lookup
Insert image description here


4.1.3 Status of storage location

represents the status of the storage location is very necessary here, because during the insertion process It cannot cover others. To determine whether the current location conflicts, it is necessary to know the status of the current location. Of course, there are other reasons, which will be discussed in detail later.


Three states are introduced here:

  1. EMPTY, display positionempty.
  2. EXIST, display positionoccupiedcompleted.
  3. DELETE means that the location was originally used for data, but now has been deleted.
  4. When delete, you only need tochange the corresponding status, and there is no need to actually delete it.

Here let the state and key-value pairs form a structure:

enum Status  //对应位置的状态
{
    
    
	EMPTY,
	EXIST,
	DELETE
};

template<class K, class V>  //哈希表中每个位置存储的元素,初始状态默认为空
struct HashData
{
    
    
	pair<K, V> _kv;
	Status _s = EMPTY;
};

The meaning of each status (this is more difficult to understand):

  1. EMPTY and EXIST are relatively simple, they are to identify whether the current position is occupied.
  2. As forDELETE status, it mainly serves search.
    For search, we also use to use the key value to convert the storage location information, assuming that when x is inserted A hash conflict occurred. When we look back, we have the following three situations:
    (1) The current position is)x may not have been deleted later (The original conflicting value has been deleted, , the storage location may be at the back, continue to search. DELETE(2) The current position is , but it is not the value to be found. The storage location may be at the back, so continue to search. EXISTEMPTY. There is no need to look backward, and you can be sure that x does not exist. (Here either x has been deleted, or x has not been inserted< /span>, otherwise x can be inserted here)



  3. If theDELETE status is not set, the hash table can only be traversed once during search a>, the time complexity is O(N), and the hash table is meaningless.
    Insert image description here

4.1.4 About key value types

In practice the key value is not necessarily a numeric type, it may be a different type, a typical representative is String. Therefore, a template parameter is generally designed to be used to convert non-numeric types into integers, C++ uses functors here. This design is very flexible, and users can design their own functors according to actual needs.

Code:

template < class Key,                                    // unordered_map::key_type
	class T,                                      // unordered_map::mapped_type
	class Hash = hash<Key>,                       // unordered_map::hasher
	class Pred = equal_to<Key>,                   // unordered_map::key_equal
	class Alloc = allocator< pair<const Key, T> >  // unordered_map::allocator_type
> class unordered_map;
//unordered_map的Hash参数即为这里所讲的仿函数类型

//这个是默认的,只要能转化为整形就可以用这个
template<class T>
struct HashFunc
{
    
    
	size_t operator()(const T& key)
	{
    
    
		return (size_t)key;
	}
};

//因为字符串做键值非常常见,库里面也特化了一份
//字符串哈希算法这里不展开讲,我这里采用的是BKDR算法
template<>
struct HashFunc<string>
{
    
    
	size_t operator()(const string& key)
	{
    
    
		size_t hashi = 0;
		for (auto ch : key)
		{
    
    
			hashi = hashi * 31 + ch;
		}
		return hashi;
	}
};

4.1.5 Code implementation

The code is relatively simple after you understand it. I added comments and you should be able to understand it.

//这个是默认的,只要能转化为整形就可以用这个
template<class T>
struct HashFunc
{
    
    
	size_t operator()(const T& key)
	{
    
    
		return (size_t)key;
	}
};

//因为字符串做键值非常常见,库里面也特化了一份
//字符串哈希算法这里不展开讲,我这里采用的是BKDR算法
template<>
struct HashFunc<string>
{
    
    
	size_t operator()(const string& key)
	{
    
    
		size_t hashi = 0;
		for (auto ch : key)
		{
    
    
			hashi = hashi * 31 + ch;
		}
		return hashi;
	}
};

// 闭散列
namespace closed_address
{
    
    
	enum Status  //对应位置的状态
	{
    
    
		EMPTY,
		EXIST,
		DELETE
	};

	template<class K, class V>  //哈希表中每个位置存储的元素,初始状态默认为空
	struct HashData
	{
    
    
		pair<K, V> _kv;
		Status _s = EMPTY;
	};

	template<class K, class V, class Hash = HashFunc<K>>
	class HashTable
	{
    
    
	public:
		HashTable()
		{
    
    
			//初始默认开10个空间
			_tables.resize(10);
		}

		//  插入
		bool Insert(const pair<K, V>& kv)
		{
    
    
			if (Find(kv.first))  //已经存在不能插入,一个键值对占一个位置
			{
    
    
				return false;
			}
			Hash hf;   //用来转化非数值类型为整数类型

			//检查是否需要扩容
			if ((double)_n / _tables.size() >= 0.7)
			{
    
    
				// 开一个新表,复用insert重新建立映射
				size_t newsize = _tables.size() * 2;
				HashTable<K, V> newHT;
				newHT._tables.resize(newsize);
				//遍历旧表
				for (size_t i = 0; i < _tables.size(); i++)
				{
    
    
					if (_tables[i]._s == EXIST)
					{
    
    
						newHT.Insert(_tables[i]._kv);
					}
				}
				//交换两个表
				newHT._tables.swap(_tables);
			}

			size_t hashi = hf(kv.first) % _tables.size();
			//线性探测寻找空位置
			while (_tables[hashi]._s == EXIST)
			{
    
    
				hashi++;
				//超出哈希表长度要进行修正
				hashi %= _tables.size();
			}
			// 插入
			_tables[hashi]._kv = kv;
			_tables[hashi]._s = EXIST;
			_n++;  //更新插入个数
			return true;
		}

		///  查找
		HashData<K, V>* Find(const K& key)
		{
    
    
			Hash hf;
			size_t hashi = hf(key) % _tables.size();
			while (_tables[hashi]._s != EMPTY)  //走到空位置说明该值不在
			{
    
    
				// 存在并且键值为key代表找到了,返回结构体指针
				if (_tables[hashi]._kv.first == key && _tables[hashi]._s == EXIST)
				{
    
    
					return &_tables[hashi];
				}
				//继续往后找
				hashi++;
				//超出哈希表长度要进行修正
				hashi %= _tables.size();  
			}
			return nullptr;
		}

		// 删除
		bool Erase(const K& key)
		{
    
    
			// 查询非空表示找到了
			HashData<K, V>* ret = Find(key);
			if (ret)
			{
    
    
				// 修改对应位置状态并加一插入个数即可
				ret->_s = DELETE;
				_n--;
				return true;
			}
			else
			{
    
    
				return false;
			}
		}

		//后面的接口不是很重要
		size_t Size()const
		{
    
    
			return _n;
		}

		bool Empty() const
		{
    
    
			return _n == 0;
		}

		void Swap(HashTable<K, V>& ht)
		{
    
    
			swap(_n, ht._n);
			_tables.swap(ht._n);
		}

	private:
		vector<HashData<K, V>> _tables;
		size_t _n = 0;
	};
}



4.2 Open hashing

4.2.1 Concept

Open hashing: also known aszipper method/hash bucket, that is, when a conflict occurs, Internal digestion is performed in the form of a linked list, that is, conflicting elements are placed in the same linked list without affecting other positions.

Insert image description here


Node definition:

template<class K, class V>
struct HashNode
{
    
    
	HashNode* _next;
	pair<K, V> _kv;

	HashNode(const pair<K, V>& kv)
		:_kv(kv)
		, _next(nullptr)
	{
    
    }
};

4.2.2 Hash table expansion

Open hash expansion:

  1. Same as before, expansion will change the original mapping relationship, and the mapping needs to be re-established.
  2. The first method is to open a new table and reuse insert. This is simpler but more expensive because the node space needs to be reapplied and initialized.
  3. The second method is to open a new table,calculate the new storage location of each node, and directly remove the node and insert it into the new table That’s it, no need to reopen the space.

Inserted code:

bool Insert(const pair<K, V>& kv)
{
    
    
	if (Find(kv.first))  //原来已经存在不能插入
		return false;

	Hash hf;  //用来转化非数值类型为整形
	
	//对于开散列扩容
	if (_n == _tables.size())
	{
    
    
		vector<Node*> newTables;
		newTables.resize(_tables.size() * 2, nullptr);
		// 遍历旧表
		for (size_t i = 0; i < _tables.size(); i++)
		{
    
    
			Node* cur = _tables[i];
			while (cur)
			{
    
    
				//先记录下一个节点,防止断掉
				Node* next = cur->_next;

				// 挪动到映射的新表(头插)
				size_t hashi = hf(cur->_kv.first) % newTables.size();
				cur->_next = newTables[hashi];
				newTables[hashi] = cur;

				cur = next;
			}

			_tables[i] = nullptr;
		}

		_tables.swap(newTables);
	}

	size_t hashi = hf(kv.first) % _tables.size();
	Node* newnode = new Node(kv);

	// 新节点头插即可
	newnode->_next = _tables[hashi];
	_tables[hashi] = newnode;
	++_n;

	return true;
}

4.2.3 Code implementation
//这个是默认的,只要能转化为整形就可以用这个
template<class T>
struct HashFunc
{
    
    
	size_t operator()(const T& key)
	{
    
    
		return (size_t)key;
	}
};

//因为字符串做键值非常常见,库里面也特化了一份
//字符串哈希算法这里不展开讲,我这里采用的是BKDR算法
template<>
struct HashFunc<string>
{
    
    
	size_t operator()(const string& key)
	{
    
    
		size_t hashi = 0;
		for (auto ch : key)
		{
    
    
			hashi = hashi * 31 + ch;
		}
		return hashi;
	}
};

namespace hash_bucket
{
    
    
	template<class K, class V>
	struct HashNode
	{
    
    
		HashNode* _next;
		pair<K, V> _kv;

		HashNode(const pair<K, V>& kv)
			:_kv(kv)
			, _next(nullptr)
		{
    
    }
	};

	template<class K, class V, class Hash = HashFunc<K>>
	class HashTable
	{
    
    
		typedef HashNode<K, V> Node;
	public:
		HashTable()
		{
    
    
			_tables.resize(10);
		}

		//节点是自己new的,需要写析构函数,遍历即可
		~HashTable()
		{
    
    
			for (size_t i = 0; i < _tables.size(); i++)
			{
    
    
				Node* cur = _tables[i];
				while (cur)
				{
    
    
					Node* next = cur->_next;
					delete cur;
					cur = next;
				}
				_tables[i] = nullptr;
			}
		}

		// 插入
		bool Insert(const pair<K, V>& kv)
		{
    
    
			if (Find(kv.first))  //原来已经存在不能插入
				return false;

			Hash hf;  //用来转化非数值类型为整形

			//对于开散列扩容
			if (_n == _tables.size())
			{
    
    
				vector<Node*> newTables;
				newTables.resize(_tables.size() * 2, nullptr);
				// 遍历旧表
				for (size_t i = 0; i < _tables.size(); i++)
				{
    
    
					Node* cur = _tables[i];
					while (cur)
					{
    
    
						//先记录下一个节点,防止断掉
						Node* next = cur->_next;

						// 挪动到映射的新表(头插)
						size_t hashi = hf(cur->_kv.first) % newTables.size();
						cur->_next = newTables[hashi];
						newTables[hashi] = cur;

						cur = next;
					}

					_tables[i] = nullptr;
				}

				_tables.swap(newTables);
			}

			size_t hashi = hf(kv.first) % _tables.size();
			Node* newnode = new Node(kv);

			// 新节点头插即可
			newnode->_next = _tables[hashi];
			_tables[hashi] = newnode;
			++_n;

			return true;
		}

		// 查找
		Node* Find(const K& key)
		{
    
    
			Hash hf;
			// 找到对应的桶遍历即可
			size_t hashi = hf(key) % _tables.size();
			Node* cur = _tables[hashi];
			while (cur)
			{
    
    
				if (cur->_kv.first == key)
				{
    
    
					return cur;
				}

				cur = cur->_next;
			}

			return nullptr;
		}

		/// 删除
		bool Erase(const K& key)
		{
    
    
			Hash hf;
			// 先找到对应桶,遍历的同时记录前置节点,链表删除就不多讲了
			size_t hashi = hf(key) % _tables.size();
			Node* prev = nullptr;
			Node* cur = _tables[hashi];
			while (cur)
			{
    
    
				if (cur->_kv.first == key)
				{
    
    
					if (prev == nullptr)
					{
    
    
						_tables[hashi] = cur->_next;
					}
					else
					{
    
    
						prev->_next = cur->_next;
					}
					delete cur;

					return true;
				}

				prev = cur;
				cur = cur->_next;
			}

			return false;
		}

		//测试接口,大家可以随机生成大量数据插入看看每个桶的平均长度,应该1-2左右
		void Some()
		{
    
    
			size_t bucketSize = 0;
			size_t maxBucketLen = 0;
			size_t sum = 0;
			double averageBucketLen = 0;

			for (size_t i = 0; i < _tables.size(); i++)
			{
    
    
				Node* cur = _tables[i];
				if (cur)
				{
    
    
					++bucketSize;
				}

				size_t bucketLen = 0;
				while (cur)
				{
    
    
					++bucketLen;
					cur = cur->_next;
				}

				sum += bucketLen;

				if (bucketLen > maxBucketLen)
				{
    
    
					maxBucketLen = bucketLen;
				}
			}

			averageBucketLen = (double)sum / (double)bucketSize;

			printf("all bucketSize:%d\n", _tables.size());
			printf("bucketSize:%d\n", bucketSize);
			printf("maxBucketLen:%d\n", maxBucketLen);
			printf("averageBucketLen:%lf\n\n", averageBucketLen);
		}

	private:
		vector<Node*> _tables;
		size_t _n = 0;
	};
}



4.3 Comparison of open and closed hashes


First let’s talk about the conclusion:In practice, open hashing is often used. The bottom layer of unordered_map and unordered_set in C++STL is open hashing.
Reason: Open hashing uses zipper to resolve conflicts will not interfere with other locations and can be effective a> Taking linear detection to resolve conflicts as an example, insert 3, 33, 333, and 4 into the closed hash (the space is 10), and 4 will be moved to the subscript 6 position due to the conflict.. There is no such problem with hashing. When searching for 4, it will be searched several times more. lookup and Improve the efficiency of hash table insertion

Insert image description here

Guess you like

Origin blog.csdn.net/2301_76269963/article/details/134587078