[C++] Hash Closed Hash

1. The concept of hash

After learning the binary search tree, AVL tree, and red-black tree, we learned that in the sequential structure and balanced tree, there is no corresponding relationship between the element key and its storage location, so when looking for an element, you must go through the key Multiple comparisons of codes. The time complexity of sequential search is O(N), and the height of the tree in the balanced tree is (logN). The efficiency of the search depends on the number of comparisons of elements during the search process.

Ideal search method: the element to be searched can be directly obtained from the table at one time without any comparison.

If a storage structure is constructed, and a one-to-one mapping relationship can be established between the storage location of an element and its key code through a certain function (hashFunc), then the element can be quickly found by the function when searching.

When adding to this structure:

  • Insert element
    According to the key code of the element to be inserted , use this function to calculate the storage position of the element and store it according to this position.

  • Search for elements
    Perform the same calculation on the key code of the element, take the obtained function as the storage position of the element, compare the element according to this position in the structure, and if the key codes are equal, the search is successful.

This method is the hash (hash) method, the conversion function used in the hash method is called the hash (hash) function, and the constructed structure is called the hash table (hash table) (or hash table)

insert image description here

Use the above method to insert directly into the map, and there is no need to compare the key codes when searching, because the search speed is very fast.

1. Hash Collision

For the key of two data elements, its key1 != key2, but there is key1 % p == key2 % p, then key1 and key2 will be mapped to the same position in the hash Table.

insert image description here
Suppose we need to store the number n, the stored index value = n % 10,
now we need to store 20 in the table, and the calculated key value is 0, but there is already data in this position, that is, a hash collision occurs.

At this point: different keywords are calculated by the same hash function to the same hash address, this phenomenon is called hash collision or hash collision.

Data elements with different keys but the same hash address are called "synonyms".
How to deal with hash collisions?

2. Hash function

One reason for hash collisions is that the design of the hash function is not reasonable enough.

Hash function design principles:

  • The domain of the hash function must include all key codes that need to be stored, and if the hash table allows m addresses, its value range must be between 0 and m-1.
  • The addresses calculated by the hash function can be evenly distributed in the entire space.
  • Hash function design should be simple enough.

Two common hash functions

① Direct addressing method

Take a linear function of the key as the hash address:hash(key) = A * Key + B

Advantages: simple and uniform

Disadvantage: Need to know the distribution of keywords in advance

Usage scenario: suitable for finding relatively small and continuous situations.

Topic example: 387. The first unique character in a string

②Remainder method after division (commonly used)

Let the number of addresses allowed in the hash table be m, take a prime number p that is not greater than m but closest to or equal to m as the divisor, and convert the key code hash(key) = key % p(p<= m)into a hash address according to the hash function: .

is the following method

insert image description here

3. Load factor

If the hash is inserted, if you can’t find a vacancy after mapping, you have to check and find the vacancy all the time, so if there is only one vacancy in the hash table, the time complexity of inserting a data is likely to become O(N ), so we have to expand it before this happens.

Under what circumstances should the capacity be expanded? How big should it be? If it is the method of dividing and leaving the remainder, what should be the prime number p?

The load factor of the hash table is defined as: a = the number of elements filled in the table / the length of the hash table

insert image description here

The smaller the load factor, the lower the probability of conflict, and the lower the space utilization rate. The
larger the load factor, the greater the probability of conflict, and the higher the space utilization rate.

The practice in the library is to expand the capacity if the load factor is greater than 0.7.

2. Closed hashing

Closed hashing: also called open address method. When a hash conflict occurs, if the hash table is not full, it means that there must be an empty space in the hash table, so you can store the key in the conflict position under " A "empty location goes in.

There are linear detection and quadratic detection to find the next empty position. Next, we will combine theory and realize it.

1. Linear detection

Linear detection: start from the position where the conflict occurs, and detect backwards in turn until the next empty position is found.

Structure definition:

First, we define the structure. We define that each position has three states {exist, empty, delete}. A position either already has data or has no data. Why is there a delete state?

If the following situation occurs, 20 is inserted after 3, and after 3 is deleted, 20 cannot be found, so we will add another deletion state to prevent the following situation from happening.

insert image description here

Use an array for storage, where each location is of HashDate type, which will record the status of the current location. Add another size variable to record the number of valid data currently stored. The structure is as follows:

enum State {
    
    EMPTY,EXIST,DELETE};   //每个位置有三种状态
 
template<class K,class V>
struct HashDate
{
    
    
	pair<K, V> _kv;
	State _state;
};
 
template<class K,class V>
class HashTable
{
    
    
public:
    //成员函数:
private:
	vector<HashDate<K, V>> _table;
	size_t _size=0;				 
};

2. Insert

  • Obtain the position of the element to be inserted in the hash table through the hash function (except the remainder method)
  • If there is no element in this position, insert a new element directly. If there is an element in this position, a hash collision occurs, use linear detection to find the next empty position, and insert a new element.

insert image description here

Precautions:

  1. To determine the prime number p, should we take the size() or capacity() of the vector as the prime number?
    Use size(), because capacity() is the total space opened up by vector, the part exceeding size() cannot be used directly, only the space within size() can be used, and size needs to be changed by inserting data or resizing. In short, we cannot directly access the part of the vector that exceeds size() and is smaller than capacity(), although it has been opened up.

  2. What should we do if linear probing keeps detecting i subscript exceeds hash_table.size().
    If you keep detecting beyond the subscript of the array, you should go back to the beginning of the array, so after each i++, we can continue to take the modulus, if it exceeds size(), it will automatically start from the subscript of the array 0; of course, Use if to judge that i exceeds size(), and it is also possible to set it to 0 if it exceeds.

  3. What should we do when the loading factor exceeds 0.7.
    That is, when _size / _table.size() >=7, we need to expand, create a new hash table, and then copy the data in the old table to the new table. At this time, we can reuse the Insert function, because the new There is no expansion problem in the table, so we will use the insert logic code in Insert, and then insert all the data into the new table, and finally we will swap the new table with the old table, and exchange the content pointed to by the new table expansion Given a temporary variable, the temporary variable calls the destructor and is automatically released. In this way, the problem of capacity expansion is solved.

bool Insert(const pair<K, V>& kv)
{
    
    
	//如果 size==0 或装载因子 >=0.7 进行扩容
	if (_table.size() == 0 || 10 * _size / _table.size() >= 7)
	{
    
    
		size_t newSize = _table.size() == 0 ? 10 : _table.size() * 2;
		HashTable<K, V> newHash;
		newHash._table.resize(newSize);
		//将旧表中的数据拷贝到新表中  --- 复用Insert继续拷贝数据
		for (auto e : _table)
		{
    
    
			if (e._state == EXIST)
			{
    
    
				newHash.Insert(e._kv);
			}
		}
		//进行交换  newHash自动调用其析构函数
		_table.swap(newHash._table);
	}
	size_t hashi = kv.first % _table.size();
	while (_table[hashi]._state == EXIST)   //如果存在数据就一直往后找
	{
    
    
		hashi++;
		//如果hashi++超过size(),需要绕回数组的开始
		hashi %= _table.size();
	}
	//找到位置,插入数据
	_table[hashi]._kv = kv;
	_table[hashi]._state = EXIST;
	++_size;
	return  true;
}

Note that if a negative number is inserted, plastic promotion will occur, and the int type will be converted to our size_t type. At this time, the negative number will be moduloed, and a legal mapping position can be obtained, which can also be searched.

Then there is another question, what to do if data redundancy occurs. That is, if the inserted value already exists, how should it be handled?

3. Find Find

So we can write a find function before inserting, if the data exists, do not insert

HashDate<K, V>* Find(const K& key)
{
    
    
	//判断表为空的情况
	if (_table.size() == 0)
		return nullptr;
	size_t hashi = key % _table.size();
	while (_table[hashi]._state != EMPTY)
	{
    
    
		//如果找到key了,并且状态不是DELETE
		if (_table[hashi]._kv.first == key && _table[hashi]._state != DELETE)
		{
    
    
			return &_table[hashi];
		}
		hashi++;
		//如果超过表的长度,则除以表的大小,让其回到表头。
		hashi %= _table.size();
	}
	return nullptr;
}

At this point there is a problem, our loop is _state !=EMPT, if the traversal returns to the starting point, the _state of the traversed data is all EMPTY, which may lead to an infinite loop, so we also need to save the state of the starting position , and also return false if you return to the starting point (of course, this is a very extreme case, but it will occur).

4. Erase delete

The idea of ​​deletion is very simple. If find finds this value, just change its corresponding state to DELETE.

bool Erase(const K& key)
{
    
    
	HashDate<K, V>* ret = Find(key);
	if (ret)
	{
    
    
		ret->_state = DELETE;
		--_size;
		return true;
	}
	return false;
}

5. Insert complex types

So what if we want to implement a hash table that counts the number of times, and the key value is of string type? The string type or string type cannot be modulo. Then what if we want to insert a complex type defined by ourselves?

Let's take a look at how to solve this problem in the STL library.

insert image description here
So we have to write the default hash and key functor as the default parameter.

insert image description here

However, the unordered_map in the library does not require us to pass in the functor, because string is a very common type. The library uses the specialization of the template to perform special processing on the string type. We will also modify it next. in a specialized form.

template<>
struct HashFunc<string>
{
    
    
	size_t operator()(const string& key)
	{
    
    
		size_t val = 0;
		for (auto ch : key)
			val = val * 131 + ch;
		return val;
	}
};

6. Secondary detection

Secondary probing does not mean probing twice, but probing with the exponent of i.

The following is the insertion result of inserting the same set of data using linear detection and quadratic detection, as follows:

insert image description here

Then we make changes in the way of linear detection:

insert image description here

3. Source code and test cases

1. hash

enum State {
    
     EMPTY, EXIST, DELETE };   //每个位置有三种状态
 
template<class K, class V>
struct HashDate
{
    
    
	pair<K, V> _kv;
	State _state= EMPTY;
};
 
template<class K>
struct HashFunc
{
    
    
	size_t operator()(const K& key)
	{
    
    
		return (size_t)key;
	}
};
 
template<>
struct HashFunc<string>
{
    
    
	size_t operator()(const string& key)
	{
    
    
		size_t val = 0;
		for (auto ch : key)
			val = val * 131 + ch;
		return val;
	}
};
 
 
 
template<class K, class V,class Hash=HashFunc<K>>
class HashTable
{
    
    
public:
	bool Insert(const pair<K, V>& kv)
	{
    
    
		//如果表中已经存在该数据
		if (Find(kv.first))	return false;
		//如果 size==0 或装载因子 >=0.7 进行扩容
		if (_table.size() == 0 || 10 * _size / _table.size() >= 7)
		{
    
    
			size_t newSize = _table.size() == 0 ? 10 : _table.size() * 2;
			HashTable<K, V, Hash> newHash;
			newHash._table.resize(newSize);
			//将旧表中的数据拷贝到新表中  --- 复用Insert继续拷贝数据
			for (auto e : _table)
			{
    
    
				if (e._state == EXIST)
				{
    
    
					newHash.Insert(e._kv);
				}
			}
			//进行交换  newHash自动调用其析构函数
			_table.swap(newHash._table);
		}
 
		Hash hash;
		size_t hashi = hash(kv.first) % _table.size();
 
		while (_table[hashi]._state == EXIST)   //如果存在数据就一直往后找
		{
    
    
			hashi++;
			//如果hashi++超过size(),需要绕回数组的开始
			hashi %= _table.size();
		}
		//找到位置,插入数据
		_table[hashi]._kv = kv;
		_table[hashi]._state = EXIST;
		++_size;
		return  true;
	}
 
	HashDate<K,V>*  Find(const K& key)
	{
    
    
		//判断表为空的情况
		if (_table.size() == 0)
			return nullptr;
 
		Hash hash;
		size_t hashi = hash(key) % _table.size();
		while (_table[hashi]._state != EMPTY)
		{
    
    
			//如果找到key了,并且状态不是DELETE
			if (_table[hashi]._kv.first == key && _table[hashi]._state!=DELETE)
			{
    
    
				return &_table[hashi];
			}
			hashi++;
			//如果超过表的长度,则除以表的大小,让其回到表头。
			hashi %= _table.size();
		}
		return nullptr;
	}
 
	bool Erase(const K& key)
	{
    
    
		HashDate<K, V>* ret = Find(key);
		if (ret)
		{
    
    
			ret->_state = DELETE;
			--_size;
			return true;
		}
		return false;
	}
 
	void Print()
	{
    
    
		for(int i=0;i< _table.size();i++)
		{
    
    
			if (_table[i]._state == EXIST)
			cout <<"i:" <<i<<" [" << _table[i]._kv.first << " " << _table[i]._kv.second <<"]" << endl;
		}
	}
 
private:
	vector<HashDate<K, V>> _table;
	size_t _size=0;
};

2. Test cases

void test_hash01()
{
    
    
	HashTable<int, int> Hash;
	int a[] = {
    
     1,11,4,15,26,7};
	for (auto e : a)
	{
    
    
		Hash.Insert(make_pair(e, e));
	}
	Hash.Print();
	cout << endl;
}
void test_hash02()
{
    
    
	HashTable<int, int> Hash;
	int a[] = {
    
     1,11,4,15,26,7,13,5,34,9 };
	for (auto e : a)
	{
    
    
		Hash.Insert(make_pair(e, e));
	}
	Hash.Print();
	cout << endl;
}
 
void test_hash03()
{
    
    
	HashTable<int, int> Hash;
	int a[] = {
    
     1,11,4,15,26,7,13,5,34,9 };
	for (auto e : a)
	{
    
    
		Hash.Insert(make_pair(e, e));
	}
	Hash.Print();
	cout << endl<<"find:"<<endl;
	
	cout << (Hash.Find(11)->_kv).first << endl;
	cout << (Hash.Find(4)->_kv).first << endl;
	cout << (Hash.Find(5)->_kv).first << endl;
	cout << (Hash.Find(34)->_kv).first << endl;
	cout << "Erase:" << endl;
	Hash.Erase(11);
	cout << Hash.Find(11) << endl;
}
 
void test_hash04_string()
{
    
    
	string arr[] = {
    
     "苹果","西瓜","菠萝","草莓","菠萝","草莓" ,"菠萝","草莓" 
		, "西瓜", "菠萝", "草莓", "西瓜", "菠萝", "草莓","苹果" };
	HashTable<string,int> countHT;
	for (auto& str : arr)
	{
    
    
		auto ptr = countHT.Find(str);
		if (ptr)
			ptr->_kv.second++;
		else
			countHT.Insert({
    
     str,1 });
	}
	countHT.Print();
}
 
void test_hash05_string()
{
    
    
	HashFunc<string> hash;
	cout << hash({
    
     "abc" }) << endl;
	cout << hash({
    
     "bac" }) << endl;
	cout << hash({
    
     "cba" }) << endl;
	cout << hash({
    
     "bbb" }) << endl;
}

This is the end of this article, code text is not easy, please support a lot!!!

Guess you like

Origin blog.csdn.net/weixin_67401157/article/details/132120087