Algorithm knowledge - hash algorithm (hash)

hash (hash) algorithm

Hash algorithm is not a specific algorithm but a general term for a class of algorithms. Hash algorithm is also called hash algorithm. Generally speaking, it satisfies the following relationship: f(data)=key, input data of any length data, and output a fixed-length data after being processed by hash algorithm key.
The two most important properties of hash algorithms are irreversibility and collision-free .
If it is a datadata set, the data set obtained after being processed by the hash algorithm is keythen keysmapped with the original data one by one to obtain a hash table. Generally speaking, the hash table M conforms to the form of M[key]=data.
The advantage of the hash table is that when the original data is large, we can use the hash algorithm to obtain a fixed-length hash value key, so the key is much smaller than the original data. We can use this smaller data set as an index to achieve the purpose of fast search.
If you think about it for a while, you can find that since the input data is not of fixed length, the output hash value is of fixed length, which means that the hash value is a finite set, and the input data can be infinite. Then establishing a one-to-one relationship is obviously unrealistic. So "collision" ( different input data corresponding to the same hash value ) is bound to occur, so a mature hash algorithm will have better collision resistance. At the same time, the problem of hash collision should also be considered when implementing the structure of the hash table.

Liezi:
For example, there are 10,000 songs here, and they need to be preserved in a certain way. At that time, it will give you a new song (named X), and ask you to confirm whether the new song is within the 10,000 songs.

Undoubtedly, comparing 10,000 songs one by one is very slow. But if there is a way to condense the data of each of 10,000 songs into a number (called a hash code), so as to get 10,000 numbers, then use the same algorithm to calculate the new song X's Code, see if the code of song X is in the previous 10,000 numbers, you can know whether song X is in the 10,000 songs.

The algorithm that condenses a song's 5M bytes of data into a single number is the hash algorithm.
A table obtained by sorting the 10,000 songs according to their respective coded numbers from small to large is a hash table.

Obviously, due to the loss of information, it is possible that the hash code of multiple songs is the same. A good hashing algorithm will minimize such collisions, allowing different songs to have different hash codes. The worst hash algorithm is naturally the same hash code for all songs calculated by that algorithm.

As an example, if you want to organize those 10,000 songs, a simple hash algorithm is to use the number of bytes of the hard drive that the song occupies as the hash code. In this case, you can make 10,000 songs "sort by size", and then encounter a new song, just see if the new song's byte count is the same as one of the existing 10,000 songs. If the number of bytes is the same, you can know whether the new song is within the 10,000 songs.

For a scale of 10,000 songs, this algorithm is quite good, since it is unlikely that two songs will have the exact same number of bytes. Even if there is a very small probability that different songs have the same hash code, there are only a few songs, and then you can compare them one by one.

Application Application
of this summary of information.

What is a summary of information? Make an analogy. You send a letter to a distant friend, but you are worried that the letter will be replaced in the middle, so you tell your friend through another way - my letter has 134 words, 34 punctuation marks, and 72 letter A. Therefore, it is easier for friends to distinguish the truth of the letter from the fake.
———————————————————————————————————————————————————
On a computer, this method is used to verify that large files are transferred correctly - you download a large file of one gigabyte, and due to various transmission errors, there are likely to be a few bytes wrong. Traditionally we can send another original file and compare whether the two are the same. If they are different, another transfer is required until some two files are the same.

But this method is extremely stupid. A clever way is to attach a message to the large file, and the binary file has a singular number of 1s. It can be inferred that if the file is randomly wrong, as long as the number of 1s in the file is counted, it is possible to check whether there is an error, and if there is an error, it will be retransmitted.
Although this method can only detect 1/2 of the errors, the hash algorithm used in practice can refine a file into a string of letters. If the file changes and the letter string remains unchanged, the probability is extremely low and extremely low. of.
This will detect most errors.
———————————————————————————————————————————————————
Information Summary The application of it is far more than that. On the computer, the most widely used algorithm is the look-up table algorithm.
Take Jinshan Kuaipan, for example - people upload files to Kuaiwan. But in fact, many files are duplicated, such as MP3, which are basically the same. There is no need for the server to store so much information repeatedly.
A reasonable approach is to give the file a hash code when the user uploads it. When another user uploads the same file, first check the server for the hash code. If there is, the user does not need to upload it. This is the so-called magic transfer technology. Sometimes files of hundreds of megabytes are uploaded in an instant.
——————————————————————————————————————————
It should be noted that: the file name cannot Instead of a hash code, the same file name is often two different files, such as two MP3s with the same name but different sound quality. There is also a problem with filename + file size. The best way to use it is a hash code.
———————————————————————
To add: message digests are often used for passwords, because message digests do not record the password itself. So in this way, even the administrator cannot know the password

If your husk password is A, the message digest is sfsg. When you log in, the browser first calculates the information summary of A and sends it to the server, the server records the correct information summary, and allows you to log in after a comparison. Even if the administrator opens the server, he can only find the sfsg code, but cannot deduce the password.
——————————————————————————————————————————
It should be added that the hash table is in the Very important in table lookup operations, searching for a string is much faster than searching for a large amount of information.

And many quick calculations are actually table lookups (pre-calculate a bunch of answers, and when you really want to calculate, just check the database for answers.)

A simple addition: Hash tables, to put it bluntly, exchange space for time, and exchange memory space larger than the actual amount of data stored for O(1) reading speed.

So if it is spread out, there are still many details that can be discussed, such as how to improve the space utilization of the hash table, when is the most suitable time for re-hash, and so on.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325810660&siteId=291194637