Hash function

1. Basic concepts
2. MD5 algorithm
- 2.1 Algorithm structure
- 2.2 Compression function
3. SHA1 algorithm
- 3.1 Algorithm structure
- 3.2 Compression function
4. Hash function attack
5. Message authentication
- 5.1 Message Authentication Code
- 5.2 HMAC

1. Basic concepts

1.1 The concept of Hash function

Hash function, also known as hash function/hash function, hash function, is an irreversible mapping from message space to image space, which can transform an input of "arbitrary" length to obtain a fixed-length output . It is a one-way cryptographic system, that is, there is only an encryption process and no decryption process.

The one-way and fixed output length of the Hash function make it possible to generate the "Digital Fingerprint" (Digital Fingerprint) of the message, also known as the message digest (MD, Message Digest) or hash value/hash value (Hash Value) , It is mainly used in message authentication , digital signature , secure transmission and storage of passwords , file integrity verification , etc.

The generation process of the hash value is expressed as: $h = H (M)$ , where

$M$ is a message of any length
$H$ is a hash (Hash) function or hash function, hash function
$h$ is a fixed-length hash value

1.2 Properties of the Hash function

(1) The input message is of any finite length, and the output hash value is of fixed length.
(2) Easy to calculate: for any given message $M$ , easy to calculate its hash value $h = H (M)$ 。
(3) One-way: also known as Preimage Resistance, for any given hash value $h$ , find $H (M) = h$ 's message $M$ is computationally infeasible.
(4) Weak collision resistance: also known as second preimage resistance (Second Preimage Resistance), for any given message $M$ , find the satisfying $\neq M^{'}$ and $H (M) = H (M^{'})$ message $M^{’}$ is computationally infeasible.
(5) Strong collision resistance: find any object that satisfies $H (M) = H (M^{'})$ pair $(M, M^{’})$ is computationally infeasible.

In addition, the Hash function should have an avalanche effect, that is, when the input bit of the message changes, at least half of the output hash value changes.

1.3 Structure of Hash function

The general structure of Hash function is called iterative Hash function structure, which was independently proposed by Merkle and amgảrd respectively. The Hash function divides the input message into $L$ fixed-length packets, each packet length is $b$ bits, the last packet contains the total length of the input message, if the last packet is less than $When the b$ bit, it needs to be filled with $b$ bit.

The hash algorithm iteratively uses a compression function $f$ , the compression function $f$ is the core of the hash algorithm, it has two inputs: one is the bit output of the previous iteration, called the link variable; the other is the $b$ bits grouped, and produce a $n (n < b)$ bit output. The link variable input in the first iteration is also called the initial value variable, which is specified by the algorithm at the beginning, and the output of the last iteration is the hash value.

Please add a picture description

2. MD5 algorithm

The MD5 algorithm was designed by Rivest, a famous cryptographer at the Massachusetts Institute of Technology. He made a detailed elaboration on MD5 in RFC1321 submitted to the IETF in 1992. MD5 is developed on the basis of MD2, MD3, and MD4. Since Safety-Belts are added to MD4, MD5 is also called "MD4 with safety belts".

2.1 Algorithm structure

The input of the MD5 algorithm is that the maximum length is less than $2^{64}$ bit messages, the input message is processed in units of 512 bit packets, and the output is $128 bit message$ digest.

insert image description here
Input message length is $N$ ， $Y_i(i=0,1,...,L-1)$ for message grouping, where $L$ is the number of groups after message expansion

$IV$ represents the initial link variable, consisting of four 32-bit registers A, B, C, and $D$

$CV_i$ is the link variable, which is the output of each packet processing unit and the input of the next packet processing unit

$CV_N$ is the output of the last unit, the hash value of the message

(1) Additional padding bits

fill a $"1"$ and several $"0"$ makes its length modulo $512$ and $448$ congruence, and then convert the real length of the message to $The 64$ bit representation is appended to the padding result so that the message length is exactly $Integer multiple of 512$ bit, namely $512\times L$ bit。

(2) Packet processing (iterative compression)

The packet processing (compression function) of the MD5 algorithm consists of 4 rounds, and the 512bit message packet $M_i$ It is equally divided into 16 sub-groups (each sub-group 32bit) to participate in each 16-step function operation. The input of each step is four 32bit link variables and a 32bit message subpacket, and the output is a 32bit value. After 4 rounds and a total of 64 steps, the obtained 4 register values are respectively input into the link variables for modulo addition, which is the intermediate hash value of the current message.

insert image description here

2.2 Compression function

The step function of MD5, that is, the compression function, first takes the vector $(A, B, C, The last three in D)$ perform a non-linear function operation, and then add the result to the first variable, $M [j]$ 、 $T [i]$ , then circularly shift the result to the left $s$ bits, and add $(A, B, C, D)$ The second variable $B$ , and finally assign the new value to the first variable in the vector.

The detailed process is as follows, where $M [j]$ is the message grouping $M_i$ The jth $j (0 \leq j \leq 15)$ 32bit subgroups

insert image description here

(1) Pseudo-random constant

$=\lfloor 2^{32} \times abs (\sin(i))\rfloor$ （ $i$ is radian, $1\leq i \leq 64$ ) are used to eliminate the regularity of the input data. For example: $\lfloor 4294 967 296\times\r2appro(\s) lfloor 1163531501.0793967247 \rfloor = 1163531501$

Then call $1163531501$ is converted to hexadecimal $455 A 14 E D$ 。

(2) Cycle left

** $<<< s$ ** means circular left shift $s$ bit, a total of 16 constant values:
$\\ round 2: 5, 9, 14, 20 \\ round 3: 4, 11, 16, 23 \\ round 4: 6, 10, 15, 21$

(3) Nonlinear function

The 4 rounds of MD5 use 4 different non-linear functions (16 steps in each round use the same function): $F, G, H, I$ are defined as follows:

第一轮： $F(x,y,z)=(x\wedge y)\lor (\lnot x\land z)$ 第二轮： $G(x,y,z)=(x\land z)\lor (y\land \lnot z)$ 第三轮： $H(x,y,z)=x\oplus y \oplus z$ 第四轮： $I(x,y,z)=y\oplus (x\lor \lnot z)$

where $x, y and z$ are three 32bit input variables, and the output is a 32bit variable; $\wedge, \land, \lnot, \oplus$ represent logical operations of AND, OR, NOT, and XOR respectively.

As in the first round, $FF (a, b, c, d, M [j], s, T [i])$ 表示: $a = b + ((a + (F (b, c, d) + M [j] + T [i]) <<< s)$ where, $1\leq i \leq 64$ , 16 steps are as follows:

FF(A,B,C,D,M[0],7,T[1])	FF(D,A,B,C,M[1],12,T[2])	FF(C,D,A,B,M[2],17,T[3])	FF(B,C,D,A,M[3],22,T[4])
FF(A,B,C,D,M[4],7,T[5])	FF(D,A,B,C,M[5],12,T[6])	FF(C,D,A,B,M[6],17,T[7])	FF(B,C,D,A,M[7],22,T[8])
FF(A,B,C,D,M[8],7,T[9])	FF(D,A,B,C,M[9],12,T[10])	FF(C,D,A,B,M[10],17,T[11])	FF(B,C,D,A,M[11],22,T[12])
FF(A,B,C,D,M[12],7,T[13])	FF(D,A,B,C,M[13],12,T[14])	FF(C,D,A,B,M[14],17,T[15])	FF(B,C,D,A,M[15],22,T[16])

After the last step of round 4 is completed, perform the following calculation: $\equiv (A+AA)\bmod 2^{32}, B \equiv (B+BB)\bmod 2^{32}$ $\equiv (C+CC)\bmod 2^{32} ，D \equiv (D+DD)\bmod 2^{32}$ Afterwards $The value of A, B, C, D$ is used as the initial value of the next iteration until the output of the last message grouping $(A ∣∣ B ∣∣ C ∣∣ D)$ is the 128bit message hash value.

3. SHA1 algorithm

In 1993, the National Institute of Standards and Technology NIST announced the Secure Hash Algorithm SHA0 (Secure Hash Algorithm) standard. On April 17, 1995, the revised version was called SHA-1, which is an algorithm required in the digital signature standard. .

In 2002, NIST released FIPS 180-2 on the basis of FIPS 180-1. In addition to SHA1, three new hash algorithm standards, SHA256, SHA384 and SHA512, were added to this standard. Their message digest lengths are 256 bit, 384 bit, and 512 bit, respectively, in order to match the use of AES.

The difference between the relevant attributes of the four Hash algorithms (unit: bit):

	SHA1	SHA256	SHA384	SHA512
message digest length	160	256	384	512
message length	$2^{64}$	$2^{64}$	$2^{128}$	$2^{128}$
packet length	512	512	1024	1024
word length	32	32	64	64
Step count	80	64	80	80

3.1 Algorithm structure

The input of the SHA1 algorithm is that the maximum length is less than $2^{64}$ bit messages, the input message is processed in units of 512 bit packets, and the output is $160$ bit message digest, so it is more resistant to exhaustion.

The design of SHA-1 is based on MD4. It has 5 32-bit registers involved in the operation. The message grouping and filling methods are the same as MD5. The main cycle is also 4 rounds, but each round performs 20 operations, nonlinear operations, shifts and The addition operation is also similar to MD5, but there are some differences in the design of nonlinear functions, addition constants, and circular left shift operations.

(1) Additional padding bits

(2) Packet processing (iterative compression)

SHA1 processes messages in units of 512 bits. The core of the algorithm is a module containing 4 loops. Each loop consists of 20 steps. Each loop uses the same step function, and the step functions in different loops contain different non- Linear functions (Ch, Parity, Maj, Parity).

The input of each step function is different, except registers $A, B, C, D$ and $In addition to E$ , there is an additional constant $K$ related to message grouping $W [t]$ , where $\leq t \leq 79)$ is the number of steps.

insert image description here
Each cycle starts with the currently processed $512$ bits $Y_q$ and $160$ bit buffer value $A, B, C, D$ and $E$ is the input, and then updates the cached content. The input mode of the last step $2^{32}$ plus the input $CV_q$ Generate $CV_{q+1}$ . All After the $512$ $160$ bit Hash value.

3.2 Compression function

The step function of SHA1, that is, the form of each cycle of the compression function is as follows, where $\leq t \leq 79)$ is the number of steps.

$A=(ROTL^5(A)+f_t(B,C,D)+E+W_t+K_t)\bmod 2^{32}$ $B = A$ $C=ROTL^{30}(B) \bmod 2^{32}$ $D = C$ $E = D$
insert image description here
(1) Constant $K_t$

K's $The 4$ values are $2, 3, 5$ and $Square root of 10$ , then multiplied by $2^{30}$ =1073741824, finally take the hexadecimal of the integer part of the result.

steps $t$	$K_t$ value
$0\leq t \leq 19$	$0 x 5 A 827999$
$20\leq t \leq 39$	$0 x 6 E D 9 EB A 1$
$40\leq t \leq 59$	$0 x 8 F 1 BBC D C$
$60\leq t \leq 79$	$0 x C A 62 C 1 D 6$

To calculate $K_t(60\leq t \leq 79)$ as an example, $\lfloor \sqrt{10}\times 2^{30} \rfloor = 3395469782$

Then $3395469782$ converted to hexadecimal $C A 62 C 1 D 6$ 。

(2) Cycle left

$ROTL^n(x) = (x<<n)$ of 32bit $x$ cycle left $n$ bit。

(3) Generate word $W_t$

32bit word $W_t$ Derived from the 512bit message packet, $W_t in the first 16 steps of processing$ The value is equal to the corresponding word in the message packet:

$W_t=M^{i}_t, 0\leq t \leq 15$

In the remaining 64 steps of operation, its value is obtained by XORing each other of the previous 4 values and then circularly shifting:

$W_t=ROTL^1(W_{t-3}\oplus W_{t-8} \oplus W_{t-14} \oplus W_{t-16}) 16\leq t \leq 79$

The above operations increase the redundancy and interdependence of the compressed packets, so it will be very difficult to find messages with the same compression result for messages in the same packet.

insert image description here

4. Hash function attack

The security of the hash function is mainly reflected in its good one-way and effective avoidance of collisions. Since the hash transformation is a kind of message contraction transformation, when the length difference between the message and the hash value is large, it is difficult to provide enough information for recovering the message only by knowing the hash value, so it is difficult to restore the message only by the hash value, Greater than the difficulty of a ciphertext-only attack on a block cipher of the same block length.

The main goal of the Hash function attack is not to restore the original message, but to forge and deceive with illegal messages with the same hash value , which requires that the hash function must resist collision attacks .

The length of the output is $128$ bit hash function, which can satisfy $H (M) = H (M^{'})$ is $2^{128}$ 。

Then, satisfy $H (M) \neq = The probability of H (M)$ is: $1-2^{128}$ try $k$ arbitrary messages and none of them satisfies $H (M) = H (M^{'})$ is $1-2^{-128})^k$ has at least one $M^{'}$ Satisfying $H (M) = H (M^{'})$ is $1-(1-2^{-128})^k$

According to the binomial theorem, the attacker must try at least $2^{127}$ messages, the probability of successful forgery can exceed $0.5$ , the existing computing power is still difficult to achieve $2^{127}$ Exhaustive search is performed in the space of $^{127}$ $A 128$ bit hash function seems to be safe. But in fact, the attacker can achieve collision through other attack methods, such as birthday attack.

The current attack method for the output length is The hash function above $160$ $Hash functions above 160$ bits are safe.

4.1 The birthday paradox

Birthday paradox problem : Assume that everyone's birthday is equally probable, 365 days a year, if $The probability that at least two of k$ individuals have the same birthday is greater than $1/2$ , minimumWhat is the value of $k ?$

Think of everyone's birthday as $[1, 365]$ random variable, $The probability that the birthdays of k$ individuals do not repeat: $p_k=\frac{p^{k}_{365}}{365 ^k}=\frac{365\times 364\times...(365-k+1)}{365^k}$ when $k = At 23$ o'clock, $p_k\approx 0.4927$ , thus $The probability that at least one of the 23$ birthdays is repeated is $1-p_k\approx 0.5073$ 。

When $k = When 100$ , $1-p_k \approx 0.9999997$ , which is $The probability that 100$ people's birthdays have at least one repetition is basically an inevitable event probability. This result is not consistent with people's intuition. This is the Birthday Paradox (Birthday Paradox).

Actually from $One person is drawn from k$ people, the probability of this person having the same birthday as others is only $\frac{1}{365}$ . But if you just find two people with the same birthday (that is, without specifying a specific date), the probability of being in the same range is much greater.

For output length $The 128-$ bit hash function seeks collisions, similar to the above situation. The probability of finding another message with the same hash value as a particular message is very small. But it is much easier to find two messages with the same hash value in two sets of messages (that is, without specifying the hash value) .

4.2 Set intersection problem

两个k元集合 $X={x_1,x_2,…,x_k},Y={y_1,y_2,…,y_k}$ ,if $x_i,y_i,1 \leq i,j \le k$ is $(1, 2, \dots,$ A uniformly distributed random variable on $n$ $) .$

deal $x_i$ ,若 $y_j=x_i$ , then call $y_j$ give $x_i$ match. fix $i, j$ ， $y_j$ give $x_i$ The probability of matching is $\frac{1}{n}$

则 $y_j \neq x_i$ The probability of is: $1-\frac{1}{n}$ All $in Y$ $k$ random variables are not equal to $x_i$ The probability of is: $(1-\frac{1}{n})^k$ X $X,$ in $Y$ $The k$ random variables are different from each other, then $X$ 与The probability that there is no match in $Y$ $(1-\frac{1}{n})^{k^2}$ Therefore, $X$ 与Probability of at least one match in $Y$ $p=1-(1-\frac{1}{n})^{k^2}$ ≥ $\ge 0$ , there must be $\le e^{-x}$ , so: $p=1-(1-\frac{1}{n})^{k^2} >1-(e^{\frac{1}{n}})^{k^2}$ If you want $p > 0.5$ ，令 $1-(e^{\frac{1}{n}})^{k^2}=0.5$ , it can be obtained: $nk=\sqrt{n\ln 2} \approx 0.83 \sqrt{n} \approx \sqrt{n}$

4.3 Birthday attack

Suppose the hash function $H$ output length is $m$ , all possible outputs are $2^m$ pieces, receive $k$ random inputs produce $X$ , receive another $k$ random inputs yield $Y$ 。

According to the "intersection of two sets" problem, when $k=2^{m/2}$ , $X$ 与 $The probability that Y$ has at least one pair of matches (that is, the hash function produces a collision) is greater than $0.5$ . Therefore, $2^{m/2}$ will determine the output length inof $m$ $H$ is the strength against collision.

Birthday attacks are also known as square root attacks. The principle is as follows:

The attacker first generates a legitimate message, and changes the writing or format by adding spaces or other means (keep the meaning unchanged) to generate $2^{m/2}$ different message variants, i.e. a legal message group is produced.
The attacker then generates an illegal message group to forge the signature
Generate hash values for the above two groups of messages respectively
Find a pair of messages with the same hash value in two sets of messages. If not found, increase the number of deformations of each group of messages until found.

According to the birthday paradox, the probability of success is very high, so that the attacker can find an illegal message with the same hash value as the legitimate message, that is, find a hash collision.

At present, the most effective attack method for Hash function attack is the modular difference method, also known as the "bit tracking method", which was first proposed by Wang Xiaoyun and others when analyzing the MD4 series hash functions. The modulo difference method is a new difference defined by combining the integer modulo difference and the XOR difference. Compared with a single difference, the combination of the two differences can express more information.

5. Message authentication

On the one hand, information security must realize the confidential transmission of messages, so that it can resist passive attacks, such as eavesdropping attacks; on the other hand, it must also prevent attackers from actively attacking the system, such as forging or tampering with messages.

Authentication (Authentication) is the main method against active attacks, which can be divided into two types: entity authentication and message authentication:

Entity Authentication : Verifying the identity of an entity
Message Authentication : Verifying the authenticity of a message
- Verify the authenticity of the source of information , generally known as information source authentication
- Verify the integrity of the message , that is, verify that the message has not been tampered with, forged, etc. during transmission and storage

5.1 Message Authentication Code

The basis of message authentication is to generate a message authentication code (MAC, Message Authentication Code) , which is used to check whether the message has been maliciously modified.

The authentication code is different from the error detection code in communication:

Error detection codes are special codes used to detect errors in messages due to communication defects
Authentication codes are used to prevent attackers from maliciously tampering or forging messages

The message authentication code uses the message and the key shared by both parties to generate a fixed-length short data block through the authentication function , and appends the data block to the message.

5.2 HMAC

Cipher block chaining mode (CBC) using symmetric block cipher systems such as DES and AES has always been the most common method for constructing MAC, such as CBC-MAC defined in FIPS PUB 113.

Since the execution speed of Hash function software such as MD5 and SHA-1 is faster than that of symmetric block cipher algorithms such as DES, many message authentication algorithms based on Hash functions have been proposed at present. Among them, HMAC (RFC 2014) has been published as a FIPS 198 standard and is used in SSL for message authentication.

The HMAC structure is as follows:

insert image description here

in,

$K$ represents the key, the length of the key can be any length, the minimum recommended length is $n$ bit, because less than $n$ bit will significantly reduce the security of the function, greater than $n$ bit also does not increase security
$M$ indicates the message input of HMAC
$L$ means messagenumber of groups in $M$
$Y_i$ Indicates message $M$ 's $i$ group
$b$ represents the number of bits contained in each packet
$n$ represents the length of the hash code generated by the embedded hash function
$IV represents the$ initial link variable
ipad says byte 0x36 repeats The result after $b$ $/8 times$
opad means byte 0x5C repeats The result after $b$ $/8 times$

HMAC can be described as: $HMAC(K,M)=H[(K^+ \oplus opad)| |(K^+ \oplus ipad)||M]$

The operation process is as follows:

(1) Key The left side of $K$ $0$ to produce a $b$ bit long $K^+$ (eg $The length of K$ is $160$ bits, $b = 512$ , you need to fill $44$ zero bytes $0 x 00$ ).
（2） $K^+$ XOR with ipad bit by bit to generate b-bit packet $S_1$
(3) Send the message to $M$ appended to $S_1$ back
(4) The Hash function $H$ acts on the result of step (3) to generate a message digest
（5） $K^+$ bit-by-bit XOR with opad to generate b-bit packet $S_0$
(6) Link the message digest generated in step (4) to $S_0$ back
(7) The Hash function $H$ acts on the result of step (6), generates a message digest, and outputs the final result

A more efficient way to implement HMAC is shown in the figure below, where $(K^+ \oplus ipad))$ 和 $f(IV,(K^+ \oplus opad ))$ are two pre-calculated values, where $f$ is the compression function of the hash function, and its input is $n-$ bit link variable and $The grouping of b$ bits, the output is $n$ -bit link variable. The above values need to be calculated only when initialization or key changes, these pre-computed values replace the function's initial value $IV$ . $_$ In the case that the messages input to the HMAC function are all short, this implementation is of great significance.

insert image description here

Cryptography Series Five: MD5, SHA1 - One article to understand the hash function

Hash function

1. Basic concepts

1.1 The concept of Hash function

1.2 Properties of the Hash function

1.3 Structure of Hash function

2. MD5 algorithm

2.1 Algorithm structure

2.2 Compression function

3. SHA1 algorithm

3.1 Algorithm structure

3.2 Compression function

4. Hash function attack

4.1 The birthday paradox

4.2 Set intersection problem

4.3 Birthday attack

5. Message authentication

5.1 Message Authentication Code

5.2 HMAC

Guess you like