## First, what hash collisions are?

The so-called hash (the hash), the input is mapped to different unique, fixed-length value (also called "hash value"). It is one of the most common software operations.

If you get a different input the same hash value, it took place "hash collision" (collision).

For example, many network services will use the hash function to generate a token, to identify the user's identity and authority.

`AFGG2piXh0ht6dmXUxqv4nA1PU120r0yMAQhuc13i8`

This string is above a hash value. If two different users, got the same token, a hash collision occurs. The server will these two users as the same person, which means that user B can read and change the information of the user A, which undoubtedly is a big security risk.

One method of hacking is trying to manufacture a "hash collision" and then invade the system, steal information.

## Second, how to prevent hash collisions?

The most effective method of preventing a hash collision, is to expand the value of space hash value.

16 bit hash value, the likelihood of collision is one of 65536 points. That is, if there are 65,537 users, they will collide. The length of the hash value is extended to 32 bits, the likelihood of collision will drop to one 4,294,967,296 points.

Longer hash value means more storage space, more calculations will affect the performance and cost. Developers must make a choice, to find a balance between security and costs.

Here are just, under the premise of how to meet the safety requirements, find the shortest length of the hash value.

## Third, the birthday attack

Hash collision probability depends on two factors (assuming reliable hash function, generation probability of each value are the same).

- Size value of the space (i.e., the length of the hash value)
- Throughout the life cycle, the number of calculated hash values

This issue has long been on the mathematics prototype called " birthday problem " (birthday problem): How many people need to have a class, in order to ensure each student's birthday is different?

The answer is surprising. If at least two classmates birthday the same probability of not more than 5%, then this class can only have seven people. In fact, a class 23 has a 50% probability of at least two identical classmates birthday; 50% probability of class 97, class 70 is a probability of 99.9% (calculated see below).

This means that if the value of space hash value is 365, so long as the hash value calculation 23, there is a 50% chance collision. In other words, the possibility of a hash collision than imagined high. In fact, there is a similar formula.

The above formula can be calculated, the number of calculated hash collision probability of 50% of the desired, N denotes the value of the hash space. Birthday problem of N is 365, calculated is 23.9. This formula tells us that the number of computations required to spend a hash collision, with the square root of the value of space is an order of magnitude.

This use of the hash space is not large enough, and the manufacturing method of attack collision, it is called the birthday attack (birthday attack).

## Fourth, the mathematical derivation

This section presents the mathematical derivation of the birthday attack.

The same birthday probability that at least two people, everyone birthday mutually different probabilities can be calculated first, and then one minus the probability.

We put this question to imagine, everybody line up in order to enter a room. The first person to enter the room, and the room already people (0 people), the probability is not the same birthday `365/365`

; the second man entered the room, unique birthday probability `364/365`

; third is `363/365`

, and so on .

So, everyone's birthday is not the same probability, is the following formula.

The above formula n is the number of people entering the room. As can be seen, the more people entering the room, the smaller the probability birthday different from each other.

The following formula can be derived to form.

Well, there are at least two people the same birthday probability is 1 minus the above formula.

## V. hash collision formula

The above formula may be derived to a further general, a form easy to calculate.

According to Taylor formula, E exponential function ^{X} may be a polynomial expansion.

If x is a very small value, then the above equation is approximately equal to the following form.

Now the birthday problem of `1/365`

substitution.

Therefore, the probability formula birthday problem, it becomes so below.

D is assumed that the value of space (birthday problem in 365), you get a generic formula.

Above is the hash collision probability formula.

## Sixth, the application

The above formula written as a function.

`const calculate = (d, n) => { const exponent = (-n * (n - 1)) / (2 * d) return 1 - Math.E ** exponent; } calculate(365, 23) // 0.5000017521827107 calculate(365, 50) // 0.9651312540863107 calculate(365, 70) // 0.9986618113807388`

一般来说，哈希值由大小写字母和阿拉伯数字构成，一共62个字符（10 + 26 + 26）。如果哈希值只有三个字符的长度（比如`abc`

），取值空间就是 `62 ^ 3 = 238,328`

，那么10000次计算导致的哈希碰撞概率是100%。

`calculate(62 ** 3, 10000) // 1`

哈希值的长度增加到5个字符（比如`abcde`

），碰撞的概率就下降到5.3%。

`calculate(62 ** 5, 10000) // 0.05310946204730993`

现在有一家公司，它的 API 每秒会收到100万个请求，每个请求都会生成一个哈希值，假定这个 API 会使用10年。那么，大约一共会计算300万亿次哈希。能够接受的哈希碰撞概率是1000亿分之一（即每天发生一次哈希碰撞），请问哈希字符串最少需要多少个字符？

根据上面的公式倒推，就会知道哈希值的最短长度是22个字符（比如`BwQ1W6soXkA1PU120r0yMA`

），计算过程略。

22个字符的哈希值，就能保证300万亿次计算里面，只有1000亿分之一的概率发生碰撞。常用的 SHA256 哈希函数产生的是64个字符的哈希值，每个字符的取值范围是0~9和a~f，发生碰撞的概率还要低得多。

## 七、参考链接

- How Long Should I Make My API Key?, by Sam Corcos
- Birthday problem, by Wikipedia
- Birthday attack, by Wikipedia

（完）