## 1 Introduction

A few days ago a big brother and I exchange a few questions, one on the issue of ID generation to take forward a mathematical hash collisions and a related interest issues with birthday paradox.

At that time in understanding two things not impressive enough, the weekend to spend time carefully studied it and found it interesting, so feel to write an article to share with today's topic is the hash collision and the birthday paradox.

In this article you will learn the following:

- Hash map compression and conflict
- Birthday Paradox
- CRC32 of conflict analysis

## 2. hash map compression and conflict

Is the mathematical nature of the hash, the hash function simply realized in the form of various lengths and input through the operation disclosed hash function to generate a fixed length string, and this process is irreversible way, i.e. not from hash generation string reversal for the initial input.

Sounds really amazing and useful as a **universal capsule** , built in a small range of a lot of different things, that there are 1GB 10MB file or files after hashing will be mapped to a string of fixed length, visible large degree of compression.

But they had to think about another problem: the input is endless, the resulting hash string length is fixed, then inevitably faced with multiple different input is mapped to a compressed same hash string is infinite set of maps a finite set of hash collision caused or called hash collision.

For chestnut:

If the hash 10bit binary string of fixed length, then the maximum number of spaces that can accommodate 10bit is 2 ^ 10 = 1024, speak of an infinite set of inputs, if there are 2000 inputs, will at least in a uniform hash function premise there are 2000-1024 = 976 conflict.

See here, surely someone will say that no hash, but the hash of compressed mapping and convenience of irreversibility indeed very attractive, so quit is not a good idea.

We know the power of the explosion, so if the number of bits to extend 32bit-> 64bit-> 128bit-> 256bit it, 2 ^ 32 is about 4.2 billion, about 2 ^ 64 more than 1,800 quadrillion, 128 and 256 is even greater , looks like a way out to here, we will increase the length of the space on the addition of a power, effectively reducing the probability of collisions, which is the current mainstream ideas and direction.

**summary:**

Hash function itself, there are many underlying implementation has a very complicated mathematical logic, uniform distribution is an important feature of hash functions, Caishuxueqian we did not do too much research on the mathematical principles of the hash function.

**Unlimited input set to compress finite set of mapping is inevitable collision** , an effective way to solve the collision is to increase the length of the mapping space, understand that it is sufficient for later reading, and I believe that smart readers will Get to the author's meaning .

## 3. magical birthday paradox

We discussed the inevitability of hash collisions, then we can not help but think: What am I much bit hash function can be selected to avoid a collision it?

In fact, the face of this problem, I started to think so: using 32bit hash function so that there are 4.2 billion space it, then produce a hash collision would not be one of 4.2 billion minutes, looks like you can sit back and relax, first see an interesting question:

Pictures from the network

Birthday paradox refers to no less than 23 people in at least two of the same birthday probability greater than 50%. For example, in a 30-person primary school classes, there are two birthdays the same probability of 70%. For large classes of 60 people, the probability is greater than 99%. From the perspective of cause logical contradiction, the birthday paradox is not a "paradox." But this mathematical fact is very counter-intuitive, it is called a paradox.

We do have some counter-intuitive, how could it do the same birthday than 365 days a year? However, this is a fact.

**To actually count the question:**

Suppose there are n individuals in the same class, without taking into account special factors premise, such as leap years, twins, born assume that the probability distribution of the average 365 days a year, calculate how many there are at least two people at the same date of birth is the probability?

**Calculation and analysis process:**

We solve its opposite event, the probability of finding the corresponding n when changing distribution, p (n) represents the n individual in each person's birthday is different probabilities, consider the boundary conditions when n> 365 according to pigeonhole principle with probability 0, n ≤ 365, then the probability is:

Wikipedia also has a general formula, look together:

Picture from Wikipedia

Look this curve and the particular discrete data:

Picture from Wikipedia

Picture from Wikipedia

**in conclusion:**

The number n = 23 when at least two of the same birthday probability is 50.7%, n = 30 probability is 70%, n = 50 probability is 97%, so it seems really counterintuitive, but that's a fact.

## 4. CRC32 probability of collision

Let us to describe the possibility of a hash collision in length CRC32 algorithm 32 as an example.

简单来想CRC32的空间大小是42亿，但是实际上并不代码40亿左右数据才会出现碰撞，事实这个数据规模并不需要很大就会出现碰撞。

CRC32的碰撞问题本质上可以从生日悖论的角度来分析，相当于计算在有N个输入的情况下出现碰撞的概率。假设现在有K个输入，不出现冲突的概率计算(将42亿用S表示)：

- 第一个输入 1/S
- 第二个输入 (S-1)/S
- 第三个输入 (S-2)/S
- ......
- 第k个输入 (S-k+1)/S

这个计算过程和生日悖论基本是一样的，随着K的增加这个概率值下降非常快，笔者在网上找了一份CRC16-CRC64的冲突测试报告，可以看下：

//http://www.backplane.com/matt/crc64.html output.16 Count 18134464/18200000 output.17 Count 18068928/18200000 output.18 Count 17937856/18200000 output.19 Count 17675712/18200000 output.20 Count 17151424/18200000 output.21 Count 16103198/18200000 output.22 Count 14061250/18200000 output.23 Count 10770169/18200000 output.24 Count 7092360/18200000 output.25 Count 4153742/18200000 output.26 Count 2259269/18200000 output.27 Count 1179721/18200000 output.28 Count 603421/18200000 output.29 Count 305089/18200000 output.30 Count 153722/18200000 output.31 Count 77254/18200000 output.32 Count 38638/18200000 output.33 Count 19232/18200000 output.34 Count 9652/18200000 output.35 Count 4914/18200000 output.36 Count 2343/18200000 output.37 Count 1204/18200000 output.38 Count 637/18200000 output.39 Count 302/18200000 output.40 Count 152/18200000 output.41 Count 75/18200000 output.42 Count 52/18200000 output.43 Count 21/18200000 output.44 Count 13/18200000 output.45 Count 7/18200000 output.46 Count 1/18200000 output.47 Count 1/18200000 output.48 Count 1/18200000 output.49 Count 0/18200000 output.50 Count 0/18200000 output.51 Count 0/18200000 output.52 Count 0/18200000 output.53 Count 0/18200000 output.54 Count 0/18200000 output.55 Count 0/18200000 output.56 Count 0/18200000 output.57 Count 0/18200000 output.58 Count 0/18200000 output.59 Count 0/18200000 output.60 Count 0/18200000 output.61 Count 0/18200000 output.62 Count 0/18200000 output.63 Count 0/18200000 output.64 Count 0/18200000

输入时1820w随机数据，上述数据给出了CRCx情况下1820w输入产生的碰撞数量，可以看到在CRC32中出现了38638个冲突，在CRC49中才出现0碰撞，所以冲突率还是很高的。

## 5.参考资料

- 维基百科-生日悖论
- CRC冲突测试报告