Hash collision probability calculation and python simulation

Table of contents

1 Introduction

2. Birthday question

3. Hash collision problem

4. Simple python simulation

5. Look at the hash collision probability from another perspective


1 Introduction

        Hash function is not a basic concept in computing theory, and there is only one-way function in computing theory. The so-called one-way function is a complicated definition, and the strict definition should refer to books on theory or cryptography. To describe a one-way function in "human" language is: if a function is given an input, it is easy to calculate its result; but when the result is given, it is difficult to calculate the input, which is one-way function. All kinds of encryption functions can be considered as approximations of one-way functions. The Hash function (or become a hash function) can also be regarded as an approximation of a one-way function. That is, it comes close to satisfying the definition of a one-way function.

        There is also a more popular understanding of the Hash function, that is, it represents a compressed mapping of data. The actual Hash function refers to mapping a large range to a small range. The purpose of mapping a large range to a small range is often to save space and make data easy to save. In addition, Hash functions are often applied to lookups. Therefore, before considering using the Hash function, you need to understand its limitations:

        (1) The main principle of Hash is to map a large range to a small range; therefore, the number of actual values ​​you input must be equal to or smaller than the small range. Otherwise there will be many conflicts.

        (2) Since Hash is close to a one-way function; therefore, you can use it to encrypt data.

        (3) Different applications have different requirements on the Hash function; for example, the Hash function used for encryption mainly considers the gap between it and the one-way function, while the Hash function used for searching mainly considers the collision rate mapped to a small range.

        The main object of the Hash function application is an array (for example, a string), and its target is generally an int type.

        Generally speaking, Hash functions can be simply divided into the following categories:
        1. Addition Hash;
        2. Bit operation Hash;
        3. Multiplication Hash;
        4. Division Hash;
        5. Table lookup Hash;
        6. Mixed Hash;
 

        This article mainly discusses the calculation of hash probability and its simple python simulation.

2. Birthday question

        Mathematically speaking, the hash collision probability problem is actually a generalization of a more popular so-called "birthday problem".

        Birthday problem: Assume that the distribution of birthdays in the population in 365 days of a year conforms to a uniform distribution (in other words, the number of people born each day in 365 days of a year is statistically equal). In a party of k people, what is the probability that at least 2 people have the same birthday? Further, what is the smallest N for which the probability of at least 2 people having the same birthday is more than 50%?

        The result of this question is somewhat counter-intuitive, so it is difficult to guess an answer even roughly close without rigorous calculations. How to calculate this probability is discussed below.

        What we want to calculate is the probability that at least two people have a birthday conflict, but it is not easy to calculate this directly. As a common technique in probability calculations, we consider the probability of the complement of the event "at least two people have a birthday conflict"—that is, no birthday conflict between any two people.

        In the following description, birthday conflict is used to indicate that two people have the same birthday.

        Considering person 1, it is clear that TA will not conflict with anyone's birthday

        Considering the second person, the probability that TA does not have a birthday conflict with the first person is obviously1-\frac{1}{365} = \frac{365-1}{365}

        Considering the third person, the probability that TA has a birthday conflict with the first two people is obviously(1-\frac{2}{365}) = \frac{365-2}{365}

        ...

        Considering the kth person, the probability that TA has a birthday conflict with the previous (k-1) person is obviously\frac{365-(k-1)}{365}

        Therefore, the probability that k individuals do not have birthday conflicts is:

                \frac{365}{365}\frac{364}{365}\cdots\frac{365-(k-1)}{365} = \prod \limits_{l=0} \limits^{k-1} \frac{l}{365}

        Therefore, the probability that at least two people have exactly the same birthday is:P_{collision} = 1 - \prod \limits_{l=0} \limits^{k-1} \frac{l}{365}

3. Hash collision problem

        Assuming that the hash function follows a uniform distribution when compressing and mapping data from a large space (recorded as the input space) to a small space (recorded as the target space), then the hash collision (the number of any two input spaces The probability problem of data mapping to the same data in the target space) is actually a generalization of the probability problem of birthday conflicts in the above birthday problem. It's just that the size of the target space changes from 365 in the birthday problem to the generalized N.

        That is, the generalized hash collision probability when the target space size is N is:

        ​​​​​​​        P_{collision, N} = 1 - \prod \limits_{l=0} \limits^{k-1} \frac{l}{N}

        When N is very large, it will be slow to calculate the second part of the right side of the above formula. Fortunately, this formula can be well approximated as follows: 

                \prod \limits_{l=0} \limits^{k-1} \frac{l}{N} = 1 - \prod \limits_{l=0} \limits^{k-1} \frac{l}{N} \cong 1 - e^{-\frac{k(k-1)}{2N}}

4. Simple python simulation

 

# -*- coding: utf-8 -*-
"""
Created on Mon Nov 21 13:44:55 2022

@author: chenxy

ref: https://preshing.com/20110504/hash-collision-probabilities/
"""
import math
import numpy as np
import matplotlib.pyplot as plt
import time
def probCollision(N,k):
    probCollision = 1.0
    for j in range(k):
        probCollision = probCollision * (N - j) / N
    return 1 - probCollision

def probCollisionApproximation(N,k):
    # return 1 - math.exp(-0.5 * k * (k - 1) / N)
    return 1 - np.exp(-0.5 * k * (k - 1) / N)

if __name__ == '__main__':
    
    tstart=time.time()   
    Pcollision = [0]
    for k in range(1,100):
        Pcollision.append(probCollision(365, k))
        print('k = {0}, Pcollision[k] = {1}'.format(k,Pcollision[k]))
    tstop=time.time()
    print('Total elapsed time = {0} seconds'.format(tstop-tstart))
    
    tstart=time.time() 
    Pcollision2 = [0]    
    for k in range(1,100):
        Pcollision2.append(probCollisionApproximation(365, k))
        print('k = {0}, Pcollision2[k] = {1}'.format(k,Pcollision2[k]))
    tstop=time.time()
    
    print('Total elapsed time = {0} seconds'.format(tstop-tstart))

    plt.plot(Pcollision)
    plt.plot(Pcollision2)

The result of the operation is as follows:

。。。
k = 17, Pcollision2[k] = 0.31106113346867104
k = 18, Pcollision2[k] = 0.34241291970846444
k = 19, Pcollision2[k] = 0.37405523755741676
k = 20, Pcollision2[k] = 0.40580512747932584
k = 21, Pcollision2[k] = 0.4374878053458634
k = 22, Pcollision2[k] = 0.4689381107801478
k = 23, Pcollision2[k] = 0.5000017521827107
k = 24, Pcollision2[k] = 0.5305363394090516
k = 25, Pcollision2[k] = 0.5604121995072768

。。。

 From the above simulation results, it can be seen that:

(1) The accuracy of the approximation method mentioned in the previous section is very high, and the calculation results of the two methods are almost consistent from the figure

(2) The probability of birthday conflict exceeds 50% when there are 23 people! This means that when there are 23 people in a party, there is a greater than 50% chance that at least two of them have the same birthday. Think about it, there are 365 days in a year, and if 23 people get together, there is more than half the probability that two people will have the same birthday, isn't it a bit magical?

        Further, by simulating any N, we can find that, for any N, the conflict probability curve has the above shape. This means that the conflict probability can actually be expressed as a function of the normalized number (k/N), which has nothing to do with the specific k and N.

 

5. Look at the hash collision probability from another perspective

        This section examines hash collision probability from another perspective.

        Given a target space of size N, randomly sample data from the input space and map it to the target space, how many input data are needed to fill the target space? What is the relationship between the filling rate of the target space and the collision probability?

        The following is a Monte Carlo simulation for this problem. code show as below:

# -*- coding: utf-8 -*-
"""
Created on Sat Nov 26 10:04:08 2022

@author: chenxy
"""


# generate random 160 bits key

import numpy as np
import random
from collections import defaultdict
import time
import matplotlib.pyplot as plt

def key160_gen() -> int:
    """
    Generate one random 160 bits key

    Returns
    -------
    int
        160 bits key.

    """
    return random.randint(0,2**160-1)

def hash_sim(cam_depth):

    hcam = np.zeros((cam_depth,))
    key_cnt       = 0
    query_ok_cnt  = 0
    collision_cnt = 0
    camfill_cnt   = 0
        
    while 1:
        key_cnt += 1
        key = key160_gen()    
        key_hash = hash(key) %(cam_depth)
        # print('key = {0}, key_hash = {1}'.format(key,key_hash))
        
        if key == hcam[key_hash]:
            query_ok_cnt += 1
        else:
            if hcam[key_hash]==0:
                camfill_cnt += 1
            else:
                collision_cnt += 1
            hcam[key_hash] = key
    
        # if key_cnt %4096 == 0:
        #     print('key_cnt = {0}, camfill_cnt = {1}'.format(key_cnt,camfill_cnt))
    
        if camfill_cnt == cam_depth:
            # print('CAM has been filled to full, with {0} key operation'.format(key_cnt))
            break        
        
    return key_cnt, collision_cnt    

rslt = []
for k in range(10,20):
    tStart = time.time()        
    cam_depth = 2**k
    key_cnt,collision_cnt = hash_sim(2**k)
    tElapsed = time.time() - tStart            
    print('cam_depth={0}, key_cnt={1}, collision_prob={2:4.3f}, tCost={3:3.2f}(sec)'.format(cam_depth,key_cnt,collision_cnt/key_cnt,tElapsed))
    
    rslt.append([cam_depth,key_cnt])

rslt = np.array(rslt)    
plt.plot(rslt[:,0],rslt[:,1])

 The result of the operation is as follows:

cam_depth=1024, key_cnt=6010, collision_prob=0.830, tCost=0.07(sec)
cam_depth=2048, key_cnt=16034, collision_prob=0.872, tCost=0.17(sec)
cam_depth=4096, key_cnt=30434, collision_prob=0.865, tCost=0.30(sec)
cam_depth=8192, key_cnt=89687, collision_prob=0.909, tCost=0.87(sec)
cam_depth=16384, key_cnt=149980, collision_prob=0.891, tCost=1.15(sec)
cam_depth=32768, key_cnt=314527, collision_prob=0.896, tCost=2.38(sec)
cam_depth=65536, key_cnt=866673, collision_prob=0.924, tCost=6.48(sec)
cam_depth=131072, key_cnt=1518369, collision_prob=0.914, tCost=11.08(sec)
cam_depth=262144, key_cnt=3657451, collision_prob=0.928, tCost=26.70(sec)
cam_depth=524288, key_cnt=6648966, collision_prob=0.921, tCost=48.48(sec)

         The above simulation results show that if the hash table is to work in a fully filled state, the hash collision probability is about 90%, that is to say, about 9 collisions will occur every 10 operations of putting into the hash table! The most important problem in the application based on the hash table is how to solve the problem of hash collision.

references:

[1] Hash Collision Probabilities (preshing.com)

Guess you like

Origin blog.csdn.net/chenxy_bwave/article/details/128402156