origin

Some time ago, I helped my classmates write about ERthe Internet and BAthe Internet.
Among them, BAthe network should take the proportion of the degree to the degree of the whole graph as the probability.
So I wrote a function that randomly draws numbers according to probability pick.

train of thought

The idea was simple. It is to divide the number axis into several pieces, and then randomly draw points.

For example, arrays [1,2,3]generate [1,6]random integers in the interval.
If the random number is , 5it is considered that the third number is selected.

question

The above method is very effective to draw a number, but if it is to draw na number, the performance will drop very badly. The more you draw, the greater the probability of repetition.

I just started using the shuffling algorithm , and when I draw the i-th number, I exchange it with the number at position i, and then take the number of the following numbers according to the probability. Although this solves the problem of repeated fetching, the efficiency is still not high. Because the prefix sum is recalculated every time.

Later, I went to improve the reservoir algorithm . But it seems that there is no algorithm with non-equal probability.

final algorithm

Finally, the method of setting the probability to 0 is adopted. It not only avoids repeated fetching, but also facilitates updating the prefix sum.
The prefix sum is maintained with a tree array, and the query modification is $l o g (n)$
judges which interval to use binary search, the original complexity of binary search is $l o g (n)$ , there is another query prefix sum inside. Total complexity $log(n))^2$

Let's give an example to understand the point
. For example, the probability ratio is [1,2,3], and the prefix sum is [1,3,6]
If the second number is selected for the first time. Then it becomes that the probability ratio is [1,0,3], the prefix sum is [1,1,4], and the dichotomy method for the prefix and each search for the first position greater than or equal to the random number can ensure no repetition and time complexity.

the code

from scipy.stats import ks_2samp
import numpy as np
import random
from typing import List


class TreeArray:
    def __init__(self, arr: List[int]) -> None:
        self.len = len(arr)+1
        self.arr = [0]*self.len
        for i in range(0, self.len-1):
            self.add(i, arr[i])

    def add(self, i: int, d: int) -> None:
        '''i位置加上d'''
        i += 1
        while i < self.len:
            self.arr[i] += d
            i += (i & (-i))

    def sum(self, i: int) -> int:
        '''获取到前i个数的和'''
        res = 0
        while i > 0:
            res += self.arr[i]
            i -= (i & (-i))
        return res
    pass

def bsearch(arr: TreeArray, k: int) -> int:
    '''返回大于等于k的下标,若多个符合返回最小的'''
    l, r = -1, arr.len-1
    while l+1 < r:
        m = int((l+r)/2)
        if arr.sum(m+1) < k:
            l = m
        else:
            r = m
    return r


def pick(ps: List[int], n: int) -> List[int]:
    '''以特定概率比例ps随机选取n个数'''
    section,  res = TreeArray(ps), [None]*n
    cur_sum = section.sum(len(ps))
    for i in range(n):
        # [1,cur_sum]
        x = random.randint(1, cur_sum)
        j = bsearch(section, x)
        res[i], cur_sum = j, cur_sum-ps[j]
        section.add(j, -ps[j])
    return res

test

from scipy.stats import ks_2samp
import numpy as np

n = 500
s = 5
p = np.array([5, 3, 2, 5, 2, 1, 8, 3, 2, 1])
p1 = p/sum(p)
ids = []
for i in range(len(p)):
    ids.append(i)

cnt = [0]*len(p)
cnt1 = [0]*len(p)
for i in range(n):
    ans = pick(p, s)
    for i in range(s):
        cnt[ans[i]] += 1
    ans = np.random.choice(ids, size=s, replace=False, p=p1)
    for i in range(s):
        cnt1[ans[i]] += 1
print(ks_2samp(cnt1, cnt))

Here I draw 500a round. The KS-test judges whether the probability distribution is met.
The matchmaking function here is numpyCurry's np.random.choice.
insert image description here
It can be seen pvaluethat it is very high.
According to the comments of the source code, it can be seen that pvaluethe higher or statisticlower the value, the higher the similarity between the two distributions.

If the KS statistic is small or the p-value is high, then we cannot
reject the hypothesis that the distributions of the two samples are
the same.

episode

Do you think this is the end? Actually not.
Strange things happened when I increased the data size.
It stands to reason that the larger the amount of data, the more similar the two distributions are. But the result I measured was just the opposite.
Its trend is not monotonous,
when n takes 500,
KstestResult(statistic=0.2, pvalue=0.9944575548290717)
when n takes 5000,
KstestResult(statistic=0.1, pvalue=1.0)
when n takes 50000
KstestResult(statistic=0.3, pvalue=0.7869297884777761)

It can be found that the effect is the best when taking 5000.
Later, I checked and wrote the information and found that ks is not suitable for discrete distribution testing.
For specific reasons, please refer to here

Probabilistic Random Sampling Algorithm without Replacement