Reservoir sampling

Cat and Peach go to eat conveyor belt sushi. Peach asks the cat to randomly remove k dishes from the turntable. What should the cat do?

A known:

1. The cat and the peach are sitting in the most upstream position, and the conveyor belt extending from the kitchen will bring the sushi prepared by the chef to them at the first time.

2. If the sushi passing in front of the cat is not taken down by the cat, it will be eaten up by the rabbits sitting in the lower house, and will not be turned back. (each sushi only goes through once)

3. Only k dishes can be accommodated on the table.

4. If the removed sushi has not been eaten, it can be put back.

5. Although the chef has stopped producing new sushi, the conveyor belt is so long in the kitchen that the cat has no idea how much sushi has been produced . (For the convenience of discussion, assume a total of N dishes of sushi, and N≥k)

6. The cat head can generate random numbers.

 

Ok, the above statement is purely for fun. It actually does something like this:

Randomly select k samples from a set S of N items, where N is a large, unknown quantity such that it is not possible to store all N items in main memory. It is required to traverse N only once.

 

Jeffrey Scott Vitter addresses this problem in his paper [1] and gives a neat algorithm:

1. Empty the table and leave k positions numbered 1~k to store sushi.

2. Put the first k (1~k) sushi in positions 1~k respectively.

3. When the jth sushi passes by (k+1≤j≤N), the cat's head generates a random integer r between 1 and j.

4. If r≤k, replace the sushi at position r on the table with the current (jth) sushi; otherwise, do not operate.

5. Do this (steps 3 and 4) until all the sushi passes in front of the cat.

 

The canonical description of this algorithm and its proof [2][3] are not complicated. We can prove the algorithm by calculating the probability P(i) that the ith sushi on the conveyor belt is finally selected, stating that P(i)=k/N:

If i≤k, it is placed on the table from the beginning, and the sushi that "threatens" it is the k+1~Nth sushi. The probability that the jth (k+1≤j≤N) sushi replaces it is 1/j, that is, the probability of being "safe" is (j-1)/j. We know from the basic common sense of probability: P(i)=[k/(k+1)]×[(k+1)/(k+2)]×...×[(N-1)/N]= k/N.

For the case of i>k, interested readers can complete this part of the proof by themselves.

 

The subtlety of this algorithm is that its time complexity is O(N) and its space complexity is O(k).

 

A meaningless answer to this question is given on Nutshell [4], and interested readers can go and have a look.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326222214&siteId=291194637