(Excerpt) Reservoir Sampling - 代码天地

(Excerpt) Reservoir Sampling

企业开发 2018-05-31 15:40:28 阅读次数: 0

Problem Statement

Reservoir Sampling is an algorithm for sampling elements from a stream of data. Imagine you are given a really large stream of data elements (queries on google searches in May, products bought at Walmart during the Christmas season, names in a phone book, whatever). Your goal is to efficiently return a random sample of 1,000 elements evenly distributed from the original stream. How would you do it?

The right answer is generating random integers between 0 and N-1, then retrieving the elements at those indices and you have your answer. (Update: reader Martin astutely points out that this is sampling with replacement. To make this sampling without replacement, one simply needs to note whether or not your sample already pulled that random number and if so, choose a new random number. This can make the algorithm pretty expensive if the sample size is very close to N though).

So, let me make the problem harder. You don't know N (the size of the stream) in advance and you can't index directly into it. You can count it, but that requires making 2 passes of the data. You can do better. There are some heuristics you might try: for example to guess the length and hope to undershoot. It will either not work in one pass or will not be evenly distributed.

Simple Solution

A relatively easy and correct solution is to assign a random number to every element as you see them in the stream, and then always keep the top 1,000 numbered elements(descendent) at all times. This is similar to how mysql does "ORDER BY RAND()" calls. This strategy works well, and only requires additionally storing the randomly generated number for each element.

猜你喜欢

转载自jays1235.iteye.com/blog/1113417

(Excerpt) Reservoir Sampling

Reservoir Sampling

水塘抽样（Reservoir sampling）

水塘抽样 Reservoir sampling

水塘采样(Reservoir sampling)算法

Reservoir Sampling - 蓄水池抽样算法

[转载]水塘抽样(Reservoir Sampling)问题

Reservoir Sampling 蓄水池采样算法

蓄水池抽样算法(Reservoir Sampling)

spark源码解读2之水塘抽样算法（Reservoir Sampling）

[编程题] LeetCode上的Reservoir Sampling(蓄水池算法)类型的题目

蓄水池采样算法（Reservoir Sampling）原理，证明和代码

[Alg] 随机抽样完备算法-蓄水池算法 Reservoir Sampling

大文件快速抽样，大样本快速随机取数--蓄水池算法Reservoir Sampling

图解连续学习中的蓄水池抽样算法(The Illustrated Reservoir sampling)

机器学习中的数学——蓄水池抽样算法（Reservoir Sampling Algorithm）

Confluence 摘要（Excerpt）宏

wordpress的excerpt()函数

Excerpt-16 March, 2019

Excerpt from Three Day to See

Gibbs Sampling

Sampling Theorem

sampling method

scheduled sampling

Sampling Matrix

importance sampling

Survey sampling

Reservoir Computing论文学习

Confluence 包含摘要（Excerpt Include）宏

[Blog Excerpt] Advice to aspiring data scientists

今日推荐

“百模大战”必有一战 | 2024中国“百模大战”竞争格局分析

最强开源大模型 Llama 3 上架 Gitee AI

虽然老乡鸡开源的不是代码，但背后的原因却让人很暖心

富文本编辑器 Quill 2.0 重磅发布，特性、可靠性与开发者体验大幅提升

周排行

android 文件上传（模拟表单提交）

node中遇到的一些问题

zhuanzai

树莓派3B板载蓝牙与HC05蓝牙模块配对(shell命令实现)

configparser模块简介 configparser模块简介

度度熊的01世界

浅谈log4j-6-xml配置转自godtrue

Kali无线渗透获取宿舍WiFi密码（WPA）

在VMware虚拟机中安装ubuntu

如何用微信公众号二维码事件做扫码登陆

每日归档

更多

2024-04-21(0)

2024-04-20(6)

2024-04-19(5)

2024-04-18(0)

2024-04-17(5)

2024-04-16(70)

2024-04-15(42)

2024-04-14(0)

2024-04-13(119)

2024-04-12(38)