1 Sublinear Space Algorithms for Big Data
Scenario: storing a number N in binary requires log(N) space
Question: What should I do if N is very large and there are many such N?
Idea: reduce some accuracy and thus save more space.
Solution: use the approximate counting algorithm , the storage of each number only needs loglog ( N ) loglog(N)The space complexity of l o g log ( N ) will do .
1.1 The counting problem of the flow model
problem definition
Define a data flow < ai > , i ∈ [ 1 , m ] , ai ∈ [ 1 , n ] <ai>,i∈[1,m],a_i∈[1,n]<ai>,i∈[1,m],ai∈[1,n ] , frequency vector< fi >, i ∈ [ 1 , n ] , fi ∈ [ 1 , m ] <fi>, i∈[1,n],f_i∈[1,m]<fi>,i∈[1,n],fi∈[1,m]。
The design space complexity is required to be loglog ( N ) loglog(N)l o g l o g ( N ) algorithm, record how many ai a_iappear in itai
morris algorithm
-
Initialize X to 0
-
loop: if ai a_iaiAppears once, with 1 / ( 2 X ) 1/(2^X)1/(2X )with probability of increasing X by 1
-
returns f ^ = ( 2 X − 1 ) \hat{f}=(2^X-1)f^=(2X−1)
Using Chebyshev's inequality P [ ∣ X − μ ∣ ≥ ϵ ] ⩽ σ 2 / ϵ 2 P[|X−μ|≥ϵ]⩽σ^2/ϵ^2P[∣X−μ∣≥ϵ ]⩽p2 /ϵ2 proof
Expected E [ 2 XN − 1 ] = NE[2^{X_N}−1]=NE[2XN−1]=N
Variance var [ 2 XN − 1 ] = 1 2 N 2 − 1 2 N var[2^{X_N}−1] = \frac{1}{2}N^2−\frac{1}{2}Nv a r [ 2XN−1]=21N2−21N
Let X = Y = 2 XN − 1 in Chebyshev's inequality X=Y=2^{X_N}−1X=Y=2XN−1。
Finally, the formula P [ ∣ Y − N ∣ ≥ ϵ ] ⩽ N 2 − N 2 ϵ 2 P[|Y−N|≥ϵ]⩽\frac{N^2−N}{2ϵ^2}P[∣Y−N∣≥ϵ ]⩽2 ϵ2N2−N。
morris+algorithm
- Run the morris algorithm k times
- Record record results ( X 1 , . . . , X k ) (X_1,...,X_k)(X1,...,Xk)
- 返回 γ = 1 k ∑ i = 1 k ( 2 X i − 1 ) \gamma=\frac{1}{k}\sum^k_{i=1}(2^{X_i}-1) γ=k1∑i=1k(2Xi−1)
证明
E [ γ ] = N E[γ]=N E[γ]=N
D [ γ ] = N 2 − N 2 k D[γ]=\frac{N^2−N}{2k} D[γ]=2kN2−N。
P [ ∣ γ − N ∣ ≥ ϵ ] ⩽ N 2 − N 2 k ϵ 2 P[|γ−N|≥ϵ]⩽\frac{N^2−N}{2kϵ^2} P[∣γ−N∣≥ϵ]⩽2kϵ2N2−N
morris++算法
- 重复morris+算法 m = O ( l o g ( 1 δ ) ) m=O(log(\frac{1}{δ})) m=O(log(d1)) times
- take mmMedian of m results
prove
E [ X i ] = 0.9 E[X_i]=0.9 E [ Xi]=0.9
μ = 0.9 m μ=0.9mm=0.9m
P [ ∑ X i < 0.5 m ] < δ P[∑X_i<0.5m]<δ P[∑Xi<0.5 m ]<d
1.2 Number of non-repeating elements
problem definition
Define a data flow < ai > , i ∈ [ 1 , m ] , ai ∈ [ 1 , n ] <a_i>,i∈[1,m],a_i∈[1,n]<ai>,i∈[1,m],ai∈[1,n ] , frequency vector< fi > , i ∈ [ 1 , n ] , fi ∈ [ 1 , m ] <f_i>,i∈[1,n],f_i∈[1,m]<fi>,i∈[1,n],fi∈[1,m]。计算不等于0并且不重复的元素个数
FM算法
-
随机选取一个哈希函数( [ 0 , 1 ] [0,1] [0,1]上的均匀分布)
h : [ n ] ↦ [ 0 , 1 ] h:[n]↦[0,1] h:[n]↦[0,1] -
z = 1 z=1 z=1
-
当一个数字 i 出现的时候: z = m i n { z , h ( i ) } z=min\{z,h(i)\} z=min{ z,h(i)}
-
返回 1 z − 1 \frac{1}{z}-1 z1−1
证明
期望 E [ z ] = 1 d + 1 E[z]=\frac{1}{d+1} E[z]=d+11
v a r [ z ] ⩽ 2 ( d + 1 ) ( d + 2 ) < 2 ( d + 1 ) ( d + 1 ) var[z]⩽\frac{2}{(d+1)(d+2)}<\frac{2}{(d+1)(d+1)} v a r [ z ]⩽(d+1)(d+2)2<(d+1)(d+1)2
P [ ∣ z − 1 d + 1 ∣ > ϵ 1 d + 1 ] < var [ z ] ϵ d + 1 2 < 2 ϵ 2 P[|z−\frac{1}{d+1}|>ϵ\ frac{1}{d+1}]<\frac{var[z]}{\frac{ϵ}{d+1}^2}<\frac{2}{ϵ^2}P[∣z−d+11∣>ϵd+11]<d+1ϵ2v a r [ z ]<ϵ22
FM+ algorithm
- Run the FM algorithm q times in total
- Randomly pick a hash function hj for each run
: [ n ] ↦ [ 0 , 1 ] h_j:[n]↦[0,1]hj:[n]↦[0,1] - Initialize zj = 1 z_j=1zj=1
- Start counting: whenever i occurs, update zj = min ( zj , hj ( i ) ) z_j=min(z_j,h_j(i))zj=min ( zj,hj(i))
- Z = 1 q ∑ j = 1 q z j Z=\frac{1}{q}∑_{j=1}^qz_j Z=q1∑j=1qzj
- returns 1 Z − 1 \frac{1}{Z}-1Z1−1
prove
E [ Z ] = 1 d + 1 E[Z]=\frac{1}{d+1}E[Z]=d+11
v a r [ Z ] ⩽ 2 ( d + 1 ) ( d + 2 ) 1 q < 2 ( d + 1 ) ( d + 1 ) 1 q var[Z]⩽\frac{2}{(d+1)(d+2)}\frac{1}{q}<\frac{2}{(d+1)(d+1)}\frac{1}{q} v a r [ Z ]⩽(d+1)(d+2)2q1<(d+1)(d+1)2q1
P [ ∣ X − d ∣ > ϵ ′ d ] < 2 q ( 2 ϵ ′ + 1 ) 2 P[|X−d|>ϵ'd]<\frac{2}{q}(\frac{2} {ϵ'}+1)^2P[∣X−d∣>ϵ′d]<q2(ϵ′2+1)2
The calculation cost is reduced to O ( 1 ϵ 2 log 1 δ ) O(\frac{1}{ {\epsilon}^2}log\frac{1}{\delta})O(ϵ21logd1)
FM'+ algorithm
- Randomly select a hash function
h : [ n ] ↦ [ 0 , 1 ] h:[n]↦[0,1]h:[n]↦[0,1] - ( z 1 , z 2 , . . . , z k ) = 1 (z_1,z_2,...,z_k)=1 (z1,z2,...,zk)=1 means that the initial value of all z is set to 1
- Maintain the smallest k hash values currently seen
- return kzk \frac{k}{z_k}zkk
prove
P [ ∣ kzk − d ∣ > ϵ d ] = P [ k ( 1 + ϵ ) d > zk ] + P [ k ( 1 − ϵ ) d < zk ] P[|\frac{k}{z_k}−d |>ϵd]=P[\frac{k}{(1+ϵ)d}>z_k]+P[\frac{k}{(1−ϵ)d}<z_k]P[∣zkk−d∣>ϵ d ]=P[( 1 + ϵ ) dk>zk]+P[( 1 − ϵ ) dk<zk]
P < 2 ϵ 2 k P < \frac{2}{{\epsilon}^2k}P<ϵ2 k2
Preparatory knowledge of PracticalFM algorithm and BJKST algorithm
若我们无法存储实数,则采用PracticalFM算法 和 BJKST算法
若 ∀ j 1 , . . . , j k ∈ [ b ] , ∀ j 1 , . . . , j k ∈ [ a ] , p [ h ( i 1 ) = j 1 ∧ . . . ∧ h ( i k ) = j k ] = 1 b k 则 : 一个从 [ a ] 映射到 [ b ] 的哈希函数是 k − w i s e 的 若\forall j_{1},...,j_{k} \in [b], \forall j_{1},...,j_{k} \in [a],\\ p[h(i_{1}) = j_{1} \wedge ...\wedge h(i_{k}) = j_{k}] = \frac{1}{b^k}\\ 则:一个从[a]映射到[b]的哈希函数是k-wise的 若∀j1,...,jk∈[b],∀j1,...,jk∈[a],p[h(i1)=j1∧...∧h(ik)=jk]=bk1则:一个从[a]映射到[b]的哈希函数是k−w ise _ _
zeros ( h ( j ) ) = max ( i : p % 2 i = 0 ) is the number of zeros at the end after being expanded into binary, for example, 8 corresponds to 1000 in binary, then zeros ( 8 ) = 3. zeros(h(j)) = max(i:p \% 2^{i} = 0)\\ That is, the number of zeros at the end after being expanded into binary, \\ For example, 8 corresponds to 1000 in binary, then zeros( 8) = 3.zeros(h(j))=max(i:p%2i=0)That is, the number of 0s at the end after being expanded into binary ,For example, the binary value corresponding to 8 is 1000 , then zeros ( 8 )=3。
PracticalFM algorithm
-
从2 − wise independent 2-wise\ independent2−w i se in d e p e n d e n t Randomly select hfrom the hash function familyh:[n]↦[n]
-
z = 0 z=0 z=0
-
If zeros ( h ( j ) ) > z zeros(h(j))>zzeros(h(j))>z
z = z e r o s ( h ( j ) ) z=zeros(h(j)) z=zeros(h(j))
-
return d ^ = 2 z + 1 2 \hat{d}=2^{z+\frac{1}{2}}d^=2z+21
algorithm explanation
1 − 2 2 C 1 - \frac{2\sqrt{2} }{C} 1−C22The probability satisfies d / C ≤ d ^ ≤ C dd / C \leq \hat{d} \leq Cdd/C≤d^≤Cd。
prove
E [ Y r ] = d 2 r E[Y_{r}] = \frac{d}{2 ^ r} E [ ANDr]=2rd
v a r [ Y r ] ≤ d 2 r var[Y_{r}] \leq \frac{d}{2^r} v a r [ Yr]≤2rd。
The final correct probability should be greater than 1 − 2 2 C 1 - \frac{2\sqrt{2} }{C}1−C22。
BJKST algorithm
-
Randomly select 2 − wise independent 2-wise~ independent2−w i se in d e p e n d e n thash functionh : [ n ] → [ n ] h:[n]→[n]h:[n]→[n]
-
Randomly select 2 − wise independent 2-wise~independent2−w i se in d e p e n d e n tHash functiong : [ n ] → [ b ϵ − 4 log 2 n ] g:[n]→[bϵ−4log2n]g:[n]→[ b ϵ−4log2n]
-
z = 0 , B = ∅ z=0,B=∅ z=0,B=∅
若 z e r o s ( h ( j ) ) > z zeros(h(j))>z zeros(h(j))>z
- B = B ∪ ( g ( j ) , z e r o s ( h ( j ) ) ) B=B∪(g(j),zeros(h(j))) B=B∪(g(j),zeros(h(j)))
- Then B ∣ > c ϵ 2 |B| > \frac{c}{\epsilon^2}∣B∣>ϵ2c
- z = z + 1 z=z+1 z=z+1
- remove (α,β) from B (α,β)( a ,b ),whereinβ < z β<zb<z
-
return d ^ = ∣ B ∣ 2 z \hat{d}=|B|2^z d^=∣B∣2z
algorithm explanation
To explain this algorithm clearly, only two points need to be clear.
- when a new element appears
- If zeros ( h ( j ) ) > z zeros(h(j))>zzeros(h(j))>z
- Then insert the corresponding binary group into B;
- If the number of elements in B exceeds a certain size:
- The value of z increases by 1 accordingly
- Just delete the elements whose second item is less than z.
- If zeros ( h ( j ) ) > z zeros(h(j))>zzeros(h(j))>z
prove
The algorithm can achieve at least 2/3 probability guaranteed (1+ϵ) approximation.
E [ Y r ] = d 2 r E[Y_{r}] = \frac{d}{2 ^ r} E [ ANDr]=2rd
v a r [ Y r ] ≤ d 2 r var[Y_{r}] \leq \frac{d}{2^r} v a r [ Yr]≤2rd
Final P [ FAIL ] < 1 / 6 P [ FAIL ] < 1/6P[FAIL]<1/6。
Coupled with the error probability caused by the previous algorithm assumptions, the final total error probability is within 1/3.
evaluate
The space complexity is O ( logn + 1 ϵ 2 ( log 1 ϵ + loglogn ) ) O(logn + \frac{1}{\epsilon ^ 2}(log\frac{1}{\epsilon} + loglogn))O(logn+ϵ21(logϵ1+loglogn))。
1.3 point query
problem definition
Define a data stream < ai >, i ∈ [ 1 , m ] , ai ∈ [ 1 , n ] <ai>, i∈[1,m],a_i∈[1,n]<ai>,i∈[1,m],ai∈[1,n ] , frequency vector< fi > , i ∈ [ 1 , n ] , fi ∈ [ 1 , m ] <fi>,i∈[1,n],f_i∈[1,m]<fi>,i∈[1,n],fi∈[1,m ] . Count the number of occurrences of all elements in the stream
knowledge preparation
- 范数: l p = ∥ x ∥ p = ( ∑ i ∣ x i ∣ p ) 1 p l_{p} = \left \| x \right \|_{p} = {(\sum_{i}{|x_{i}|^p})}^{\frac{1}{p} } lp=∥x∥p=(∑i∣xi∣p)p1
- l p l_p lpPoint query (frequency estimation)
for a given data stream σ σσ和 a i a_i ai输出 f i ^ \hat{f_i} fi^满足 f i ^ = f i ± ϵ ∣ ∣ f ∣ ∣ p \hat{f_{i} } = f_{i} \pm \epsilon \left|| \textbf{f} \right||_{p} fi^=fi±ϵ∣∣f∣∣p
∣ ∣ x ∣ ∣ 1 ≥ ∣ ∣ x ∣ ∣ 2 ≥ . . . ≥ ∣ ∣ x ∣ ∣ ∞ \left|| x \right||_{1} \geq \left|| x \right||_{2} \geq ... \geq \left|| x \right||_{\infty} ∣∣x∣∣1≥∣∣x∣∣2≥...≥∣∣x∣∣∞,p越大,估计越准确
‖ x ‖ 0 ‖x‖_0 ‖x‖0是不同元素的数目
‖ x ‖ 1 ‖x‖_1 ‖x‖1是流的长度
‖ x ‖ ∞ ‖x‖_∞ ‖x‖∞是最大频度
Misra_Gries算法
Maintain a set A whose elements are ( i , fi ^ ) (i,\hat{f_{i} })(i,fi^)
-
A ← ∅ A←∅ A←∅
-
For each element e in the data stream
if e∈A,令 ( e , f e ^ ) → ( e , f e ^ + 1 ) (e,\hat{f_{e} }) \rightarrow (e,\hat{f_{e} } + 1) (e,fe^)→(e,fe^+1)
else if ∣ A ∣ < 1 ϵ |A| < \frac{1}{\epsilon} ∣A∣<ϵ1: Insert (e,1) into A
else
- Decrement all counts in A by 1
- if f j ^ = 0 \hat{f_{j} } = 0 fj^=0 : remove (j,0) from A
-
For query i, if i ∈ A i∈Ai∈A , returnsfi ^ \hat{f_{i} }fi^, otherwise return 0
prove
For any query i, return fi ^ \hat{f_{i} }fi^ 满足 f i − ϵ m ≤ f i ^ ≤ f i f_{i} - \epsilon m \leq \hat{f_{i} } \leq f_{i} fi−ϵm≤fi^≤fi。
证明
结合算法过程,总共有两种情况。
如果不发生减1的情况,那么 f i ^ = f i \hat{f_{i} } = f_{i} fi^=fi
如果发生了减1的情况,有 f i ^ < f i \hat{f_{i} } < f_{i} fi^<fi
假设发生了c次减1的情况,总数减少 c ϵ ≤ m \frac{c}{\epsilon} \leq m ϵc≤m,每个计数至多减少c, f i ^ ≥ f i − c ≥ f i − ϵ m \hat{f_{i} } \geq f_{i} - c \geq f_{i} - \epsilon m fi^≥fi−c≥fi−ϵm。
算法的空间代价是 O ( ϵ − 1 l o g n ) O(\epsilon^{-1}logn) O(ϵ−1logn)
Metwally算法
Maintain a set A, the elements in the set are ( i , fi ^ ) (i,\hat{f_i})(i,fi^)
- A←∅
- For each element e in the data stream
- if e∈A:令 ( e , f i ^ ) ← ( e , f i ^ + 1 ) (e,\hat{f_i})←(e,\hat{f_i}+1) (e,fi^)←(e,fi^+1)
- else if ∣ A ∣ < 1 ϵ |A| < \frac{1}{\epsilon} ∣A∣<ϵ1: Insert (e,1) into A
- else Insert (e,MIN+1) into A, and delete one that satisfies fe ^ = MIN \hat{f_{e} } = MINfe^=MIN
- query i if i ∈ A i∈Ai∈A , returnsfi ^ \hat{f_i}fi^, otherwise return MIN
proof goal
For any query i, return fi ≤ fi ^ ≤ fi + ϵ m f_{i} \leq \hat{f_{i} } \leq f_{i} + \epsilon mfi≤fi^≤fi+ϵm
证明
① 如果不发生删除的情况,那么 f i ^ = f i \hat{f_{i} } = f_{i} fi^=fi。
② 如果删除,计数一定不大于删除后的MIN,有 f i ^ ≥ f i \hat{f_{i} } \geq f_{i} fi^≥fi,A中元素总是m, M I N 1 ϵ ≤ m ⇒ M I N ≤ ϵ m MIN \frac{1}{\epsilon} \leq m \Rightarrow MIN \leq \epsilon m MINϵ1≤m⇒MIN≤ϵm,每个元素至多超出真实值MIN, f i ^ ≤ f i + ϵ m \hat{f_{i} } \leq f_{i} + \epsilon m fi^≤fi+ϵm。
算法的空间代价是 O ( ϵ − 1 l o g n ) O(\epsilon^{-1}logn) O(ϵ−1logn)
新的定义
Sketch
定义在数据流σ上的数据结构DS(σ)是一个Sketch
如果存在一个Space−Efficient的合并算法COMB使得 C O M B ( D S ( σ 1 ) , D S ( σ 2 ) ) = D S ( σ 1 ∘ σ 2 ) COMB(DS(\sigma_{1}),DS(\sigma_{2})) = DS(\sigma_{1} \circ \sigma_{2}) COMB(DS(σ1),DS(σ2))=DS(σ1∘σ2),其中∘是数据流的连接操作。
Linear Sketch
定义在[n]上的数据流σ上的sketching输出sk(σ),如果sk(σ)取值为维度l=l(n)的向量,并且是f(σ)的线性函数,那么sk(σ)是一个Linear Sketch,l是这个sketch的维度。
Count-Min算法
-
C [ 1... t ] [ 1... k ] ← 0 , k = 2 ϵ , t = ⌈ l o g 1 δ ⌉ C[1...t][1...k] \leftarrow \textbf{0},k = \frac{2}{\epsilon},t = \left \lceil log\frac{1}{\delta} \right \rceil C[1...t][1...k]←0,k=ϵ2,t=⌈logd1⌉
-
Randomly select t 2−wise independent hash functions hi : [ n ] → [ k ] h_i:[n]→[k]hi:[n]→[k]
-
For each update (j,c) that occurs, do the following
for i=1 to t
C [ i ] [ h i ( j ) ] = C [ i ] [ h i ( j ) ] + c C[i][h_{i}(j)] = C[i][h_{i}(j)] + c C[i][hi(j)]=C[i][hi(j)]+c
-
For a query for a, return fa ^ = min 1 ≤ i ≤ t C [ i ] [ hi ( a ) ] \hat{f_{a} } = \min_{1 \leq i \leq t}{C[ i][h_{i}(a)]}fa^=min1≤i≤tC[i][hi(a)]
algorithm explanation
在算法开始时,构造一个 t 行 k 列 的空数组,可以认为每一行是独立的,算法在运行时同时记录了t个这样的数组。在每出现一个流数据的时候,对每一个数组进行一次更新,注意元素的第二个下标用的是数据的哈希值。
算法在运行的过程中可能产生冲突,也就是两个不同的流数据的哈希值可能相同,这个时候就会导致结果偏大,但是因为有相当于t次的重复计算,通过取最小值的方法来进行一些弥补
证明
该算法以1−δ概率给出 l 1 l_{1} l1点查询问题的(1+ϵ)近似
评价
算法的空间代价为 O ( 1 ϵ l o g 1 δ ( l o g n + l o g m ) ) O(\frac{1}{\epsilon}log\frac{1}{\delta}(logn + logm)) O(ϵ1logδ1(logn+logm))
Count-Median算法
-
C [ 1... t ] [ 1... k ] ← 0 , k = 2 ϵ , t = ⌈ l o g 1 δ ⌉ C[1...t][1...k] \leftarrow \textbf{0},k = \frac{2}{\epsilon},t = \left \lceil log\frac{1}{\delta} \right \rceil C[1...t][1...k]←0,k=ϵ2,t=⌈logδ1⌉
-
随机选择t个2−wise独立哈希函数 h i : [ n ] → [ k ] h_i:[n]→[k] hi:[n]→[k]
-
对每一个出现的更新(j,c)进行如下操作
for i=1 to t
C [ i ] [ h i ( j ) ] = C [ i ] [ h i ( j ) ] + c C[i][h_{i}(j)] = C[i][h_{i}(j)] + c C[i][hi(j)]=C[i][hi(j)]+c
-
针对对于a的查询,令 ∣ C [ x ] [ h x ( a ) ] ∣ = m e d i a n 1 ≤ i ≤ t ∣ C [ i ] [ h i ( a ) ] ∣ |C[x][h_x(a)]| = median_{1 \leq i \leq t}{|C[i][h_{i}(a)]|} ∣C[x][hx(a)]∣=median1≤i≤t∣C[i][hi(a)]∣
-
返回 f a ^ = ∣ C [ x ] [ h x ( a ) ] ∣ \hat{f_a} = |C[x][h_x(a)]| fa^=∣C[x][hx(a)]∣
algorithm explanation
The counting method of the Count−MinSketch algorithm is exactly the same, the difference lies in the acquisition of the return value, which returns the original value corresponding to the median of the absolute values of all t array values.
Count Sketch Algorithm
-
C [ 1... k ] ← 0 , k = 3 ϵ 2 C[1...k] \leftarrow 0,k = \frac{3}{\epsilon^2}C[1...k]←0,k=ϵ23
-
随机选择1个2−wise独立哈希函数 h : [ n ] → [ k ] h:[n]→[k] h:[n]→[k]
-
随机选择1个2−wise独立哈希函数 g : [ n ] → − 1 , 1 g:[n]→{−1,1} g:[n]→−1,1
-
对于每一个更新(j,c)
C [ h ( j ) ] = C [ h ( j ) ] + c ∗ g ( j ) C[h(j)] = C[h(j)] + c * g(j) C[h(j)]=C[h(j)]+c∗g(j)
-
针对查询a,返回 f ^ = g ( a ) ∗ C [ h ( j ) ] \hat{f} = g(a) * C[h(j)] f^=g(a)∗C[h(j)]
Count Sketch+算法
-
C [ 1... t ] [ 1... k ] ← 0 , k = 3 ϵ 2 , t = O ( l o g 1 δ ) C[1...t][1...k] \leftarrow \textbf{0},k = \frac{3}{\epsilon^2},t = O(log\frac{1}{\delta}) C[1...t][1...k]←0,k=ϵ23,t=O(logd1)
-
Randomly select a 2−wise independent hash function hi : [ n ] → [ k ] h_i:[n]→[k]hi:[n]→[k]
-
Randomly select a 2−wise independent hash function gi : [ n ] → { − 1 , 1 } g_i:[n] \rightarrow \{-1,1\}gi:[n]→{ −1,1}
For each update (j,c)
For i: 1 → ti: 1→ti:1→t
C [ h i ( j ) ] = C [ h i ( j ) ] + c ∗ g i ( j ) C[h_i(j)] = C[h_i(j)] + c * g_i(j) C[hi(j)]=C[hi(j)]+c∗gi(j)
-
返回 f ^ = m e d i a n 1 ≤ i ≤ t g i ( a ) C [ i ] [ h i ( a ) ] \hat f=median~1≤i≤tgi(a)C[i][h_i(a)] f^=median 1≤i≤tgi(a)C[i][hi(a)]
算法解释
相当于是将Count Sketch算法运行了t次,最后取了中值。利用齐尔诺夫不等式可以解决,令Xi=1⇔第i次运行成功,成功概率是2/3,最后只要成功的个数超过一半即可。最终是通过t控制了δ。
1.4 频度矩估计
Basic AMS算法
- ( m , r , a ) ← ( 0 , 0 , 0 ) (m,r,a)←(0,0,0) (m,r,a)←(0,0,0)
- 对于每一个更新 j
- m ← m + 1 m←m+1 m←m+1
- β ← r a n d o m b i t w i t h P [ β = 1 ] = 1 m β ← random~bit ~with~ P[β = 1] = \frac{1}{m} b←random bit with P[β=1]=m1
- if β==1: a=j,r=0
- if j==a: r ← r + 1 r←r+1 r←r+1
- returns X = m ( rk − ( r − 1 ) k ) X = m(r^k - (r - 1)^k)X=m(rk−(r−1)k)
Analysis of Algorithms
E [ X ] = F k E[X] = F_k E [ X ]=Fk
V a r [ X ] ≤ k n 1 − 1 k F k 2 Var[X] \leq kn^{1 - \frac{1}{k} }F_k^2 Yes [ X ] _≤kn1−k1Fk2
利用切比雪夫不等式对结果进行检验 P [ ∣ X − E [ x ] ∣ > ϵ E [ X ] ] < k n 1 − 1 k ϵ 2 P[|X - E[x]| > \epsilon E[X]]<\frac{kn^{1 - \frac{1}{k} } }{\epsilon^2} P[∣X−E[x]∣>ϵE[X]]<ϵ2kn1−k1
评价
存储 m 和 r 需要 log n 位
存储 a 需要 log m 位
算法的方差太大,需要进一步的改进,但是其相应的存储代价为 O ( l o g m + l o g n ) O(logm+logn) O(logm+logn)
Final AMS 算法
- 利用Median‑of‑Mean技术调用Basic AMS算法;
- 计算 t = c l o g 1 δ t = clog\frac{1}{\delta} t=clogd1averages, each of which is r = 3 k ϵ 2 n 1 − 1 kr = \frac{3k}{\epsilon ^ 2}n^{1 - \frac{1}{k} }r=ϵ23 kn1−k1average of calls
- Returns the median of t values
prove
记 { X i j } i ∈ [ t ] , j ∈ [ r ] \{X_{ij}\}_{i \in [t],j \in [r]} { Xij}i∈[t],j∈[r]is a set of random variables that are independent of the same part as X.
Y i = 1 r ∑ j = 1 r X i j Y_i = \frac{1}{r}\sum_{j = 1}^{r}X_{ij} Yi=r1∑j=1rXij
Z = ∑ i = 1 t Y i Z = \sum_{i = 1}^{t}Y_i Z=∑i=1tYi
计算 E [ Y i ] = E [ X ] , V a r [ Y i ] = V a r [ X ] k E[Y_i] = E[X],Var[Y_i] = \frac{Var[X]}{k} E[Yi]=E[X],Var[Yi]=kVar[X]
根据切比雪夫不等式有 P [ ∣ Y i − E [ Y i ] ∣ ≥ ϵ E [ Y i ] ] ≤ V a r [ X ] k ϵ 2 E [ X ] 2 P[|Y_i - E[Y_i]| \geq \epsilon E[Y_i]] \leq \frac{Var[X]}{k\epsilon^2E[X]^2} P[∣Yi−E[Yi]∣≥ϵE[Yi]]≤kϵ2E[X]2Var[X]
取值 r = 3 V a r [ X ] k ϵ 2 E [ X ] 2 r = \frac{3Var[X]}{k\epsilon^2E[X]^2} r=kϵ2E[X]23Var[X],将期望和方差带入可以计算得到算法中给出的结果
现在 P [ ∣ Y i − E [ Y i ] ∣ ≥ ϵ E [ Y i ] ] ≤ 1 3 P[|Y_i - E[Y_i]| \geq \epsilon E[Y_i]] \leq \frac{1}{3} P[∣Yi−E[Yi]∣≥ϵE[Yi]]≤31
最后利用Median技术进行处理即可。
评价
算法的空间代价为 O ( 1 ϵ 2 l o g 1 δ k n 1 − 1 k ( l o g m + l o g n ) ) O(\frac{1}{\epsilon^2}log\frac{1}{\delta}kn^{1 - \frac{1}{k} }(logm + logn)) O(ϵ21logδ1kn1−k1(logm+l o g n )) , when k≥2, the cost is too high.
Basic F2 AMS algorithm
-
Randomly select 4−wise independent hash functions h : [ n ] → − 1 , 1 h:[n]→{−1,1}h:[n]→−1,1
-
x←0
-
For each update (j,c)
x ← x + c ∗ h ( j ) x \leftarrow x + c * h(j) x←x+c∗h(j)
-
returns x 2 x^2x2
analyze
This algorithm itself does not have a better effect, but after optimization of the median−of−mean technique, a (ϵ,δ) algorithm can be obtained. So in the proof here, just calculate the expectation and variance of the result.
prove
记Y j = h ( j ) , j ∈ [ 1 , n ] , X = ∑ j = 1 nfj Y j Y_j = h(j),j \in [1,n],X = \sum_{j = 1 }^{n}f_yY_yYj=h(j),j∈[1,n],X=∑j=1nfjYj
E [ X 2 ] = ∑ i = 1 n f i Y i × ∑ j = 1 n f j Y j E[X^2] = \sum_{i = 1}^{n}f_iY_i \times \sum_{j = 1}^{n}f_jY_j E[X2]=∑i=1nfiYi×∑j=1nfjYj
V a r [ X 2 ] = E [ X 4 ] − E [ X 2 ] 2 Var[X^2] = E[X^4] - E[X^2]^2 Var[X2]=E[X4]−E[X2]2
评价
算法的空间代价是O(logm+logn),下面先使用mean将犯错概率限制在1/3,再使用median技术对结果进行优化。
1.5 固定大小采样
水库抽样算法
- m←0
- 使用数据流的前s个元素对抽样数组进行初始化
A [ 1 , . . . , s ] , m ← s A[1,...,s],m\leftarrow s A[1,...,s],m←s - 对于每一个更新x
- x以 s m + 1 \frac{s}{m + 1} m+1s概率随机替换A中的一个元素
- m++
证明
假定已经流过的数据量为n,采样池大小为s
考虑最普通的情况,第j个元素进了采样池,之后再也没有被选出去,那么在第n个元素流过之后,这个元素还在采样池中的概率是s/n
计算方法:被选进去的概率是s/j,为保证不被选出去,两种情况:新的元素没有选进来,新的元素选进来了但是该元素没有被替换掉。这两种情况对应着 ( 1 − s j + 1 ) + s j + 1 ∗ s − 1 s = j j + 1 (1 - \frac{s}{j + 1}) + \frac{s}{j + 1}*\frac{s - 1}{s} = \frac{j}{j + 1} (1−j+1s)+j+1s∗ss−1=j+1j。依次类推最终可以计算得到结果s/n。
1.6 Bloom Filter
给定一个数据集U,从中抽取一个子集S,给定一个数q∈U,判定q∈S是否成立。
近似哈希的方法
- 令H是一族通用哈希函数: [ U ] → [ m ] , m = n δ [U]→[m],m = \frac{n}{\delta} [U]→[m],m=δn
- 随机选择 h∈H,并维护数组A[m],S的大小是n
- 对每一个 i∈S, A [ h ( i ) ] = 1 A[h(i)]=1 A[h(i)]=1
- Given a query q, return yes if and only if A [ h ( i ) ] = 1 A[h(i)]=1A[h(i)]=1
prove
If q∈S, yes is returned. If q∉S, then no should be returned, but yes is returned with a certain probability. This is the wrong situation. The element is not in S, but its hash value is the same as An element of has the same hash value. ∑ j ∈ SP [ h ( q ) = h ( j ) ] ≤ nm = δ \sum_{j \in S}P[h(q) = h(j)] \leq \frac{n}{m} = \delta∑j∈SP[h(q)=h(j)]≤mn=δ , which computes values other than m and solves the approximation problem.
Bloom Filter method
-
Let H be a family of independent ideal hash functions: [U]→[m]
-
Randomly select h 1 , . . . , hd ∈ H h_1,...,h_d \in Hh1,...,hd∈H , and maintain the array A[m]
-
For every i∈S
For each j∈[1,d]
A [ h j ( i ) ] = 1 A[h_j(i)] = 1 A[hj(i)]=1
-
Given a query q, return yes if and only if ∀ j ∈ [ d ] , A [ hj ( q ) ] = 1 \forall j \in [d],A[h_j(q)] = 1∀j∈[d],A[hj(q)]=1
prove
The probability of failure is P ≤ ( nm ) d = δ P \leq (\frac{n}{m})^d = \deltaP≤(mn)d=δ , so the final cost ism = O ( nlog 1 δ ) m = O(nlog\frac{1}{\delta})m=O ( n l o gd1)
2 Sublinear Time Algorithms for Big Data
2.1 Vertex Cover 1, the average degree algorithm for computing graphs
definition
Known : G = ( V , E ) G=(V,E)G=(V,E)
求:平均度 d ˉ = ∑ u ∈ V d ( u ) n \bar{d} = \frac{\sum_{u\in V}d(u)}{n} dˉ=n∑u∈Vd(u)
假设:G是简单图,没有平行边和自环
分析
将具有相似或者相同度数的节点分组,然后估算每个分组的平均度数。
首先将所有的点进行分桶,分成t
个桶,第i
个桶里的点集合为 B i = { v ∣ ( 1 + β ) ( i − 1 ) < d ( v ) < ( 1 + β ) i } , 0 < i ≤ t − 1 B_i=\{v|(1+\beta)^{(i-1)} < d(v) < (1+\beta)^{i}\},0<i\leq t-1 Bi={
v∣(1+β)(i−1)<d(v)<(1+β)i},0<i≤t−1,其中β是超参数。
于是 B i B_i Bi中的点的总度数有上下界如公式所示: ( 1 + β ) ( i − 1 ) ∣ B i ∣ < d ( B i ) < ( 1 + β ) i ∣ B i ∣ (1+\beta)^{(i-1)}|B_i| < d(B_i) < (1+\beta)^{i}|B_i| (1+β)(i−1)∣Bi∣<d(Bi)<(1+β)i∣Bi∣
进一步的G
的总度数可以表示为: ∑ i = 0 t − 1 ( 1 + β ) ( i − 1 ) ∣ B i ∣ < ∑ u ∈ V d ( u ) < ∑ i = 0 t − 1 ( 1 + β ) i ∣ B i ∣ \sum_{i=0}^{t-1}(1+\beta)^{(i-1)}|B_i| < \sum_{u\in V}d(u) < \sum_{i=0}^{t-1}(1+\beta)^{i}|B_i| ∑i=0t−1(1+β)(i−1)∣Bi∣<∑u∈Vd(u)<∑i=0t−1(1+b )i∣Bi∣
于是我们可以得到: ∑ i = 0 t − 1 ( 1 + β ) ( i − 1 ) ∣ B i ∣ n < d ˉ < ∑ i = 0 t − 1 ( 1 + β ) i ∣ B i ∣ n \frac{\sum_{i=0}^{t-1}(1+\beta)^{(i-1)}|B_i|}{n} < \bar{d} < \frac{\sum_{i=0}^{t-1}(1+\beta)^{i}|B_i|}{n} n∑i=0t−1( 1 + b )(i−1)∣Bi∣<dˉ<n∑i=0t−1( 1 + b )i∣Bi∣
So the problem is transformed into B in \frac{B_i}{n}nBiestimate
algorithm
- Take the sample set S from V
- S i ← S ∩ B i S_i \gets S \cap B_i Si←S∩Bi
- ρ i ← S i S \rho_i \gets \frac{S_i}{S} ri←SSi
- 返回 d ˉ ^ = ∑ i = 0 t − 1 ρ i ( 1 + β ) i \hat{\bar{d} } = \sum_{i=0}^{t-1}\rho_i(1+\beta)^{i} dˉ^=∑i=0t−1ri(1+b )i
The idea of the algorithm is actually very simple, the B in \frac{B_i}{n}nBiIt is understood as a kind of probability, which is to randomly select a point, which belongs to the probability of Bi, so it is very simple to understand the algorithm.
evaluate
The idea and calculation of the algorithm are very simple, but a very clever transformation has been carried out, but this algorithm is still problematic
2.2 Calculation graph average degree algorithm 2
improve algorithm
ρ i \rho_i for the smaller bucketri, assuming a d ˉ \bar{d}dA lower orderα \alpha of ˉa .
- Draw samples S from V
- S i ← S ∩ B i S_i \gets S \cap B_i Si←S∩Bi
- f o r i ∈ { 0 , … , t − 1 } d o \boldsymbol{for}\ i \in \{0,\dots,t-1\}\ \boldsymbol{do} for i∈{
0,…,t−1} do
- i f ∣ S i ∣ ≥ θ ρ t h e n \boldsymbol{if}\ |S_i| \geq \theta_\rho\ \boldsymbol{then} if ∣Si∣≥θρ then
- ρ i ← ∣ S i ∣ ∣ S ∣ \rho_i \gets \frac{|S_i|}{|S|} ρi←∣S∣∣Si∣
- e l s e \boldsymbol{else} else
- ρ i ← 0 \rho_i\gets 0 ρi←0
- r e t u r n d ˉ ^ = ∑ i = 0 t − 1 ρ i ( 1 + β ) i \boldsymbol{return}\ \hat{\bar{d} } = \sum_{i=0}^{t-1}\rho_i(1+\beta)^{i} return dˉ^=∑i=0t−1ρi(1+β)i
我们将算法的结果调整为以2/3的概率有 ( 0.5 − ϵ ) d ˉ < d ˉ ^ < ( 1 + ϵ ) d ˉ (0.5-\epsilon)\bar{d} < \hat{\bar{d} } < (1 + \epsilon)\bar{d} (0.5−ϵ)dˉ<dˉ^<(1+ϵ)dˉ。
这里给出一组参数,使得能够满足上述结果:
- β = ϵ 4 \beta = \frac{\epsilon}{4} β=4ϵ
- ∣ S ∣ = Θ ( n α ⋅ p o l y ( l o g n , 1 / ϵ ) ) |S| = \Theta(\sqrt{\frac{n}{\alpha} }\cdot poly(log\ n,1/\epsilon)) ∣S∣=Θ(αn⋅poly(log n,1/ϵ))
- t = ⌈ l o g ( 1 + β ) n ⌉ + 1 t = \left \lceil log_{(1+\beta)}n \right \rceil + 1 t=⌈log(1+β)n⌉+1
- θ ρ = 1 t 3 8 ⋅ ϵ α n ∣ S ∣ \theta_{\rho} = \frac{1}{t}\sqrt{\frac{3}{8}\cdot\frac{\epsilon\alpha}{n} }|S| θρ=t183⋅nϵ a∣S∣
2.3 Calculation graph average degree algorithm 3
Thoughts on Algorithm Improvement
We attribute the errors of the algorithm to the edges, let us see which edges lead to such errors. Divide the nodes into two parts U, V/U, where U is a node with a small degree, V/U is a node with a large degree, and E(U,V/U) represents the set of edges connecting the two sets. Therefore, we assert that the error occurs because we only calculate the edge in E(U,V/U) once, and it is easy to understand this point if we recall the previous example. So we only need to find the proportion of this part of the side at each sampling time.
improve algorithm
Use E [ Δ i ] E[\Delta_i]E [ Di]的1 + ϵ 1+\epsilon1+ϵ估计,可以得到Δ i ρ i ( 1 + β ) i \Delta_i\rho_i(1+\beta)^iDiri(1+b )iorT interms ( 1 + ϵ ) ( 1 + β ) \frac{T_i}{n}of(1+\epsilon)(1+\beta)nTiof ( 1+ϵ ) ( 1+β ) estimate. So, the modified algorithm is as follows:
- 从V中抽取样本S, ∣ S ∣ = O ~ ( L ρ ϵ 2 ) , L = p o l y ( l o g n ϵ ) , ρ = 1 t ϵ 4 ⋅ α n |S| = \tilde{O}(\frac{L}{\rho\epsilon^2}),L=poly(\frac{log\ n}{\epsilon}),\rho = \frac{1}{t}\sqrt{\frac{\epsilon}{4}\cdot \frac{\alpha}{n} } ∣S∣=O~(ρϵ2L),L=poly(ϵlog n),ρ=t14ϵ⋅nα
- S i ← S ∩ B i S_i \gets S \cap B_i Si←S∩Bi
- f o r i ∈ { 0 , … , t − 1 } d o \boldsymbol{for}\ i \in \{0,\dots,t-1\}\ \boldsymbol{do} for i∈{
0,…,t−1} do
- i f ∣ S i ∣ ≥ θ ρ t h e n \boldsymbol{if}\ |S_i| \geq \theta_\rho\ \boldsymbol{then} if ∣Si∣≥θρ then
- ρ i ← ∣ S i ∣ ∣ S ∣ \rho_i \gets \frac{|S_i|}{|S|} ρi←∣S∣∣Si∣
- e s t i m a t e Δ i estimate\ \Delta_i estimate Δi
- e l s e \boldsymbol{else} else
- ρ i ← 0 \rho_i\gets 0 ρi←0
- i f ∣ S i ∣ ≥ θ ρ t h e n \boldsymbol{if}\ |S_i| \geq \theta_\rho\ \boldsymbol{then} if ∣Si∣≥θρ then
- r e t u r n d ˉ ^ = ∑ i = 0 t − 1 ( 1 + Δ i ) ρ i ( 1 + β ) i \boldsymbol{return}\ \hat{\bar{d} } = \sum_{i=0}^{t-1}(1+\Delta_i)\rho_i(1+\beta)^{i} return dˉ^=∑i=0t−1(1+Δi)ρi(1+β)i
2.4 计算图的平均度算法四
Alg III
Use E [ Δ i ] E[\Delta_i]E [ Di]的1 + ϵ 1+\epsilon1+ϵ估计,可以得到Δ i ρ i ( 1 + β ) i \Delta_i\rho_i(1+\beta)^iDiri(1+b )iorT interms ( 1 + ϵ ) ( 1 + β ) \frac{T_i}{n}of(1+\epsilon)(1+\beta)nTiof ( 1+ϵ ) ( 1+β ) estimate. So, the modified algorithm is as follows:
- Let V be an independent variable S, ∣ S ∣ = O ~ ( L ρ ϵ 2 ) , L = poly ( log n ϵ ) , ρ = 1 t ϵ 4 ⋅ α n |S| = \tilde{O}(\frac{L}{\rho\epsilon^2}),L=poly(\frac{log\n}{\epsilon}),\rho = \frac{1}{t} \sqrt{\frac{\epsilon}{4}\cdot\frac{\alpha}{n}}∣S∣=O~(ρϵ2L),L=poly(ϵlog n),ρ=t14ϵ⋅nα
- S i ← S ∩ B i S_i \gets S \cap B_i Si←S∩Bi
- f o r i ∈ { 0 , … , t − 1 } d o \boldsymbol{for}\ i \in \{0,\dots,t-1\}\ \boldsymbol{do} for i∈{
0,…,t−1} do
- i f ∣ S i ∣ ≥ θ ρ t h e n \boldsymbol{if}\ |S_i| \geq \theta_\rho\ \boldsymbol{then} if ∣Si∣≥θρ then
- ρ i ← ∣ S i ∣ ∣ S ∣ \rho_i \gets \frac{|S_i|}{|S|} ρi←∣S∣∣Si∣
- e s t i m a t e Δ i estimate\ \Delta_i estimate Δi
- e l s e \boldsymbol{else} else
- ρ i ← 0 \rho_i\gets 0 ρi←0
- i f ∣ S i ∣ ≥ θ ρ t h e n \boldsymbol{if}\ |S_i| \geq \theta_\rho\ \boldsymbol{then} if ∣Si∣≥θρ then
- r e t u r n d ˉ ^ = ∑ i = 0 t − 1 ( 1 + Δ i ) ρ i ( 1 + β ) i \boldsymbol{return}\ \hat{\bar{d} } = \sum_{i=0}^{t-1}(1+\Delta_i)\rho_i(1+\beta)^{i} return dˉ^=∑i=0t−1(1+Δi)ρi(1+β)i
Alg IV
- α ← n \alpha \gets n α←n
- d ˉ ^ < − ∞ \hat{\bar{d} } < -\infty dˉ^<−∞
- w h i l e d ˉ ^ < α d o \boldsymbol{while}\ \hat{\bar{d} } < \alpha\ \boldsymbol{do} while dˉ^<α do
- α ← α / 2 \alpha \gets \alpha/2 α←α/2
- i f α < 1 n t h e n \boldsymbol{if}\ \alpha < \frac{1}{n}\ \boldsymbol{then} if α<n1 then
- r e t u r n 0 ; \boldsymbol{return}\ 0; return 0;
- d ˉ ^ ← A l g I I I ∼ α \hat{\bar{d} } \gets AlgIII_{\sim \alpha} dˉ^←AlgIII∼α
- r e t u r n d ˉ ^ \boldsymbol{return}\ \hat{\bar{d} } return dˉ^
算法相关指标
近似比: ( 1 + ϵ ) (1 + \epsilon) (1+ϵ)
运行时间: O ~ ( n ) ⋅ p o l y ( ϵ − 1 l o g n ) n / d ˉ \tilde{O}(\sqrt{n})\cdot poly(\epsilon^{-1}log\ n)\sqrt{n/\bar{d} } O~(n)⋅poly(ϵ−1log n)n/dˉ
2.5 Approximate minimum spanning tree
definition
known:G = ( V , E ) , ϵ , d = deg ( G ) G=(V,E),ϵ,d=deg(G)G=(V,E),ϵ ,d=d e g ( G ) , the weight of edge (u,v) iswuv ∈ { 1 , 2 , … , w } ∪ { ∞ } w_{uv}∈\{1,2,…,w\}∪\{ ∞\}wuv∈{ 1,2,…,w}∪{ ∞}
Find : M ^ \hat{M}M^ satisfy( 1 − ϵ ) M ≤ M ^ ≤ ( 1 + ϵ ) M (1−ϵ)M≤\hat{M}≤(1+ϵ)M(1−ϵ ) M≤M^≤(1+ϵ)M,令M为 m i n T s p a n s G W ( T ) min_{TspansG}{W(T)} minTspansGW(T)
分析
定义:
G的子图 G ( i ) = ( V , E ( i ) ) G^{(i)}=(V,E^{(i)}) G(i)=(V,E(i))
E ( i ) = { ( u , v ) ∣ w u v ≤ i } E(i)=\{(u,v)|w_{uv}≤i\} E(i)={(u,v)∣wuv≤i}
连通分量的个数为C(i)
Theorem : If M is the sum of the number of connected components of all such subgraphs, M = n − w + ∑ i = 1 w − 1 C ( i ) M=n-w+\sum_{i=1}^{w -1}{C^{(i)} }M=n−w+∑i=1w−1C(i)
Proof : Let a minimum spanning tree of graph G be MST = ( V , E ′ ) MST=(V,E')MST=(V,E′),MST的子图 M S T ( i ) = ( V ′ , E ′ ( i ) ) , V ′ = V , E ′ ( i ) = { ( u , v ) ∣ ( u , v ) ∈ E ( i ) & ( u , v ) ∈ E ′ } MST^{(i)} = (V',E'^{(i)}),V'=V,E'^{(i)}=\{(u,v)|(u,v) \in E^{(i)} \And (u,v) \in E'\} MST(i)=(V′,E′(i)),V′=V,E′(i)={(u,v)∣(u,v)∈E(i)&(u,v)∈E′}
ảọa i a_iaiis the number of edges with weight i in MST
∑ i > 1 α i \sum_{i>1}{\alpha_i} ∑i>1aiIndicates the number of edges with edge weights greater than 1 in the MST. Adding 1 to this value should be the MST obtained after removing these edges from the MST in the MST ( l ) MST^{(l)}MSTThe number of connected components in ( l )
M S T ( l ) MST^{(l)} MSTThe number of connected components of ( l ) should be consistent with G(i), so we get the following formula:∑ i > 1 α i = C ( l ) − 1 \sum_{i>1}{\ alpha_i}=C^{(l)}-1∑i>1ai=C(l)−1。
于是我们就可以用如下的方法计算M了:
M = ∑ i = 1 w i ⋅ α i = ∑ i = 1 w α i + ∑ i = 2 w α i + ⋯ + ∑ i = w w α i = C ( 0 ) − 1 + C ( 1 ) − 1 + ⋯ + C ( w − 1 ) − 1 = n − 1 + C ( 1 ) − 1 + ⋯ + C ( w − 1 ) − 1 = n − w + ∑ i = 1 w − 1 C ( i ) \begin{align*} M &= \sum_{i=1}^{w}{i \cdot \alpha_i}=\sum_{i=1}^{w}{\alpha_i} + \sum_{i=2}^{w}{\alpha_i} + \dots + \sum_{i=w}^{w}{\alpha_i}\\ &= C^{(0)}-1 + C^{(1)}-1 + \dots + C^{(w-1)}-1\\ &= n-1+C^{(1)}-1 + \dots + C^{(w-1)}-1\\ &= n-w+\sum_{i=1}^{w-1}{C^{(i)} } \end{align*} M=i=1∑wi⋅αi=i=1∑wαi+i=2∑wαi+⋯+i=w∑wαi=C(0)−1+C(1)−1+⋯+C(w−1)−1=n−1+C(1)−1+⋯+C(w−1)−1=n−w+i=1∑w−1C(i)
time complexity
The time complexity of running w times to solve the number of connected components: w ⋅ O ( d / ϵ 3 ) = O ( dw 4 / ϵ ′ 3 ) w\cdot O(d/\epsilon^3)=O(dw^4 /\epsilon'^3)w⋅O ( d / ϵ3)=O ( dw _4 /ϵ′3)。
2.6 Find the diameter of the point set
Definition :
It is known that there are m points, and the distance between points is represented by an adjacency matrix, then Dij represents the distance from point i to point j, D is a symmetric matrix, and satisfies the triangle inequality D ij ≤ D ik + D kj D_ {ij} \leq D_{ik} + D_{kj}Dij≤Di+Dkj。
Find out: the point pair (i, j) makes Dij the largest, then Dij is the diameter of the set of m points.
Solving algorithm: The Indyk's Algorithm
- Optional k ∈ [ 1 , m ] k∈[1,m]k∈[1,m]
- select lll,使得∀ i , D ki ≤ D kl \forall i,D_{ki} \leq D_{kl}∀i,Dto≤Dkl
- returns ( k , l ) , D kl (k,l),D_{kl}(k,l),Dkl
Approximation Ratio Analysis of Algorithm Analysis
The optimal solution is denoted as opt, which is the distance between point i and point j
Then there are opt 2 ≤ D kl ≤ opt \frac{opt}{2} \leq D_{kl} \leq opt2opt≤Dkl≤opt
Approximate Ratio of Algorithm Analysis to Prove Inequality
Regarding the easy proof of the right side of the inequality, the proof of the left side is as follows:
opt = D ij ≤ D ik + D kj ≤ D kl + D kl ≤ 2 D kl \begin{align*} opt = &D_{ij}\\ \leq &D_{ik} + D_{kj}\\ \leq &D_{kl} + D_{kl}\\ \leq &2D_{kl} \end{align*}opt=≤≤≤DijDi+DkjDkl+Dkl2 Dkl
Algorithm evaluation
The time complexity of the algorithm is O ( m ) = O ( n ) O(m) = O(\sqrt{n})O(m)=O(n)。
2.7 Find the number of connected components
definition
known:G = ( V , E ) , ϵ , d = deg ( G ) G = (V,E),\epsilon,d = deg(G)G=(V,E),ϵ ,d=d e g ( G ) , the graph G is represented by an adjacency list, where d represents the degree of the node with the largest degree among all nodes,∣ V ∣ = n , ∣ E ∣ = m ≤ d ⋅ n |V| = n,|E | = m \leq d\cdot n∣V∣=n,∣E∣=m≤d⋅n
求出:一个y,make C − ϵ ⋅ n ≤ y ≤ C + ϵ ⋅ n C - \epsilon \cdot n \leq y \leq C + \epsilon \cdot nC−ϵ⋅n≤y≤C+ϵ⋅n , where C is the standard solution obtained by using the linear algorithm
problem analysis
Record the number of nodes in the connected component to which vertex v belongs as nv n_vnv, A∈V is a point set of connected components, then there is the following equation relationship: ∑ u ∈ A 1 nu = ∑ u ∈ A 1 ∣ A ∣ = 1 \sum_{u \in A}{\frac{1} {n_u} } = \sum_{u \in A}{\frac{1}{|A|} } = 1∑u∈Anu1=∑u∈A∣A∣1=1 , this is easy to prove, because the nv n_vof each point in the same connected componentnvIt is the same, both are the reciprocal of the number of midpoints in the components. In this way, the final result C can be expressed as ∑ u ∈ V 1 nu \sum_{u\in V}{\frac{1}{n_u} }∑u∈Vnu1. So further, the estimation of C can be transformed into 1 nu \frac{1}{n_u}nu1estimate.
problem-solving thinking
n u n_u nuIt is very large, and it is difficult to calculate accurately, but at this time 1 nu \frac{1}{n_u}nu1Very small, can be replaced by a small constant 1 nu \frac{1}{n_u}nu1(0 value 2 \frac{\epsilon}{2}2ϵ), if we set ^ = min { nu , 2 ϵ } \hat{n_u} = min\{n_u,\frac{2}{\epsilon}\}nu^=my { nu,ϵ2},则 C ^ = ∑ u ∈ V 1 n u ^ \hat{C} = \sum_{u \in V}{\frac{1}{\hat{n_u} } } C^=∑u∈Vnu^1, using C ^ \hat{C}C^ to estimate C can get very good results.
Algorithm - Calculate nu ^ \hat{n_u}nu^
The idea is very simple, it is a small search, if the number of searched points is less than 2 ϵ \frac{2}{\epsilon}ϵ2Just continue to search, otherwise return directly to 2 ϵ \frac{2}{\epsilon}ϵ2。
The time complexity is O ( d ⋅ 1 ϵ ) O(d\cdot \frac{1}{\epsilon})O(d⋅ϵ1) , the larger d is, the longer it takes,1 ϵ \frac{1}{\epsilon}ϵ1The bigger it is, the longer it takes.
Algorithm - Calculate nu n_unu
Randomly select r = b / ϵ 2 r = b/{\epsilon}^2 from the set of nodesr=b / ϵ2 nodes constitute node U, apply the previous algorithm to each node, so the finalC ^ = nr ∑ u ∈ U 1 nu ^ \hat{C} = \frac{n}{r} \sum_{u \in U}{\frac{1}{\hat{n_u} } }C^=rn∑u∈Unu^1, the time complexity is O ( d / ϵ 3 ) O(d/{\epsilon}^3)O ( d / ϵ3)
3 Parallel Computing Algorithms
3.1 Basic Questions (1)
MapReduce
There are several points for calculations using the MR model:
- The calculation is carried out in rounds, and the input and output data of each round are in the form of <key, value>
- Each round is divided into: Map, Shuffle, Reduce
- Map: Each data is
map
processed by a function, and a new data set is output, that is, <key, value> is changed - Shuffle: Transparent to programmers, all the data output in the Map are grouped
key
, and the samekey
data is assigned to the samereducer
- Reduce: Input <k,v1,v2,…>, output a new data set
- For such a computing framework, our design goals are: fewer rounds, less memory, and greater parallelism
problem instance
Let's look at some specific problem examples. These examples can be parallelized. Of course, there are more than one logic to implement, and there are many possibilities. Here is just one method.
Build an inverted index
Definition : Given a set of documents, count which documents each word appears in
- Map函数: < d o c I D , c o n t e n t > → < w o r d , d o c I D > <docID,content> \rightarrow <word,docID> <docID,content>→<word,docID> , split the content when the map function is processing, and convert the separated words into words plus the corresponding docID output.
- Reduce函数: < w o r d , d o c I D > → < w o r d , l i s t o f d o c I D > <word,docID> \rightarrow <word,list\ of\ docID> <word,docID>→<word,list of docID> , after the map function ends, shuffle will automatically output the key-value pairs with the same key to a machine, and we can directly organize the data in this batch together.
word count
Definition : Given a set of documents, count the number of occurrences of each word
- Map function: <docID, content> → <word, 1>
- Reduce function: <word,1>→<word,count>
retrieve
Definition : Given a line number and corresponding document content, count the position where the specified word appears (*line number)
- Map function: slightly
- Reduce function: slightly
3.2 Basic Questions (2)
Introduction
Problem Example: Matrix Multiplication
It is simple matrix multiplication. We know that the time complexity of ordinary matrix multiplication is O(m⋅n⋅d), for a matrix multiplication of m⋅d and d⋅n. Through the parallel computing framework, we greatly speed up the execution of this process. The specific two methods are as follows
Matrix multiplication 1
Definition : matrix A and matrix B, (A,i,j) points to the element of row i and column j of matrix A, but it is not the real element, but represents an index, and aij represents this element.
- Map:
- ((A,i,j),aij)→(j,(A,i,aij))
- ((B,j,k),bjk)→(j,(B,k,bjk))
- Reduce:(j,(A,i,aij)),(j,(B,k,bjk))→((i,k),aij∗bjk)
- Map:nothing(identity)
- Reduce:((i,k),(v1,v2,…))→((i,k),∑vi)
Analysis of ideas : Each element on the target matrix is obtained by accumulating d intermediate elements (aij∗bjk). We first find out all such elements and then add them up. This is true for each element on the target matrix.
Matrix multiplication 2
-
Map function:
- ((A,i,j),aij)→((i,x),(A,j,aij)) for all x∈[1,n]
- ((B,j,k),bjk)→((y,k),(B,j,bjk)) for all y∈[1,m]
-
Reduce function: ((i,k),(A,j,aij))∧((i,k),(B,j,bjk))→((i,k),∑aij∗bjk)
Analysis of ideas : Compared with the first method, method 2 does not find the intermediate elements first, but puts together the 2d basic elements used to solve each target matrix, and then performs the dot product operation.
3.3 Sorting algorithm
Introduction to the problem
algorithm
Using p processors, input < i , A [ i ] > <i,A[i]><i,A[i]>
-
Map: < i , A [ i ] > → < j , ( ( i , A [ i ] ) , y ) > <i,A[i]> \rightarrow <j,((i,A[i]),y)> <i,A[i]>→<j,((i,A[i]),y)>
-
输出 < i % p , ( ( i , A [ i ] ) , 0 ) > <i\%p,((i,A[i]),0)> <i%p,((i,A[i]),0)>
-
以概率T/n为所有 j ∈ [ 0 , p − 1 ] j ∈ [0, p − 1] j∈[0,p−1]输出 < j , ( ( i , A [ i ] ) , 1 ) > <j,((i,A[i]),1)> <j,((i,A[i]),1)>
否则输出 < j , ( ( i , A [ i ] ) , 0 ) > <j,((i,A[i]),0)> <j,((i,A[i]),0)>
-
-
Reduce:
- 将y=1的数据收集为S并排序
- 构造 ( s 1 , s 2 , . . . , s p − 1 ) (s_1,s_2,...,s_{p−1}) (s1,s2,...,sp−1), s k s_k skFor the kth in S ⌈ ∣ S ∣ p ⌉ k\left \lceil \frac{|S|}{p} \right \rceilk⌈p∣S∣⌉
- Collect data for y=0 as D
- (i,x)∈D satisfies sk < x ≤ sk + 1 s_k < x \leq s_{k+1}sk<x≤sk+1, output <k,(i,x)>
-
Map:nothing(identity)
-
Reduce:$ <j, ((i, A[i]), . . . )>$
- general owned ( i , A [ i ] ) (i, A[i])(i,A [ i ]) Sort and output according to $A[i]$
Analysis of ideas : The overall idea is to divide the entire data into different small segments, and the data between segments has a strict size relationship, so that the data inside each segment is some very close data, so we The data set inside each segment can be sorted with the help of some efficient sorting methods. Then one of the keys to solving the problem is whether we divide the data well or not. We can use theory to prove that there is a high probability that the division effect can be guaranteed. **But it is not absolute. It is possible that there is basically no data on a certain computing node, and excessive data appears on another computing node. Of course, this is an extremely low probability event.
3.4 Calculate the minimum spanning tree (spanning tree)
Introduction to the problem
Known graph G=(V, E), where V, E are the point set and edge set of the graph respectively. A subgraph T ∗ = ( V , E ′ ) T^*=(V,E') of a graph GT∗=(V,E′ )is called a minimum spanning tree if and only ifT ∗ T^∗T∗是连通的并且 ∑ ( u , v ) ∈ T ∗ w ( u , v ) = min T { ∑ ( u , v ) ∈ T w ( u , v ) } \sum_{(u,v)\in T^*}{w(u,v)} = \min_{T}\{\sum_{(u,v)\in T}{w(u,v)}\} ∑(u,v)∈T∗w(u,v)=minT{ ∑(u,v)∈Tw(u,v)}。
In the small data range, the Kruskal or Prim algorithm is usually used to solve it. When the scale of the graph becomes larger, due to the limitation of the complexity of these two algorithms, it will not be able to solve it within a limited time. At this time, MapReduce can be used to speed up the calculation.
Algorithm main idea
Using the graph partition algorithm, the graph G is divided into k subgraphs, and the minimum spanning tree is calculated in each subgraph, as follows.
- Divide the node into k parts, for each ( i , j ) ∈ [ k ] 2 (i,j)\in [k]^2(i,j)∈[k]2,令 G i j = ( V i ∪ V j , E i j ) G_{ij} = (V_i\cup V_j,E_{ij}) Gij=(Vi∪Vj,Eij) is nodeV i ∪ V j V_i\cup V_jVi∪VjThe derived subgraph on
- Solve M ij = MSF ( G ij ) M_{ij}=MSF(G_{ij}) separately on each GijMij=MSF(Gij)
- 令 H = ∪ i , j M i j H = \cup_{i,j}M_{ij} H=∪i,jMij, calculate M = MST ( H ) M=MST(H)M=MST ( H )
Note: MSF refers to the minimum spanning forest. Give an example to illustrate this concept. There are n−1 edges in a minimum spanning tree, which can be understood as a minimum spanning forest with only one tree. Similarly, there are also minimum spanning forests with two trees and three trees in the same graph, as long as the forest is min { ∑ ( u , v ) ∈ Forestw ( u , v ) } \min\{\ sum_{(u,v)\in Forest}w(u,v)\}min{ ∑(u,v)∈Forestw(u,v )} will do.
In essence, the essence of the algorithm is to first calculate the spanning tree locally, then use the remaining edges connecting these spanning trees to form a new graph, and find the minimum spanning tree of this new graph as the total result.
MR algorithm
- Map:input:<(u,v),NULL>
- transformation<(h(u),h(v));(u,v)>
- For the above conversion data, if h(u)=h(v), then for all j∈[1,k], output <(h(u),j);(u,v)>
- Reduce:input:<(i,j);Eij>
- 令Me=MSF(Thou)
- Output <NULL;(u,v)> for each edge e=(u,v) in Mij
- Map:nothing(identity)
- Reduce:M=MST(H)
Among them, h is a hash function, which can be realized by a random algorithm with uniform values (ps: use a random algorithm to generate a hash table instead of randomly taking values every time it runs).
Of course, the load balancing problem of each computing node should also be considered when dividing.
4 External memory model algorithm
4.1 External storage model
External memory model
So far, according to the storage model, the algorithm models we have learned should be divided into two types: one is the RAM model, which is the design model of our commonly used algorithms, and the other is the I/O model, the memory is smaller than the amount of data , the external memory is unlimited.
There are some differences between external memory access and memory access:
- RAM is faster than external storage
- Continuous access to external memory is less costly than random access, that is to say: access in units of blocks
Fundamental Problems Based on the External Storage Model
In the I/O model, the size of the memory is M, the page size is B, the size of the external memory is unlimited, and the page size is B.
Continuous reading of N data on the external storage requires O(N/B) I/O times
How to calculate matrix multiplication?
Input two matrices X and Y of size N×N
- Divide the matrix into sizes M / 2 × M / 2 \sqrt{M}/2\times\sqrt{M}/2M/2×M/2 blocks
- Considering each block in the X×Y matrix, there is obviously a total of O ( ( NM ) 2 ) O((\frac{N}{\sqrt{M} })^2)O((MN)2 )blocks need to output
- Each block needs to scan NM \frac{N}{\sqrt{M} }MNto the input block
- Each memory calculation requires O(M/B) I/O times
- Total O ( ( NM ) 3 ⋅ M / B ) O((\frac{N}{\sqrt{M} })^3\cdot M/B)O((MN)3⋅M / B ) Next I/O
linked list
Perform three operations: insert(x,p), remove§, traverse(p,k)
The time complexity of each operation under the memory model: updateO(1), traverseO(k)
Under the external memory model, the consecutive elements in a linked list are placed in a block of size B. Also, let each block be at least B/2 in size:
- remove: If it is less than B/2 after deletion, it will be merged with the adjacent block, if it is greater than B after the merger, it will be divided equally
- insert: If it is greater than B after insertion, it will be divided evenly
- traverse:O(2k/B)
search structure
Perform three operations: insert(x), remove(x), query(x)
( a , b ) − t r e e : 2 ≤ a ≤ ( b + 1 ) / 2 (a,b)-tree:2\leq a \leq (b+1)/2 (a,b)−tree:2≤a≤(b+1)/2
Similar to binary search tree ⇒ ( p 0 , k 1 , . . . , kc , pc ) \Rightarrow (p_0,k_1,...,k_c,p_c)⇒(p0,k1,...,kc,pc)
The root node has 0 or ≥ 2 children; except the root, the number of children of each non-leaf node ∈ [a,b]:
- remove: Find the corresponding leaf node. If it is smaller than a after deletion, it will be merged with the adjacent block. If it is greater than b after the merger, it will be divided evenly, and then recursively delete the node or adjust the key value in the upper layer.
- insert: Find the corresponding leaf node, if it is greater than b after insertion, divide it evenly, and recursively insert to the next level
- q u e r y : O ( l o g a ( N / a ) ) query:O(log_a(N/a)) query:O(loga(N/a))
4.2 External memory sorting
External memory sorting problem
When considering the external memory sorting algorithm, it should be closely combined with the external memory model.
algorithm
- Given N data, divide it into groups of size O(M)
- Each set of data can be sorted in memory
- Reading each set of data from external storage requires O(M/B) I/O times
- Perform the above operations on all groups, so each group contains sorted data
- Perform a multi-way merge sort on these sorted groups
- O(M/B) groups can be merged each time
process explanation
The first thing to understand is that when transferring data from external storage to internal memory, only the amount of data of B can be transferred at a time. Therefore, if you want to read the memory slowly at one time, the corresponding number of I/Os is O(M/B). In addition, when performing multi-way merge sort, how many groups can be merged at most. A page is read from each group, and then sorted, so it has nothing to do with the size of each group, only the size of the memory, so it is O(M/B).
icon
evaluate
The time complexity is divided into two parts, one is intra-group sorting, and the other is inter-group merge sorting.
- For intra-group sorting, you only need to read the data of each group into memory, and the corresponding time complexity of this part is O ( N / B ) O(N/B)O(N/B)。
- For merge sorting, the corresponding time cost should be the sum of the overhead of each merge, and each merge needs to import all the data into memory once, and this time cost is O ( N / B ) O(N / B)O ( N / B ) . The number of times to merge can be expressed asO ( log M / BNB ) O(log_{M/B}\frac{N}{B})O(logM/BBN) . To sum up: the total time overhead is:O ( N / B ⋅ log M / BNB ) O(N/B \cdot log_{M/B}\frac{N}{B})O(N/B⋅logM/BBN)
4.3 List Ranking
List Ranking
definition
[List Ranking Problem] : Given an adjacency linked list L of size N, L is stored in an array (continuous external storage space), and the rank (serial number in the linked list) of each node is calculated.
[General List Ranking Problem] : Given an adjacency linked list L of size N, L is stored in an array (continuous external storage space), and each node v in L stores weight wv w_vwv, calculate the rank of each node v (the sum of weights from the head node to v)
analyze
- If the subsequence on the merge list is a node (that is, the page is merged internally, and a page is regarded as a node), the weight sum is used as the weight of the node, and the rank value of the preceding and following nodes is not affected.
- If the size of the linked list is at most M, then this problem can be solved by using O(M/B) I/O times, that is, all the data is read into the memory, and the memory algorithm is used to solve this problem.
Let's take a look at the intuitive understanding of this problem through a picture, as shown in the figure below, the relevant data is: N=10, B=2, M=4. In the worst case, the access cost is O(N) I/O times.
algorithm
Input the external memory linked list L of size N
- Find an independent set of vertices X in L
- "Skip" the nodes in X to build a new, smaller external memory linked list L'
- Solve L' recursively
- "Backfill" the nodes in X, and construct the rank of L according to the rank of L'
Step 1.1.4 can be done in O ( sort ) = O ( NB log MBNB ) O(sort) = O(\frac{N}{B}log_{\frac{M}{B} }{\frac{N }{B} })OR ( sort ) _=O(BNlogBMBN) time I/O solution, construct the following recursive equation:
T ( N ) = T ( ( 1 − α ) N ) + O ( N B l o g M B N B ) T(N) = T((1-\alpha)N) + O(\frac{N}{B}log_{\frac{M}{B} }{\frac{N}{B} }) T(N)=T((1−a ) N )+O(BNlogBMBN) , solving the equation:T ( N ) = O ( NB log MBNB ) T(N) = O(\frac{N}{B}log_{\frac{M}{B} }{\frac{N}{ B} })T(N)=O(BNlogBMBN)
step~1
According to the ID order of the nodes, it is divided into a forward (f) linked list and a backward (b) linked list. The f linked list is colored according to the interval of red and blue, and the b linked list is colored according to the interval of green and blue. Note that the f linked list and b linked list here refer to a continuous segment of the linked list, and the ID sequence relationship of the data in this segment is either all consistent with the location in the memory, or just the opposite. Obviously, there are multiple f linked lists and b linked lists in a linked list.
The realization of this step only needs to be implemented by one external memory sorting, so the time cost is O ( NB log MBNB ) O(\frac{N}{B}log_{\frac{M}{B} }{\frac{ N}{B} })O(BNlogBMBN)。
step~2,4
- L ~ = c o p y ( L ) \tilde{L} = copy(L) L~=copy(L)
- Sort the linked list L according to the address of the successor node, and perform the following operations while sorting
- In step~2, rewrite the node pointer and weight of the successor node belonging to X
- In step~4, rewrite the node pointer and weight of the successor node belonging to X, rewrite and store the weight of the node belonging to X,
and use the memory algorithm to solve this problem.
Let's take a look at the intuitive understanding of this problem through a picture, as shown in the figure below, the relevant data is: N=10, B=2, M=4. In the worst case, the access cost is O(N) I/O times.
algorithm
Input the external memory linked list L of size N
- Find an independent set of vertices X in L
- "Skip" the nodes in X to build a new, smaller external memory linked list L'
- Solve L' recursively
- "Backfill" the nodes in X, and construct the rank of L according to the rank of L'
Step 1.1.4 can be done in O ( sort ) = O ( NB log MBNB ) O(sort) = O(\frac{N}{B}log_{\frac{M}{B} }{\frac{N }{B} })OR ( sort ) _=O(BNlogBMBN) time I/O solution, construct the following recursive equation:
T ( N ) = T ( ( 1 − α ) N ) + O ( N B l o g M B N B ) T(N) = T((1-\alpha)N) + O(\frac{N}{B}log_{\frac{M}{B} }{\frac{N}{B} }) T(N)=T((1−a ) N )+O(BNlogBMBN) , solving the equation:T ( N ) = O ( NB log MBNB ) T(N) = O(\frac{N}{B}log_{\frac{M}{B} }{\frac{N}{ B} })T(N)=O(BNlogBMBN)
step~1
According to the ID order of the nodes, it is divided into a forward (f) linked list and a backward (b) linked list. The f linked list is colored according to the interval of red and blue, and the b linked list is colored according to the interval of green and blue. Note that the f linked list and b linked list here refer to a continuous segment of the linked list, and the ID sequence relationship of the data in this segment is either all consistent with the location in the memory, or just the opposite. Obviously, there are multiple f linked lists and b linked lists in a linked list.
The realization of this step only needs to be implemented by one external memory sorting, so the time cost is O ( NB log MBNB ) O(\frac{N}{B}log_{\frac{M}{B} }{\frac{ N}{B} })O(BNlogBMBN)。
step~2,4
- L ~ = c o p y ( L ) \tilde{L} = copy(L) L~=copy(L)
- Sort the linked list L according to the address of the successor node, and perform the following operations while sorting
- In step~2, rewrite the node pointer and weight of the successor node belonging to X
- In step~4, rewrite the node pointer and weight of the successor node belonging to X, and rewrite the weight of the node belonging to X
- Reorder L by address