Big data algorithm focus

1 Sublinear Space Algorithms for Big Data

Scenario: storing a number N in binary requires log(N) space

Question: What should I do if N is very large and there are many such N?

Idea: reduce some accuracy and thus save more space.

Solution: use the approximate counting algorithm , the storage of each number only needs loglog ( N ) loglog(N)The space complexity of l o g log ( N ) will do .

1.1 The counting problem of the flow model

problem definition

Define a data flow < ai > , i ∈ [ 1 , m ] , ai ∈ [ 1 , n ] <ai>,i∈[1,m],a_i∈[1,n]<ai>,i[1,m],ai[1,n ] , frequency vector< fi >, i ∈ [ 1 , n ] , fi ∈ [ 1 , m ] <fi>, i∈[1,n],f_i∈[1,m]<fi>i[1,n],fi[1,m]

The design space complexity is required to be loglog ( N ) loglog(N)l o g l o g ( N ) algorithm, record how many ai a_iappear in itai

morris algorithm

  1. Initialize X to 0

  2. loop: if ai a_iaiAppears once, with 1 / ( 2 X ) 1/(2^X)1/(2X )with probability of increasing X by 1

  3. returns f ^ = ( 2 X − 1 ) \hat{f}=(2^X-1)f^=(2X1)

Using Chebyshev's inequality P [ ∣ X − μ ∣ ≥ ϵ ] ⩽ σ 2 / ϵ 2 P[|X−μ|≥ϵ]⩽σ^2/ϵ^2P[Xμϵ ]p2 /ϵ2 proof

Expected E [ 2 XN − 1 ] = NE[2^{X_N}−1]=NE[2XN1]=N

Variance var [ 2 XN − 1 ] = 1 2 N 2 − 1 2 N var[2^{X_N}−1] = \frac{1}{2}N^2−\frac{1}{2}Nv a r [ 2XN1]=21N221N

Let X = Y = 2 XN − 1 in Chebyshev's inequality X=Y=2^{X_N}−1X=Y=2XN1

Finally, the formula P [ ∣ Y − N ∣ ≥ ϵ ] ⩽ N 2 − N 2 ϵ 2 P[|Y−N|≥ϵ]⩽\frac{N^2−N}{2ϵ^2}P[YNϵ ]2 ϵ2N2N

morris+algorithm

  1. Run the morris algorithm k times
  2. Record record results ( X 1 , . . . , X k ) (X_1,...,X_k)(X1,...,Xk)
  3. 返回 γ = 1 k ∑ i = 1 k ( 2 X i − 1 ) \gamma=\frac{1}{k}\sum^k_{i=1}(2^{X_i}-1) γ=k1i=1k(2Xi1)

证明

E [ γ ] = N E[γ]=N E[γ]=N

D [ γ ] = N 2 − N 2 k D[γ]=\frac{N^2−N}{2k} D[γ]=2kN2N

P [ ∣ γ − N ∣ ≥ ϵ ] ⩽ N 2 − N 2 k ϵ 2 P[|γ−N|≥ϵ]⩽\frac{N^2−N}{2kϵ^2} P[γNϵ]2kϵ2N2N

morris++算法

  1. 重复morris+算法 m = O ( l o g ( 1 δ ) ) m=O(log(\frac{1}{δ})) m=O(log(d1)) times
  2. take mmMedian of m results

prove

E [ X i ] = 0.9 E[X_i]=0.9 E [ Xi]=0.9
μ = 0.9 m μ=0.9mm=0.9m
P [ ∑ X i < 0.5 m ] < δ P[∑X_i<0.5m]<δ P[Xi<0.5 m ]<d

1.2 Number of non-repeating elements

problem definition

Define a data flow < ai > , i ∈ [ 1 , m ] , ai ∈ [ 1 , n ] <a_i>,i∈[1,m],a_i∈[1,n]<ai>,i[1,m],ai[1,n ] , frequency vector< fi > , i ∈ [ 1 , n ] , fi ∈ [ 1 , m ] <f_i>,i∈[1,n],f_i∈[1,m]<fi>,i[1,n],fi[1,m]。计算不等于0并且不重复的元素个数

FM算法

  1. 随机选取一个哈希函数( [ 0 , 1 ] [0,1] [0,1]上的均匀分布)
    h : [ n ] ↦ [ 0 , 1 ] h:[n]↦[0,1] h:[n][0,1]

  2. z = 1 z=1 z=1

  3. 当一个数字 i 出现的时候: z = m i n { z , h ( i ) } z=min\{z,h(i)\} z=min{ z,h(i)}

  4. 返回 1 z − 1 \frac{1}{z}-1 z11

证明

期望 E [ z ] = 1 d + 1 E[z]=\frac{1}{d+1} E[z]=d+11

v a r [ z ] ⩽ 2 ( d + 1 ) ( d + 2 ) < 2 ( d + 1 ) ( d + 1 ) var[z]⩽\frac{2}{(d+1)(d+2)}<\frac{2}{(d+1)(d+1)} v a r [ z ](d+1)(d+2)2<(d+1)(d+1)2

P [ ∣ z − 1 d + 1 ∣ > ϵ 1 d + 1 ] < var [ z ] ϵ d + 1 2 < 2 ϵ 2 P[|z−\frac{1}{d+1}|>ϵ\ frac{1}{d+1}]<\frac{var[z]}{\frac{ϵ}{d+1}^2}<\frac{2}{ϵ^2}P[zd+11>ϵd+11]<d+1ϵ2v a r [ z ]<ϵ22

FM+ algorithm

  1. Run the FM algorithm q times in total
  2. Randomly pick a hash function hj for each run
    : [ n ] ↦ [ 0 , 1 ] h_j:[n]↦[0,1]hj:[n][0,1]
  3. Initialize zj = 1 z_j=1zj=1
  4. Start counting: whenever i occurs, update zj = min ( zj , hj ( i ) ) z_j=min(z_j,h_j(i))zj=min ( zj,hj(i))
  5. Z = 1 q ∑ j = 1 q z j Z=\frac{1}{q}∑_{j=1}^qz_j Z=q1j=1qzj
  6. returns 1 Z − 1 \frac{1}{Z}-1Z11

prove

E [ Z ] = 1 d + 1 E[Z]=\frac{1}{d+1}E[Z]=d+11

v a r [ Z ] ⩽ 2 ( d + 1 ) ( d + 2 ) 1 q < 2 ( d + 1 ) ( d + 1 ) 1 q var[Z]⩽\frac{2}{(d+1)(d+2)}\frac{1}{q}<\frac{2}{(d+1)(d+1)}\frac{1}{q} v a r [ Z ](d+1)(d+2)2q1<(d+1)(d+1)2q1

P [ ∣ X − d ∣ > ϵ ′ d ] < 2 q ( 2 ϵ ′ + 1 ) 2 P[|X−d|>ϵ'd]<\frac{2}{q}(\frac{2} {ϵ'}+1)^2P[Xd>ϵd]<q2(ϵ2+1)2

The calculation cost is reduced to O ( 1 ϵ 2 log 1 δ ) O(\frac{1}{ {\epsilon}^2}log\frac{1}{\delta})O(ϵ21logd1)

image-20221128091528462

FM'+ algorithm

  1. Randomly select a hash function
    h : [ n ] ↦ [ 0 , 1 ] h:[n]↦[0,1]h:[n][0,1]
  2. ( z 1 , z 2 , . . . , z k ) = 1 (z_1,z_2,...,z_k)=1 (z1,z2,...,zk)=1 means that the initial value of all z is set to 1
  3. Maintain the smallest k hash values ​​currently seen
  4. return kzk \frac{k}{z_k}zkk

prove

P [ ∣ kzk − d ∣ > ϵ d ] = P [ k ( 1 + ϵ ) d > zk ] + P [ k ( 1 − ϵ ) d < zk ] P[|\frac{k}{z_k}−d |>ϵd]=P[\frac{k}{(1+ϵ)d}>z_k]+P[\frac{k}{(1−ϵ)d}<z_k]P[zkkd>ϵ d ]=P[( 1 + ϵ ) dk>zk]+P[( 1 ϵ ) dk<zk]

P < 2 ϵ 2 k P < \frac{2}{{\epsilon}^2k}P<ϵ2 k2

Preparatory knowledge of PracticalFM algorithm and BJKST algorithm

若我们无法存储实数,则采用PracticalFM算法 和 BJKST算法
若 ∀ j 1 , . . . , j k ∈ [ b ] , ∀ j 1 , . . . , j k ∈ [ a ] , p [ h ( i 1 ) = j 1 ∧ . . . ∧ h ( i k ) = j k ] = 1 b k 则 : 一个从 [ a ] 映射到 [ b ] 的哈希函数是 k − w i s e 的 若\forall j_{1},...,j_{k} \in [b], \forall j_{1},...,j_{k} \in [a],\\ p[h(i_{1}) = j_{1} \wedge ...\wedge h(i_{k}) = j_{k}] = \frac{1}{b^k}\\ 则:一个从[a]映射到[b]的哈希函数是k-wise的 j1,...,jk[b],j1,...,jk[a],p[h(i1)=j1...h(ik)=jk]=bk1:一个从[a]映射到[b]的哈希函数是kw ise _ _

zeros ( h ( j ) ) = max ( i : p % 2 i = 0 ) is the number of zeros at the end after being expanded into binary, for example, 8 corresponds to 1000 in binary, then zeros ( 8 ) = 3. zeros(h(j)) = max(i:p \% 2^{i} = 0)\\ That is, the number of zeros at the end after being expanded into binary, \\ For example, 8 corresponds to 1000 in binary, then zeros( 8) = 3.zeros(h(j))=max(i:p%2i=0)That is, the number of 0s at the end after being expanded into binary ,For example, the binary value corresponding to 8 is 1000 , then zeros ( 8 )=3

PracticalFM algorithm

  1. 2 − wise independent 2-wise\ independent2w i se in d e p e n d e n t  Randomly select hfrom the hash function familyh:[n][n]

  2. z = 0 z=0 z=0

  3. If zeros ( h ( j ) ) > z zeros(h(j))>zzeros(h(j))>z

    z = z e r o s ( h ( j ) ) z=zeros(h(j)) z=zeros(h(j))

  4. return d ^ = 2 z + 1 2 \hat{d}=2^{z+\frac{1}{2}}d^=2z+21

algorithm explanation

1 − 2 2 C 1 - \frac{2\sqrt{2} }{C} 1C22 The probability satisfies d / C ≤ d ^ ≤ C dd / C \leq \hat{d} \leq Cdd/Cd^Cd

prove

E [ Y r ] = d 2 r E[Y_{r}] = \frac{d}{2 ^ r} E [ ANDr]=2rd

v a r [ Y r ] ≤ d 2 r var[Y_{r}] \leq \frac{d}{2^r} v a r [ Yr]2rd

The final correct probability should be greater than 1 − 2 2 C 1 - \frac{2\sqrt{2} }{C}1C22

BJKST algorithm

  1. Randomly select 2 − wise independent 2-wise~ independent2w i se in d e p e n d e n thash  functionh : [ n ] → [ n ] h:[n]→[n]h:[n][n]

  2. Randomly select 2 − wise independent 2-wise~independent2w i se in d e p e n d e n tHash  functiong : [ n ] → [ b ϵ − 4 log 2 n ] g:[n]→[bϵ−4log2n]g:[n][ b ϵ4log2n]

  3. z = 0 , B = ∅ z=0,B=∅ z=0,B=

    z e r o s ( h ( j ) ) > z zeros(h(j))>z zeros(h(j))>z

    1. B = B ∪ ( g ( j ) , z e r o s ( h ( j ) ) ) B=B∪(g(j),zeros(h(j))) B=B(g(j),zeros(h(j)))
    2. Then B ∣ > c ϵ 2 |B| > \frac{c}{\epsilon^2}B>ϵ2c
      1. z = z + 1 z=z+1 z=z+1
      2. remove (α,β) from B (α,β)( a ,b ),whereinβ < z β<zb<z
  4. return d ^ = ∣ B ∣ 2 z \hat{d}=|B|2^z d^=B2z

algorithm explanation

To explain this algorithm clearly, only two points need to be clear.

  • when a new element appears
    • If zeros ( h ( j ) ) > z zeros(h(j))>zzeros(h(j))>z
      • Then insert the corresponding binary group into B;
      • If the number of elements in B exceeds a certain size:
        • The value of z increases by 1 accordingly
        • Just delete the elements whose second item is less than z.

prove

The algorithm can achieve at least 2/3 probability guaranteed (1+ϵ) approximation.

E [ Y r ] = d 2 r E[Y_{r}] = \frac{d}{2 ^ r} E [ ANDr]=2rd

v a r [ Y r ] ≤ d 2 r var[Y_{r}] \leq \frac{d}{2^r} v a r [ Yr]2rd

Final P [ FAIL ] < 1 / 6 P [ FAIL ] < 1/6P[FAIL]<1/6

Coupled with the error probability caused by the previous algorithm assumptions, the final total error probability is within 1/3.

evaluate

The space complexity is O ( logn + 1 ϵ 2 ( log 1 ϵ + loglogn ) ) O(logn + \frac{1}{\epsilon ^ 2}(log\frac{1}{\epsilon} + loglogn))O(logn+ϵ21(logϵ1+loglogn))

1.3 point query

problem definition

Define a data stream < ai >, i ∈ [ 1 , m ] , ai ∈ [ 1 , n ] <ai>, i∈[1,m],a_i∈[1,n]<ai>i[1,m],ai[1,n ] , frequency vector< fi > , i ∈ [ 1 , n ] , fi ∈ [ 1 , m ] <fi>,i∈[1,n],f_i∈[1,m]<fi>,i[1,n],fi[1,m ] . Count the number of occurrences of all elements in the stream

knowledge preparation

  • 范数 l p = ∥ x ∥ p = ( ∑ i ∣ x i ∣ p ) 1 p l_{p} = \left \| x \right \|_{p} = {(\sum_{i}{|x_{i}|^p})}^{\frac{1}{p} } lp=xp=(ixip)p1
  • l p l_p lpPoint query (frequency estimation)
    for a given data stream σ σσ a i a_i ai输出 f i ^ \hat{f_i} fi^满足 f i ^ = f i ± ϵ ∣ ∣ f ∣ ∣ p \hat{f_{i} } = f_{i} \pm \epsilon \left|| \textbf{f} \right||_{p} fi^=fi±ϵfp
    ∣ ∣ x ∣ ∣ 1 ≥ ∣ ∣ x ∣ ∣ 2 ≥ . . . ≥ ∣ ∣ x ∣ ∣ ∞ \left|| x \right||_{1} \geq \left|| x \right||_{2} \geq ... \geq \left|| x \right||_{\infty} x1x2...x,p越大,估计越准确
    ‖ x ‖ 0 ‖x‖_0 x0是不同元素的数目
    ‖ x ‖ 1 ‖x‖_1 x1是流的长度
    ‖ x ‖ ∞ ‖x‖_∞ x是最大频度

Misra_Gries算法

Maintain a set A whose elements are ( i , fi ^ ) (i,\hat{f_{i} })(i,fi^)

  1. A ← ∅ A←∅ A

  2. For each element e in the data stream

    if e∈A,令 ( e , f e ^ ) → ( e , f e ^ + 1 ) (e,\hat{f_{e} }) \rightarrow (e,\hat{f_{e} } + 1) (e,fe^)(e,fe^+1)

    else if ∣ A ∣ < 1 ϵ |A| < \frac{1}{\epsilon} A<ϵ1: Insert (e,1) into A

    else

    1. Decrement all counts in A by 1
    2. if f j ^ = 0 \hat{f_{j} } = 0 fj^=0 : remove (j,0) from A
  3. For query i, if i ∈ A i∈AiA , returnsfi ^ \hat{f_{i} }fi^, otherwise return 0

prove

For any query i, return fi ^ \hat{f_{i} }fi^ 满足 f i − ϵ m ≤ f i ^ ≤ f i f_{i} - \epsilon m \leq \hat{f_{i} } \leq f_{i} fiϵmfi^fi

证明

结合算法过程,总共有两种情况。

如果不发生减1的情况,那么 f i ^ = f i \hat{f_{i} } = f_{i} fi^=fi
如果发生了减1的情况,有 f i ^ < f i \hat{f_{i} } < f_{i} fi^<fi
假设发生了c次减1的情况,总数减少 c ϵ ≤ m \frac{c}{\epsilon} \leq m ϵcm,每个计数至多减少c, f i ^ ≥ f i − c ≥ f i − ϵ m \hat{f_{i} } \geq f_{i} - c \geq f_{i} - \epsilon m fi^ficfiϵm

算法的空间代价是 O ( ϵ − 1 l o g n ) O(\epsilon^{-1}logn) O(ϵ1logn)

Metwally算法

Maintain a set A, the elements in the set are ( i , fi ^ ) (i,\hat{f_i})(i,fi^)

  1. A←∅
  2. For each element e in the data stream
    1. if e∈A:令 ( e , f i ^ ) ← ( e , f i ^ + 1 ) (e,\hat{f_i})←(e,\hat{f_i}+1) (e,fi^)(e,fi^+1)
    2. else if ∣ A ∣ < 1 ϵ |A| < \frac{1}{\epsilon} A<ϵ1: Insert (e,1) into A
    3. else Insert (e,MIN+1) into A, and delete one that satisfies fe ^ = MIN \hat{f_{e} } = MINfe^=MIN
  3. query i if i ∈ A i∈AiA , returnsfi ^ \hat{f_i}fi^, otherwise return MIN

proof goal

For any query i, return fi ≤ fi ^ ≤ fi + ϵ m f_{i} \leq \hat{f_{i} } \leq f_{i} + \epsilon mfifi^fi+ϵm

证明

① 如果不发生删除的情况,那么 f i ^ = f i \hat{f_{i} } = f_{i} fi^=fi

② 如果删除,计数一定不大于删除后的MIN,有 f i ^ ≥ f i \hat{f_{i} } \geq f_{i} fi^fi,A中元素总是m, M I N 1 ϵ ≤ m ⇒ M I N ≤ ϵ m MIN \frac{1}{\epsilon} \leq m \Rightarrow MIN \leq \epsilon m MINϵ1mMINϵm,每个元素至多超出真实值MIN, f i ^ ≤ f i + ϵ m \hat{f_{i} } \leq f_{i} + \epsilon m fi^fi+ϵm

算法的空间代价是 O ( ϵ − 1 l o g n ) O(\epsilon^{-1}logn) O(ϵ1logn)

image-20221128103533485image-20221128103634458

新的定义

Sketch

定义在数据流σ上的数据结构DS(σ)是一个Sketch

如果存在一个Space−Efficient的合并算法COMB使得 C O M B ( D S ( σ 1 ) , D S ( σ 2 ) ) = D S ( σ 1 ∘ σ 2 ) COMB(DS(\sigma_{1}),DS(\sigma_{2})) = DS(\sigma_{1} \circ \sigma_{2}) COMB(DS(σ1),DS(σ2))=DS(σ1σ2),其中∘是数据流的连接操作。

Linear Sketch

定义在[n]上的数据流σ上的sketching输出sk(σ),如果sk(σ)取值为维度l=l(n)的向量,并且是f(σ)的线性函数,那么sk(σ)是一个Linear Sketch,l是这个sketch的维度。

Count-Min算法

  1. C [ 1... t ] [ 1... k ] ← 0 , k = 2 ϵ , t = ⌈ l o g 1 δ ⌉ C[1...t][1...k] \leftarrow \textbf{0},k = \frac{2}{\epsilon},t = \left \lceil log\frac{1}{\delta} \right \rceil C[1...t][1...k]0,k=ϵ2,t=logd1

  2. Randomly select t 2−wise independent hash functions hi : [ n ] → [ k ] h_i:[n]→[k]hi:[n][k]

  3. For each update (j,c) that occurs, do the following

    for i=1 to t

    C [ i ] [ h i ( j ) ] = C [ i ] [ h i ( j ) ] + c C[i][h_{i}(j)] = C[i][h_{i}(j)] + c C[i][hi(j)]=C[i][hi(j)]+c

  4. For a query for a, return fa ^ = min ⁡ 1 ≤ i ≤ t C [ i ] [ hi ( a ) ] \hat{f_{a} } = \min_{1 \leq i \leq t}{C[ i][h_{i}(a)]}fa^=min1itC[i][hi(a)]

algorithm explanation

在算法开始时,构造一个 t 行 k 列 的空数组,可以认为每一行是独立的,算法在运行时同时记录了t个这样的数组。在每出现一个流数据的时候,对每一个数组进行一次更新,注意元素的第二个下标用的是数据的哈希值。

算法在运行的过程中可能产生冲突,也就是两个不同的流数据的哈希值可能相同,这个时候就会导致结果偏大,但是因为有相当于t次的重复计算,通过取最小值的方法来进行一些弥补

证明

该算法以1−δ概率给出 l 1 l_{1} l1点查询问题的(1+ϵ)近似

评价

算法的空间代价为 O ( 1 ϵ l o g 1 δ ( l o g n + l o g m ) ) O(\frac{1}{\epsilon}log\frac{1}{\delta}(logn + logm)) O(ϵ1logδ1(logn+logm))

Count-Median算法

  1. C [ 1... t ] [ 1... k ] ← 0 , k = 2 ϵ , t = ⌈ l o g 1 δ ⌉ C[1...t][1...k] \leftarrow \textbf{0},k = \frac{2}{\epsilon},t = \left \lceil log\frac{1}{\delta} \right \rceil C[1...t][1...k]0,k=ϵ2,t=logδ1

  2. 随机选择t个2−wise独立哈希函数 h i : [ n ] → [ k ] h_i:[n]→[k] hi:[n][k]

  3. 对每一个出现的更新(j,c)进行如下操作

    for i=1 to t

    C [ i ] [ h i ( j ) ] = C [ i ] [ h i ( j ) ] + c C[i][h_{i}(j)] = C[i][h_{i}(j)] + c C[i][hi(j)]=C[i][hi(j)]+c

  4. 针对对于a的查询,令 ∣ C [ x ] [ h x ( a ) ] ∣ = m e d i a n 1 ≤ i ≤ t ∣ C [ i ] [ h i ( a ) ] ∣ |C[x][h_x(a)]| = median_{1 \leq i \leq t}{|C[i][h_{i}(a)]|} C[x][hx(a)]=median1itC[i][hi(a)]

  5. 返回 f a ^ = ∣ C [ x ] [ h x ( a ) ] ∣ \hat{f_a} = |C[x][h_x(a)]| fa^=C[x][hx(a)]

algorithm explanation

The counting method of the Count−MinSketch algorithm is exactly the same, the difference lies in the acquisition of the return value, which returns the original value corresponding to the median of the absolute values ​​of all t array values.

Count Sketch Algorithm

  1. C [ 1... k ] ← 0 , k = 3 ϵ 2 C[1...k] \leftarrow 0,k = \frac{3}{\epsilon^2}C[1...k]0,k=ϵ23

  2. 随机选择1个2−wise独立哈希函数 h : [ n ] → [ k ] h:[n]→[k] h:[n][k]

  3. 随机选择1个2−wise独立哈希函数 g : [ n ] → − 1 , 1 g:[n]→{−1,1} g:[n]1,1

  4. 对于每一个更新(j,c)

    C [ h ( j ) ] = C [ h ( j ) ] + c ∗ g ( j ) C[h(j)] = C[h(j)] + c * g(j) C[h(j)]=C[h(j)]+cg(j)

  5. 针对查询a,返回 f ^ = g ( a ) ∗ C [ h ( j ) ] \hat{f} = g(a) * C[h(j)] f^=g(a)C[h(j)]

Count Sketch+算法

  1. C [ 1... t ] [ 1... k ] ← 0 , k = 3 ϵ 2 , t = O ( l o g 1 δ ) C[1...t][1...k] \leftarrow \textbf{0},k = \frac{3}{\epsilon^2},t = O(log\frac{1}{\delta}) C[1...t][1...k]0,k=ϵ23,t=O(logd1)

  2. Randomly select a 2−wise independent hash function hi : [ n ] → [ k ] h_i:[n]→[k]hi:[n][k]

  3. Randomly select a 2−wise independent hash function gi : [ n ] → { − 1 , 1 } g_i:[n] \rightarrow \{-1,1\}gi:[n]{ 1,1}

    For each update (j,c)

    ​ For i: 1 → ti: 1→ti:1t

    C [ h i ( j ) ] = C [ h i ( j ) ] + c ∗ g i ( j ) C[h_i(j)] = C[h_i(j)] + c * g_i(j) C[hi(j)]=C[hi(j)]+cgi(j)

  4. 返回 f ^ = m e d i a n   1 ≤ i ≤ t g i ( a ) C [ i ] [ h i ( a ) ] \hat f=median~1≤i≤tgi(a)C[i][h_i(a)] f^=median 1itgi(a)C[i][hi(a)]

算法解释

相当于是将Count Sketch算法运行了t次,最后取了中值。利用齐尔诺夫不等式可以解决,令Xi=1⇔第i次运行成功,成功概率是2/3,最后只要成功的个数超过一半即可。最终是通过t控制了δ。

image-20221128103815171

1.4 频度矩估计

Basic AMS算法

  1. ( m , r , a ) ← ( 0 , 0 , 0 ) (m,r,a)←(0,0,0) (m,r,a)(0,0,0)
  2. 对于每一个更新 j
    1. m ← m + 1 m←m+1 mm+1
    2. β ← r a n d o m   b i t   w i t h   P [ β = 1 ] = 1 m β ← random~bit ~with~ P[β = 1] = \frac{1}{m} brandom bit with P[β=1]=m1
    3. if β==1: a=j,r=0
    4. if j==a: r ← r + 1 r←r+1 rr+1
  3. returns X = m ( rk − ( r − 1 ) k ) X = m(r^k - (r - 1)^k)X=m(rk(r1)k)

Analysis of Algorithms

E [ X ] = F k E[X] = F_k E [ X ]=Fk

V a r [ X ] ≤ k n 1 − 1 k F k 2 Var[X] \leq kn^{1 - \frac{1}{k} }F_k^2 Yes [ X ] _kn1k1Fk2

利用切比雪夫不等式对结果进行检验 P [ ∣ X − E [ x ] ∣ > ϵ E [ X ] ] < k n 1 − 1 k ϵ 2 P[|X - E[x]| > \epsilon E[X]]<\frac{kn^{1 - \frac{1}{k} } }{\epsilon^2} P[XE[x]>ϵE[X]]<ϵ2kn1k1

评价

存储 m 和 r 需要 log n 位
存储 a 需要 log m 位
算法的方差太大,需要进一步的改进,但是其相应的存储代价为 O ( l o g m + l o g n ) O(logm+logn) O(logm+logn)

Final AMS 算法

  1. 利用Median‑of‑Mean技术调用Basic AMS算法;
  2. 计算 t = c l o g 1 δ t = clog\frac{1}{\delta} t=clogd1averages, each of which is r = 3 k ϵ 2 n 1 − 1 kr = \frac{3k}{\epsilon ^ 2}n^{1 - \frac{1}{k} }r=ϵ23 kn1k1average of calls
  3. Returns the median of t values

prove

{ X i j } i ∈ [ t ] , j ∈ [ r ] \{X_{ij}\}_{i \in [t],j \in [r]} { Xij}i[t],j[r]is a set of random variables that are independent of the same part as X.

Y i = 1 r ∑ j = 1 r X i j Y_i = \frac{1}{r}\sum_{j = 1}^{r}X_{ij} Yi=r1j=1rXij

Z = ∑ i = 1 t Y i Z = \sum_{i = 1}^{t}Y_i Z=i=1tYi

计算 E [ Y i ] = E [ X ] , V a r [ Y i ] = V a r [ X ] k E[Y_i] = E[X],Var[Y_i] = \frac{Var[X]}{k} E[Yi]=E[X],Var[Yi]=kVar[X]

根据切比雪夫不等式有 P [ ∣ Y i − E [ Y i ] ∣ ≥ ϵ E [ Y i ] ] ≤ V a r [ X ] k ϵ 2 E [ X ] 2 P[|Y_i - E[Y_i]| \geq \epsilon E[Y_i]] \leq \frac{Var[X]}{k\epsilon^2E[X]^2} P[YiE[Yi]ϵE[Yi]]kϵ2E[X]2Var[X]

取值 r = 3 V a r [ X ] k ϵ 2 E [ X ] 2 r = \frac{3Var[X]}{k\epsilon^2E[X]^2} r=kϵ2E[X]23Var[X],将期望和方差带入可以计算得到算法中给出的结果

现在 P [ ∣ Y i − E [ Y i ] ∣ ≥ ϵ E [ Y i ] ] ≤ 1 3 P[|Y_i - E[Y_i]| \geq \epsilon E[Y_i]] \leq \frac{1}{3} P[YiE[Yi]ϵE[Yi]]31

最后利用Median技术进行处理即可。

评价

算法的空间代价为 O ( 1 ϵ 2 l o g 1 δ k n 1 − 1 k ( l o g m + l o g n ) ) O(\frac{1}{\epsilon^2}log\frac{1}{\delta}kn^{1 - \frac{1}{k} }(logm + logn)) O(ϵ21logδ1kn1k1(logm+l o g n )) , when k≥2, the cost is too high.

Basic F2 AMS algorithm

  1. Randomly select 4−wise independent hash functions h : [ n ] → − 1 , 1 h:[n]→{−1,1}h:[n]1,1

  2. x←0

  3. For each update (j,c)

    x ← x + c ∗ h ( j ) x \leftarrow x + c * h(j) xx+ch(j)

  4. returns x 2 x^2x2

analyze

This algorithm itself does not have a better effect, but after optimization of the median−of−mean technique, a (ϵ,δ) algorithm can be obtained. So in the proof here, just calculate the expectation and variance of the result.

prove

Y j = h ( j ) , j ∈ [ 1 , n ] , X = ∑ j = 1 nfj Y j Y_j = h(j),j \in [1,n],X = \sum_{j = 1 }^{n}f_yY_yYj=h(j),j[1,n],X=j=1nfjYj

E [ X 2 ] = ∑ i = 1 n f i Y i × ∑ j = 1 n f j Y j E[X^2] = \sum_{i = 1}^{n}f_iY_i \times \sum_{j = 1}^{n}f_jY_j E[X2]=i=1nfiYi×j=1nfjYj

V a r [ X 2 ] = E [ X 4 ] − E [ X 2 ] 2 Var[X^2] = E[X^4] - E[X^2]^2 Var[X2]=E[X4]E[X2]2

评价

算法的空间代价是O(logm+logn),下面先使用mean将犯错概率限制在1/3,再使用median技术对结果进行优化。

1.5 固定大小采样

水库抽样算法

  1. m←0
  2. 使用数据流的前s个元素对抽样数组进行初始化
    A [ 1 , . . . , s ] , m ← s A[1,...,s],m\leftarrow s A[1,...,s],ms
  3. 对于每一个更新x
    1. x以 s m + 1 \frac{s}{m + 1} m+1s概率随机替换A中的一个元素
    2. m++

证明

假定已经流过的数据量为n,采样池大小为s

考虑最普通的情况,第j个元素进了采样池,之后再也没有被选出去,那么在第n个元素流过之后,这个元素还在采样池中的概率是s/n

计算方法:被选进去的概率是s/j,为保证不被选出去,两种情况:新的元素没有选进来,新的元素选进来了但是该元素没有被替换掉。这两种情况对应着 ( 1 − s j + 1 ) + s j + 1 ∗ s − 1 s = j j + 1 (1 - \frac{s}{j + 1}) + \frac{s}{j + 1}*\frac{s - 1}{s} = \frac{j}{j + 1} (1j+1s)+j+1sss1=j+1j。依次类推最终可以计算得到结果s/n。

1.6 Bloom Filter

给定一个数据集U,从中抽取一个子集S,给定一个数q∈U,判定q∈S是否成立。

近似哈希的方法

  1. 令H是一族通用哈希函数: [ U ] → [ m ] , m = n δ [U]→[m],m = \frac{n}{\delta} [U][m]m=δn
  2. 随机选择 h∈H,并维护数组A[m],S的大小是n
  3. 对每一个 i∈S, A [ h ( i ) ] = 1 A[h(i)]=1 A[h(i)]=1
  4. Given a query q, return yes if and only if A [ h ( i ) ] = 1 A[h(i)]=1A[h(i)]=1

prove

If q∈S, yes is returned. If q∉S, then no should be returned, but yes is returned with a certain probability. This is the wrong situation. The element is not in S, but its hash value is the same as An element of has the same hash value. ∑ j ∈ SP [ h ( q ) = h ( j ) ] ≤ nm = δ \sum_{j \in S}P[h(q) = h(j)] \leq \frac{n}{m} = \deltajSP[h(q)=h(j)]mn=δ , which computes values ​​other than m and solves the approximation problem.

Bloom Filter method

  1. Let H be a family of independent ideal hash functions: [U]→[m]

  2. Randomly select h 1 , . . . , hd ∈ H h_1,...,h_d \in Hh1,...,hdH , and maintain the array A[m]

  3. For every i∈S

    ​ For each j∈[1,d]

    A [ h j ( i ) ] = 1 A[h_j(i)] = 1 A[hj(i)]=1

  4. Given a query q, return yes if and only if ∀ j ∈ [ d ] , A [ hj ( q ) ] = 1 \forall j \in [d],A[h_j(q)] = 1j[d],A[hj(q)]=1

prove

The probability of failure is P ≤ ( nm ) d = δ P \leq (\frac{n}{m})^d = \deltaP(mn)d=δ , so the final cost ism = O ( nlog 1 δ ) m = O(nlog\frac{1}{\delta})m=O ( n l o gd1)

2 Sublinear Time Algorithms for Big Data

2.1 Vertex Cover 1, the average degree algorithm for computing graphs

definition

Known : G = ( V , E ) G=(V,E)G=(V,E)
:平均度 d ˉ = ∑ u ∈ V d ( u ) n \bar{d} = \frac{\sum_{u\in V}d(u)}{n} dˉ=nuVd(u)
假设:G是简单图,没有平行边和自环

分析

将具有相似或者相同度数的节点分组,然后估算每个分组的平均度数。

首先将所有的点进行分桶,分成t个桶,第i个桶里的点集合为 B i = { v ∣ ( 1 + β ) ( i − 1 ) < d ( v ) < ( 1 + β ) i } , 0 < i ≤ t − 1 B_i=\{v|(1+\beta)^{(i-1)} < d(v) < (1+\beta)^{i}\},0<i\leq t-1 Bi={ v(1+β)(i1)<d(v)<(1+β)i},0<it1,其中β是超参数。

于是 B i B_i Bi中的点的总度数有上下界如公式所示: ( 1 + β ) ( i − 1 ) ∣ B i ∣ < d ( B i ) < ( 1 + β ) i ∣ B i ∣ (1+\beta)^{(i-1)}|B_i| < d(B_i) < (1+\beta)^{i}|B_i| (1+β)(i1)Bi<d(Bi)<(1+β)iBi

进一步的G的总度数可以表示为: ∑ i = 0 t − 1 ( 1 + β ) ( i − 1 ) ∣ B i ∣ < ∑ u ∈ V d ( u ) < ∑ i = 0 t − 1 ( 1 + β ) i ∣ B i ∣ \sum_{i=0}^{t-1}(1+\beta)^{(i-1)}|B_i| < \sum_{u\in V}d(u) < \sum_{i=0}^{t-1}(1+\beta)^{i}|B_i| i=0t1(1+β)(i1)Bi<uVd(u)<i=0t1(1+b )iBi

于是我们可以得到: ∑ i = 0 t − 1 ( 1 + β ) ( i − 1 ) ∣ B i ∣ n < d ˉ < ∑ i = 0 t − 1 ( 1 + β ) i ∣ B i ∣ n \frac{\sum_{i=0}^{t-1}(1+\beta)^{(i-1)}|B_i|}{n} < \bar{d} < \frac{\sum_{i=0}^{t-1}(1+\beta)^{i}|B_i|}{n} ni=0t1( 1 + b )(i1)Bi<dˉ<ni=0t1( 1 + b )iBi

So the problem is transformed into B in \frac{B_i}{n}nBiestimate

algorithm

  1. Take the sample set S from V
  2. S i ← S ∩ B i S_i \gets S \cap B_i SiSBi
  3. ρ i ← S i S \rho_i \gets \frac{S_i}{S} riSSi
  4. 返回 d ˉ ^ = ∑ i = 0 t − 1 ρ i ( 1 + β ) i \hat{\bar{d} } = \sum_{i=0}^{t-1}\rho_i(1+\beta)^{i} dˉ^=i=0t1ri(1+b )i

The idea of ​​the algorithm is actually very simple, the B in \frac{B_i}{n}nBiIt is understood as a kind of probability, which is to randomly select a point, which belongs to the probability of Bi, so it is very simple to understand the algorithm.

evaluate

The idea and calculation of the algorithm are very simple, but a very clever transformation has been carried out, but this algorithm is still problematic

2.2 Calculation graph average degree algorithm 2

improve algorithm

ρ i \rho_i for the smaller bucketri, assuming a d ˉ \bar{d}dA lower orderα \alpha of ˉa .

  1. Draw samples S from V
  2. S i ← S ∩ B i S_i \gets S \cap B_i SiSBi
  3. f o r   i ∈ { 0 , … , t − 1 }   d o \boldsymbol{for}\ i \in \{0,\dots,t-1\}\ \boldsymbol{do} for i{ 0,,t1} do
    1. i f   ∣ S i ∣ ≥ θ ρ   t h e n \boldsymbol{if}\ |S_i| \geq \theta_\rho\ \boldsymbol{then} if Siθρ then
    2. ρ i ← ∣ S i ∣ ∣ S ∣ \rho_i \gets \frac{|S_i|}{|S|} ρiSSi
    3. e l s e \boldsymbol{else} else
      1. ρ i ← 0 \rho_i\gets 0 ρi0
  4. r e t u r n   d ˉ ^ = ∑ i = 0 t − 1 ρ i ( 1 + β ) i \boldsymbol{return}\ \hat{\bar{d} } = \sum_{i=0}^{t-1}\rho_i(1+\beta)^{i} return dˉ^=i=0t1ρi(1+β)i

我们将算法的结果调整为以2/3的概率有 ( 0.5 − ϵ ) d ˉ < d ˉ ^ < ( 1 + ϵ ) d ˉ (0.5-\epsilon)\bar{d} < \hat{\bar{d} } < (1 + \epsilon)\bar{d} (0.5ϵ)dˉ<dˉ^<(1+ϵ)dˉ

这里给出一组参数,使得能够满足上述结果:

  1. β = ϵ 4 \beta = \frac{\epsilon}{4} β=4ϵ
  2. ∣ S ∣ = Θ ( n α ⋅ p o l y ( l o g   n , 1 / ϵ ) ) |S| = \Theta(\sqrt{\frac{n}{\alpha} }\cdot poly(log\ n,1/\epsilon)) S=Θ(αn poly(log n,1/ϵ))
  3. t = ⌈ l o g ( 1 + β ) n ⌉ + 1 t = \left \lceil log_{(1+\beta)}n \right \rceil + 1 t=log(1+β)n+1
  4. θ ρ = 1 t 3 8 ⋅ ϵ α n ∣ S ∣ \theta_{\rho} = \frac{1}{t}\sqrt{\frac{3}{8}\cdot\frac{\epsilon\alpha}{n} }|S| θρ=t183nϵ a S

2.3 Calculation graph average degree algorithm 3

Thoughts on Algorithm Improvement

We attribute the errors of the algorithm to the edges, let us see which edges lead to such errors. Divide the nodes into two parts U, V/U, where U is a node with a small degree, V/U is a node with a large degree, and E(U,V/U) represents the set of edges connecting the two sets. Therefore, we assert that the error occurs because we only calculate the edge in E(U,V/U) once, and it is easy to understand this point if we recall the previous example. So we only need to find the proportion of this part of the side at each sampling time.

improve algorithm

Use E [ Δ i ] E[\Delta_i]E [ Di]1 + ϵ 1+\epsilon1+ϵ估计,可以得到Δ i ρ i ( 1 + β ) i \Delta_i\rho_i(1+\beta)^iDiri(1+b )iorT interms ( 1 + ϵ ) ( 1 + β ) \frac{T_i}{n}of(1+\epsilon)(1+\beta)nTiof ( 1+ϵ ) ( 1+β ) estimate. So, the modified algorithm is as follows:

  1. 从V中抽取样本S, ∣ S ∣ = O ~ ( L ρ ϵ 2 ) , L = p o l y ( l o g   n ϵ ) , ρ = 1 t ϵ 4 ⋅ α n |S| = \tilde{O}(\frac{L}{\rho\epsilon^2}),L=poly(\frac{log\ n}{\epsilon}),\rho = \frac{1}{t}\sqrt{\frac{\epsilon}{4}\cdot \frac{\alpha}{n} } S=O~(ρϵ2L),L=poly(ϵlog n),ρ=t14ϵnα
  2. S i ← S ∩ B i S_i \gets S \cap B_i SiSBi
  3. f o r   i ∈ { 0 , … , t − 1 }   d o \boldsymbol{for}\ i \in \{0,\dots,t-1\}\ \boldsymbol{do} for i{ 0,,t1} do
    1. i f   ∣ S i ∣ ≥ θ ρ   t h e n \boldsymbol{if}\ |S_i| \geq \theta_\rho\ \boldsymbol{then} if Siθρ then
      1. ρ i ← ∣ S i ∣ ∣ S ∣ \rho_i \gets \frac{|S_i|}{|S|} ρiSSi
      2. e s t i m a t e   Δ i estimate\ \Delta_i estimate Δi
    2. e l s e \boldsymbol{else} else
      1. ρ i ← 0 \rho_i\gets 0 ρi0
  4. r e t u r n   d ˉ ^ = ∑ i = 0 t − 1 ( 1 + Δ i ) ρ i ( 1 + β ) i \boldsymbol{return}\ \hat{\bar{d} } = \sum_{i=0}^{t-1}(1+\Delta_i)\rho_i(1+\beta)^{i} return dˉ^=i=0t1(1+Δi)ρi(1+β)i

2.4 计算图的平均度算法四

Alg III

Use E [ Δ i ] E[\Delta_i]E [ Di]1 + ϵ 1+\epsilon1+ϵ估计,可以得到Δ i ρ i ( 1 + β ) i \Delta_i\rho_i(1+\beta)^iDiri(1+b )iorT interms ( 1 + ϵ ) ( 1 + β ) \frac{T_i}{n}of(1+\epsilon)(1+\beta)nTiof ( 1+ϵ ) ( 1+β ) estimate. So, the modified algorithm is as follows:

  1. Let V be an independent variable S, ∣ S ∣ = O ~ ( L ρ ϵ 2 ) , L = poly ( log n ϵ ) , ρ = 1 t ϵ 4 ⋅ α n |S| = \tilde{O}(\frac{L}{\rho\epsilon^2}),L=poly(\frac{log\n}{\epsilon}),\rho = \frac{1}{t} \sqrt{\frac{\epsilon}{4}\cdot\frac{\alpha}{n}}S=O~(ρϵ2L),L=poly(ϵlog n),ρ=t14ϵnα
  2. S i ← S ∩ B i S_i \gets S \cap B_i SiSBi
  3. f o r   i ∈ { 0 , … , t − 1 }   d o \boldsymbol{for}\ i \in \{0,\dots,t-1\}\ \boldsymbol{do} for i{ 0,,t1} do
    1. i f   ∣ S i ∣ ≥ θ ρ   t h e n \boldsymbol{if}\ |S_i| \geq \theta_\rho\ \boldsymbol{then} if Siθρ then
      1. ρ i ← ∣ S i ∣ ∣ S ∣ \rho_i \gets \frac{|S_i|}{|S|} ρiSSi
      2. e s t i m a t e   Δ i estimate\ \Delta_i estimate Δi
    2. e l s e \boldsymbol{else} else
      1. ρ i ← 0 \rho_i\gets 0 ρi0
  4. r e t u r n   d ˉ ^ = ∑ i = 0 t − 1 ( 1 + Δ i ) ρ i ( 1 + β ) i \boldsymbol{return}\ \hat{\bar{d} } = \sum_{i=0}^{t-1}(1+\Delta_i)\rho_i(1+\beta)^{i} return dˉ^=i=0t1(1+Δi)ρi(1+β)i

Alg IV

  1. α ← n \alpha \gets n αn
  2. d ˉ ^ < − ∞ \hat{\bar{d} } < -\infty dˉ^<
  3. w h i l e   d ˉ ^ < α   d o \boldsymbol{while}\ \hat{\bar{d} } < \alpha\ \boldsymbol{do} while dˉ^<α do
    1. α ← α / 2 \alpha \gets \alpha/2 αα/2
    2. i f   α < 1 n   t h e n \boldsymbol{if}\ \alpha < \frac{1}{n}\ \boldsymbol{then} if α<n1 then
      1. r e t u r n   0 ; \boldsymbol{return}\ 0; return 0;
    3. d ˉ ^ ← A l g I I I ∼ α \hat{\bar{d} } \gets AlgIII_{\sim \alpha} dˉ^AlgIIIα
  4. r e t u r n   d ˉ ^ \boldsymbol{return}\ \hat{\bar{d} } return dˉ^

算法相关指标

近似比: ( 1 + ϵ ) (1 + \epsilon) (1+ϵ)

运行时间: O ~ ( n ) ⋅ p o l y ( ϵ − 1 l o g   n ) n / d ˉ \tilde{O}(\sqrt{n})\cdot poly(\epsilon^{-1}log\ n)\sqrt{n/\bar{d} } O~(n )poly(ϵ1log n)n/dˉ

2.5 Approximate minimum spanning tree

definition

knownG = ( V , E ) , ϵ , d = deg ( G ) G=(V,E),ϵ,d=deg(G)G=(V,E),ϵ ,d=d e g ( G ) , the weight of edge (u,v) iswuv ∈ { 1 , 2 , … , w } ∪ { ∞ } w_{uv}∈\{1,2,…,w\}∪\{ ∞\}wuv{ 1,2,,w}{ }

Find : M ^ \hat{M}M^ satisfy( 1 − ϵ ) M ≤ M ^ ≤ ( 1 + ϵ ) M (1−ϵ)M≤\hat{M}≤(1+ϵ)M(1ϵ ) MM^(1+ϵ)M,令M为 m i n T s p a n s G W ( T ) min_{TspansG}{W(T)} minTspansGW(T)

分析

定义

G的子图 G ( i ) = ( V , E ( i ) ) G^{(i)}=(V,E^{(i)}) G(i)=(V,E(i))
E ( i ) = { ( u , v ) ∣ w u v ≤ i } E(i)=\{(u,v)|w_{uv}≤i\} E(i)={(u,v)wuvi}
连通分量的个数为C(i)

Theorem : If M is the sum of the number of connected components of all such subgraphs, M = n − w + ∑ i = 1 w − 1 C ( i ) M=n-w+\sum_{i=1}^{w -1}{C^{(i)} }M=nw+i=1w1C(i)

Proof : Let a minimum spanning tree of graph G be MST = ( V , E ′ ) MST=(V,E')MST=(V,E),MST的子图 M S T ( i ) = ( V ′ , E ′ ( i ) ) , V ′ = V , E ′ ( i ) = { ( u , v ) ∣ ( u , v ) ∈ E ( i ) & ( u , v ) ∈ E ′ } MST^{(i)} = (V',E'^{(i)}),V'=V,E'^{(i)}=\{(u,v)|(u,v) \in E^{(i)} \And (u,v) \in E'\} MST(i)=(V,E(i)),V=V,E(i)={(u,v)(u,v)E(i)&(u,v)E}

ảọa i a_iaiis the number of edges with weight i in MST

∑ i > 1 α i \sum_{i>1}{\alpha_i} i>1aiIndicates the number of edges with edge weights greater than 1 in the MST. Adding 1 to this value should be the MST obtained after removing these edges from the MST in the MST ( l ) MST^{(l)}MSTThe number of connected components in ( l )

M S T ( l ) MST^{(l)} MSTThe number of connected components of ( l ) should be consistent with G(i), so we get the following formula:∑ i > 1 α i = C ( l ) − 1 \sum_{i>1}{\ alpha_i}=C^{(l)}-1i>1ai=C(l)1

于是我们就可以用如下的方法计算M了:
M = ∑ i = 1 w i ⋅ α i = ∑ i = 1 w α i + ∑ i = 2 w α i + ⋯ + ∑ i = w w α i = C ( 0 ) − 1 + C ( 1 ) − 1 + ⋯ + C ( w − 1 ) − 1 = n − 1 + C ( 1 ) − 1 + ⋯ + C ( w − 1 ) − 1 = n − w + ∑ i = 1 w − 1 C ( i ) \begin{align*} M &= \sum_{i=1}^{w}{i \cdot \alpha_i}=\sum_{i=1}^{w}{\alpha_i} + \sum_{i=2}^{w}{\alpha_i} + \dots + \sum_{i=w}^{w}{\alpha_i}\\ &= C^{(0)}-1 + C^{(1)}-1 + \dots + C^{(w-1)}-1\\ &= n-1+C^{(1)}-1 + \dots + C^{(w-1)}-1\\ &= n-w+\sum_{i=1}^{w-1}{C^{(i)} } \end{align*} M=i=1wiαi=i=1wαi+i=2wαi++i=wwαi=C(0)1+C(1)1++C(w1)1=n1+C(1)1++C(w1)1=nw+i=1w1C(i)

time complexity

The time complexity of running w times to solve the number of connected components: w ⋅ O ( d / ϵ 3 ) = O ( dw 4 / ϵ ′ 3 ) w\cdot O(d/\epsilon^3)=O(dw^4 /\epsilon'^3)wO ( d / ϵ3)=O ( dw _4 /ϵ′3)

2.6 Find the diameter of the point set

Definition :

It is known that there are m points, and the distance between points is represented by an adjacency matrix, then Dij represents the distance from point i to point j, D is a symmetric matrix, and satisfies the triangle inequality D ij ≤ D ik + D kj D_ {ij} \leq D_{ik} + D_{kj}DijDi+Dkj

Find out: the point pair (i, j) makes Dij the largest, then Dij is the diameter of the set of m points.

Solving algorithm: The Indyk's Algorithm

  1. Optional k ∈ [ 1 , m ] k∈[1,m]k[1,m]
  2. select lll,使得∀ i , D ki ≤ D kl \forall i,D_{ki} \leq D_{kl}i,DtoDkl
  3. returns ( k , l ) , D kl (k,l),D_{kl}(k,l),Dkl

Approximation Ratio Analysis of Algorithm Analysis

The optimal solution is denoted as opt, which is the distance between point i and point j

Then there are opt 2 ≤ D kl ≤ opt \frac{opt}{2} \leq D_{kl} \leq opt2optDklopt

Approximate Ratio of Algorithm Analysis to Prove Inequality

Regarding the easy proof of the right side of the inequality, the proof of the left side is as follows:
opt = D ij ≤ D ik + D kj ≤ D kl + D kl ≤ 2 D kl \begin{align*} opt = &D_{ij}\\ \leq &D_{ik} + D_{kj}\\ \leq &D_{kl} + D_{kl}\\ \leq &2D_{kl} \end{align*}opt=DijDi+DkjDkl+Dkl2 Dkl

Algorithm evaluation

The time complexity of the algorithm is O ( m ) = O ( n ) O(m) = O(\sqrt{n})O(m)=O(n )

2.7 Find the number of connected components

definition

known:G = ( V , E ) , ϵ , d = deg ( G ) G = (V,E),\epsilon,d = deg(G)G=(V,E),ϵ ,d=d e g ( G ) , the graph G is represented by an adjacency list, where d represents the degree of the node with the largest degree among all nodes,∣ V ∣ = n , ∣ E ∣ = m ≤ d ⋅ n |V| = n,|E | = m \leq d\cdot nV=n,E=mdn

求出:一个y,make C − ϵ ⋅ n ≤ y ≤ C + ϵ ⋅ n C - \epsilon \cdot n \leq y \leq C + \epsilon \cdot nCϵnyC+ϵn , where C is the standard solution obtained by using the linear algorithm

problem analysis

Record the number of nodes in the connected component to which vertex v belongs as nv n_vnv, A∈V is a point set of connected components, then there is the following equation relationship: ∑ u ∈ A 1 nu = ∑ u ∈ A 1 ∣ A ∣ = 1 \sum_{u \in A}{\frac{1} {n_u} } = \sum_{u \in A}{\frac{1}{|A|} } = 1uAnu1=uAA1=1 , this is easy to prove, because the nv n_vof each point in the same connected componentnvIt is the same, both are the reciprocal of the number of midpoints in the components. In this way, the final result C can be expressed as ∑ u ∈ V 1 nu \sum_{u\in V}{\frac{1}{n_u} }uVnu1. So further, the estimation of C can be transformed into 1 nu \frac{1}{n_u}nu1estimate.

problem-solving thinking

n u n_u nuIt is very large, and it is difficult to calculate accurately, but at this time 1 nu \frac{1}{n_u}nu1Very small, can be replaced by a small constant 1 nu \frac{1}{n_u}nu1(0 value 2 \frac{\epsilon}{2}2ϵ), if we set ^ = min { nu , 2 ϵ } \hat{n_u} = min\{n_u,\frac{2}{\epsilon}\}nu^=my { nu,ϵ2},则 C ^ = ∑ u ∈ V 1 n u ^ \hat{C} = \sum_{u \in V}{\frac{1}{\hat{n_u} } } C^=uVnu^1, using C ^ \hat{C}C^ to estimate C can get very good results.

Algorithm - Calculate nu ^ \hat{n_u}nu^

The idea is very simple, it is a small search, if the number of searched points is less than 2 ϵ \frac{2}{\epsilon}ϵ2Just continue to search, otherwise return directly to 2 ϵ \frac{2}{\epsilon}ϵ2

The time complexity is O ( d ⋅ 1 ϵ ) O(d\cdot \frac{1}{\epsilon})O(dϵ1) , the larger d is, the longer it takes,1 ϵ \frac{1}{\epsilon}ϵ1The bigger it is, the longer it takes.

Algorithm - Calculate nu n_unu

Randomly select r = b / ϵ 2 r = b/{\epsilon}^2 from the set of nodesr=b / ϵ2 nodes constitute node U, apply the previous algorithm to each node, so the finalC ^ = nr ∑ u ∈ U 1 nu ^ \hat{C} = \frac{n}{r} \sum_{u \in U}{\frac{1}{\hat{n_u} } }C^=rnuUnu^1, the time complexity is O ( d / ϵ 3 ) O(d/{\epsilon}^3)O ( d / ϵ3)

3 Parallel Computing Algorithms

3.1 Basic Questions (1)

MapReduce

There are several points for calculations using the MR model:

  • The calculation is carried out in rounds, and the input and output data of each round are in the form of <key, value>
  • Each round is divided into: Map, Shuffle, Reduce
  • Map: Each data is mapprocessed by a function, and a new data set is output, that is, <key, value> is changed
  • Shuffle: Transparent to programmers, all the data output in the Map are grouped key, and the same keydata is assigned to the samereducer
  • Reduce: Input <k,v1,v2,…>, output a new data set
  • For such a computing framework, our design goals are: fewer rounds, less memory, and greater parallelism

problem instance

Let's look at some specific problem examples. These examples can be parallelized. Of course, there are more than one logic to implement, and there are many possibilities. Here is just one method.

Build an inverted index

Definition : Given a set of documents, count which documents each word appears in

  • Map函数: < d o c I D , c o n t e n t > → < w o r d , d o c I D > <docID,content> \rightarrow <word,docID> <docID,content>→<word,docID> , split the content when the map function is processing, and convert the separated words into words plus the corresponding docID output.
  • Reduce函数: < w o r d , d o c I D > → < w o r d , l i s t   o f   d o c I D > <word,docID> \rightarrow <word,list\ of\ docID> <word,docID>→<word,list of docID> , after the map function ends, shuffle will automatically output the key-value pairs with the same key to a machine, and we can directly organize the data in this batch together.

word count

Definition : Given a set of documents, count the number of occurrences of each word

  • Map function: <docID, content> → <word, 1>
  • Reduce function: <word,1>→<word,count>

retrieve

Definition : Given a line number and corresponding document content, count the position where the specified word appears (*line number)

  • Map function: slightly
  • Reduce function: slightly

3.2 Basic Questions (2)

Introduction

Problem Example: Matrix Multiplication

It is simple matrix multiplication. We know that the time complexity of ordinary matrix multiplication is O(m⋅n⋅d), for a matrix multiplication of m⋅d and d⋅n. Through the parallel computing framework, we greatly speed up the execution of this process. The specific two methods are as follows

Matrix multiplication 1

Definition : matrix A and matrix B, (A,i,j) points to the element of row i and column j of matrix A, but it is not the real element, but represents an index, and aij represents this element.

  • Map:
    • ((A,i,j),aij)→(j,(A,i,aij))
    • ((B,j,k),bjk)→(j,(B,k,bjk))
  • Reduce:(j,(A,i,aij)),(j,(B,k,bjk))→((i,k),aij∗bjk)
  • Map:nothing(identity)
  • Reduce:((i,k),(v1,v2,…))→((i,k),∑vi)

Analysis of ideas : Each element on the target matrix is ​​obtained by accumulating d intermediate elements (aij∗bjk). We first find out all such elements and then add them up. This is true for each element on the target matrix.

Matrix multiplication 2

  • Map function:

    • ((A,i,j),aij)→((i,x),(A,j,aij)) for all x∈[1,n]
    • ((B,j,k),bjk)→((y,k),(B,j,bjk)) for all y∈[1,m]
  • Reduce function: ((i,k),(A,j,aij))∧((i,k),(B,j,bjk))→((i,k),∑aij∗bjk)

Analysis of ideas : Compared with the first method, method 2 does not find the intermediate elements first, but puts together the 2d basic elements used to solve each target matrix, and then performs the dot product operation.

3.3 Sorting algorithm

Introduction to the problem

algorithm

Using p processors, input < i , A [ i ] > <i,A[i]><i,A[i]>

  • Map: < i , A [ i ] > → < j , ( ( i , A [ i ] ) , y ) > <i,A[i]> \rightarrow <j,((i,A[i]),y)> <i,A[i]>→<j,((i,A[i]),y)>

    1. 输出 < i % p , ( ( i , A [ i ] ) , 0 ) > <i\%p,((i,A[i]),0)> <i%p,((i,A[i]),0)>

    2. 以概率T/n为所有 j ∈ [ 0 , p − 1 ] j ∈ [0, p − 1] j[0,p1]输出 < j , ( ( i , A [ i ] ) , 1 ) > <j,((i,A[i]),1)> <j,((i,A[i]),1)>

      否则输出 < j , ( ( i , A [ i ] ) , 0 ) > <j,((i,A[i]),0)> <j,((i,A[i]),0)>

  • Reduce:

    • 将y=1的数据收集为S并排序
    • 构造 ( s 1 , s 2 , . . . , s p − 1 ) (s_1,s_2,...,s_{p−1}) (s1,s2,...,sp1) s k s_k skFor the kth in S ⌈ ∣ S ∣ p ⌉ k\left \lceil \frac{|S|}{p} \right \rceilkpS
    • Collect data for y=0 as D
    • (i,x)∈D satisfies sk < x ≤ sk + 1 s_k < x \leq s_{k+1}sk<xsk+1, output <k,(i,x)>
  • Map:nothing(identity)

  • Reduce:$ <j, ((i, A[i]), . . . )>$

    • general owned ( i , A [ i ] ) (i, A[i])(i,A [ i ]) Sort and output according to $A[i]$

Analysis of ideas : The overall idea is to divide the entire data into different small segments, and the data between segments has a strict size relationship, so that the data inside each segment is some very close data, so we The data set inside each segment can be sorted with the help of some efficient sorting methods. Then one of the keys to solving the problem is whether we divide the data well or not. We can use theory to prove that there is a high probability that the division effect can be guaranteed. **But it is not absolute. It is possible that there is basically no data on a certain computing node, and excessive data appears on another computing node. Of course, this is an extremely low probability event.

3.4 Calculate the minimum spanning tree (spanning tree)

Introduction to the problem

Known graph G=(V, E), where V, E are the point set and edge set of the graph respectively. A subgraph T ∗ = ( V , E ′ ) T^*=(V,E') of a graph GT=(V,E )is called a minimum spanning tree if and only ifT ∗ T^∗T是连通的并且 ∑ ( u , v ) ∈ T ∗ w ( u , v ) = min ⁡ T { ∑ ( u , v ) ∈ T w ( u , v ) } \sum_{(u,v)\in T^*}{w(u,v)} = \min_{T}\{\sum_{(u,v)\in T}{w(u,v)}\} (u,v)Tw(u,v)=minT{ (u,v)Tw(u,v)}

In the small data range, the Kruskal or Prim algorithm is usually used to solve it. When the scale of the graph becomes larger, due to the limitation of the complexity of these two algorithms, it will not be able to solve it within a limited time. At this time, MapReduce can be used to speed up the calculation.

Algorithm main idea

Using the graph partition algorithm, the graph G is divided into k subgraphs, and the minimum spanning tree is calculated in each subgraph, as follows.

  1. Divide the node into k parts, for each ( i , j ) ∈ [ k ] 2 (i,j)\in [k]^2(i,j)[k]2,令 G i j = ( V i ∪ V j , E i j ) G_{ij} = (V_i\cup V_j,E_{ij}) Gij=(ViVj,Eij) is nodeV i ∪ V j V_i\cup V_jViVjThe derived subgraph on
  2. Solve M ij = MSF ( G ij ) M_{ij}=MSF(G_{ij}) separately on each GijMij=MSF(Gij)
  3. H = ∪ i , j M i j H = \cup_{i,j}M_{ij} H=i,jMij, calculate M = MST ( H ) M=MST(H)M=MST ( H )

Note: MSF refers to the minimum spanning forest. Give an example to illustrate this concept. There are n−1 edges in a minimum spanning tree, which can be understood as a minimum spanning forest with only one tree. Similarly, there are also minimum spanning forests with two trees and three trees in the same graph, as long as the forest is min ⁡ { ∑ ( u , v ) ∈ Forestw ( u , v ) } \min\{\ sum_{(u,v)\in Forest}w(u,v)\}min{ (u,v)Forestw(u,v )} will do.

In essence, the essence of the algorithm is to first calculate the spanning tree locally, then use the remaining edges connecting these spanning trees to form a new graph, and find the minimum spanning tree of this new graph as the total result.

MR algorithm

  • Map:input:<(u,v),NULL>
    • transformation<(h(u),h(v));(u,v)>
    • For the above conversion data, if h(u)=h(v), then for all j∈[1,k], output <(h(u),j);(u,v)>
  • Reduce:input:<(i,j);Eij>
    • 令Me=MSF(Thou)
    • Output <NULL;(u,v)> for each edge e=(u,v) in Mij
  • Map:nothing(identity)
  • Reduce:M=MST(H)

Among them, h is a hash function, which can be realized by a random algorithm with uniform values ​​(ps: use a random algorithm to generate a hash table instead of randomly taking values ​​every time it runs).

Of course, the load balancing problem of each computing node should also be considered when dividing.

4 External memory model algorithm

4.1 External storage model

External memory model

So far, according to the storage model, the algorithm models we have learned should be divided into two types: one is the RAM model, which is the design model of our commonly used algorithms, and the other is the I/O model, the memory is smaller than the amount of data , the external memory is unlimited.

There are some differences between external memory access and memory access:

  • RAM is faster than external storage
  • Continuous access to external memory is less costly than random access, that is to say: access in units of blocks

Fundamental Problems Based on the External Storage Model

In the I/O model, the size of the memory is M, the page size is B, the size of the external memory is unlimited, and the page size is B.

Continuous reading of N data on the external storage requires O(N/B) I/O times

How to calculate matrix multiplication?

Input two matrices X and Y of size N×N

  1. Divide the matrix into sizes M / 2 × M / 2 \sqrt{M}/2\times\sqrt{M}/2M /2×M /2 blocks
  2. Considering each block in the X×Y matrix, there is obviously a total of O ( ( NM ) 2 ) O((\frac{N}{\sqrt{M} })^2)O((M N)2 )blocks need to output
  3. Each block needs to scan NM \frac{N}{\sqrt{M} }M Nto the input block
  4. Each memory calculation requires O(M/B) I/O times
  5. Total O ( ( NM ) 3 ⋅ M / B ) O((\frac{N}{\sqrt{M} })^3\cdot M/B)O((M N)3M / B ) Next I/O

linked list

Perform three operations: insert(x,p), remove§, traverse(p,k)

The time complexity of each operation under the memory model: updateO(1), traverseO(k)

Under the external memory model, the consecutive elements in a linked list are placed in a block of size B. Also, let each block be at least B/2 in size:

  • remove: If it is less than B/2 after deletion, it will be merged with the adjacent block, if it is greater than B after the merger, it will be divided equally
  • insert: If it is greater than B after insertion, it will be divided evenly
  • traverse:O(2k/B)

search structure

Perform three operations: insert(x), remove(x), query(x)

( a , b ) − t r e e : 2 ≤ a ≤ ( b + 1 ) / 2 (a,b)-tree:2\leq a \leq (b+1)/2 (a,b)tree:2a(b+1)/2

Similar to binary search tree ⇒ ( p 0 , k 1 , . . . , kc , pc ) \Rightarrow (p_0,k_1,...,k_c,p_c)(p0,k1,...,kc,pc)

The root node has 0 or ≥ 2 children; except the root, the number of children of each non-leaf node ∈ [a,b]:

  • remove: Find the corresponding leaf node. If it is smaller than a after deletion, it will be merged with the adjacent block. If it is greater than b after the merger, it will be divided evenly, and then recursively delete the node or adjust the key value in the upper layer.
  • insert: Find the corresponding leaf node, if it is greater than b after insertion, divide it evenly, and recursively insert to the next level
  • q u e r y : O ( l o g a ( N / a ) ) query:O(log_a(N/a)) query:O(loga(N/a))

4.2 External memory sorting

External memory sorting problem

When considering the external memory sorting algorithm, it should be closely combined with the external memory model.

algorithm

  • Given N data, divide it into groups of size O(M)
  • Each set of data can be sorted in memory
  • Reading each set of data from external storage requires O(M/B) I/O times
  • Perform the above operations on all groups, so each group contains sorted data
  • Perform a multi-way merge sort on these sorted groups
  • O(M/B) groups can be merged each time

process explanation

The first thing to understand is that when transferring data from external storage to internal memory, only the amount of data of B can be transferred at a time. Therefore, if you want to read the memory slowly at one time, the corresponding number of I/Os is O(M/B). In addition, when performing multi-way merge sort, how many groups can be merged at most. A page is read from each group, and then sorted, so it has nothing to do with the size of each group, only the size of the memory, so it is O(M/B).

icon

1606612390645

evaluate

The time complexity is divided into two parts, one is intra-group sorting, and the other is inter-group merge sorting.

  1. For intra-group sorting, you only need to read the data of each group into memory, and the corresponding time complexity of this part is O ( N / B ) O(N/B)O(N/B)
  2. For merge sorting, the corresponding time cost should be the sum of the overhead of each merge, and each merge needs to import all the data into memory once, and this time cost is O ( N / B ) O(N / B)O ( N / B ) . The number of times to merge can be expressed asO ( log M / BNB ) O(log_{M/B}\frac{N}{B})O(logM/BBN) . To sum up: the total time overhead is:O ( N / B ⋅ log M / BNB ) O(N/B \cdot log_{M/B}\frac{N}{B})O(N/BlogM/BBN)

4.3 List Ranking

List Ranking

definition

[List Ranking Problem] : Given an adjacency linked list L of size N, L is stored in an array (continuous external storage space), and the rank (serial number in the linked list) of each node is calculated.

[General List Ranking Problem] : Given an adjacency linked list L of size N, L is stored in an array (continuous external storage space), and each node v in L stores weight wv w_vwv, calculate the rank of each node v (the sum of weights from the head node to v)

analyze

  • If the subsequence on the merge list is a node (that is, the page is merged internally, and a page is regarded as a node), the weight sum is used as the weight of the node, and the rank value of the preceding and following nodes is not affected.
  • If the size of the linked list is at most M, then this problem can be solved by using O(M/B) I/O times, that is, all the data is read into the memory, and the memory algorithm is used to solve this problem.

Let's take a look at the intuitive understanding of this problem through a picture, as shown in the figure below, the relevant data is: N=10, B=2, M=4. In the worst case, the access cost is O(N) I/O times.

1607238213831

algorithm

Input the external memory linked list L of size N

  1. Find an independent set of vertices X in L
  2. "Skip" the nodes in X to build a new, smaller external memory linked list L'
  3. Solve L' recursively
  4. "Backfill" the nodes in X, and construct the rank of L according to the rank of L'

Step 1.1.4 can be done in O ( sort ) = O ( NB log MBNB ) O(sort) = O(\frac{N}{B}log_{\frac{M}{B} }{\frac{N }{B} })OR ( sort ) _=O(BNlogBMBN) time I/O solution, construct the following recursive equation:

T ( N ) = T ( ( 1 − α ) N ) + O ( N B l o g M B N B ) T(N) = T((1-\alpha)N) + O(\frac{N}{B}log_{\frac{M}{B} }{\frac{N}{B} }) T(N)=T((1a ) N )+O(BNlogBMBN) , solving the equation:T ( N ) = O ( NB log MBNB ) T(N) = O(\frac{N}{B}log_{\frac{M}{B} }{\frac{N}{ B} })T(N)=O(BNlogBMBN)

step~1

According to the ID order of the nodes, it is divided into a forward (f) linked list and a backward (b) linked list. The f linked list is colored according to the interval of red and blue, and the b linked list is colored according to the interval of green and blue. Note that the f linked list and b linked list here refer to a continuous segment of the linked list, and the ID sequence relationship of the data in this segment is either all consistent with the location in the memory, or just the opposite. Obviously, there are multiple f linked lists and b linked lists in a linked list.

The realization of this step only needs to be implemented by one external memory sorting, so the time cost is O ( NB log MBNB ) O(\frac{N}{B}log_{\frac{M}{B} }{\frac{ N}{B} })O(BNlogBMBN)

step~2,4

  1. L ~ = c o p y ( L ) \tilde{L} = copy(L) L~=copy(L)
  2. Sort the linked list L according to the address of the successor node, and perform the following operations while sorting
    1. In step~2, rewrite the node pointer and weight of the successor node belonging to X
    2. In step~4, rewrite the node pointer and weight of the successor node belonging to X, rewrite and store the weight of the node belonging to X,
      and use the memory algorithm to solve this problem.

Let's take a look at the intuitive understanding of this problem through a picture, as shown in the figure below, the relevant data is: N=10, B=2, M=4. In the worst case, the access cost is O(N) I/O times.

1607238213831

algorithm

Input the external memory linked list L of size N

  1. Find an independent set of vertices X in L
  2. "Skip" the nodes in X to build a new, smaller external memory linked list L'
  3. Solve L' recursively
  4. "Backfill" the nodes in X, and construct the rank of L according to the rank of L'

Step 1.1.4 can be done in O ( sort ) = O ( NB log MBNB ) O(sort) = O(\frac{N}{B}log_{\frac{M}{B} }{\frac{N }{B} })OR ( sort ) _=O(BNlogBMBN) time I/O solution, construct the following recursive equation:

T ( N ) = T ( ( 1 − α ) N ) + O ( N B l o g M B N B ) T(N) = T((1-\alpha)N) + O(\frac{N}{B}log_{\frac{M}{B} }{\frac{N}{B} }) T(N)=T((1a ) N )+O(BNlogBMBN) , solving the equation:T ( N ) = O ( NB log MBNB ) T(N) = O(\frac{N}{B}log_{\frac{M}{B} }{\frac{N}{ B} })T(N)=O(BNlogBMBN)

step~1

According to the ID order of the nodes, it is divided into a forward (f) linked list and a backward (b) linked list. The f linked list is colored according to the interval of red and blue, and the b linked list is colored according to the interval of green and blue. Note that the f linked list and b linked list here refer to a continuous segment of the linked list, and the ID sequence relationship of the data in this segment is either all consistent with the location in the memory, or just the opposite. Obviously, there are multiple f linked lists and b linked lists in a linked list.

The realization of this step only needs to be implemented by one external memory sorting, so the time cost is O ( NB log MBNB ) O(\frac{N}{B}log_{\frac{M}{B} }{\frac{ N}{B} })O(BNlogBMBN)

step~2,4

  1. L ~ = c o p y ( L ) \tilde{L} = copy(L) L~=copy(L)
  2. Sort the linked list L according to the address of the successor node, and perform the following operations while sorting
    1. In step~2, rewrite the node pointer and weight of the successor node belonging to X
    2. In step~4, rewrite the node pointer and weight of the successor node belonging to X, and rewrite the weight of the node belonging to X
  3. Reorder L by address

Guess you like

Origin blog.csdn.net/twi_twi/article/details/129257745