Big data algorithm self-test

1 Sublinear Space Algorithms for Big Data

1.1 The counting problem of the flow model

problem definition? What algorithm to use? Algorithmic steps? (Hint: three layers of progression)

Chebyshev's inequality? how to prove? Expectation, variance, space complexity?

Very limited space to store huge numbers

morris,morris+,morris++

1 / ( 2 X ) 1/(2^X) 1/(2X) => f ^ = ( 2 X − 1 ) \hat{f}=(2^X-1) f^=(2X1)

E [ γ ] = NE[γ]=NE [ c ]=N

D [ γ ] = N 2 − N 2 k D[γ]=\frac{N^2−N}{2k}D [ c ]=2 kN2N

P [ ∣ γ − N ∣ ≥ ϵ ] ⩽ N 2 − N 2 k ϵ 2 P[|γ−N|≥ϵ]⩽\frac{N^2−N}{2kϵ^2}P[γNϵ ]2 k ϵ2N2N

1.2 Number of non-repeating elements

problem definition? What algorithm to use? Algorithmic steps?
(Hint: store real numbers: three levels of progression + cannot store real numbers: 1+1)

how to prove? Expectation, variance, space complexity?

Count the number of unique elements in a data stream

  1. FM,FM+

    h : [ n ] ↦ [ 0 , 1 ] h:[n]↦[0,1] h:[n][0,1] => z = m i n { z , h ( i ) } z=min\{z,h(i)\} z=min { z ,h(i)} => 1 z − 1 \frac{1}{z}-1 z11

    E [ Z ] = 1 d + 1 E[Z]=\frac{1}{d+1}E[Z]=d+11

    v a r [ Z ] ⩽ 2 ( d + 1 ) ( d + 2 ) 1 q < 2 ( d + 1 ) ( d + 1 ) 1 q var[Z]⩽\frac{2}{(d+1)(d+2)}\frac{1}{q}<\frac{2}{(d+1)(d+1)}\frac{1}{q} v a r [ Z ](d+1)(d+2)2q1<(d+1)(d+1)2q1

    P [ ∣ X − d ∣ > ϵ ′ d ] < 2 q ( 2 ϵ ′ + 1 ) 2 P[|X−d|>ϵ'd]<\frac{2}{q}(\frac{2} {ϵ'}+1)^2P[Xd>ϵd]<q2(ϵ2+1)2

  2. FM’+

    Maintain the smallest k hash values ​​currently seen, return kzk \frac{k}{z_k}zkk

  3. PracticalFM

    1. If zeros ( h ( j ) ) > z zeros(h(j))>zzeros(h(j))>z z = z e r o s ( h ( j ) ) z=zeros(h(j)) z=zeros(h(j))

    2. return d ^ = 2 z + 1 2 \hat{d}=2^{z+\frac{1}{2}}d^=2z+21

    E [ Y r ] = d 2 r E[Y_{r}] = \frac{d}{2 ^ r} E [ ANDr]=2rd

    v a r [ Y r ] ≤ d 2 r var[Y_{r}] \leq \frac{d}{2^r} v a r [ Yr]2rd

    The final correct probability should be greater than 1 − 2 2 C 1 - \frac{2\sqrt{2} }{C}1C22

  4. BJKST

    1. z e r o s ( h ( j ) ) > z zeros(h(j))>z zeros(h(j))>z

      1. B = B ∪ ( g ( j ) , z e r o s ( h ( j ) ) ) B=B∪(g(j),zeros(h(j))) B=B(g(j),zeros(h(j)))
      2. Then B ∣ > c ϵ 2 |B| > \frac{c}{\epsilon^2}B>ϵ2c
        1. z = z + 1 z=z+1 z=z+1
        2. remove (α,β) from B (α,β)( a ,b ),whereinβ < z β<zb<z
    2. return d ^ = ∣ B ∣ 2 z \hat{d}=|B|2^z d^=B2z

1.3 point query

problem definition? What algorithm to use? Algorithmic steps?

Space complexity?

Count the number of occurrences of all elements in the stream

  1. Misra_Gries

    Maintain a set A whose elements are ( i , fi ^ ) (i,\hat{f_{i} })(i,fi^)

    1. A ← ∅ A←∅ A

    2. For each element e in the data stream

      if e∈A,令 ( e , f e ^ ) → ( e , f e ^ + 1 ) (e,\hat{f_{e} }) \rightarrow (e,\hat{f_{e} } + 1) (e,fe^)(e,fe^+1)

      else if ∣ A ∣ < 1 ϵ |A| < \frac{1}{\epsilon} A<ϵ1: Insert (e,1) into A

      else

      1. Decrement all counts in A by 1
      2. if f j ^ = 0 \hat{f_{j} } = 0 fj^=0 : remove (j,0) from A
    3. For query i, if i ∈ A i∈AiA , returnsfi ^ \hat{f_{i} }fi^, otherwise return 0

    The space cost is O ( ϵ − 1 logn ) O(\epsilon^{-1}logn)O ( ϵ1logn)

  2. Metwally

    1. For each element e in the data stream
      1. if e∈A:令 ( e , f i ^ ) ← ( e , f i ^ + 1 ) (e,\hat{f_i})←(e,\hat{f_i}+1) (e,fi^)(e,fi^+1)
      2. else if ∣ A ∣ < 1 ϵ |A| < \frac{1}{\epsilon} A<ϵ1: Insert (e,1) into A
      3. else Insert (e,MIN+1) into A, and delete one that satisfies fe ^ = MIN \hat{f_{e} } = MINfe^=MIN
    2. query i if i ∈ A i∈AiA , returnsfi ^ \hat{f_i}fi^, otherwise return MIN

    The space cost is O ( ϵ − 1 logn ) O(\epsilon^{-1}logn)O ( ϵ1logn)

  3. Count-Min

    1. Randomly select t 2−wise independent hash functions hi : [ n ] → [ k ] h_i:[n]→[k]hi:[n][k]

    2. For each update (j,c) that occurs, do the following

      for i=1 to t

      C [ i ] [ h i ( j ) ] = C [ i ] [ h i ( j ) ] + c C[i][h_{i}(j)] = C[i][h_{i}(j)] + c C[i][hi(j)]=C[i][hi(j)]+c

    3. For a query for a, return fa ^ = min ⁡ 1 ≤ i ≤ t C [ i ] [ hi ( a ) ] \hat{f_{a} } = \min_{1 \leq i \leq t}{C[ i][h_{i}(a)]}fa^=min1itC[i][hi(a)]

  4. Count-Median (min change median)

  5. Count Sketch

    1. Randomly select a 2−wise independent hash function h : [ n ] → [ k ] h:[n]→[k]h:[n][k]

    2. Randomly select a 2−wise independent hash function g : [ n ] → − 1 , 1 g:[n]→{−1,1}g:[n]1,1

    3. For each update (j,c)

      C [ h ( j ) ] = C [ h ( j ) ] + c ∗ g ( j ) C[h(j)] = C[h(j)] + c * g(j) C[h(j)]=C[h(j)]+cg(j)

    4. For query a, return f ^ = g ( a ) ∗ C [ h ( j ) ] \hat{f} = g(a) * C[h(j)]f^=g(a)C[h(j)]

  6. Count Sketch+
    (equivalent to running the Count Sketch algorithm t times, and finally taking the median value)

    1. Randomly select a 2−wise independent hash function hi : [ n ] → [ k ] h_i:[n]→[k]hi:[n][k]

    2. Randomly select a 2−wise independent hash function gi : [ n ] → { − 1 , 1 } g_i:[n] \rightarrow \{-1,1\}gi:[n]{ 1,1}

      For each update (j,c)

      ​ For i: 1 → ti: 1→ti:1t

      C [ h i ( j ) ] = C [ h i ( j ) ] + c ∗ g i ( j ) C[h_i(j)] = C[h_i(j)] + c * g_i(j) C[hi(j)]=C[hi(j)]+cgi(j)

    3. 返回 f ^ = m e d i a n 1 ≤ i ≤ t g i ( a ) C [ i ] [ h i ( a ) ] \hat f=median_{1≤i≤t}g_i(a)C[i][h_i(a)] f^=median1itgi(a)C[i][hi(a)]

1.4 Frequency moment estimation

problem definition? What algorithm to use? Algorithmic steps?

Expectation, variance, space complexity?

slightly

1.5 Fixed-Size Sampling

problem definition? What algorithm to use? Algorithmic steps?

Expectation, variance, space complexity?

Reservoir Sampling Algorithm

  1. Initialize the sampling array
    A [ 1 , . . . , s ] , m ← s A[1,...,s],m\leftarrow s using the first s elements of the data streamA[1,...,s],ms
  2. For each update x
    1. x by sm + 1 \frac{s}{m + 1}m+1sProbabilistically replace an element in A at random
    2. m++

1.6 Bloom Filter

problem definition? What algorithm to use? Algorithmic steps?

Draw a small data set from the large data set, draw a number, and guess whether it belongs to the small data set

  1. approximate hash

    1. Let H be a family of universal hash functions: [ U ] → [ m ] , m = n δ [U]→[m], m = \frac{n}{\delta}[U][m]m=dn
    2. Randomly select h∈H, and maintain the array A[m], the size of S is n
    3. For each i∈S, A [ h ( i ) ] = 1 A[h(i)]=1A[h(i)]=1
    4. Given a query q, return yes if and only if A [ h ( i ) ] = 1 A[h(i)]=1A[h(i)]=1
  2. Bloom Filter

    1. Let H be a family of independent ideal hash functions: [U]→[m]

    2. Randomly select h 1 , . . . , hd ∈ H h_1,...,h_d \in Hh1,...,hdH , and maintain the array A[m]

    3. For every i∈S

      ​ For each j∈[1,d]

      A [ h j ( i ) ] = 1 A[h_j(i)] = 1 A[hj(i)]=1

    4. Given a query q, return yes if and only if ∀ j ∈ [ d ] , A [ hj ( q ) ] = 1 \forall j \in [d],A[h_j(q)] = 1j[d],A[hj(q)]=1

2 Sublinear Time Algorithms for Big Data

2.1 Find the number of connected components

problem definition? What algorithm to use? Algorithmic steps?

Calculation formula? time complexity?

If the number of searched points is less than 2 ϵ \frac{2}{\epsilon}ϵ2Just continue to search, otherwise return directly to 2 ϵ \frac{2}{\epsilon}ϵ2

Randomly select r = b / ϵ 2 r = b/{\epsilon}^2 from the set of nodesr=b / ϵ2 nodes form a node U, apply this algorithm to each node

最终的 C ^ = n r ∑ u ∈ U 1 n u ^ \hat{C} = \frac{n}{r} \sum_{u \in U}{\frac{1}{\hat{n_u} } } C^=rnuUnu^1, the time complexity is O ( d / ϵ 3 ) O(d/{\epsilon}^3)O ( d / ϵ3)

2.2 Approximate minimum spanning tree

problem definition? What algorithm to use? Algorithmic steps?

Calculation formula? time complexity?

G-symbol G ( i ) = ( V , E ( i ) ) G^{(i)}=(V,E^{(i)})G(i)=(V,E(i)) E ( i ) = { ( u , v ) ∣ w u v ≤ i } E(i)=\{(u,v)|w_{uv}≤i\} E ( i )={(u,v)wuvi } , the number of connected components is C(i)

M is the sum of the number of connected components of all such subgraphs: M = n − w + ∑ i = 1 w − 1 C ( i ) M=n-w+\sum_{i=1}^{w-1} {C^{(i)} }M=nw+i=1w1C(i)
M = ∑ i = 1 w i ⋅ α i = ∑ i = 1 w α i + ∑ i = 2 w α i + ⋯ + ∑ i = w w α i = C ( 0 ) − 1 + C ( 1 ) − 1 + ⋯ + C ( w − 1 ) − 1 = n − 1 + C ( 1 ) − 1 + ⋯ + C ( w − 1 ) − 1 = n − w + ∑ i = 1 w − 1 C ( i ) \begin{align*} M &= \sum_{i=1}^{w}{i \cdot \alpha_i}=\sum_{i=1}^{w}{\alpha_i} + \sum_{i=2}^{w}{\alpha_i} + \dots + \sum_{i=w}^{w}{\alpha_i}\\ &= C^{(0)}-1 + C^{(1)}-1 + \dots + C^{(w-1)}-1\\ &= n-1+C^{(1)}-1 + \dots + C^{(w-1)}-1\\ &= n-w+\sum_{i=1}^{w-1}{C^{(i)} } \end{align*} M=i=1wiai=i=1wai+i=2wai++i=wwai=C(0)1+C(1)1++C(w1)1=n1+C(1)1++C(w1)1=nw+i=1w1C(i)

2.3 Find the diameter of the point set

problem definition? What algorithm to use? Algorithmic steps?

Calculation formula? time complexity?

The Indyk’s Algorithm

  1. Optional k ∈ [ 1 , m ] k∈[1,m]k[1,m]
  2. select lll,使得∀ i , D ki ≤ D kl \forall i,D_{ki} \leq D_{kl}i,DtoDkl
  3. returns ( k , l ) , D kl (k,l),D_{kl}(k,l),Dkl

2.4 Average Degree Algorithm for Calculation Graph

Alg III

  1. Let V be an independent variable S, ∣ S ∣ = O ~ ( L ρ ϵ 2 ) , L = poly ( log n ϵ ) , ρ = 1 t ϵ 4 ⋅ α n |S| = \tilde{O}(\frac{L}{\rho\epsilon^2}),L=poly(\frac{log\n}{\epsilon}),\rho = \frac{1}{t} \sqrt{\frac{\epsilon}{4}\cdot\frac{\alpha}{n}}S=O~(ρ ϵ2L),L=poly(ϵl o g  n),r=t14ϵna
  2. S i ← S ∩ B i S_i \gets S \cap B_i SiSBi
  3. f o r   i ∈ { 0 , … , t − 1 }   d o \boldsymbol{for}\ i \in \{0,\dots,t-1\}\ \boldsymbol{do} for i{ 0,,t1} do
    1. i f   ∣ S i ∣ ≥ θ ρ   t h e n \boldsymbol{if}\ |S_i| \geq \theta_\rho\ \boldsymbol{then} if Siir then
      1. ρ i ← ∣ S i ∣ ∣ S ∣ \rho_i \gets \frac{|S_i|}{|S|} riSSi
      2. e s t i m a t e   Δ i estimate\ \Delta_i estimate Δi
    2. e l s e \boldsymbol{else} else
      1. ρ i ← 0 \rho_i\gets 0 ri0
  4. r e t u r n   d ˉ ^ = ∑ i = 0 t − 1 ( 1 + Δ i ) ρ i ( 1 + β ) i \boldsymbol{return}\ \hat{\bar{d} } = \sum_{i=0}^{t-1}(1+\Delta_i)\rho_i(1+\beta)^{i} return dˉ^=i=0t1(1+Di) ri(1+b )i

Algae IV

  1. α ← n \alpha \gets n an
  2. d ˉ ^ < − ∞ \hat{\bar{d} } < -\infty dˉ^<
  3. w h i l e   d ˉ ^ < α   d o \boldsymbol{while}\ \hat{\bar{d} } < \alpha\ \boldsymbol{do} while dˉ^<a do 
    1. α ← α / 2 \alpha \gets \alpha/2 aa /2
    2. i f   α < 1 n   t h e n \boldsymbol{if}\ \alpha < \frac{1}{n}\ \boldsymbol{then} if α<n1 then
      1. r e t u r n   0 ; \boldsymbol{return}\ 0; return 0;
    3. d ˉ ^ ← A l g I I I ∼ α \hat{\bar{d} } \gets AlgIII_{\sim \alpha} dˉ^AlgIIIα
  4. r e t u r n   d ˉ ^ \boldsymbol{return}\ \hat{\bar{d} } return dˉ^

Algorithm related indicators

Approximate ratio: ( 1 + ϵ ) (1 + \epsilon)(1+) _

Running time: O ~ ( n ) ⋅ poly ( ϵ − 1 log n ) n / d ˉ \tilde{O}(\sqrt{n})\cdot poly(\epsilon^{-1}log\ n)\sqrt {n/\bar{d} }O~(n )p o l y ( ϵ1log n)n/dˉ

3 Parallel Computing Algorithms

3.1 Building an inverted index

problem definition? What does the Map function do? What does the Reduce function do?

Given a set of documents, count which documents each word appears in

map: < d o c I D , c o n t e n t > → < w o r d , d o c I D > <docID,content> \rightarrow <word,docID> <docID,content>→<word,docID>
reduce: < w o r d , d o c I D > → < w o r d , l i s t   o f   d o c I D > <word,docID> \rightarrow <word,list\ of\ docID> <word,docID>→<word,list of docID>

3.2 Word Count

problem definition? What does the Map function do? What does the Reduce function do?

Given a set of documents, count the number of occurrences of each word

  • Map function: <docID, content> → <word, 1>
  • Reduce function: <word,1>→<word,count>

3.3 Search

problem definition?

Given a line number and corresponding document content, count the occurrences of a specified word

  • Map函数: < l i n e I D , w o r d > → < w o r d , l i n e I D > <lineID,word>→<word,lineID> <l in e I D ,word>→<word,l in e I D>
  • Reduce函数: < w o r d , l i n e I D > → < w o r d , l i s t   o f   l i n e I D > > <word,lineID>→<word,list~of~ lineID>> <word,l in e I D>→<word,l i s t o f l in e I D  >>

3.4 Matrix multiplication

problem definition?

Two algorithms: What does the Map function do? What does the Reduce function do?

  1. Matrix multiplication 1
    • Map:
      • ((A,i,j),aij)→(j,(A,i,aij))
      • ((B,j,k),bjk)→(j,(B,k,bjk))
    • Reduce:(j,(A,i,aij)),(j,(B,k,bjk))→((i,k),aij∗bjk)
    • Map:nothing(identity)
    • Reduce:((i,k),(v1,v2,…))→((i,k),∑vi)
  2. Matrix multiplication 2
    • Map function:

      • ((A,i,j),aij)→((i,x),(A,j,aij)) for all x∈[1,n]
      • ((B,j,k),bjk)→((y,k),(B,j,bjk)) for all y∈[1,m]
    • Reduce function: ((i,k),(A,j,aij))∧((i,k),(B,j,bjk))→((i,k),∑aij∗bjk)

3.5 Sorting algorithm

What does the Map function do? What does the Reduce function do? A key to solving the problem?

Using p processors, input < i , A [ i ] > <i,A[i]><i,A[i]>

  • Map: < i , A [ i ] > → < j , ( ( i , A [ i ] ) , y ) > <i,A[i]> \rightarrow <j,((i,A[i]),y)> <i,A[i]>→<j,((i,A[i]),y)>

    1. 输出 < i % p , ( ( i , A [ i ] ) , 0 ) > <i\%p,((i,A[i]),0)> <i%p,((i,A[i]),0)>

    2. With probability T/n for all j ∈ [ 0 , p − 1 ] j ∈ [0, p − 1]j[0,p1]输出 < j , ( ( i , A [ i ] ) , 1 ) > <j,((i,A[i]),1)> <j,((i,A[i]),1)>

      Otherwise output < j , ( ( i , A [ i ] ) , 0 ) > <j,((i,A[i]),0)><j,((i,A[i]),0)>

  • Reduce:

    • Collect the data of y=1 as S and sort
    • Construct ( s 1 , s 2 , . . . , sp − 1 ) (s_1,s_2,...,s_{p−1})(s1,s2,...,sp1) s k s_k skFor the kth in S ⌈ ∣ S ∣ p ⌉ k\left \lceil \frac{|S|}{p} \right \rceilkpS
    • Collect data for y=0 as D
    • Satisfy sk < x ≤ sk + 1 s_k < x \leq s_{k+1} for any (i,x)∈Dsk<xsk+1, output <k,(i,x)>
  • Map:nothing(identity)

  • Reduce:$ <j, ((i, A[i]), . . . )>$

    • general owned ( i , A [ i ] ) (i, A[i])(i,A [ i ]) Sort and output according to $A[i]$

3.6 Computing the minimum spanning tree (spanning tree)

The main idea? What does the Map function do? What does the Reduce function do?

Using the graph partition algorithm, the graph G is divided into k subgraphs, and the minimum spanning tree is calculated in each subgraph

The essence of the algorithm is to calculate the spanning tree locally first, then use the remaining edges connecting these spanning trees to form a new graph, and find the minimum spanning tree of this new graph as the total result

  • Map:input:<(u,v),NULL>
    • transformation<(h(u),h(v));(u,v)>
    • For the above conversion data, if h(u)=h(v), then for all j∈[1,k], output <(h(u),j);(u,v)>
  • Reduce:input:<(i,j);Eij>
    • 令Me=MSF(Thou)
    • Output <NULL;(u,v)> for each edge e=(u,v) in Mij
  • Map:nothing(identity)
  • Reduce:M=MST(H)

4 External memory model algorithm

4.1 External storage model

In the I/O model, the memory size is ___, the page size is ___, and the external storage size is ___. How many I/Os are required to continuously read N data from the external memory?

M, B, Unlimited, N/B

4.2 Computing matrix multiplication

Input two matrices X and Y of size N×N

  1. Divide the matrix into blocks of size ___
  2. Considering each block in the X×Y matrix, there are obviously ___ blocks to output
  3. Each block needs to scan ___ pairs of input blocks
  4. Each in-memory calculation requires ___ I/O
  5. Total___times I/O

M / 2 × M / 2 \sqrt{M}/2\times\sqrt{M}/2 M /2×M /2

O ( ( N M ) 2 ) O((\frac{N}{\sqrt{M} })^2) O((M N)2)

N M \frac{N}{\sqrt{M} } M N

O(M/B)

O ( ( N M ) 3 ⋅ M / B ) O((\frac{N}{\sqrt{M} })^3\cdot M/B) O((M N)3M/B)

4.3 Data structure

4.3.1 External storage stack

The memory maintains an array of size ___, realizes the memory stack structure, and stores the rest of the data in the external memory

How to push the stack (push)?
How to pop the stack (pop)?

I/O cost analysis:
▷ Worst case cost: O(1) times I/O
▷ Amortized analysis: ___, optimal

2b

If it is not full, press it, if it is full, write it out and then press it

Play it if it’s not empty, read it when it’s empty

O(1/B)

4.3.2 External memory linked list

Queue (Queue)
▷ The memory maintains two arrays A and B of size B, one for dequeue and one for enqueue
▷ A and B are stored separately?
▷ The rest of the data is stored in external memory
How to handle queue operations?
▷ Insert?
▷ Remove?
I/O cost analysis:
▷ Worst case cost: O(1) I/O
▷ Amortized analysis: ___, optimal

k queue head data and k′ queue tail data

If B is not full, it will be stored in memory, if it is full, it will be written out and then stored in memory

Play when A is empty, read and play again when empty

O(1/B)

4.3.3 Linked list

Perform three operations: insert(x,p), remove§, traverse(p,k)

  • Idea 2: block "half full" ⇒ data at least B/2;

    Under the external memory model, the consecutive elements in a linked list are placed in a block of size B. At the same time, make each block size at least B/2:
    ▷ remove: When to merge? Under what circumstances should it be equally divided?
    ▷ insert: under what circumstances should it be equally divided?
    ▷ traverse: ___, the worst case cost of insert and remove is O(1)

    ▷ Amortized cost: N consecutive insertions ___, consecutive deletions ___

    If it is less than B/2, it will be merged with the neighbor block, and if it is greater than B, it will be divided equally

    greater than B, evenly divided

    O(2k/B)

    N consecutive insertions O(2N/B), consecutive deletions O ( log 2 B ⋅ N / B ) O(log_2 B N/B)O(log2BN/B)

  • Idea 3: Two consecutive blocks contain at least 2B/3 data;

    Two consecutive blocks contain at least 2B/3 data; the memory maintenance size is B buffer
    ▷ remove: when to merge? Under what circumstances?
    ▷ insert: When to merge? Under what circumstances is it equally divided?
    ▷ traverse: ___

    ▷ Amortized cost: N consecutive insertions ___, continuous deletions ___
    ▷ Amortized cost: N consecutive updates ___

Delete and check whether there are adjacent blocks such that the data volume is ≤ 2B/3, and if so, merge

If the current block is full, insert to the neighbor; if the neighbors are full, divide the current block equally

O(3k/B)

O(2N/B),O(3N/B)

O(12N/B)

4.4 Search structure

Perform three operations: insert(x), remove(x), query(x)

( a , b ) − t r e e : (a,b)-tree: (a,b)tree: the relationship between a and b

Similar to binary search tree ⇒ ( p 0 , k 1 , p 1 , k 2 , p 2 , . . . , kc , pc ) \Rightarrow (p_0, k_1, p_1, k_2, p_2, . . . , k_c, p_c)(p0,k1,p1,k2,p2,...,kc,pc)

The root node has __ children; the number of children of each non-leaf node is __:

  • remove: how to operate?
  • insert: how to operate?
  • query: time complexity?

2 ≤ a ≤ (b + 1)/2

The root node has 0 or ≥ 2 children, and the number of children of other non-leaf nodes ∈ [a, b]

If it is less than a after deletion, it will be merged with the adjacent block, if it is greater than b after the merger, it will be divided evenly

If it is greater than b after insertion, it will be divided equally

$ O(log_a(N/a))$

image-20230218091126688

insert operation

Assuming the inserted key value is K, first find the corresponding leaf node L

  • If there is free space in L, then insert directly, end;

  • Otherwise, split the leaf node into two nodes, and divide the keys into these two new nodes, so that the number of keys meets the minimum requirement;

    • When splitting leaf node N:

      ▷Create a new node M, let M be the right brother of N, sort the keys, before ⌈ ( n + 1 ) / 2 ⌉ ⌈(n+1)/2⌉⌈(n+1 ) /2 stay in N, the other key‐pointers are put in M.

    • When splitting the non-leaf node N:
      ▷Sort the key-pointers, before ⌈ ( n + 2 ) / 2 ⌉ ⌈(n+2)/2⌉⌈(n+2 ) /2 pointers stay in N, leaving⌊ ( n + 2 ) / 2 ⌋ ⌊(n+2)/2⌋⌊(n+2 ) /2 pointers into M.
      ▷ Front⌈ n / 2 ⌉ ⌈ n/2 ⌉n /2 keys stay in N, after⌊ n / 2 ⌋ ⌊n/2⌋n /2 keys are put into M, and the key in the middle is reserved and inserted into the upper layer node, and the key pointer points to M.

delete operation

Assuming that the deleted key value is K, first find the corresponding leaf node L.

  • If L still has the minimum number of keys after deleting K, stop

  • Otherwise, you need to do the following:

    ▷Try to merge with one of L's adjacent sibling nodes (the same node can still be placed after merging). After merging, it is equivalent to deleting a key value in the upper node, then recursive processing;

    ▷ Otherwise, consider L's adjacent siblings

    • Assuming that one of them can provide L with a key-pointer, and after removing the key-pointer, the brother node still meets the minimum requirement for the number of keys, then L borrows a pair of key-pointers from the brother, and updates the corresponding key value;
    • If neither sibling can provide a key-pointer, then it must be the case that L has less than the minimum number of keys, and L's brother M has exactly the minimum number of keys, then the two nodes can be merged.

B+ tree – performance

Maximum number of pointers in a node: n, number of records: N
B+Tree insertion operation: O ( log ⌈ n / 2 ⌉ ( N ) ) O(log_{⌈n/2⌉}(N))O(logn/2( N ))
B+Tree delete operation:O ( log ⌈ n / 2 ⌉ ( N ) ) O(log_{⌈n/2⌉}(N))O(logn/2(N))

4.5 External memory sorting

▷ Given __ data
▷ Divided into groups of size __, each group can be sorted in memory, requiring __ I/O times
▷ sorted groups, merge (Merge)
▷ __ can be merged each time Grouping
▷ I/O cost:
▷ Drawing comprehension

N

O(M),O(M/B)

O(M/B)

O ( N / B ⋅ l o g M / B N B ) O(N/B · log_{M/B} \frac{N}{B} ) O ( N / B logM/BBN) O ( N / B ⋅ l o g M / B N M ) O(N/B · log_{M/B} \frac{N}{M} ) O ( N / B logM/BMN) (doubtful)

4.6 List Ranking

problem definition? (two questions)

algorithm? (four steps)

Given an adjacency linked list L of size N, L is stored in an array (contiguous external storage space), and the rank (serial number in the linked list) of each node is calculated

Input the external memory linked list L of size N

  1. Find an independent set of vertices X in L
  2. "Skip" the nodes in X to build a new, smaller external memory linked list L'
  3. Solve L' recursively
  4. "Backfill" the nodes in X, and construct the rank of L according to the rank of L'

Guess you like

Origin blog.csdn.net/twi_twi/article/details/129257839