1 Sublinear Space Algorithms for Big Data
1.1 The counting problem of the flow model
problem definition? What algorithm to use? Algorithmic steps? (Hint: three layers of progression)
Chebyshev's inequality? how to prove? Expectation, variance, space complexity?
Very limited space to store huge numbers
morris,morris+,morris++
1 / ( 2 X ) 1/(2^X) 1/(2X) => f ^ = ( 2 X − 1 ) \hat{f}=(2^X-1) f^=(2X−1)
E [ γ ] = NE[γ]=NE [ c ]=N
D [ γ ] = N 2 − N 2 k D[γ]=\frac{N^2−N}{2k}D [ c ]=2 kN2−N。
P [ ∣ γ − N ∣ ≥ ϵ ] ⩽ N 2 − N 2 k ϵ 2 P[|γ−N|≥ϵ]⩽\frac{N^2−N}{2kϵ^2}P[∣γ−N∣≥ϵ ]⩽2 k ϵ2N2−N
1.2 Number of non-repeating elements
problem definition? What algorithm to use? Algorithmic steps?
(Hint: store real numbers: three levels of progression + cannot store real numbers: 1+1)
how to prove? Expectation, variance, space complexity?
Count the number of unique elements in a data stream
FM,FM+
h : [ n ] ↦ [ 0 , 1 ] h:[n]↦[0,1] h:[n]↦[0,1] => z = m i n { z , h ( i ) } z=min\{z,h(i)\} z=min { z ,h(i)} => 1 z − 1 \frac{1}{z}-1 z1−1
E [ Z ] = 1 d + 1 E[Z]=\frac{1}{d+1}E[Z]=d+11
v a r [ Z ] ⩽ 2 ( d + 1 ) ( d + 2 ) 1 q < 2 ( d + 1 ) ( d + 1 ) 1 q var[Z]⩽\frac{2}{(d+1)(d+2)}\frac{1}{q}<\frac{2}{(d+1)(d+1)}\frac{1}{q} v a r [ Z ]⩽(d+1)(d+2)2q1<(d+1)(d+1)2q1
P [ ∣ X − d ∣ > ϵ ′ d ] < 2 q ( 2 ϵ ′ + 1 ) 2 P[|X−d|>ϵ'd]<\frac{2}{q}(\frac{2} {ϵ'}+1)^2P[∣X−d∣>ϵ′d]<q2(ϵ′2+1)2
FM’+
Maintain the smallest k hash values currently seen, return kzk \frac{k}{z_k}zkk
PracticalFM
If zeros ( h ( j ) ) > z zeros(h(j))>zzeros(h(j))>z: z = z e r o s ( h ( j ) ) z=zeros(h(j)) z=zeros(h(j))
return d ^ = 2 z + 1 2 \hat{d}=2^{z+\frac{1}{2}}d^=2z+21
E [ Y r ] = d 2 r E[Y_{r}] = \frac{d}{2 ^ r} E [ ANDr]=2rd
v a r [ Y r ] ≤ d 2 r var[Y_{r}] \leq \frac{d}{2^r} v a r [ Yr]≤2rd。
The final correct probability should be greater than 1 − 2 2 C 1 - \frac{2\sqrt{2} }{C}1−C22。
BJKST
若 z e r o s ( h ( j ) ) > z zeros(h(j))>z zeros(h(j))>z
- B = B ∪ ( g ( j ) , z e r o s ( h ( j ) ) ) B=B∪(g(j),zeros(h(j))) B=B∪(g(j),zeros(h(j)))
- Then B ∣ > c ϵ 2 |B| > \frac{c}{\epsilon^2}∣B∣>ϵ2c
- z = z + 1 z=z+1 z=z+1
- remove (α,β) from B (α,β)( a ,b ),whereinβ < z β<zb<z
return d ^ = ∣ B ∣ 2 z \hat{d}=|B|2^z d^=∣B∣2z
1.3 point query
problem definition? What algorithm to use? Algorithmic steps?
Space complexity?
Count the number of occurrences of all elements in the stream
Misra_Gries
Maintain a set A whose elements are ( i , fi ^ ) (i,\hat{f_{i} })(i,fi^)
A ← ∅ A←∅ A←∅
For each element e in the data stream
if e∈A,令 ( e , f e ^ ) → ( e , f e ^ + 1 ) (e,\hat{f_{e} }) \rightarrow (e,\hat{f_{e} } + 1) (e,fe^)→(e,fe^+1)
else if ∣ A ∣ < 1 ϵ |A| < \frac{1}{\epsilon} ∣A∣<ϵ1: Insert (e,1) into A
else
- Decrement all counts in A by 1
- if f j ^ = 0 \hat{f_{j} } = 0 fj^=0 : remove (j,0) from A
For query i, if i ∈ A i∈Ai∈A , returnsfi ^ \hat{f_{i} }fi^, otherwise return 0
The space cost is O ( ϵ − 1 logn ) O(\epsilon^{-1}logn)O ( ϵ−1logn)
Metwally
- For each element e in the data stream
- if e∈A:令 ( e , f i ^ ) ← ( e , f i ^ + 1 ) (e,\hat{f_i})←(e,\hat{f_i}+1) (e,fi^)←(e,fi^+1)
- else if ∣ A ∣ < 1 ϵ |A| < \frac{1}{\epsilon} ∣A∣<ϵ1: Insert (e,1) into A
- else Insert (e,MIN+1) into A, and delete one that satisfies fe ^ = MIN \hat{f_{e} } = MINfe^=MIN
- query i if i ∈ A i∈Ai∈A , returnsfi ^ \hat{f_i}fi^, otherwise return MIN
The space cost is O ( ϵ − 1 logn ) O(\epsilon^{-1}logn)O ( ϵ−1logn)
Count-Min
Randomly select t 2−wise independent hash functions hi : [ n ] → [ k ] h_i:[n]→[k]hi:[n]→[k]
For each update (j,c) that occurs, do the following
for i=1 to t
C [ i ] [ h i ( j ) ] = C [ i ] [ h i ( j ) ] + c C[i][h_{i}(j)] = C[i][h_{i}(j)] + c C[i][hi(j)]=C[i][hi(j)]+c
For a query for a, return fa ^ = min 1 ≤ i ≤ t C [ i ] [ hi ( a ) ] \hat{f_{a} } = \min_{1 \leq i \leq t}{C[ i][h_{i}(a)]}fa^=min1≤i≤tC[i][hi(a)]
Count-Median (min change median)
Count Sketch
Randomly select a 2−wise independent hash function h : [ n ] → [ k ] h:[n]→[k]h:[n]→[k]
Randomly select a 2−wise independent hash function g : [ n ] → − 1 , 1 g:[n]→{−1,1}g:[n]→−1,1
For each update (j,c)
C [ h ( j ) ] = C [ h ( j ) ] + c ∗ g ( j ) C[h(j)] = C[h(j)] + c * g(j) C[h(j)]=C[h(j)]+c∗g(j)
For query a, return f ^ = g ( a ) ∗ C [ h ( j ) ] \hat{f} = g(a) * C[h(j)]f^=g(a)∗C[h(j)]
Count Sketch+
(equivalent to running the Count Sketch algorithm t times, and finally taking the median value)
Randomly select a 2−wise independent hash function hi : [ n ] → [ k ] h_i:[n]→[k]hi:[n]→[k]
Randomly select a 2−wise independent hash function gi : [ n ] → { − 1 , 1 } g_i:[n] \rightarrow \{-1,1\}gi:[n]→{ −1,1}
For each update (j,c)
For i: 1 → ti: 1→ti:1→t
C [ h i ( j ) ] = C [ h i ( j ) ] + c ∗ g i ( j ) C[h_i(j)] = C[h_i(j)] + c * g_i(j) C[hi(j)]=C[hi(j)]+c∗gi(j)
返回 f ^ = m e d i a n 1 ≤ i ≤ t g i ( a ) C [ i ] [ h i ( a ) ] \hat f=median_{1≤i≤t}g_i(a)C[i][h_i(a)] f^=median1≤i≤tgi(a)C[i][hi(a)]
1.4 Frequency moment estimation
problem definition? What algorithm to use? Algorithmic steps?
Expectation, variance, space complexity?
slightly
1.5 Fixed-Size Sampling
problem definition? What algorithm to use? Algorithmic steps?
Expectation, variance, space complexity?
Reservoir Sampling Algorithm
- Initialize the sampling array
A [ 1 , . . . , s ] , m ← s A[1,...,s],m\leftarrow s using the first s elements of the data streamA[1,...,s],m←s- For each update x
- x by sm + 1 \frac{s}{m + 1}m+1sProbabilistically replace an element in A at random
- m++
1.6 Bloom Filter
problem definition? What algorithm to use? Algorithmic steps?
Draw a small data set from the large data set, draw a number, and guess whether it belongs to the small data set
approximate hash
- Let H be a family of universal hash functions: [ U ] → [ m ] , m = n δ [U]→[m], m = \frac{n}{\delta}[U]→[m],m=dn
- Randomly select h∈H, and maintain the array A[m], the size of S is n
- For each i∈S, A [ h ( i ) ] = 1 A[h(i)]=1A[h(i)]=1
- Given a query q, return yes if and only if A [ h ( i ) ] = 1 A[h(i)]=1A[h(i)]=1
Bloom Filter
Let H be a family of independent ideal hash functions: [U]→[m]
Randomly select h 1 , . . . , hd ∈ H h_1,...,h_d \in Hh1,...,hd∈H , and maintain the array A[m]
For every i∈S
For each j∈[1,d]
A [ h j ( i ) ] = 1 A[h_j(i)] = 1 A[hj(i)]=1
Given a query q, return yes if and only if ∀ j ∈ [ d ] , A [ hj ( q ) ] = 1 \forall j \in [d],A[h_j(q)] = 1∀j∈[d],A[hj(q)]=1
2 Sublinear Time Algorithms for Big Data
2.1 Find the number of connected components
problem definition? What algorithm to use? Algorithmic steps?
Calculation formula? time complexity?
If the number of searched points is less than 2 ϵ \frac{2}{\epsilon}ϵ2Just continue to search, otherwise return directly to 2 ϵ \frac{2}{\epsilon}ϵ2。
Randomly select r = b / ϵ 2 r = b/{\epsilon}^2 from the set of nodesr=b / ϵ2 nodes form a node U, apply this algorithm to each node
最终的 C ^ = n r ∑ u ∈ U 1 n u ^ \hat{C} = \frac{n}{r} \sum_{u \in U}{\frac{1}{\hat{n_u} } } C^=rn∑u∈Unu^1, the time complexity is O ( d / ϵ 3 ) O(d/{\epsilon}^3)O ( d / ϵ3)
2.2 Approximate minimum spanning tree
problem definition? What algorithm to use? Algorithmic steps?
Calculation formula? time complexity?
G-symbol G ( i ) = ( V , E ( i ) ) G^{(i)}=(V,E^{(i)})G(i)=(V,E(i)), E ( i ) = { ( u , v ) ∣ w u v ≤ i } E(i)=\{(u,v)|w_{uv}≤i\} E ( i )={(u,v)∣wuv≤i } , the number of connected components is C(i)
M is the sum of the number of connected components of all such subgraphs: M = n − w + ∑ i = 1 w − 1 C ( i ) M=n-w+\sum_{i=1}^{w-1} {C^{(i)} }M=n−w+∑i=1w−1C(i)
M = ∑ i = 1 w i ⋅ α i = ∑ i = 1 w α i + ∑ i = 2 w α i + ⋯ + ∑ i = w w α i = C ( 0 ) − 1 + C ( 1 ) − 1 + ⋯ + C ( w − 1 ) − 1 = n − 1 + C ( 1 ) − 1 + ⋯ + C ( w − 1 ) − 1 = n − w + ∑ i = 1 w − 1 C ( i ) \begin{align*} M &= \sum_{i=1}^{w}{i \cdot \alpha_i}=\sum_{i=1}^{w}{\alpha_i} + \sum_{i=2}^{w}{\alpha_i} + \dots + \sum_{i=w}^{w}{\alpha_i}\\ &= C^{(0)}-1 + C^{(1)}-1 + \dots + C^{(w-1)}-1\\ &= n-1+C^{(1)}-1 + \dots + C^{(w-1)}-1\\ &= n-w+\sum_{i=1}^{w-1}{C^{(i)} } \end{align*} M=i=1∑wi⋅ai=i=1∑wai+i=2∑wai+⋯+i=w∑wai=C(0)−1+C(1)−1+⋯+C(w−1)−1=n−1+C(1)−1+⋯+C(w−1)−1=n−w+i=1∑w−1C(i)
2.3 Find the diameter of the point set
problem definition? What algorithm to use? Algorithmic steps?
Calculation formula? time complexity?
The Indyk’s Algorithm
- Optional k ∈ [ 1 , m ] k∈[1,m]k∈[1,m]
- select lll,使得∀ i , D ki ≤ D kl \forall i,D_{ki} \leq D_{kl}∀i,Dto≤Dkl
- returns ( k , l ) , D kl (k,l),D_{kl}(k,l),Dkl
2.4 Average Degree Algorithm for Calculation Graph
Alg III
- Let V be an independent variable S, ∣ S ∣ = O ~ ( L ρ ϵ 2 ) , L = poly ( log n ϵ ) , ρ = 1 t ϵ 4 ⋅ α n |S| = \tilde{O}(\frac{L}{\rho\epsilon^2}),L=poly(\frac{log\n}{\epsilon}),\rho = \frac{1}{t} \sqrt{\frac{\epsilon}{4}\cdot\frac{\alpha}{n}}∣S∣=O~(ρ ϵ2L),L=poly(ϵl o g n),r=t14ϵ⋅na
- S i ← S ∩ B i S_i \gets S \cap B_i Si←S∩Bi
- f o r i ∈ { 0 , … , t − 1 } d o \boldsymbol{for}\ i \in \{0,\dots,t-1\}\ \boldsymbol{do} for i∈{ 0,…,t−1} do
- i f ∣ S i ∣ ≥ θ ρ t h e n \boldsymbol{if}\ |S_i| \geq \theta_\rho\ \boldsymbol{then} if ∣Si∣≥ir then
- ρ i ← ∣ S i ∣ ∣ S ∣ \rho_i \gets \frac{|S_i|}{|S|} ri←∣S∣∣Si∣
- e s t i m a t e Δ i estimate\ \Delta_i estimate Δi
- e l s e \boldsymbol{else} else
- ρ i ← 0 \rho_i\gets 0 ri←0
- r e t u r n d ˉ ^ = ∑ i = 0 t − 1 ( 1 + Δ i ) ρ i ( 1 + β ) i \boldsymbol{return}\ \hat{\bar{d} } = \sum_{i=0}^{t-1}(1+\Delta_i)\rho_i(1+\beta)^{i} return dˉ^=∑i=0t−1(1+Di) ri(1+b )i
Algae IV
- α ← n \alpha \gets n a←n
- d ˉ ^ < − ∞ \hat{\bar{d} } < -\infty dˉ^<−∞
- w h i l e d ˉ ^ < α d o \boldsymbol{while}\ \hat{\bar{d} } < \alpha\ \boldsymbol{do} while dˉ^<a do
- α ← α / 2 \alpha \gets \alpha/2 a←a /2
- i f α < 1 n t h e n \boldsymbol{if}\ \alpha < \frac{1}{n}\ \boldsymbol{then} if α<n1 then
- r e t u r n 0 ; \boldsymbol{return}\ 0; return 0;
- d ˉ ^ ← A l g I I I ∼ α \hat{\bar{d} } \gets AlgIII_{\sim \alpha} dˉ^←AlgIII∼α
- r e t u r n d ˉ ^ \boldsymbol{return}\ \hat{\bar{d} } return dˉ^
Algorithm related indicators
Approximate ratio: ( 1 + ϵ ) (1 + \epsilon)(1+) _
Running time: O ~ ( n ) ⋅ poly ( ϵ − 1 log n ) n / d ˉ \tilde{O}(\sqrt{n})\cdot poly(\epsilon^{-1}log\ n)\sqrt {n/\bar{d} }O~(n)⋅p o l y ( ϵ−1log n)n/dˉ
3 Parallel Computing Algorithms
3.1 Building an inverted index
problem definition? What does the Map function do? What does the Reduce function do?
Given a set of documents, count which documents each word appears in
map: < d o c I D , c o n t e n t > → < w o r d , d o c I D > <docID,content> \rightarrow <word,docID> <docID,content>→<word,docID>
reduce: < w o r d , d o c I D > → < w o r d , l i s t o f d o c I D > <word,docID> \rightarrow <word,list\ of\ docID> <word,docID>→<word,list of docID>
3.2 Word Count
problem definition? What does the Map function do? What does the Reduce function do?
Given a set of documents, count the number of occurrences of each word
- Map function: <docID, content> → <word, 1>
- Reduce function: <word,1>→<word,count>
3.3 Search
problem definition?
Given a line number and corresponding document content, count the occurrences of a specified word
- Map函数: < l i n e I D , w o r d > → < w o r d , l i n e I D > <lineID,word>→<word,lineID> <l in e I D ,word>→<word,l in e I D>
- Reduce函数: < w o r d , l i n e I D > → < w o r d , l i s t o f l i n e I D > > <word,lineID>→<word,list~of~ lineID>> <word,l in e I D>→<word,l i s t o f l in e I D >>
3.4 Matrix multiplication
problem definition?
Two algorithms: What does the Map function do? What does the Reduce function do?
- Matrix multiplication 1
- Map:
- ((A,i,j),aij)→(j,(A,i,aij))
- ((B,j,k),bjk)→(j,(B,k,bjk))
- Reduce:(j,(A,i,aij)),(j,(B,k,bjk))→((i,k),aij∗bjk)
- Map:nothing(identity)
- Reduce:((i,k),(v1,v2,…))→((i,k),∑vi)
- Matrix multiplication 2
Map function:
- ((A,i,j),aij)→((i,x),(A,j,aij)) for all x∈[1,n]
- ((B,j,k),bjk)→((y,k),(B,j,bjk)) for all y∈[1,m]
Reduce function: ((i,k),(A,j,aij))∧((i,k),(B,j,bjk))→((i,k),∑aij∗bjk)
3.5 Sorting algorithm
What does the Map function do? What does the Reduce function do? A key to solving the problem?
Using p processors, input < i , A [ i ] > <i,A[i]><i,A[i]>
Map: < i , A [ i ] > → < j , ( ( i , A [ i ] ) , y ) > <i,A[i]> \rightarrow <j,((i,A[i]),y)> <i,A[i]>→<j,((i,A[i]),y)>
输出 < i % p , ( ( i , A [ i ] ) , 0 ) > <i\%p,((i,A[i]),0)> <i%p,((i,A[i]),0)>
With probability T/n for all j ∈ [ 0 , p − 1 ] j ∈ [0, p − 1]j∈[0,p−1]输出 < j , ( ( i , A [ i ] ) , 1 ) > <j,((i,A[i]),1)> <j,((i,A[i]),1)>
Otherwise output < j , ( ( i , A [ i ] ) , 0 ) > <j,((i,A[i]),0)><j,((i,A[i]),0)>
Reduce:
- Collect the data of y=1 as S and sort
- Construct ( s 1 , s 2 , . . . , sp − 1 ) (s_1,s_2,...,s_{p−1})(s1,s2,...,sp−1), s k s_k skFor the kth in S ⌈ ∣ S ∣ p ⌉ k\left \lceil \frac{|S|}{p} \right \rceilk⌈p∣S∣⌉
- Collect data for y=0 as D
- Satisfy sk < x ≤ sk + 1 s_k < x \leq s_{k+1} for any (i,x)∈Dsk<x≤sk+1, output <k,(i,x)>
Map:nothing(identity)
Reduce:$ <j, ((i, A[i]), . . . )>$
- general owned ( i , A [ i ] ) (i, A[i])(i,A [ i ]) Sort and output according to $A[i]$
3.6 Computing the minimum spanning tree (spanning tree)
The main idea? What does the Map function do? What does the Reduce function do?
Using the graph partition algorithm, the graph G is divided into k subgraphs, and the minimum spanning tree is calculated in each subgraph
The essence of the algorithm is to calculate the spanning tree locally first, then use the remaining edges connecting these spanning trees to form a new graph, and find the minimum spanning tree of this new graph as the total result
- Map:input:<(u,v),NULL>
- transformation<(h(u),h(v));(u,v)>
- For the above conversion data, if h(u)=h(v), then for all j∈[1,k], output <(h(u),j);(u,v)>
- Reduce:input:<(i,j);Eij>
- 令Me=MSF(Thou)
- Output <NULL;(u,v)> for each edge e=(u,v) in Mij
- Map:nothing(identity)
- Reduce:M=MST(H)
4 External memory model algorithm
4.1 External storage model
In the I/O model, the memory size is ___, the page size is ___, and the external storage size is ___. How many I/Os are required to continuously read N data from the external memory?
M, B, Unlimited, N/B
4.2 Computing matrix multiplication
Input two matrices X and Y of size N×N
- Divide the matrix into blocks of size ___
- Considering each block in the X×Y matrix, there are obviously ___ blocks to output
- Each block needs to scan ___ pairs of input blocks
- Each in-memory calculation requires ___ I/O
- Total___times I/O
M / 2 × M / 2 \sqrt{M}/2\times\sqrt{M}/2 M/2×M/2
O ( ( N M ) 2 ) O((\frac{N}{\sqrt{M} })^2) O((MN)2)
N M \frac{N}{\sqrt{M} } MN
O(M/B)
O ( ( N M ) 3 ⋅ M / B ) O((\frac{N}{\sqrt{M} })^3\cdot M/B) O((MN)3⋅M/B)
4.3 Data structure
4.3.1 External storage stack
The memory maintains an array of size ___, realizes the memory stack structure, and stores the rest of the data in the external memory
How to push the stack (push)?
How to pop the stack (pop)?
I/O cost analysis:
▷ Worst case cost: O(1) times I/O
▷ Amortized analysis: ___, optimal
2b
If it is not full, press it, if it is full, write it out and then press it
Play it if it’s not empty, read it when it’s empty
O(1/B)
4.3.2 External memory linked list
Queue (Queue)
▷ The memory maintains two arrays A and B of size B, one for dequeue and one for enqueue
▷ A and B are stored separately?
▷ The rest of the data is stored in external memory
How to handle queue operations?
▷ Insert?
▷ Remove?
I/O cost analysis:
▷ Worst case cost: O(1) I/O
▷ Amortized analysis: ___, optimal
k queue head data and k′ queue tail data
If B is not full, it will be stored in memory, if it is full, it will be written out and then stored in memory
Play when A is empty, read and play again when empty
O(1/B)
4.3.3 Linked list
Perform three operations: insert(x,p), remove§, traverse(p,k)
-
Idea 2: block "half full" ⇒ data at least B/2;
Under the external memory model, the consecutive elements in a linked list are placed in a block of size B. At the same time, make each block size at least B/2:
▷ remove: When to merge? Under what circumstances should it be equally divided?
▷ insert: under what circumstances should it be equally divided?
▷ traverse: ___, the worst case cost of insert and remove is O(1)▷ Amortized cost: N consecutive insertions ___, consecutive deletions ___
If it is less than B/2, it will be merged with the neighbor block, and if it is greater than B, it will be divided equally
greater than B, evenly divided
O(2k/B)
N consecutive insertions O(2N/B), consecutive deletions O ( log 2 B ⋅ N / B ) O(log_2 B N/B)O(log2B⋅N/B)
-
Idea 3: Two consecutive blocks contain at least 2B/3 data;
Two consecutive blocks contain at least 2B/3 data; the memory maintenance size is B buffer
▷ remove: when to merge? Under what circumstances?
▷ insert: When to merge? Under what circumstances is it equally divided?
▷ traverse: ___▷ Amortized cost: N consecutive insertions ___, continuous deletions ___
▷ Amortized cost: N consecutive updates ___
Delete and check whether there are adjacent blocks such that the data volume is ≤ 2B/3, and if so, merge
If the current block is full, insert to the neighbor; if the neighbors are full, divide the current block equally
O(3k/B)
O(2N/B),O(3N/B)
O(12N/B)
4.4 Search structure
Perform three operations: insert(x), remove(x), query(x)
( a , b ) − t r e e : (a,b)-tree: (a,b)−tree: the relationship between a and b
Similar to binary search tree ⇒ ( p 0 , k 1 , p 1 , k 2 , p 2 , . . . , kc , pc ) \Rightarrow (p_0, k_1, p_1, k_2, p_2, . . . , k_c, p_c)⇒(p0,k1,p1,k2,p2,...,kc,pc)
The root node has __ children; the number of children of each non-leaf node is __:
- remove: how to operate?
- insert: how to operate?
- query: time complexity?
2 ≤ a ≤ (b + 1)/2
The root node has 0 or ≥ 2 children, and the number of children of other non-leaf nodes ∈ [a, b]
If it is less than a after deletion, it will be merged with the adjacent block, if it is greater than b after the merger, it will be divided evenly
If it is greater than b after insertion, it will be divided equally
$ O(log_a(N/a))$
insert operation
Assuming the inserted key value is K, first find the corresponding leaf node L
-
If there is free space in L, then insert directly, end;
-
Otherwise, split the leaf node into two nodes, and divide the keys into these two new nodes, so that the number of keys meets the minimum requirement;
-
When splitting leaf node N:
▷Create a new node M, let M be the right brother of N, sort the keys, before ⌈ ( n + 1 ) / 2 ⌉ ⌈(n+1)/2⌉⌈(n+1 ) /2 ⌉ stay in N, the other key‐pointers are put in M.
-
When splitting the non-leaf node N:
▷Sort the key-pointers, before ⌈ ( n + 2 ) / 2 ⌉ ⌈(n+2)/2⌉⌈(n+2 ) /2 ⌉ pointers stay in N, leaving⌊ ( n + 2 ) / 2 ⌋ ⌊(n+2)/2⌋⌊(n+2 ) /2 ⌋ pointers into M.
▷ Front⌈ n / 2 ⌉ ⌈ n/2 ⌉⌈ n /2 ⌉ keys stay in N, after⌊ n / 2 ⌋ ⌊n/2⌋⌊ n /2 ⌋ keys are put into M, and the key in the middle is reserved and inserted into the upper layer node, and the key pointer points to M.
-
delete operation
Assuming that the deleted key value is K, first find the corresponding leaf node L.
-
If L still has the minimum number of keys after deleting K, stop
-
Otherwise, you need to do the following:
▷Try to merge with one of L's adjacent sibling nodes (the same node can still be placed after merging). After merging, it is equivalent to deleting a key value in the upper node, then recursive processing;
▷ Otherwise, consider L's adjacent siblings
- Assuming that one of them can provide L with a key-pointer, and after removing the key-pointer, the brother node still meets the minimum requirement for the number of keys, then L borrows a pair of key-pointers from the brother, and updates the corresponding key value;
- If neither sibling can provide a key-pointer, then it must be the case that L has less than the minimum number of keys, and L's brother M has exactly the minimum number of keys, then the two nodes can be merged.
B+ tree – performance
Maximum number of pointers in a node: n, number of records: N
B+Tree insertion operation: O ( log ⌈ n / 2 ⌉ ( N ) ) O(log_{⌈n/2⌉}(N))O(log⌈n/2⌉( N ))
B+Tree delete operation:O ( log ⌈ n / 2 ⌉ ( N ) ) O(log_{⌈n/2⌉}(N))O(log⌈n/2⌉(N))
4.5 External memory sorting
▷ Given __ data
▷ Divided into groups of size __, each group can be sorted in memory, requiring __ I/O times
▷ sorted groups, merge (Merge)
▷ __ can be merged each time Grouping
▷ I/O cost:
▷ Drawing comprehension
N
O(M),O(M/B)
O(M/B)
O ( N / B ⋅ l o g M / B N B ) O(N/B · log_{M/B} \frac{N}{B} ) O ( N / B ⋅logM/BBN)或 O ( N / B ⋅ l o g M / B N M ) O(N/B · log_{M/B} \frac{N}{M} ) O ( N / B ⋅logM/BMN) (doubtful)
4.6 List Ranking
problem definition? (two questions)
algorithm? (four steps)
Given an adjacency linked list L of size N, L is stored in an array (contiguous external storage space), and the rank (serial number in the linked list) of each node is calculated
Input the external memory linked list L of size N
- Find an independent set of vertices X in L
- "Skip" the nodes in X to build a new, smaller external memory linked list L'
- Solve L' recursively
- "Backfill" the nodes in X, and construct the rank of L according to the rank of L'