Algorithms2-week4-hashtable

Hash Table: Supported Operations
Purpose:maintain a (possibly evolving) set of stuff.
(transactions, people+associated data, IP address, etc)
Insert: add new record.
Delete: delete existing record.
Lookup: check for a particular record (a “dictionary”)
应用:
1. Application: De-Duplication
Given: a “stream” of objects.
(Linear scan through a huge file. Or objects arriving in real time)
Goal: remove duplicates (keep track of unique objects)
report unique visitors to web site
avoid duplicates in search results.
Solution: when new object x arrives
lookup x in hash table H
if not found, Insert x into H.
2. The 2-SUM Problem
Input: unsorted array A of n integers. Target sum t.
Goal: determine whether or not there are two numbers x, y in A with
x + y = t
Naive Solution: θ ( n 2 ) time via exhaustive search
Better:
1.) sort A ( θ ( n l o g ( n ) ) time)
2.) for each x in A, look for t-x in A via binary search.
Amazing:
1.) insert elements of A into hash table H.
2.) for each x in A, Lookup t-x , θ ( n ) time.
3. Futher Immediate Applications
Historical application : symbol tables in compilers.
Blocking network traffic.
Search algorithms (game tree exploration)
Use hash table to avoid exploring any configuration
(arrangement of chess pieces ) more than once.
4. High-Level Idea.
Setup: universe U[all IP addersses, all names, all chessboard configurations,etc] [generally really big]
Goal: wnat to maintain evolving set S U
[generally, of reasonable size].
Solution:
1.) pick n = numbers of buckets.
2.) choose a hash function: take a key as input return the position between 0 and n 1 . h : U { 0 , 1 , 2 , . . . , n 1 } .
3.) use array A of length n, store x in A[h(x)].
关于: Naive Solutions:
1. Array-based solution [indexed by u]
O ( 1 ) operations by θ ( | U | ) space.
2. List-based solution. θ ( | S | ) space but θ ( | S | ) Lookup.
5. Resolving Collisions.
Collision: distinct x , y U such that h ( x ) = h ( y ) ,hash function: 不同的键值返回同样的position。
1.) Solution #1: (separate) chaning,
keep linked list in each bucket.
given a key/object x, perform Insert/Delete/Loopup in the list in A[h(x)]. (A:linked list for x, h(x): Bucket for x).
2.) Solution #2: open addressing. (only one object per bucket)
Hash function now specifies probe sequence h 1 ( x ) , h 2 ( x ) . . .
Examples: linear probing(look consecutively),(17 then 18,19..)
Double hashing.(the first one specifies initial bucket that you probe, the second one specify the offset for each subsequent probe).
Definition: the load factor of a hash table is:
α = # o f o b j e t c s i n h a s h t a b l e # o f b u c k e t s o f h a s h t a b l e
Note:
1.) α = O(1) is necessary condition for operations to run in constant time.
2.) with open addressing, need α << 1. (only one object per bucket)
6. Pathological Data Sets(病态数据集)
Upshot#2: for god HT performance, need a good hash function.
Ideal(理想): user super-clever hash function guaranteed to spread every data set out evenly.
Problem: DOES NOT EXIST!(for every hash function, there is a pathological data set)
Reason: fix a hash function h: U { 0 , 1 , . . . , n 1 }
Pigeonhole Principle(鸽巢原理), there exist bucket i such that at least | U | n elements of U hash to l under h.
if data set drawn only from these, everything collides!
7. Pathological Data in the Real World.
Main Point: can paralyze several real-world systems by exploiting badly designed hash functions.
open source.
overly simplistic hash function.
(easy to reverse engineer a pathological data set)
Solutions
1. Use a cryptographic hash function(e.g., SHA-2)
infeasible to reverse engineer a pathological data set.
2. Use randomization.
design a family H of hash functions such that for all datasets S, “almost all”functions h H spread S out “pretty evenly”.
Universal Hash Functions
Definition: Let H be a set of hash functions from U to
{ 0 , 1 , 2 , . . . , n 1 } .
H is universal if and only if :
for all x,y in U(with x y )
P r h H [ x , y , c o l l i d e ] 1 n (collide: h ( x ) = h ( y ) ),
When h is chosen uniformly at random from H.
i . . e , c o l l i s i o n p r o b a b i l i t y a s s m a l l a s w i t h " g o l d s t a n a r d " of perfectly random hashing.
Example: Hashing IP Addresses.
Let U = IP addresses (of the form( x 1 , x 2 , x 3 , x 4 )),with each x i { 0 , 1 , 2 , . . . , 255 }
Let n = a prime(small multiple of # of objects in HT)
Construction:Define one hash function has per 4-tuple a = ( a 1 , a 2 , a 3 , a 4 ) with each a i { 0 , 1 , 2 , 3 , . . . , n 1 } .
Define: h a : IP addrs buckets by
h a ( x 1 , x 2 , x 3 , x 4 ) = ( a 1 x 1 + a 2 x 2 + a 3 x 3 + a 4 x 4 ) m o d , n
A Universal Hash Function
Define: H = { h a | a 1 , a 2 , a 3 , a 4 { 0 , 1 , 2 , . . . , n 1 } }
h a ( x 1 , x 2 , x 3 , x 4 ) = ( a 1 x 1 + a 2 x 2 + a 3 x 3 + a 4 x 4 ) m o d ( n )
Theorem: This family is universal.
Proof:(Part 1)
Consider distinct IP addresses( x 1 , x 2 , x 3 , x 4 ), ( y 1 , y 2 , y 3 , y 4 ).
Assume: x 4 y 4
Note: collision
a 1 x 1 + a 2 x 2 + a 3 x 3 + a 4 x 4 = a 1 y 1 + a 2 y 2 + a 3 y 3 + a 4 y 4
a 4 ( x 4 y 4 ) = i = 1 3 a i ( y i x i ) m o d ( n )
Proof (Part II)
The story So Far: with a 1 , a 2 , a 3 fixed arbitrarily, how many choices of a 4 satisfy
a 4 ( x 4 y 4 ) = i = 1 3 a i ( y i x i ) m o d ( n ) .
Key Claim: left-hand side equally likely to be any of {0,1,2,…,n-1}
Reason: x 4 y 4 .
Bloom Filter(布隆滤波器): Supported Operations.
Fast Inserts and Lookups.
Comparison to Hash Tables.
Pros: more space efficient
Cons:
1) can’t store an associated object.
2) No deletions.
3) Small false positive probability.
(might say x has been inserted even though it has’t been )
Applications:
Original: early spellcheckers.
Canonical(规范): list of forbidden passwords.
Modern: network routers,
Limited memory, need to be super-fast.
Bloom Filter: Under the Hood:
Ingredients:
1) array of n bits.
(So n | S | = # of bits per object in the data set S)
2) k hash functions h 1 , . . . , h k (k = small constant)
Insert(x) :
for i = 1, 2, …, k
set A[ h i ( x ) ] = 1
Lookup(x): return TRUE
A[ h i ( x ) ] = 1 for every i = 1,2,…,k.
Note: no false negatives:
(if x was inserted, Loopup(x) guaranteed to succeed).
But : false positive if all k h i ( x ) s already set to 1 by other insertions.
Heuristic(启发式) Analysis
Intuition: should be a trade-off between space and error (false positive)
probability.
Assume: all h i ( x ) s uniformaly random and independent.
Setup: n bits, insert data set S into bloom filter.
Note: for each bit of A, the probability it’s been set to 1 is (under above assumption):
1 ( 1 1 n ) k | S | 1 e k | S | n = 1 e k b
b=# of bits per object (n/|S|)

Story so far: probability a given bit 1 is 1 e k b
So: under assumption, for x not in S, false positve probality is
[ 1 e k b ] k Error rank ϵ
where b = # of bits per object.
How to set k ?: for fixed b , ϵ is minimized by setting
Plugging back in :
ϵ ( 1 2 ) ( l n 2 ) b or b 1.44 l o g 2 1 ϵ
k ( l n 2 ) b

猜你喜欢

转载自blog.csdn.net/qq_31805127/article/details/80324046