Ten massive data processing summary

A, Bloom filter
Scope: can be used to implement the data dictionary, data re-arbitrate or intersection of a set of
basic principles and elements:
For the principle is very simple, + k bit array independent hash functions. The median value of the corresponding group hash function is set, the lookup If all bits are a hash function corresponding to specify the presence, it is clear that this process does not guarantee the result of the lookup is 100% correct. They also do not support the deletion of a key has been inserted as the keyword corresponding bit will affect other keywords. So a simple improvement is counting Bloom filter, with a counter array instead of the median group, can support deleted.
There is also a more important issue, how the input element number n, and the size of the hash function to determine the number m of bit array. When the number of the hash function k = (ln2) * (m / n) the minimum error. In the case where the error rate is not greater than E, m is at least equal to n * lg (1 / E) to represent any set of n elements. M but should also be larger, but also because at least half of the array to ensure that bit is 0, then m should be> = nlg (1 / E) * lge probably nlg (1 / E) 1.44 times (expressed in base 2 LG logarithmic).
As an example we assume that the error rate is 0.01, at this time should be about 13 m to n times. This is about 8 k.
Note that different units where m and n, m is a bit as a unit, and n is the number of elements of the units (accurate to say that the number of different elements). Length of the individual elements are generally has a lot of bit. Therefore, the use of bloom on the filter memory is usually saved.
Extended:
The Bloom filter elements of the collection array bit map, with k (k is the number of hash functions) th bit map indicates whether an element is not full in this collection. Counting bloom filter (CBF) will expand the number of bits in each group as a counter, thereby supporting the deletion of elements. Spectral Bloom Filter (SBF) which appears to associate with a set of elements number. SBF using minimum counter to approximate the frequency of occurrence of elements.
Examples of questions: to give you A, B two files, each storing five billion URL, each URL occupy 64 bytes, the memory limit is 4G, let you find out the A, common URL B file. If n is three and the files it?
The next question we have to calculate memory usage, 4G = 2 ^ 32 * 4000000000 about 34 billion about 8, n = 50 million if the operator needs to press the error rate is about 0.01 65,000,000,000 bit. Now available is 34 billion, not much difference, this may cause some increase in the error rate. Also, if these urlip is one to one, it can be converted into ip, is much simpler.

Two, Hashing
Scope: quickly find, delete basic data structure, typically the amount of data may require a total memory into
basic principles and elements:
hash function selected for strings, integers, arranged in the corresponding particular hash method.
Collision handling, one is open hashing, also known as the zipper method; the other is closed hashing, also known as open address method, opened addressing.
Extended:
d-left hashing in d is more meaning, let's simplify this issue, take a look at 2-left hashing. 2-left hashing refers to a hash table is divided into two halves of equal length, are called T1 and T2, with T1 and T2 to a hash function, h1 and h2. When storing a new key, while the two calculated using a hash function obtained two address h1 [key] and h2 [key]. In this case need to check H2 [key] position T1 h1 of the [key] and T2 in the position, which has been stored in a location (a collision) key more, less then load new key storage location. If as many sides, such as two positions are empty or are stored in a key, and put the child table T1 in the left, 2-left resulting new key is also stored. When looking for a key, it must be carried out twice hash, look for two positions at the same time.

Examples of questions:
1) massive log data to extract most of the day to visit a number of Baidu that IP.
IP number is still limited, up to 2 ^ 32, so you can consider using the ip hash directly into memory, and then statistics.

Three, bit-map
Scope: can quickly find the data, the heavy sentence, delete the data range is generally 10 times int following
basic principles and elements: use bit array to represent the existence of certain elements, such as 8 phone number
extensions: bloom filter can be seen as an extension to the bit-map of
question examples:
1) within a file known to contain some phone numbers, each number is 8 digits, count the number of different numbers.
8 up to 99,999,999, a 'bit takes about 99m, about 10 bytes of memory to a few m.
2) 250 000 000 integers not find an integer number of repetition, this memory space is insufficient to accommodate 250000000 integers.
The bit-map extend it, we can use 2bit represents a number, 0 does not appear, indicating the occurrence of a 1, 2 occurs 2 times or more. 2bit or we do not have to say that we achieved this 2bit-map with two bit-map can be simulated.
Fourth, the stack
Scope: mass data before large n, and n is relatively small, the heap memory can be placed in
the basic principles and elements: the maximum heap before seeking small n, n is greater before seeking the minimum heap. Methods, such as seeking the first n small, we compare the current largest element elements maximum heap, if it is less than the maximum element is the largest element that should be replaced. So that the last n elements is obtained the smallest n. Suitable for large amount of data required before small n, n relatively small size, this may be scanned again to get all of the first n elements of high efficiency.
Extension: double stack, a stack with a maximum binding minimum heap may be used to maintain the median.
Examples of questions:
1) 100W maximum number in the top 100 to find the number.
To a stack with a minimum size of 100 elements.

Fifth, the double barrels of division - in fact essentially [divide and rule] the idea of focusing on the "minute" technique!
Scope: k-th largest, median, or duplicate numbers will not be repeated
basic principles and elements: because a large range of elements, can not use direct addressing table, dividing it by the multiple, gradually established range, then the last one can be It is within the acceptable range. It can be reduced by multiple, double just one example.
Extension:
problem instance:
1) .2.5 one hundred million integers not find an integer number of repetition, this memory space is insufficient to accommodate 250000000 integers.
A bit like a pigeonhole principle, the number is an integer of 2 ^ 32, i.e., we can the number of these 2 ^ 32, 2 ^ divided into eight regions (such as a file represents a single region), then separated data to different area, then use bitmap in different areas can be directly solved. That is as long as there is enough disk space, you can easily solve.
2) .5 one hundred million int find their median.
This example is more obvious than the above. First, we will int divided into 2 ^ 16 area, and then read the statistics the number of falls in the number of each region, then we can determine the median fall into that area according to statistics, knowing the first in this region large numbers are just a few of the median. Then the second scan we only count the number of those who fall in this region of it.
In fact, if not int is int64, we can go through three times this division can be reduced to an acceptable level. Which can be divided into 2 ^ int64 first region 24, then the first of several large numbers determined region, sub-region is divided into 2 ^ 20 in this region, then determines the first few large number of sub-regions, and the number of sub-sections, the number of only 2 ^ 20, you can directly use the direct addr table statistics.

Sixth, the database index
Scope: large data CRUD
basic principles and elements: using design data implemented method deletions change search massive data processing.

Seven inverted index (Inverted index)
Scope: search engines, keyword query
basic principles and elements: Why called inverted index? An index that method is used to map stored in the full-text search for a word in a document or set of documents in a storage location.
In English, for example, the following is to be indexed text:
T0 = "What IS IT IS IT"
Tl = "What IS IT"
T2 = "IT IS A Banana"
We can get the inverted file index below:
"A": {2}
"Banana": {2}
"IS": {0,. 1, 2}
"IT": {0,. 1, 2}
"What" : {0, 1}
retrieval condition "what", "is" and "it" corresponding to the intersection of the sets.
Forward index developed to store a list of words for each document. Forward index queries often meet each document ordered frequent and full-text query each word verification of such inquiries in the verification document. In the forward index, the document occupies a central location, each document points to a sequence of index items it contains. That document points to those words it contains, and the reverse index is pointing to a word document that contains it, it is easy to see that this inverse relationship.
Extended:
problem instance: document retrieval system, query those files contain certain words, such as common academic keyword search.

Eight external sorting
Scope: big data sorting, de-duplication
basic principles and elements: merge external sorting methods, the choice of replacement tree loser principle, optimal merge tree
expansion:
problem instance:
1) there is a size of a 1G. file, which each line is a word, the word is not the size of more than 16 bytes, the memory size limit is 1M. Returns the highest frequency of 100 words.
This data has obvious characteristics, size is 16 bytes word, but the memory is only 1m to do some hash is not enough, it can be used to sort. When the input buffer memory can be used.

Nine, trie tree
scope of application: data volume, repetition and more, but the data types of small memory can be placed in
the basic principles and key points: implementation, the child node representation
Extended: compression achieved.
Examples of questions:
1) There are 10 files, each file 1G, each row of each file is stored in the user's query, query may be repeated for each file. To you sorted by frequency of query.
2) .1000 Wan string, some of which are the same (duplicate), the need to repeat all removed, no repeated character string retained. Ask how the design and implementation?
3) Find the most popular queries: Repeat of the query string is relatively high, although the total number is 10 million, but if the removal of duplication, not to exceed 3 million, each no more than 255 bytes.

Ten, distributed processing mapreduce
Scope: large volumes of data, but the data types of small memory can be placed in
the basic principles and elements: the data to the different machines to process data partitioning, the reduction results.
Extension:
problem instance:
. 1) .The Example Canonical file application of Process A to the MapReduce IS COUNT The Appearances of
each of Different Word Documents in SET A:
. 2) distributed in the mass data of computers 100, find a way out of this highly efficient Statistics TOP10 batch data.
3). A total of N machines, each machine has N number. Each machine stores up to O (N) and the number of operations thereof. How to find the number N ^ 2 number (median)?

Memories of the good

Published 57 original articles · won praise 33 · Views 140,000 +

Private letter concerns

Ten massive data processing summary

Guess you like