bitmap, Trie, a database index, the inverted index, external sorting, Mapreduce

Bitmap

Problems for the integer 4 billion of non-repetition of unsigned int, no row over the sequence, and then give a number, how to quickly determine the number of whether the number of them that 4 billion?
Scheme 1: Method using bitmap / Bitmap, the application memory 512M, a representative of one bit unsigned int value. Read 4 billion number, set the corresponding bit position, read the number to be queried to see if the corresponding bit to 1, indicates the presence of 1, 0 means does not exist.

It can also be extended to 2-Bitmap.

Trie tree

Question: There are 10 files, each file 1G, each row of each file is stored in the user's query, query may be repeated for each file. To you sorted by frequency of query.

Program: Its solution is: count the number of times each word appears with the trie, the time complexity is O (n * le) (le stabilization representing a word length), that is, the total length of all the characters. (Trie is a time to insert its length, a seek time is the height of the tree)

It can also be used to re-string, statistical top K.

Database Index

See separate database index

Inverted index (Inverted index)

Scope: search engines, keyword query

Basic principles and key points: what is called an inverted index? An index that method, used to find a word which documents the emergence of a mapping in.

In English, for example, the following is to be indexed text:

= T0 "What IS IT IS IT"
Tl = "What IS IT"
T2 = "IT IS A Banana"
we get the inverted file index below:
"A": {2}
"Banana": {2}
" iS ": {0,. 1, 2}
" IT ": {0,. 1, 2}
" What ": {0,. 1}
to find" what is it ", is to compute the" what "," is "and" it "the intersection of the corresponding collection.

External sorting

Question: How to sort the file to disk

Description: Given a file, which contains up to n positive integers not repeated (i.e. contain less than n may not repeat a positive integer), and where each number is less than equal to n, n = 10 ^ 7.

Output: get the list in ascending order from small to large to contain all of the integer input.
Conditions: Up to approximately 1MB of memory space available, but enough disk space. And requires less operating time of 5 minutes 10 seconds for the best results.

Program: external sorting

An example of the outer external sorting merge sort (External merge sort), it reads the amount of data can be placed in some of the memory, the output of a run of the sort in memory (i.e., internal data is ordered temporary file), All data has been processed and then be merged. For example, 900 MB of data to be sorted, the machine but only 100 MB of memory available, the outer merge sort operation as follows:

  1. 100 MB of data read into memory, in some conventional manner (e.g., quick sort, heap sort or the like) to complete the ordering in memory.
  2. The complete sequencing of data written to disk.
  3. Repeat steps 1 and 2 until all the data are stored in different blocks (temporary file) 100 MB's. In this example, there are 900 MB of data, a single temporary file size of 100 MB, so will produce nine temporary files.
  4. Each read temporary file (a run) before the 10 MB (= 100 MB / (9 + 1 block)) data into the input buffer memory, 10 MB as the last output buffer. (In practice, the appropriate transfer a small input buffer, the output buffer is increased appropriately to obtain better results.)
  5. The implementation of nine road merging algorithm, and outputs the result to the output buffer. Once the output buffer is full , the data buffer is written to the target file, empty the buffer. Once an empty input buffer 9, it is associated with the buffer from the file, read the next data 10M, unless the file is read. This is a critical step "outside merge sort" to complete the sorting out of main memory - because the "merge algorithm" (merge algorithm) just do a sequentially accessed (be merged) for a chunk of each, each chunk without completely loaded into the main memory.

Option II: Bitmap

10 ^ 7 requires 10 ^ 7bit, record whether there had been (in fact, bool vis [1e7 + 5])

This issue is divided into the following three steps using a bitmap scheme to solve:

  • The first step, all the bits are set to 0, so that the set is initialized to empty.
  • The second step, is established by a set of each integer in the read file, each corresponding bits are set to 1.
  • The third step, testing every bit integer if the bit is 1, the corresponding outputs.

After the above three steps, resulting in an ordered output file.

MapReduce distributed processing of

MapReduce is a computing model, simply means that the bulk of the work (data) decomposition (MAP) to perform, and then merge the results into a final result (REDUCE). The advantage of this is that the task can be broken, can be calculated in parallel by a large number of machines, to reduce the time for the entire operation. But if you want me to introduce the popular point, then, plainly, Mapreduce principle is a merge sort.

For example, for the aforementioned inverted index,

Inverted index: Map function analysis input output a list of each document (word document number), Reduce function is one for all (word document number) given word, sorted all the document number, output (word, list (document number)). All output set to form a simple inverted index, which in a simple algorithm to track the position of the word in the document.

 

 

 

Reference links:

1. 

2. Wikipedia - external sorting

3. preliminary understanding CSDN_JULY-MapReduce technology and learning

4. 

Guess you like

Origin www.cnblogs.com/lfri/p/12422962.html