External sorting - sort of large file sorting process extremely large amounts of data algorithm integer --5 million ideas

External sorting

** external sorting (External sorting) ** refers sorting algorithms can process an extremely large amount of data. Generally speaking, data can not be outside the sorting process once loaded into memory, reading and writing only on the slower external memory (typically a hard disk) on. External sorting usually is a "sort - merge" strategy. In the sorting stage, first in the amount of data can be read into the memory, and outputs it to a temporary file sort, which carries out, sorting data to be organized into a plurality of ordered temporary file. Then the stage will merge these temporary files are combined into one large file orderly, that is sort results.

External merge sort

An example of the outer external sorting merge sort (External merge sort), it reads the amount of data can be placed in some of the memory, the output of a run of the sort in memory (i.e., internal data is ordered temporary file), All data has been processed and then be merged. For example, 900 MB of data to be sorted, the machine but only 100 MB of memory available, the outer merge sort operation as follows:

  1. 100 MB of data read into memory, performs the ordering in the memory by some conventional manner (e.g., quick sort, heap sort, merge sort method, etc.).
  2. The complete sequencing of data written to disk.
  3. Repeat steps 1 and 2 until all the data are stored in different blocks (temporary file) 100 MB's. In this example, there are 900 MB of data, a single temporary file size of 100 MB, so will produce nine temporary files.
  4. Each read temporary file (a run) before the 10 MB (= 100 MB / (9 + 1 block)) data into the input buffer memory, 10 MB as the last output buffer. (In practice, the appropriate transfer a small input buffer, the output buffer is increased appropriately to obtain better results.)
  5. The implementation of nine road merging algorithm (see back 多路排序), and outputs the result to the output buffer. Once the output buffer is full, the data buffer is written to the target file, empty the buffer. Once an empty input buffer 9, it is associated with the buffer from the file, read the next data 10M, unless the file is read. This is a critical step "outside merge sort" to complete the sorting out of main memory - because the "merge algorithm" (merge algorithm) just do a sequentially accessed (be merged) for a chunk of each, each chunk without completely contained main memory.

In order to increase the length of each ordered a temporary file, you can use replacement selection sorting (the Sorting Replacement Selection) (can you can look at a directory). It can generate a run larger than the memory size. The specific method is to use a minimum memory heap sort, it is assumed that the minimum size of the heap M(memory 100M of the preceding superscript). Algorithm is described as follows:

  1. The initial input file is read into memory, the establishment of minimum heap.
  2. The top of the stack of elements to the output buffer. Then read the next record:
    1. If the key value of the element is not less than the key value of the newly output, as the top of the stack and adjusting the element stack, so as to meet the nature of the stack;
    2. Otherwise, a new element into the end position of the stack, the stack size minus 1. (Wiki here to say I will not be achieved, I think there should be inserted into another minimum heap is used to generate the next round of a run)
  3. Repeat step 2 until the heap size becomes zero.
  4. At this time, a cis-string has been generated. All elements of the heap of built heap (and here I think the idea is to speak directly with another 2.2 minimum heap), a next generation began to shun string.

This method can generate an average length of 2Ma run (equivalent to the front of the standard points 200M file), and may further reduce the number of external memory access (access to twice the original memory access time as long as the average current), save time and improve the algorithm effectiveness.


Note: Individual understanding may be biased, may be wrong, I hope bigwigs wing, thanks


Replacement Selection Sort

I have not seen this algorithm is described in detail, but according to the analysis I learned from reading this handout thing.

As I understand it, the main difference between the selection and ordering alternative to ordering that selection sort is intended to complete the sequence stored in the main memory of the sort, and selection sort is used to replace the too big to fit into the main memory unordered sequence into a memory. A series of "sequence" ordered sequences may be stored in the external memory. The outer links may then be combined together to form a unitary sorted sequence. One or two key steps in spite of their name and arithmetic operations are similar, but they are designed to solve fundamentally different problems.

Select the sort
had a lot of selection sort of good tutorials online, so I will not spend too much time discussing it. Intuitively, the algorithm works as follows:

Find the smallest element and to exchange their position 0 of the array.
Find the second smallest element, and to exchange their position 1 of the array.
Found third smallest element array and to exchange their position 2
...
to find the n-th smallest element, and to exchange their position in the array of n-1.
This assumes that the array can be completely stored in memory, if this is the case, the algorithm will Θ (n 2) time to run. It's not very fast, is not used for large data sets is recommended.

Replace selection sort
algorithm in 1965, it has been described by Donald Knuth, so it is designed to work with our current computing environment used in a completely different computing environments. Little computer memory (usually some fixed number of registers), but can access a large external drive. Usually build some algorithms, some of the values loaded into the register, processing therein, and then directly back to refresh its external storage. (Interestingly, similar to the current mode of operation the processor, in addition to the main memory instead of external memory).

Suppose we have enough space to accommodate two memory arrays: a first array of size n Values ​​can accommodate a stack value, the second array of size n Active may Nabul capacitance value. Considering we only have enough memory space to accommodate Activeand Values ​​array, as well as some extra storage space variables, we will try to adopt a large number of unsorted value of the input stream, and try to sort it.

The idea behind the algorithm is as follows. First, n values ​​of the external source comprises a sequence of unsorted Values ​​loaded directly onto the array. Then, all Active value is set to true. For example, if n = 4, we may have the following settings:

Values: 0. 1. 4. 3
the Active: Yes Yes Yes Yes
alternative to sorting algorithm works to find the minimum value is repeated Values array and write it to the output stream. In this case, we first find the values 0 and written to the stream. This gives

Values: 4 1 3
Active: Yes Yes Yes Yes

Output: 0
Now, we have a gap in the Values array, we can extract a value from an external source to another. Suppose we get 2. In this case, we have the following settings:

Values: 4 1 2 3
Active: Yes Yes Yes Yes

Output: 0
Note that, since the 2> 0, and 0 is the smallest element here, it is possible to ensure that when the output 0 is written, it is not earlier than 2. that's good. Therefore, we continue to the next step in the algorithm, and once again find the smallest element here. That is 1, so we will send it to the output device:

Values: 4 2 3
Active: Yes Yes Yes Yes

Output: 0 1
Now, another value is read from an external source in which:

Values: 4 -1 2 3
Active: Yes Yes Yes Yes

Output: 0 1
Now we have a problem. This new value (-1) is less than 1, which means that if we really want this sorted order into the output value, then it should be located before 1. However, we do not have enough memory to re-read the output device and repair. Instead, we will do the following. Now, let's -1 is retained in memory. We will do our best to sort the elements of the rest, but when we do so, we will carry out the second iteration to generate the sorted sequence, and -1 into the sequence. In other words, we will have two sort sequences, rather than produce a sort sequence.

In order to show that we do not want to write memory-1, we will mark the slot marked inactive 1. as the picture shows:

Values: 4 -1 2 3
Active: Yes NO Yes Yes

Output: 0 1
From now on, we will pretend -1 does not exist.

Let's move on. Now, we find the minimum value (2) is still active in memory and written to the device:

Values: 4 -1 3
Active: Yes NO Yes Yes

Output: 0 1 2
Now, we next value extracted from the input device. Assuming it is 7:

Values: 4 -1 7 3
Active: Yes NO Yes Yes

Output: 0 1 2
Since 7> 2, it outputs after 2, so we do nothing.

The next iteration, we find the lowest effective value (3) and writes it out:

Values: 4 -1 7
Active: Yes NO Yes Yes

Output: 0 1 2 3
We next value extracted from the input device. It is also assumed that 3. In this case, we know that the minimum value is 3, so we can write directly to the output stream, as in this case all three values are minimum, so we can save iteration:

Values: 4 -1 7
Active: Yes NO Yes Yes

Output: 0 1 2 3 3
Now, we next value extracted from the input device. Assuming it is 2. In this case, as before, we should know ahead 2 3. As previously -1, this means that we now need 2 will remain in memory; we write out later. Now, we are set up as follows:

Values: 4 -1 7 2
Active: Yes NO Yes NO

Output: 0 1 2 3 3
Now, we find the smallest RMS (4) and write it to the output device:

Values: -1 7 2
Active: Yes NO Yes NO

Output: 0 1 2 3 3 4
Suppose we read the next one as an input. So Values, we put it in, but it is marked as invalid:

Values: 1 -1 7 2
Active: NO NO Yes NO

Output: 0 1 2 3 3 4
only one active value, i.e. 7, which we write:

Values: 1 -1 2
Active: NO NO Yes NO

Output: 0 1 2 3 3 4 7
assumes we read a 5. In this case, the same as before, but we will store the slot marked inactive:

Values: 1 -1 5 2
Active: NO NO NO NO

Output: 0 1 2 3 3 4 7
Please note that all values are now in an inactive state. This means that we have cleared all the current output values can be entered to run from memory. Now, we need to write out all the values held later. To this end, we will all be marked as active values, and then repeat the same as before:

Values: 1 -1 5 2
Active: Yes Yes Yes Yes

The Output: 0 2. 3. 1. 4. 3. 7
-1 is the minimum, and therefore its output will be:

Values: 1 5 2
Active: Yes Yes Yes Yes

Output: 0 1 2 3 3 4 7 -1
Suppose we are reading 3. -1 <3, we loaded into Values array.

Values: 1 3 5 2
Active: Yes Yes Yes Yes

The Output: 2. 3. 1 0 -1. 7. 4. 3
. 1 is a minimum here, so we will remove:

Values: 3 5 2
Active: Yes Yes Yes Yes

Output: 0 1 2 3 3 4 7 -1 1
assuming we do not have the input value. We see this slot marked as completed:

Values: — 3 5 2
Active: Yes Yes Yes Yes

Output: 0 1 2 3 3 4 7 -1 1
followed by 2:

Values: — 3 5 —
Active: Yes Yes Yes Yes

Output: 0 1 2 3 3 4 7 -1 1 2
and 3:

Values: — — 5 —
Active: Yes Yes Yes Yes

Output: 0 1 2 3 3 4 7 -1 1 2 3
Last, 5:

Values: — — — —
Active: Yes Yes Yes Yes

Output: 0 1 2 3 3 4 7 -1 1 2 3 5
we have done! Please note that the results are not sorted sequence, but much better than before. Now, it is sorted by the order of the two chains. They are combined together (with us for mergesort merger in the same manner) resulting array will be sorted. The algorithm may produce more chains, but due to our small sample input, so only two.

Well, this how fast? Well, each iteration loop up to n times comparison (in memory), one read and one write. Thus, if a total flow of N values, then the algorithm performs O (nN) Compare and O (N) storage operations. If the memory operation is very expensive, it is still not bad, although the final second pass need to merge all the contents.

In pseudo-code, the algorithm is as follows:

Make Values an array of n elements.
Make Active an array of n booleans, all initially true.

Read n values from memory into Values.
Until no values are left to process:
    Find the smallest value that is still active.
    Write it to the output device.
    Read from the input device into the slot where the old element was.
    If it was smaller than the old element, mark the old slot inactive.
    If all slots are inactive, mark them all active.

If we have any reason to coding algorithm, I'd be shocked. A few decades ago, when the memory is really very young, it makes sense. Today, there is a better external sorting algorithms available (first method of speaking in front), and almost certainly better than their performance of the algorithm. (Chinese wiki said replacement selection sorting method reduces the disk down faster, this person said first external merge sort faster, but I still feel off the disk to reduce possible replacement selection sorting method a little faster )

Multiple sorting

definition

k road

Violence Act

Min each time selecting a value from the top are removed ordered string k to k-1 timesmin(a,b)

Best loser merge tree + Tree

  1. In carrying out the initial original file is divided into m sections merge in order to minimize the value of m for substitution - selection sorting algorithm can be implemented throughout the initial file into a smaller number of unequal lengths the initial merge segment.
  2. While the initial segment merging merge file ordered complete process, in order to minimize the number of external memory read and write, constructed using the best tree merge mode for merging the initial merge segment, and merge specific method is the use of loser tree method.

It can be mild reference

Reference links

https://en.wikipedia.org/wiki/External_sorting
https://zh.wikipedia.org/wiki/%E5%A4%96%E6%8E%92%E5%BA%8F
https://stackoverflow.com/questions/16326689/replacement-selection-sort-v-selection-sort

发布了34 篇原创文章 · 获赞 4 · 访问量 7526

Guess you like

Origin blog.csdn.net/neve_give_up_dan/article/details/104401236