Massive data sorting, the amount of data is larger than the memory size, how to achieve sorting

train of thought

The traditional sorting algorithm generally refers to the internal sorting algorithm, which is aimed at the situation where the data can be loaded into the memory all at once. However, in the face of massive data, that is, it is impossible to load all the data into the memory at one time, an external sorting method is required.
The outer sorting method adopts the block method (divide and conquer). First, the data is divided into blocks, and the data in the block is sorted by selecting an efficient inner sorting strategy. Then use the idea of merge sort to sort all the blocks to get an ordered sequence of all data.

Example 1

For example, consider a sorting method with a 1G file and 100M available memory. First divide the file into 10 pieces of 100M, and load them into the memory for sorting, and finally store the results in the hard disk. The result is 10 individually sorted files.
Then load 9M of data from each file into the input buffer, and the output buffer size is 10M. The data in the cache area is merged and sorted. After the cache area is full, it is written to the hard disk. The cache area is cleared and the next data is written. For the input buffer area, when the 9M data of a block are all used up, the next 9M data of the block is loaded until all the data of all 9 blocks have been loaded into the memory and processed. In the end, what we got was a 1G sorted file stored on the hard disk.

Example 2

How to sort 1TB data using 32GB memory
Divide 1TB data on the disk into 40 chunks (chunks), each 25GB. (Attention, leave some system space!)
Sequentially read each 25GB data into the memory, and use the quick sort algorithm to sort.
Store the sorted data (also 25GB) back to disk.
Loop 40 times, and now all 40 blocks are individually sorted. (The rest of the work is how to merge and sort them!)

Read 25G/40=0.625G from 40 blocks into memory (40 input buffers).
Performs a 40-way merge and temporarily stores the merged result in a 2GB memory-based output buffer. When the buffer is full of 2GB, write the final file on the hard disk and clear the output buffer; when any of the 40 input buffers is processed, write the next 0.625GB in the block corresponding to the buffer, until all processing is complete.

Example 3

According to 1G of memory and 10G of data, we divide 10G of data into 10 parts, and sort each 1G by calling the disk from memory. After sorting, we will get 10 ordered data arrays.
Merge: The multi-way merge process can use a min-heap.
Create a minimum heap with a size of 10 in the memory, and a buffer (less than 1G, not too small).
Take the top of 10 sorted data and enter the minimum heap. Then the smallest number is at the top of the heap, remove the top element of the heap and write it into the buffer, then enter the minimum heap from the next bit in the array to which the removed element belongs, remove the top of the heap again and enter the buffer...until the buffer is full , the buffer is written back to the disk, the buffer is cleared, and the data is placed into the minimum heap again...
until all 10 copies of data are written, and then the elements of the minimum heap are written back to the disk in order.

Source: https://blog.csdn.net/FX677588/article/details/72471357
https://blog.csdn.net/fengyuyeguirenenen/article/details/125095520