Written interview questions: find the median of massive data

      Originally published in:

 

     The median is the number in the middle after sorting. The median is a frequent visitor in written interview interviews. This problem was encountered twice in T company's intern recruitment and campus recruitment N years ago.

 

 

The median of non-mass data

      Refer to the previous article: top-K problem and random selection algorithm . Obviously, we can use:

      1. Quick sort algorithm

          After sorting, find the median directly.

 

      2. Direct selection algorithm

          Select directly until the median is selected.

 

      3. Heap selection algorithm

          Heap selection, finally selected to the median.

 

      4. Random selection algorithm

          Randomly select the algorithm and find the median.

 

 

The median of massive data

      If it is to find the median of massive data, it is not easy to use the above method, because it is impossible to load the massive data in the large file into the memory. What should I do?

     Refer to the previous article: Huashan on the Sword Barrel Sorting . Note that we need to ask for the median, not the sort. Specific steps are as follows:

     Step1: Create multiple small file buckets, set the value range of each bucket, and then assign the massive data elements to the corresponding buckets, and record the number of elements in the buckets.

     Step2: According to the number of elements in the bucket, calculate the bucket where the median is located, and then sort the bucket to find the median value of the massive data.

 

     The specific schematic diagram is as follows:

 

     

     The median is actually a special order statistic. Whether it is non-mass data or massive data, we can quickly find the median.

 

 

Guess you like

Origin blog.csdn.net/stpeace/article/details/108921752