Top-K problem and random selection algorithm

Originally published in:

I have talked about top-K issues before, as follows:

Written interview questions: ask for top-K (use pile)

Some friends reported that if the memory can hold these N elements, then using the heap to handle the top-K problem is not the best algorithm. Indeed, in this article, let's take a step-by-step look at the ideas for handling top-K problems.

1. Quick sort algorithm

Sorting method is the most direct algorithm, and it is also the most easily thought of algorithm. Using quick sort, the time complexity is O(N*logN). Taking sorting from largest to smallest as an example, top-K is a[0] to a[K-1].

However, the top-K problem itself does not require sorting, so quick sort obviously does a lot of useless work. Take a look at the following improvements.

2. Direct selection algorithm

Face the problem, change your mind, and choose a general from the dwarf:

a. First select the largest value from N numbers;

b. Then select the largest value from the remaining N-1 numbers;

c. Then continue to select the largest value from the remaining N-2 numbers;

......

By analogy, after selecting K times, top-K is selected. Obviously, the time complexity is O(N*K).

However, we noticed that in the selection process, these K numbers still produced a sorting effect and did some useless work. Take a look at the following improvements.

3. Heap selection algorithm

Use heap selection method:

Written interview questions: ask for top-K (use pile)

After using the heap, it avoids sorting K numbers, and the time complexity is O(N*logK). Obviously, the algorithm performance is further optimized. However, this is still not the best algorithm. Let's look down.

4. Random selection algorithm

Let's first look at such a problem: find the i-th largest value in the array a[N].

The random selection algorithm program is:

#include <iostream>
using namespace std;

int partition(int a[], int low, int high) //划分  
{  
    int pivotKey = a[low];  
    while(low < high)  
    {  
        while(low < high && a[high] >= pivotKey)  
        {
           high--;  
        }
 
        a[low] = a[high];  
 
        while(low < high && a[low] <= pivotKey)
        {
            low++;  
        }
 
        a[high] = a[low];  
    }  
 
    a[low] = pivotKey; //恢复
  
    return low;  
}

// 找第i个最小的值
int randomSelect(int *a, int low, int high, int i)
{
    if(low == high)
    {
        return a[low];
    }
 
    int pivot = partition(a, low, high); 
    int k = pivot - low + 1; 
 
    if(k == i) // 刚好划出来了
    {
        return a[pivot];
    }
  
    if(i < k) // 缩小范围进行递归
    {
        return randomSelect(a, low, pivot - 1, i);
    }
 
    return randomSelect(a, pivot + 1, high, i - k); // 缩小范围进行递归
}

int main()
{
    int a[] = {2, 5, 3, 1, 4, 111, 55};
    int n = sizeof(a) / sizeof(a[0]);
 
    int i = 6;
    cout << randomSelect(a, 0, n - 1, i) << " " << endl; // 第i个最小值

    return 0;
}

The random selection algorithm draws on the quick sort algorithm and uses the idea of recursive divide and conquer. The average time complexity of this algorithm is O(N), and the worst time complexity is O(N^2).

How to prove the average time complexity? Let's see:

If in extreme cases, the time complexity deteriorates to O(N^2), what should we do? Friends who are interested can refer to the optimization algorithm in "Introduction to Algorithms"---Quick selection algorithm, which can optimize the worst time complexity to O(N).

In the array, we can find the i-th largest number with O(N) time complexity, so similarly, we naturally easily use O(N) time complexity to deal with the top-K problem.

This article discusses the top-K problem again and introduces the random algorithm algorithm. The most important thing is to understand the idea of the algorithm.