Top-K problem and random selection algorithm

      Originally published in:

 

 

     I have talked about top-K issues before, as follows:

     Written interview questions: ask for top-K (use pile)

 

     Some friends reported that if the memory can hold these N elements, then using the heap to handle the top-K problem is not the best algorithm. Indeed, in this article, let's take a step-by-step look at the ideas for handling top-K problems.

 

1. Quick sort algorithm

     Sorting method is the most direct algorithm, and it is also the most easily thought of algorithm. Using quick sort, the time complexity is O(N*logN). Taking sorting from largest to smallest as an example, top-K is a[0] to a[K-1].

     However, the top-K problem itself does not require sorting, so quick sort obviously does a lot of useless work. Take a look at the following improvements.

 

2. Direct selection algorithm

    Face the problem, change your mind, and choose a general from the dwarf:

     a. First select the largest value from N numbers;

     b. Then select the largest value from the remaining N-1 numbers;

     c. Then continue to select the largest value from the remaining N-2 numbers;

      ......

     By analogy, after selecting K times, top-K is selected. Obviously, the time complexity is O(N*K).

     However, we noticed that in the selection process, these K numbers still produced a sorting effect and did some useless work. Take a look at the following improvements.

 

3. Heap selection algorithm

     Use heap selection method:

     Written interview questions: ask for top-K (use pile)

    After using the heap, it avoids sorting K numbers, and the time complexity is O(N*logK). Obviously, the algorithm performance is further optimized. However, this is still not the best algorithm. Let's look down.

 

4. Random selection algorithm

     Let's first look at such a problem: find the i-th largest value in the array a[N].

     The random selection algorithm program is:

#include <iostream>
using namespace std;

int partition(int a[], int low, int high) //划分  
{  
    int pivotKey = a[low];  
    while(low < high)  
    {  
        while(low < high && a[high] >= pivotKey)  
        {
           high--;  
        }
 
        a[low] = a[high];  
 
        while(low < high && a[low] <= pivotKey)
        {
            low++;  
        }
 
        a[high] = a[low];  
    }  
 
    a[low] = pivotKey; //恢复
  
    return low;  
}

// 找第i个最小的值
int randomSelect(int *a, int low, int high, int i)
{
    if(low == high)
    {
        return a[low];
    }
 
    int pivot = partition(a, low, high); 
    int k = pivot - low + 1; 
 
    if(k == i) // 刚好划出来了
    {
        return a[pivot];
    }
  
    if(i < k) // 缩小范围进行递归
    {
        return randomSelect(a, low, pivot - 1, i);
    }
 
    return randomSelect(a, pivot + 1, high, i - k); // 缩小范围进行递归
}

int main()
{
    int a[] = {2, 5, 3, 1, 4, 111, 55};
    int n = sizeof(a) / sizeof(a[0]);
 
    int i = 6;
    cout << randomSelect(a, 0, n - 1, i) << " " << endl; // 第i个最小值

    return 0;
}


     The random selection algorithm draws on the quick sort algorithm and uses the idea of ​​recursive divide and conquer. The average time complexity of this algorithm is O(N), and the worst time complexity is O(N^2).

     How to prove the average time complexity? Let's see:

     If in extreme cases, the time complexity deteriorates to O(N^2), what should we do? Friends who are interested can refer to the optimization algorithm in "Introduction to Algorithms"---Quick selection algorithm, which can optimize the worst time complexity to O(N).

 

     In the array, we can find the i-th largest number with O(N) time complexity, so similarly, we naturally easily use O(N) time complexity to deal with the top-K problem.

     This article discusses the top-K problem again and introduces the random algorithm algorithm. The most important thing is to understand the idea of ​​the algorithm.

Guess you like

Origin blog.csdn.net/stpeace/article/details/108921559