Discussion on Top K Problem of Mass Data Processing (Java Implementation and Scope of Three Methods)

topic:
CVTE written test https://www.1024do.com/?p=3949
The search engine will record all the search strings used by the user for each search through log files, and the length of each query string is 1-255 bytes.
Suppose there are currently 10 million records (the repetition of these query strings is relatively high, although the total number is 10 million, but if the repetition is removed, it will not exceed 3 million. The higher the repetition of a query string, it means that it The more users, the more popular.), please count the 10 most popular query strings, and the required memory cannot exceed 1G.
 
Ideas: The steps to solve this problem can be divided into two steps: 1. Count the number of occurrences of each "query string" (hereinafter referred to as query) 2. According to the statistical results, find the top 10
 
1. Count the number of query occurrences:
Using the hash idea, maintain a HashTable where the Key is the Query string and the Value is the number of occurrences of the Query. Each time a Query is read, if the string is not in the Table, add the string and set the Value to 1; if the string is in the Table, add one to the count of the string.
Because the query speed in the hashtable is very fast, almost reaching the time complexity of O(1), so when N records are counted, the time complexity can reach O(N), and the time complexity is linear.
 
2. According to the statistical results, find out the topK
With the help of the heap structure, we can find and adjust/move in log order time. '
Specific method: maintain a small root heap of size K (10 in this topic), and then traverse 3 million Query, and compare with the root element respectively. (Because this question is to find the "largest" 10, so use a small root heap. The elements traversed each time only need to be compared with the smallest element in the heap - "root". If it is less than the root, it means that it will definitely not enter topK; If it is greater than the root, it means that it can eliminate the smallest element in the heap, which is the root, and then adjust it)
 
The last K elements left in the heap are top K
 
TOP K problem

Discussion of Top k problem (java implementation and scope of application of three methods)

In many written tests and interviews, I like to examine Top K. The following three implementation methods and practical scope are given from my own experience.

  1. Merger

    This method is suitable for the case where several arrays are ordered to find Top k. The time complexity is O(k*m). (m: is the number of arrays). The specific implementation is as follows:

copy code
/**
* Knowing several m arrays in decreasing order, find the largest number in the first k of these data
*Suitable for using Merge's method, time complexity (O(k*m);
*/
import java.util.List;
import java.util.Arrays;
import java.util.ArrayList;
public class TopKByMerge{
 public int[] getTopK(List<List<Integer>>input,int k){
    int index[]=new int[input.size()];//Save the position of each array subscript scan;
    int result[]=new int[k];
    for(int i=0;i<k;i++){
       int max=Integer.MIN_VALUE;
       int maxIndex=0;
       for(int j=0;j<input.size();j++){
           if(index[j]<input.get(j).size()){
                if(max<input.get(j).get(index[j])){
                    max=input.get(j).get(index[j]);
                    maxIndex=j;
                }
           }
       }
       if(max==Integer.MIN_VALUE){
           return result;
       }
       result[i]=max;
       index[maxIndex]+=1;
       
    }
    return result;
 }
copy code
  1.  Quick sort process

    The quick sort process method uses the quick sort process to find Top k. The average time complexity is (O(n)). It is suitable for unordered single arrays. The specific java implementation is as follows:

The goal of Quick Select is to find the kth largest element, so

Select a pivot element pivot, partition the array into two sub-arrays,

  • If the length of the split left subarray is > k, the kth largest element must appear in the left subarray;
  • If the length of the split left subarray = k-1, the kth largest element is pivot;
  • If the above two conditions are not satisfied, the kth largest element must appear in the right subarray.
copy code
/*
*Using the process of quick sort to find the smallest k number
*
*/
public class TopK{
   int partion(int a[],int first,int end){
        int i=first;
        int main=a[end];
        for(int j=first;j<end;j++){
             if(a[j]<main){
                int temp=a[j];
                a[j]=a[i];
                a[i]=temp;
                i++;
             }
        }
        a[end]=a[i];
        a[i]=main;
        return i;    
   }
   void getTopKMinBySort(int a[],int first,int end,int k){
      if(first<end){
          int partionIndex=partion(a,first,end);
          if(partionIndex==k-1)return;
          else if(partionIndex>k-1)getTopKMinBySort(a,first,partionIndex-1,k);
          else getTopKMinBySort(a,partionIndex+1,end,k);
      }
   }
public static void main(String []args){
      int a[]={2,20,3,7,9,1,17,18,0,4};
      int k=6;
      new TopK().getTopKMinBySort(a,0,a.length-1,k);
      for(int i=0;i<k;i++){
         System.out.print(a[i]+" ");
      }
   }
}
copy code
  1. Use a small root heap or a large root heap

   To find the largest K, use a small root heap, and to find the smallest K use a large root heap.

  Find the maximum K steps:

  1.     A small root heap of K nodes is established according to the first K data.
  2.     In the subsequent scan of NK data,
  • If the data is larger than the root node of the small root heap, the value of the root node is overwritten with the data, and the node is adjusted to the small root heap.
  • If the data is less than or equal to the root node of the small root heap, the small root heap is unchanged.

 Finding the minimum K is similar to finding the maximum K. The time complexity is O(nlogK) (n: the length of the data), which is especially suitable for finding Top K of big data.

copy code
/**
 * Find the previous maximum K solutions: small root heap (when the amount of data is relatively large (especially when the memory cannot accommodate it), the heap is preferred)
 *
 *
 */
public class TopK {
    /**
     * Create a small root heap of k nodes
     *
     * @param a
     * @param k
     * @return
     */
    int[] createHeap(int a[], int k) {
        int[] result = new int[k];
        for (int i = 0; i < k; i++) {
            result[i] = a[i];
        }
        for (int i = 1; i < k; i++) {
            int child = i;
            int parent = (i - 1) / 2;
            int temp = a[i];
            while (parent >= 0 &&child!=0&& result[parent] >temp) {
                result[child] = result[parent];
                child = parent;
                parent = (parent - 1) / 2;
            }
            result[child] = temp;
        }
        return result;

    }

    void insert(int a[], int value) {
         a[0]=value;
         int parent=0;
         
         while(parent<a.length){
             int lchild=2*parent+1;
             int rchild=2*parent+2;
             int minIndex=parent;
             if(lchild<a.length&&a[parent]>a[lchild]){
                 minIndex = lchild;
             }
             if(rchild<a.length&&a[minIndex]>a[rchild]){
                 minIndex = rchild;
             }
             if(minIndex==parent){
                 break;
             }else{
                 int temp=a[parent];
                 a[parent]=a[minIndex];
                 a [minIndex] = temp;
                 parent = minIndex;
             }
         }
         
    }

    int[] getTopKByHeap(int input[], int k) {
        int heap[] = this.createHeap(input, k);
        for(int i=k;i<input.length;i++){
            if(input[i]>heap[0]){
                this.insert(heap, input[i]);
            }
        
            
        }
        return heap;

    }

    public static void main(String[] args) {
        int a[] = { 4, 3, 5, 1, 2,8,9,10};
        int result[] = new TopK().getTopKByHeap(a, 3);
        for (int temp : result) {
            System.out.println(temp);
        }
    }
}
copy code

In many written tests and interviews, I like to examine Top K. The following three implementation methods and practical scope are given from my own experience.

  1. Merger

    This method is suitable for the case where several arrays are ordered to find Top k. The time complexity is O(k*m). (m: is the number of arrays). The specific implementation is as follows:

copy code
/**
* Knowing several m arrays in decreasing order, find the largest number in the first k of these data
*Suitable for using Merge's method, time complexity (O(k*m);
*/
import java.util.List;
import java.util.Arrays;
import java.util.ArrayList;
public class TopKByMerge{
 public int[] getTopK(List<List<Integer>>input,int k){
    int index[]=new int[input.size()];//Save the position of each array subscript scan;
    int result[]=new int[k];
    for(int i=0;i<k;i++){
       int max=Integer.MIN_VALUE;
       int maxIndex=0;
       for(int j=0;j<input.size();j++){
           if(index[j]<input.get(j).size()){
                if(max<input.get(j).get(index[j])){
                    max=input.get(j).get(index[j]);
                    maxIndex=j;
                }
           }
       }
       if(max==Integer.MIN_VALUE){
           return result;
       }
       result[i]=max;
       index[maxIndex]+=1;
       
    }
    return result;
 }
copy code
  1.  Quick sort process

    The quick sort process method uses the quick sort process to find Top k. The average time complexity is (O(n)). It is suitable for unordered single arrays. The specific java implementation is as follows:

The goal of Quick Select is to find the kth largest element, so

Select a pivot element pivot, partition the array into two sub-arrays,

  • If the length of the split left subarray is > k, the kth largest element must appear in the left subarray;
  • If the length of the split left subarray = k-1, the kth largest element is pivot;
  • If the above two conditions are not satisfied, the kth largest element must appear in the right subarray.
copy code
/*
*Using the process of quick sort to find the smallest k number
*
*/
public class TopK{
   int partion(int a[],int first,int end){
        int i=first;
        int main=a[end];
        for(int j=first;j<end;j++){
             if(a[j]<main){
                int temp=a[j];
                a[j]=a[i];
                a[i]=temp;
                i++;
             }
        }
        a[end]=a[i];
        a[i]=main;
        return i;    
   }
   void getTopKMinBySort(int a[],int first,int end,int k){
      if(first<end){
          int partionIndex=partion(a,first,end);
          if(partionIndex==k-1)return;
          else if(partionIndex>k-1)getTopKMinBySort(a,first,partionIndex-1,k);
          else getTopKMinBySort(a,partionIndex+1,end,k);
      }
   }
public static void main(String []args){
      int a[]={2,20,3,7,9,1,17,18,0,4};
      int k=6;
      new TopK().getTopKMinBySort(a,0,a.length-1,k);
      for(int i=0;i<k;i++){
         System.out.print(a[i]+" ");
      }
   }
}
copy code
  1. Use a small root heap or a large root heap

   To find the largest K, use a small root heap, and to find the smallest K use a large root heap.

  Find the maximum K steps:

  1.     A small root heap of K nodes is established according to the first K data.
  2.     In the subsequent scan of NK data,
  • If the data is larger than the root node of the small root heap, the value of the root node is overwritten with the data, and the node is adjusted to the small root heap.
  • If the data is less than or equal to the root node of the small root heap, the small root heap is unchanged.

 Finding the minimum K is similar to finding the maximum K. The time complexity is O(nlogK) (n: the length of the data), which is especially suitable for finding Top K of big data.

copy code
/**
 * Find the previous maximum K solutions: small root heap (when the amount of data is relatively large (especially when the memory cannot accommodate it), the heap is preferred)
 *
 *
 */
public class TopK {
    /**
     * Create a small root heap of k nodes
     *
     * @param a
     * @param k
     * @return
     */
    int[] createHeap(int a[], int k) {
        int[] result = new int[k];
        for (int i = 0; i < k; i++) {
            result[i] = a[i];
        }
        for (int i = 1; i < k; i++) {
            int child = i;
            int parent = (i - 1) / 2;
            int temp = a[i];
            while (parent >= 0 &&child!=0&& result[parent] >temp) {
                result[child] = result[parent];
                child = parent;
                parent = (parent - 1) / 2;
            }
            result[child] = temp;
        }
        return result;

    }

    void insert(int a[], int value) {
         a[0]=value;
         int parent=0;
         
         while(parent<a.length){
             int lchild=2*parent+1;
             int rchild=2*parent+2;
             int minIndex=parent;
             if(lchild<a.length&&a[parent]>a[lchild]){
                 minIndex = lchild;
             }
             if(rchild<a.length&&a[minIndex]>a[rchild]){
                 minIndex = rchild;
             }
             if(minIndex==parent){
                 break;
             }else{
                 int temp=a[parent];
                 a[parent]=a[minIndex];
                 a [minIndex] = temp;
                 parent = minIndex;
             }
         }
         
    }

    int[] getTopKByHeap(int input[], int k) {
        int heap[] = this.createHeap(input, k);
        for(int i=k;i<input.length;i++){
            if(input[i]>heap[0]){
                this.insert(heap, input[i]);
            }
        
            
        }
        return heap;

    }

    public static void main(String[] args) {
        int a[] = { 4, 3, 5, 1, 2,8,9,10};
        int result[] = new TopK().getTopKByHeap(a, 3);
        for (int temp : result) {
            System.out.println(temp);
        }
    }
}
copy code

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324753387&siteId=291194637