Algorithm analysis of mode

  Copyright Statement: This article is original by Colin Cai, welcome to repost. To repost, must indicate the original URL 

  http://www.cnblogs.com/Colin-Cai/p/12664044.html 

  Author: windows 

  QQ / micro letter: 6679072 

  E-mail: [email protected]

  The so-called mode comes from such a problem: an array of length len, where a number appears more than len / 2, how to find this number.

  

  Sort-based

 

  Sorting is the first feeling, that is to sort the array and traverse it again to get the result.

  Basically written in C language is as follows:

int find(int a, int len)
{
    sort(a, len);
    return traverse(a, len);
}

  Sorting has O (nlogn) time complexity algorithms, such as quick sort, merge sort, and heap sort, and traversing the sorted array to get the result is time linear complexity, which is O (n) . So the time complexity of the whole algorithm is O (nlogn) .

 

  Find a better algorithm

 

  The above algorithm is too simple, and many times the things we can get out of our first sense are not necessarily reliable.

  Can we find a linear time-level algorithm, that is, Θ (n) time-level algorithm? Θ is the same symbol for the upper and lower bounds. In fact, it is easy to prove that there is no algorithm lower than the linear time level, that is, the time o (n) . The small o is different from the big O, which means low order infinity. The proof is roughly as follows:

  If an algorithm can solve the above problem with o (n) time complexity. Because it is an infinity of lower order than n, then there must be an array of length N. After completing this algorithm, the detected elements in the array are less than N / 2. Assuming that the result of the algorithm operation is a, then we replace all the elements of this array that were not detected during the operation of the array with the same number b that is not the result a of the algorithm. Then the new array is calculated by the algorithm, because the number that has not been detected will not affect the result of the algorithm, the result is naturally a, but in fact, the number of occurrences of the array more than N / 2 times is b. This leads to contradictions, so the o (n) time algorithm for this problem does not exist.

  

  We can now start thinking about something deeper.

  We will first find that if there are two different numbers in an array, remove the two numbers from the array to get a new array, then the new array still has the same mode as the old array. This one is easy to prove:

  Suppose the array is a, the length is len, the mode is x, and the number of occurrences is t. Of course, t> len / 2 is satisfied. Suppose there are two numbers y and z, y ≠ z. Remove these two numbers, the remaining array length is len-2. If these two numbers are not equal to the mode x, that is, x ≠ y and x ≠ z, then the number of times x appears in the new array is still t, t> len / 2> (len-2) / 2, So t is still the mode in the new array. And if x exists in these two numbers, then there is naturally only one x, then the number of occurrences of x in the remaining array is t-1, t-1> len / 2-1 = (len-2) / 2, so x is still the mode of the new array.

  

  With the above ideas, we will think about how to find the different numbers of this pair.

  We can record the number num and the number of times it repeats, traverse the array once, according to the following flowchart.

 

 

  num / times has always recorded the number and its repetitions. Times plus 1 and minus 1 are determined by whether the new number of the array is the same as num. The situation of minus 1 depends on the proposition proved above. Find a For different numbers, remove these two, and the mode of the remaining array will not change. 

  The point is to prove that the final result is the required mode. If the latter result is not a mode, then each time the mode appears, it must be "offset" with a non-mode number, so the number of non-mode numbers in the array will not be less than the number of modes Not reality. So the above algorithm is established, it has linear time complexity O (n) and constant space complexity O (1) .

  The C language code is basically as follows:

int find(int *a, int len)
{
    int i, num = 0, times = 0;for(i=0;i<len;i++) { if(times > 0) { if(num == a[i]) times++; else times--; } else { num = a[i]; times = 1; } } return num; }

   If written in Scheme, the program can be concise as follows:

(define (find s)
 (car
  (fold-right
   (lambda (n r)
    (if (zero? (cdr r))
     (cons n 1)
     (cons (car r) ((if (eq? n (car r)) + -) (cdr r) 1))))
   '(() . 0) s)))

 

  Problems after upgrade

 

  The mode above is more than 1/2 of the length of the array. If you change 1/2 to 1/3, how do you find it?

  For example, if the array is [1, 1, 2, 3, 4], then the mode to be found is 1.

  Let's sublimate again, if it is 1 / m, where m is a parameter, how to find out? This question is going to be more complicated than before. In addition, we must realize that after the problem is upgraded, there may be more than one mode, such as [1, 1, 2, 2, 3] length is 5, 1 and 2 are both Greater than 5/3. There are at most m-1 modes.

 

  Ideas

 

  If it is still sorted and then traversed, it is still valid, but the time complexity is still O (nlogn) level, we still expect an algorithm with linear time complexity.

  For the first question, the premise of establishment is to remove two different numbers in the array, and the mode remains unchanged. So, after the upgrade, is there still a similar result. Unlike before, we now look at what happens when the mode is changed from more than 1/2 to more than 1 / m, and we look at the removal of m different numbers in the array a of length len. The proof process is as follows:

  Similarly, let's assume there is a mode x in a, and the number of occurrences of x is t. See if x is not a mode after removing m different numbers. After removing the m numbers, the new array has a length of len-m. x is the mode, so the number of occurrences of x is t> len / m. If there are no xs in the removed m numbers, the number of occurrences of x in the remaining array is still t, t> len / m> (len-m ) / m, so in this case x is still the mode; if there are x in the removed m numbers, because the m numbers are different from each other, so there is only one x, so the number of occurrences of x in the remaining array is t -1, t> len / m, so that t-1> len / m-1 = (len-m) / m, so x is still the majority in the remaining array. The above holds true for all modes in the array. Similarly, it can be proved that for the numbers that are not in the array, the remaining arrays are still not in the mode. In fact, replace all the above with ≤.

  With the above understanding, we can follow the previous algorithm, but here it is changed to a linked list with a length of at most n-1. For example, for the array [1, 2, 1, 3], the mode 1 exceeds 1/3 of the array length 4, the process is as follows

  Initially, the empty list []

  Retrieve the first element 1, and find that there is no entry with num = 1 in the linked list, and the length of the linked list does not reach 2, so insert into the linked list and get [(num = 1, times = 1)]

  Retrieve the second element 2, and find that there is no record element with num = 2 in the linked list, the length of the linked list does not reach 2, insert into the linked list, and get [(num = 1, times = 1), (num = 2, times = 1 )]

  Retrieve the third element 1 and find that the num = 1 entry already exists in the linked list, then add 1 to the entry time to get [(num = 1, times = 2), (num = 2, times = 1)]

  Retrieving the fourth element 3, it is found that there is no num = 3 table element in the linked list, the length of the linked list has reached the maximum, equal to 2, so the elimination is performed, that is, the time of each table element is reduced by 1, and the table element reduced to 0 Move out of the linked list and get [(num = 1, times = 1)]

  The above is the process, and finally the mode is 1.

  The linked list finally obtained by the above process does contain all the modes, which is easy to prove, because the time of any mode cannot be completely cancelled. However, the above process does not actually guarantee that the final list will be all modes. For example, [1,1,2,3,4] will eventually get [(num = 1, times = 1), (num = 4, times = 1)], but 4 is not a mode.

  So we need to traverse the array again after getting the linked list, and record the number of repetitions in the linked list.

  Python uses map / reduce higher-order functions to replace procedural loops. The above algorithm also requires so much code as follows.

from functools import reduce
def find(a, m):
    def find_index(arr, test):
        for i in range(len(arr)):
            if test(arr[i]):
                return i
        return -1
    def check(r, n):
        index = find_index(r, lambda x : x[0]==n)
        if index >= 0:
            r[index][1] += 1
            return r
        if len(r) < m-1:
            return r+[[n,1]]
        return reduce(lambda arr,x : arr if x[1]==1 else arr+[[x[0],x[1]-1]], r, [])
    def count(r, n):
        index = find_index(r, lambda x : x[0]==n)
        if index < 0:
            return r
        r[index][1] += 1
        return r
    return reduce(lambda r,x : r+[x[0]] if x[1]>len(a)//m else r, \
        reduce(count, a, \
            list(map(lambda x : [x[0],0], reduce(check, a, [])))), [])

 

  If the code is written in C language, it will be more, but you can not use a linked list, and it is much more efficient to use a fixed-length array. Times = 0 means that the element is not occupied. This will not be realized here, and let the interested readers realize it by themselves.

Guess you like

Origin www.cnblogs.com/Colin-Cai/p/12664044.html