Find the median?

How to find a median?

If you enter an array and let you find the median, the simple solution is to sort the array

If the length of the array is odd, the middle element is the median

If the array length is even, the average of the middle two elements is used as the median.

But if the data size is very large and sorting is not realistic, then you can also use a probability algorithm to randomly select a part of the data, sort it, and find the median as the median of all the data.

295. Median of Data Streams

This question is to let us design such a class:

class MedianFinder {
    
    

    // 添加一个数字
    public void addNum(int num) {
    
    }

    // 计算当前添加的所有数字的中位数
    public double findMedian() {
    
    }
}

General idea:

The conventional thinking is

Use an array to record all the numbers added by addNum, and ensure the order of the elements in the array through the logic of insertion sorting. When the findMedian method is called, the median can be directly calculated through the array index

But the problem with using an array as the underlying container is also obvious:

AddNum can use the binary search algorithm when searching for the insertion position, but the insertion operation needs to move data, and the worst time complexity is O(N)

Because of the time complexity of inserting arrays, consider using linked lists .

Using a linked list to insert data is fast, but when looking for the insertion position, it can only be traversed linearly, and the worst time complexity is O(N)

Moreover, the findMedian method also needs to traverse to find the intermediate index, and the worst time complexity is O(N)

Because of the time complexity of looking up linked lists, consider using a balanced binary tree

The complexity of adding, deleting, checking and modifying a balanced binary tree is O(logN)

For example, using the TreeSet container provided by Java, the bottom layer is a red-black tree, addNum is directly inserted, and findMedian can calculate the rank of the median element by calculating the number of current elements

However, TreeSet is an element in which there are no repeated elements in Set, but our data flow may input repeated data, and the calculation of the median also needs to count repeated elements

Not only that, but TreeSet doesn't implement an API for quickly counting elements by rank. That is to say, if we want to find the fifth largest element in the TreeSet, we need to manually implement this requirement.

Even a balanced binary tree doesn't work, so is **priority queue (binary heap)** okay?

The priority queue is a limited data structure that can only add/delete elements from the top of the heap. Our addNum method can insert elements from the top of the heap, but the findMedian function needs to be fetched from the middle of the data. This function priority queue has no way which provided

Solutions:

To solve this problem, we will definitely use an ordered data structure. The data structure used in this question is two priority queues

The median is the middlemost element of the sorted array

We can abstract the ordered array into an inverted triangle (from large to small), and the width can be regarded as the size of the element, then the middle part of this inverted triangle is the element for calculating the median

Cut this inverted triangle in half from the middle, program a small inverted triangle and a trapezoid

This small inverted triangle is equivalent to an ordered array from small to large, and this trapezoid is equivalent to an ordered array from large to small

They can be big top piles and small top piles respectively, and the median is their top elements

But although the trapezoid is a small top pile, the elements in it are larger, we call it large, and the inverted triangle is a big top pile, but the elements in it are smaller, we call it small.

Of course, the two heaps need to be correctly maintained by the algorithm logic to ensure that the correct median can be calculated for the top elements of the heap. We can easily see that the difference between the elements in the two heaps cannot exceed 1 . (This is actually restricting the implementation of the addNum method)

Suppose the total number of elements isn

  • If nit is an even number, we hope that the number of elements in the two heaps is the same, so take out the top elements of the two heaps and find an average, which is the median;
  • If nit is an odd number, then we hope that the number of elements of the two heaps is n/2 + 1the sum n/2, so that the top element of the heap with more elements is the median.

Therefore, we can get the code as follows:

class MedianFinder {
    
    

    private PriorityQueue<Integer> large;
    private PriorityQueue<Integer> small;

    public MedianFinder() {
    
    
        // 小顶堆
        large = new PriorityQueue<>();
        // 大顶堆
        small = new PriorityQueue<>((a, b) -> {
    
    
            return b - a;
        });
    }

    public double findMedian() {
    
    
        // 如果元素不一样多,多的那个堆的堆顶元素就是中位数
        if (large.size() < small.size()) {
    
    
            return small.peek();
        } else if (large.size() > small.size()) {
    
    
            return large.peek();
        }
        // 如果元素一样多,两个堆堆顶元素的平均数是中位数
        return (large.peek() + small.peek()) / 2.0;
    }

    public void addNum(int num) {
    
    
        // 后文实现
    }
}

How to implement the addNum method?

Each time the addNum method is called, compare the number of elements of the following large and small? Whoever has fewer elements will be added to whoever, if they have the same number of elements, they will be added to large by default

// 有缺陷的代码实现
public void addNum(int num) {
    
    
    if (small.size() >= large.size()) {
    
    
        large.offer(num);
    } else {
    
    
        small.offer(num);
    }
}

But there are still problems, such as

addNum(1), now the number of elements in the two heaps is the same, both are 0, so 1 is added to the large heap by default.

addNum(2), now large has more elements than small, so add 2 to the small heap.

addNum(3), now both heaps have one element, so 3 is added to large by default.

Call findMedian, the expected result should be 2, but the actual result is 1

insert image description here
From the figure, we can get the fact that our trapezoid and small inverted triangle are cut from the middle of the original large inverted triangle, then the minimum width of the trapezoid must be greater than or equal to the maximum width of the small inverted triangle, so that they can be combined Into a big inverted triangle!

That is to say, when we addNum, we not only need to maintain that the difference between the number of large and small elements does not exceed 1, but also maintain that the top element of the large heap is greater than or equal to the top element of the small heap

So how do we achieve it?

We can do this. When we want to add elements to large, we can’t add them directly. Instead, we need to add them to small first, and then add the top elements of small to large. Adding elements to small is the same.

What is the rationale for this?

Suppose we are going to insert elements into large:

If the inserted num is smaller than the top element of the small heap, then num will remain in the small heap,

In order to ensure that the difference between the number of elements in the two heaps is not greater than 1, in exchange, insert the elements at the top of the small heap into the large heap

If the inserted num is greater than the top element of the small heap, then num will become the top element of the small heap, and will finally be inserted into the large heap

On the contrary, inserting elements into the small is a reason, which subtly ensures that the large heap is larger than the small heap as a whole and the difference between the elements of the two heaps does not exceed 1, then the median can quickly pass through the top elements of the two heaps calculated

// 正确的代码实现
public void addNum(int num) {
    
    
    if (small.size() >= large.size()) {
    
    
        small.offer(num);
        large.offer(small.poll());
    } else {
    
    
        large.offer(num);
        small.offer(large.poll());
    }
}

addNumMethod time complexity O(logN), findMedianmethod time complexity O(1).

The complete code is as follows:

class MedianFinder {
    
    
    private PriorityQueue<Integer> small;
    private PriorityQueue<Integer> large;
    public MedianFinder() {
    
    
        small=new PriorityQueue<>();
        large=new PriorityQueue<>((a,b)->{
    
    
            return b-a;
        });
    }
    
    public void addNum(int num) {
    
    
        //当两个堆的元素个数相同的时候,向large中添加元素,在向large添加元素之前,先将元素添加到small,再将small的堆顶元素添加到large
        if(small.size()>=large.size()){
    
    
            small.offer(num);
            large.offer(small.poll());
        }else{
    
    
            large.offer(num);
            small.offer(large.poll());
        }
    }
    
    public double findMedian() {
    
    
        //如果两个堆的元素相同就返回两个堆的堆顶元素平均值
        //如果两个堆的元素个数不相同就返回元素个数比较多的堆顶元素
        if(small.size()>large.size()){
    
    
            return small.peek();
        }else if(small.size()<large.size()){
    
    
            return large.peek();
        }
        return (small.peek()+large.peek())/2.0;
    }
}

Guess you like

Origin blog.csdn.net/weixin_52055811/article/details/130527120