[Data structure and algorithm] -> data structure -> heap (below) -> application of heap

Ⅰ Preface

In the previous article, I explained in detail the principle and implementation of the heap, and also gave the most easily thought of application of a heap-heap sorting, but in fact, there are several very important applications of the heap, in software development We may use it frequently. In this article, let’s take a look at these applications.

[Data Structure and Algorithm] -> Data Structure -> Heap (on) -> Detailed Heap & Heap Sorting

Ⅱ Heap application 1: Priority queue

First, let's look at the first application scenario of the heap- priority queue .

Priority queue, as the name suggests, it should be a queue first. We know that the biggest feature of queues is first in, first out. However, in the priority queue, the order of data dequeuing is not first-in-first-out, but according to priority, with the highest priority being the first to dequeue.

[Data Structure and Algorithm] -> Data Structure -> Queue -> Application of Circular Array & Creation of Queue Tool Library

So how to implement a priority queue? There are actually many methods, but the most direct and efficient is to use the heap. Because heaps and priority queues are very similar, a heap can be regarded as a priority queue. In many cases, they are just conceptual distinctions. Inserting an element into the priority queue is equivalent to inserting an element into the heap; removing the element with the highest priority from the priority queue is equivalent to removing the top element of the heap.

There are many application scenarios for priority queues, such as Huffman coding, the shortest path of the graph, and the minimum spanning tree algorithm. Not only that, many languages ​​provide priority queue implementations, such as PriorityQueue in Java, priority_queue in C++, etc.

Now, let's take two concrete examples to see how the priority queue is used.

1. Combine small files in order

Suppose we have 100 small files, each of which has a size of 100 MB, and each file stores an ordered string. We hope to merge these 100 small files into an orderly large file. The priority queue will be used here.

The overall idea is a bit like the merge function in merge sort. We take the first string from each of these 100 files, put it into the array, then compare the sizes, put the smallest string into the merged large file, and delete it from the array.

Assuming that the smallest string comes from the small file 7.txt, we will then take a string from this small file, put it in the array, re-compare the size, and select the smallest string into the merged large file. It is deleted from the array. And so on, until the data in all files are put into the big file.

Here we use the data structure of an array to store the string taken out of the small file. Every time you take the smallest string from the array, you need to traverse the entire array, which is obviously not efficient.

So here you can use the priority queue, which is the heap. We put the string taken from the small file into the small top pile, the element at the top of the pile, that is, the element at the head of the priority queue, is the smallest string. We put this string into a large file and delete it from the heap. Then take the next string from the small file and put it in the heap. By looping this process, the data in 100 small files can be put into large files in sequence.

We know that the time complexity of deleting data on the top of the heap and inserting data into the heap is O(logn), where n represents the number of data in the heap, which is 100 here. Is it much more efficient than using array storage?

2. High performance timer

Suppose we have a timer, and a lot of timed tasks are maintained in the timer, and each task sets a time point to trigger execution. Every time the timer passes a small unit time, such as 1 second, the task is scanned again to see if any task reaches the set execution time. If it arrives, take it out for execution.

Timing task
2020. 9.11. 16:42 Task 1
2020. 9.11. 17:52 Task 2
2020. 9.11. 18:01 Task 3
2020. 9.11. 13:15 Mission 4

However, this method of scanning the task list every 1 second is relatively inefficient for two reasons:

  1. The agreed execution time of the task may be a long time away from the current time, so many previous scans are actually futile;
  2. Scan the entire task list every time. If the task list is large, it will inevitably be time-consuming.

In response to these problems, we can use priority queues to solve them. We store these tasks in the priority queue according to the execution time set by the tasks. The head of the queue (that is, the top of the small top heap) stores the tasks that are executed first.

In this way, the timer does not need to scan the task list every second. It takes the execution time point of the first task of the team and subtracts it from the current time point to obtain a time interval T.

This time interval T is, starting from the current time, how long it takes to wait before the first task needs to be executed, so that the timer can be set after T seconds before the task is executed. From the current time point to (T-1) seconds, the timer does not need to do anything.

When T seconds have elapsed, the timer will execute the task at the head of the queue in the priority queue, and then calculate the difference between the execution time of the new task at the head of the queue and the current time point, and use this value as the timer to execute the next task. Waiting time.

In this way, the timer does not need to poll once every 1 second or traverse the entire task list, and the performance is improved.

Ⅲ The application of the reactor 2: Using the reactor to find Top K

Now let's look at another very important application scenario of the heap, which is the problem of seeking Top K.

We abstract the problem of seeking Top K into two categories. One is a static data collection , which means that the data collection is determined in advance and will not change. The other is for dynamic data collections , which means that the data collection is not known in advance, and data is dynamically added to the collection.

For static data, how to find the top K big data in an array containing n data? We can maintain a small top heap of size K, traverse the array sequentially, and compare the data from the array with the top element of the heap. If it is larger than the top element of the heap, we delete the top element of the heap and insert this element into the heap; if it is smaller than the top element of the heap, no processing is done, and we continue to traverse the array. In this way, after all the data in the array is traversed, the data in the heap is the top K big data.

Traversing the array requires O(n) time complexity, and a heap operation requires O(logK) time complexity, so in the worst case, n elements are piled into the heap once, and the time complexity is O(nlogK).

To obtain Top K for dynamic data is real-time Top K. For example, there are two operations in a data set, one is to add data, and the other asks the current top K big data.

If each time we ask for the top K big data, we recalculate based on the current data, then the time complexity is O(nlogK), and n represents the size of the current data. In fact, we can always maintain a small top heap of K size. When data is added to the collection, we compare it with the top element of the heap. If it is larger than the top element of the heap, we delete the top element of the heap. , And insert this element into the heap; if it is smaller than the top element of the heap, no processing is done.

In this way, whenever we need to query the current top K big data, we can immediately return it to him.

Ⅳ The application of heap three: use heap to find the median

I believe everyone knows that the median is the middle number. Assuming that the subscripts are arranged starting from 0, if the number of data is odd, and the data is arranged from small to large, the n/2 + 1 data is the median; if the number of data is even, it is in There are two data in the middle position, the n/2th and the n/2+1th data. At this time, we can choose one as the median at will. For example, take the first of the two numbers, which is the nth /2 data.

For a set of static data , the median is fixed, we can sort it first, and the n/2th data is the median. Every time we ask for the median, we just return this fixed value directly.

However, if we operate on dynamic data again, because the median is constantly changing, if we use the method of sorting first, we must sort first every time the median is asked, then the efficiency will be very low.

With the help of the heap data structure, we can achieve the median operation very efficiently without sorting.

We need to maintain two piles, a large top pile and a small top pile. The big top heap stores the first half of the data, the small top heap stores the second half of the data, and the data in the small top heap is larger than the data in the big top heap.

In other words, if there are n data, n is an even number, and we sort from small to large, the first n/2 data are stored in the big top heap, and the last n/2 data are stored in the small top heap. In this way, the top element in the big top pile is the median we are looking for. If n is an odd number, the situation is similar, the big top heap stores n/2+1 pieces of data, and the small top heap stores n/2 pieces of data.

Insert picture description here
Because the data changes dynamically, we also need to adjust the heap after adding the data. If the newly added data is less than or equal to the top element of the big top heap, we insert this new data into the big top heap; otherwise, we insert this new data into the small top heap.

At this time, it may happen that the number of data in the two heaps does not meet the previous agreement: if n is an even number, the first n/2 data are stored in the big top heap, and the last n/2 data are stored in the small In the top pile. If n is an odd number, the large top heap stores n/2+1 data, and the small top heap stores n/2 data.

Then we can continuously move the top element from one heap to another heap, and through such adjustments, let the data in the two heaps meet the above agreement.

Insert picture description here
Therefore, we can use two heaps, a large top heap and a small top heap, to achieve the median operation in a dynamic data set. Inserting data requires heapization, so the time complexity becomes O(logn), but to find the median, we only need to return the top element of the large top heap, so the time complexity is O(1).

In fact, using two heaps can not only quickly find the median, but also quickly find other percentile data. For example, we need to solve the problem of "quickly seeking 99% response time of the interface".

Let me first explain what "99% response time" is.

The concept of median is to arrange the data from small to large, in the middle position, called the median, this data will be greater than or equal to the previous 50% of the data. The concept of the 99th percentile can be compared to the median. If a set of data is arranged from small to large, the 99th percentile is the data that is greater than the previous 99% of the data.

If you are still a bit vague, let me give another example. For example, if there are 100 data, which are 1, 2, 3,..., 100, the 99 percentile is 99, because the number less than or equal to 99 accounts for 99 %.

Knowing this concept, let's take a look at 99% response time. If there are 100 interface access requests, the response time of each interface request is different, such as 55 milliseconds, 100 milliseconds, 23 milliseconds, etc., and we rank the response time of these 100 interfaces from smallest to largest, ranking 99th The data is the 99% response time, also called the 99th percentile response time.

Let us summarize that if there are n data, after sorting the data from small to large, the 99th percentile is approximately the n 99% data. Similarly, the 80th percentile is approximately the n 80% data.

Next, let's take a look at how to find 99% response time.

We maintain two piles, a large top pile and a small top pile. Assuming that the number of current total data is n, n 99% data are stored in the big top heap , and n 1% data are stored in the small top heap . The data at the top of the top pile is the 99% response time we are looking for.

Every time we insert a piece of data, we have to judge the relationship between this data and the top data of the large top heap and the small top heap, and then decide which heap to insert into. If the newly inserted data is smaller than the top heap data of the big top heap, then insert the big top heap; if the newly inserted data is larger than the top heap data of the small top heap, then insert the small top heap.

However, in order to keep the data in the big top heap accounting for 99% and the data in the small top heap accounting for 1%, we have to recalculate each time new data is inserted. At this time, the data in the big top heap and the small top heap The number, whether it still meets the ratio of 99:1. If it doesn't match, we move the data in one heap to another until this ratio is met. The method of moving is similar to the previous method of finding the median, so I won't repeat it.

Through this method, every time data is inserted, several data stacking operations may be involved, so the time complexity is O(logn). Every time you ask for 99% response time, you can directly return the top of the heap data in the big top heap, and the time complexity is O(1).

In addition, the content of this article comes from "The Beauty of Data Structure and Algorithm" by Wang Zheng, a geek.

Guess you like

Origin blog.csdn.net/qq_45627684/article/details/108536199