[DSA] Heap-heap detailed explanation (take the largest heap as an example)

stack

[Definition]
Heap is a general term for a special type of data structure in computer science. The heap is usually an array object that can be viewed as a complete binary tree.

【note】

  • The heap mentioned here is a data structure, not the concept of heap in the memory model.
  • The heap here is a logical structure.

【nature】

  • The value of any node in the heap is always not greater than (not less than) the value of its child nodes;
  • The pile is always a complete tree.

【Description】

  • The heap with the largest root node is called the largest heap or large root heap , and the heap with the smallest root node is called the smallest heap or small root heap . Common heaps include binary forks and Fibonacci heaps.
  • The heap is a non-linear data structure, equivalent to a one-dimensional array, with two direct successors.

Binary fork

Binary heap is a complete binary tree or an approximate complete binary tree, it is divided into two types: maximum heap and minimum heap.

Complete binary tree : If the depth of the binary tree is h, the number of nodes in all other layers (1 ~ h-1) except the h-th layer reaches the maximum number, and all nodes in the h-th layer are continuously concentrated on the leftmost, This is a complete binary tree. As shown in the figure below, they are completely binary trees.
Insert picture description here
Maximum heap: the key value of the parent node is always greater than or equal to the key value of any child node;
minimum heap: the key value of the parent node is always less than or equal to the key value of any child node.
The schematic diagram is as follows:Insert picture description here

Binary heap implementation

Binary heaps are generally implemented through " arrays ". The binary heap realized by the array has a certain relationship between the position of the parent node and the child node. Sometimes, we put the "first element of the binary heap" at the position of the array index 0, and sometimes at the position of 1. Of course, they are essentially the same (both binary heaps), but there is a slight difference in implementation.

[Note] The implementation of the binary fork in this article all adopts the method of "the first element of the binary fork in the array index is 0"!

The big root heap in the above figure has two implementation methods:

  1. The first element is placed at index 0.
    At this point, the relationship between the array table and the nodes is as follows:
  • The array index of the left child with index i is (2 * i + 1)
  • The array index of the right child with index i is (2 * i + 2)
  • The array index of the parent node with index i is ((i-1) / 2)
    intuitive understanding:
    when the array index is 0, the parent node is a [0], the left child is a [1], and the right child is a [2]
    When the array subscript is 1, the parent node is a [0], the left child is a [3], and the right child is a [4]
    When the array subscript is 2, the parent node is a [0] , The left child is a [5], the right child is a [6]
    Insert picture description here
  1. The first element is placed at index 1
  • The array index of the left child with index i is (2 * i)
  • The array index of the right child with index i is (2 * i + 1)
  • The array index of the parent node with index i is (2/2),
    which will not be repeated here.
    Insert picture description here

Binary heap operation

The core of the binary fork operation method is [add node], [delete node]. The following examples have taken the big root heap as an example.

Add node diagram

Insert node 85
Insert picture description here
Step 1:
Insert the new node at the end of the array.
Insert picture description here
Step 2:
Compare the size of the newly inserted node and the parent node, where 85> 40, then exchange the position with the parent node
Insert picture description here
Step 3:
Repeat the above comparison steps.
Insert picture description here
In the moving step, it is found that 85 is less than 100, then stop moving.

Delete node diagram

Take the deletion of the root node as an example. The
first step: clear the data of the root node. The
Insert picture description here
second step: move the last node to the root node. The
Insert picture description here
third step: compare with the two child nodes, select the larger child node and exchange it with the
Insert picture description here
fourth. Step: Repeat the third step
Insert picture description here

Note: If you are not deleting the root node, you need to pay attention to it. After deleting, you must also ensure that the replaced tree has a large root heap and is a complete binary tree.

Implementation code

#include <stdio.h>
#include <stdlib.h>

#define ARRAY_LEN(arr) ((sizeof(arr))/sizeof(arr[0]))

#define MAX_NUM (128)

typedef int Type;

static Type heap_arr[MAX_NUM];
static int  heap_size = 0; // 堆数组的大小

/**
 * 根据数据 data 从对中获取对应的索引
 * @param  data [description]
 * @return      [description]
 */
int get_data_index_from_heap(int data)
{
    for (int i = 0; i < heap_size; ++i)
    {
        if (data == heap_arr[i])
        {
            return i;
        }
    }

    return -1;
}

/**
 * 在数组实现的堆中,向下调整元素的位置,使之符合大根堆
 * 注:   
 *     在数组试下你的堆中,第 i 个节点的
 *     左孩子的下标是 2*i+1, 
 *     右孩子的下标是 2*i+2,
 *     父节点的下标是 (i-1)/2
 *     
 * @param  start [一般从删除元素的位置开始]
 * @param  end   [数组的最后一个索引]
 */
static void max_heap_fixup_down(int start, int end)
{
    int curr_node_pos = start;
    int left_child = 2*start+1;
    int curr_node_data = heap_arr[curr_node_pos];



    while(left_child <= end) 
    {

        // left_child 是左孩子, left_child+1是同一个父节点下的右孩子
        if (left_child < end && heap_arr[left_child] < heap_arr[left_child+1])
        {
            // 从被删除的节点的左右孩子中选取较大的,赋值给父节点
            left_child++;
        }
        if (curr_node_data >= heap_arr[left_child])
        {
            // 选出孩子节点的较大者之后,与当前节点比较
            break;
        }
        else
        {
            heap_arr[curr_node_pos] = heap_arr[left_child];
            curr_node_pos = left_child;
            left_child = 2*left_child+1;
        }
    }

    heap_arr[curr_node_pos] = heap_arr[left_child];
}

/**
 * 删除对中的数据 data
 * @param data [description]
 */
static int max_heap_delete(int data)
{
    if (heap_size == 0)
    {
        printf("堆已空!\n");
        return -1;
    }
    int index = get_data_index_from_heap(data);
    if (index < 0)
    {
        printf("删除失败, 数据 [%d] 不存在!\n", data);
        return -1;
    }

    // 删除index的元素,使用最后的元素将其替换
    heap_arr[index] = heap_arr[--heap_size];

    // 删除元素之后,调整堆
    max_heap_fixup_down(index, heap_size-1);
}

/**
 * 在数组实现的堆中,将元素向上调整
 * 注:   
 *     在数组试下你的堆中,第 i 个节点的
 *     左孩子的下标是 2*i+1, 
 *     右孩子的下标是 2*i+2,
 *     父节点的下标是 (i-1)/2
 *     
 * @param  start [从数组的最后一个元素开始,start是最后一个元素的下标]
 */
static void max_heap_fixup_up(int start)
{
    int curr_node_pos = start;
    int parent = (start-1)/2;
    int curr_node_data = heap_arr[curr_node_pos];

    // 从最后一个元素开始比价,知道第0个元素
    while(curr_node_pos > 0) 
    {   
        // 当前节点的数据小于父节点,退出
        if (curr_node_data <= heap_arr[parent])
        {
            break;
        }
        else
        {
            // 交换父节点和当前节点
            heap_arr[curr_node_pos] = heap_arr[parent];
            heap_arr[parent] = curr_node_data;

            curr_node_pos = parent;
            parent = (parent-1)/2;
        }
    }
}

/**
 * 将新数据插入到二叉堆中
 * @param  data [插入数据]
 * @return      [成功返回0, 失败返回-1]
 */
int max_heap_insert(Type data)
{
    if (heap_size == MAX_NUM)
    {
        printf("堆已经满了!\n");
        return -1;
    }

    heap_arr[heap_size] = data;
    // 调整堆 
    max_heap_fixup_up(heap_size);
    heap_size++; // 对的数量自增

    return 0;
}

/**
 * 打印二叉堆
 */
void max_heap_print()
{
    for (int i = 0; i < heap_size; ++i)
    {
        printf("%d ", heap_arr[i]);
    }
}

int main(int argc, char const *argv[])
{
    Type tmp[] = {10, 40, 30, 60, 90, 70, 20, 50, 80};
    int len = ARRAY_LEN(tmp);

    printf("---> 添加元素:\n");
    for (int i = 0; i < len; ++i)
    {
        printf("%d ", tmp[i]);
        max_heap_insert(tmp[i]);
    }   

    printf("\n---> 最大堆: ");
    max_heap_print();

    max_heap_insert(85);
    printf("\n---> 插入元素之后 最大堆: ");
    max_heap_print();


    max_heap_delete(90);
    printf("\n---> 删除元素之后 最大堆: ");
    max_heap_print();
    printf("\n");

    return 0;
}

Heap application scenarios

Heap sort

There are two processes: building a heap and sorting. The process of building a heap is the process of inserting elements into the heap. We can build the heap on the original array in situ and then output the top elements of the heap in sequence to achieve the purpose of sorting. The time complexity of building a heap is O (n), and the time complexity of the sorting process is O (nlogn). Heap sorting is not a stable sorting algorithm because there is a swap of the last element of the heap with the top element of the heap during the sorting process. The operation may change the original relative order.

Heaps are commonly used to implement priority queues.

In the queue, the operating system scheduler repeatedly extracts the first job in the queue and runs it, because in reality some short tasks will wait a long time to end, or some not short but important jobs , Should also have priority. The heap is a data structure designed to solve such problems.
-Merge ordered small files

If there are 100 small files, each small file is 100 MB, and each small file stores an ordered string, and now it is required to merge into an orderly large file, then how to do it?

The intuitive approach is to take the first line of each small file into the array, and then compare the size, and insert it into the large file in sequence. If the smallest line comes from the file a, then delete it from the array after inserting into the large file This line, then take the next line of file a and insert it into the array, compare the size again, take the smallest line inserted into the second line of the large file, and so on. The whole process is much like the merge function of merge sort. It is obviously inefficient to loop through the entire array every time it is inserted into a large file.

The priority queue with the help of the heap is very efficient. For example, we can take the first line of 100 files to build a small top heap. If the top element of the file comes from file a, then remove the top element of the pile and insert it into a large file, and delete the element from the top of the pile (this is the heap implementation removeMax function), and then take the next line from file a and insert it into the top of the heap, repeat the above process to complete the operation of merging ordered small files.

The time complexity of deleting the top data of the heap and inserting data into the heap is both O (logn), where n represents the number of data in the heap, which is 100 here.

-High performance timer

If there are many timing tasks, how to design a high-performance timer to perform these timing tasks? If every small unit time passes (for example, 1 second), scan the task again to see if any task reaches the set execution time. If it arrives, take it out and execute. This is obviously a waste of resources, because the time interval between these tasks may be as long as several hours.

With the help of the priority queue of the heap, we can design this way: build a small top heap in timed order, take out the top task first, and query the difference between its execution time and the current time. If it is T seconds, then In the time of T-1 second, the timer does not need to do anything. When the T seconds interval is reached, the task is taken out and executed. Correspondingly, the top element of the heap is deleted from the top of the heap, and then the next top element of the heap is removed to query its execution. time.

In this way, the timer does not need to be polled once every 1 second, nor does it need to traverse the entire task list, and the performance is improved.

-topK issues

The situation of taking top k elements can be divided into two categories, one is static data collection, that is, no new elements will be added after the data is determined, and the other is dynamic data collection, which will add elements at any time, but still seek k Great element.

For static data, we can first insert the static data into the small top heap in sequence, maintain a small top heap of size k, traverse the remaining data, and insert it into the small top heap of size k in sequence. If the element is smaller than k, then Without processing, continue to traverse the next data. If it is greater than k, delete the top heap and insert the value into the top of the heap, so that at the end of the traversal, the top element of the heap is the k-th largest element.

Traversing an array requires O (n) time complexity, and a heap operation requires O (logK) time complexity, so in the worst case, n elements are put into the heap once, so the time complexity is O (nlogK).

For dynamic data, the processing method is also the same, which is equivalent to finding top k in real time, then it can be recalculated every time to find top k, and the time complexity is still O (nlogK), n represents the size of the current data. We can always maintain a small top heap of size K. When data is added to the collection, we compare it with the elements at the top of the heap. If it is larger than the top element of the heap, we delete the top element of the heap and insert this element into the heap; if it is smaller than the top element of the heap, no processing is done. In this way, whenever we need to query the current top K big data, we can immediately return to him.

Published 134 original articles · Liked 119 · Visit 310,000+

Guess you like

Origin blog.csdn.net/jobbofhe/article/details/102555102