[Data Structure] Detailed Explanation of Heap

The knowledge in this chapter requires related concepts such as trees. If you don’t understand it yet, please read this article first: Getting to Know Binary Trees for the First Time


1. The sequential structure and implementation of binary tree

1. The sequential structure of the binary tree

Ordinary binary trees are not suitable for storage in arrays, because there may be a lot of wasted space. The complete binary tree is more suitable for sequential structure storage. In reality, we usually store the heap (a binary tree) in an array of sequential structures. It should be noted that the heap here and the heap in the virtual process address space of the operating system are two different things. One is the data structure, and the other is the management in the operating system. A region of memory is segmented.

insert image description here

2. The concept and structure of the heap

Strictly defined : If there is a key set K = K =K= {k 0 , k 1 , k 2 . . , kn − 1 k_0,k_1, k_2,..,k_{n-1}k0k1k2..kn1}, store all its elements in a one-dimensional array in the order of a complete binary tree, and satisfy: K ​​i <= K 2 i + 1 K_i<= K_{2i+1}Ki<=K2i + 1 _ K i ; < = K 2 ∗ i + 2 ( K i > = K 2 i + 1 且 K i > = K 2 ∗ i + 2 ) i = 0 , 1 , ⋅ ⒉ . . . , K_i;<= K_{2*i+2}(K_i>= K_{2i+1}且K_i>= K_{2*i+2}) i=0,1,·⒉..., Ki;<=K2i+2(Ki>=K2i + 1 _And Ki>=K2i+2)i=012.... , it is called a small pile (or a large pile). The heap with the largest root node is called the largest heap or large root heap, and the heap with the smallest root node is called the smallest heap or small root heap.

Non-strict definition : There are two types of heaps, which are divided into large heaps and small heaps. They are all complete binary trees !
Large heap : all parents in the tree are greater than or equal to the child
Small heap : all parents in the tree are less than or equal to the child

insert image description here
insert image description here

We also said in the previous article that the left child is an odd number, and the right child is an even number, and there is a certain relationship between the subscript of the child and the father (we can also get the following relationship by finding the law):

leftchild = parent*2+1
rightchild = parent*2+2
parent = (child-1)/2	//偶数会被取整数,因此可以直接按照左孩子公式反推

Second, the simple implementation of the heap (take the large heap as an example)

1. Definition of heap

Since the heap is more suitable for array storage, we can define it according to the structure of the sequence table, but remember that array storage is a physical structure, and we need to abstract this physical structure into a complete binary tree.

//堆的结构定义
typedef int HPDateType;
typedef struct Heap
{
    
    
	HPDateType* a;	//指向要存储的数据
	int size;		//记录当前结构存储了多少数据
	int capacity;	//记录当前结构的最大容量是多少
}HP;

2. Heap initialization

For the initialization of the heap, we can open up space for the pointer a, or we can not open up space, here we choose not to open up space.

//堆的初始化
void HeapInit(HP* php)
{
    
    
	assert(php);
	php->a = NULL;
	php->size = php->capacity = 0;
}

3. Destruction of the heap

To destroy the heap, we can directly release the space, and then set the pointer to empty size and capacity to 0.

//堆的销毁
void HeapDestroy(HP*php)
{
    
    
	assert(php);
	free(php->a);	//指针置空
	php->a = 0;		//size置成0
	php->size = php->capacity = 0;	//capacity置成0
}

4. Heap printing

Since we are using an array to implement the heap, we can simply traverse the array and print them out.

//堆的打印
void HeapPrint(HP* php)
{
    
    
	assert(php);
	int i = 0;
	for (i = 0; i < php->size; ++i)
	{
    
    
		printf("%d ", php->a[i]);
	}
}

5. Heap insertion

The heap insertion is a bit more complicated. After we insert data, we must ensure that the heap after inserting the data is still a large heap, otherwise the structure of the heap will be destroyed.

Let's consider the first case first:
insert image description here

Let's consider the second case again:
insert image description here
let's summarize these two cases:
①First of all, before we insert data (child nodes), the original heap must satisfy the heap structure, otherwise there is a problem with the previous code, which is different from the one we inserted Data is irrelevant!

② Then the inserted child node should be compared with its parent node. If it is smaller than its parent node, it will not be exchanged and inserted normally. If the inserted child node is larger than the parent node, an upward adjustment exchange is required. The number of exchanges is uncertain, it may be once or twice, but the most is the height of the tree h = log ⁡ 2 N h=\ log_2Nh=log2N times (NNN is the number of nodes), which means that the time complexity ofour heap insertion is O ( log ⁡ 2 n ) O(\log_2n)O(log2n)

The next step is that we have to write the corresponding code according to these two situations. At this time, the subscript relationship between the child and the father is very important. Through this relationship, we can find the parent node from the child node, and find the child node from the parent node. .

//堆的插入
void HeapPush(HP* php, HPDateType data)
{
    
    
	assert(php);
	//判断是否需要扩容
	if (php->capacity == php->size)
	{
    
    
		int new_capacity = php->capacity == 0 ? 4 : php->capacity * 2;
		HPDateType* tmp = (HPDateType*)realloc(php->a, sizeof(HPDateType)*new_capacity);
		//realloc对于没有进行动态内存分配过的指针 调用会相当与一次malloc
		if (NULL == tmp)
		{
    
    
			perror("malloc fail:");
			exit(-1);
		}
		php->a = tmp;
		php->capacity = new_capacity;
	}
	//数据插入
	php->a[php->size] = data;
	php->size++;
	//向上调整
	AdjustUp(php->a,php->size-1);
}

It is very important to adjust the algorithm upwards! ! !

//交换函数
void Swap(HPDateType* x, HPDateType* y)
{
    
    
	HPDateType tmp = *x;
	*x = *y;
	*y = tmp;
}
//向上调整
void AdjustUp(HPDateType*a,int child)
{
    
    
	assert(a);
	int parent = (child - 1) / 2;	//找到刚插入的节点的父节点
	while (child>0)		//child=0说明子节点已经调整到了堆顶,已经不需要再进行调整了。
	{
    
    
		if (a[child] > a[parent])	//子节点比父节点大就交换
		{
    
    
			Swap(&a[child], &a[parent]);
			child = parent;		//更改孩子的下标,方便继续与上面新的父节点比较
			parent = (child - 1) / 2;	//更改父节点的下标,方便继续与下面新的子节点比较
		}
		else
		{
    
    
			break;//比较不满足条件,说明数据经过调整后已经符合大堆了
		}
	}
}

6. Obtaining the top element of the heap

For a heap, the data at the top of the heap must be the largest or smallest number in the heap , so it is necessary to obtain the data at the top of the heap, and its implementation is not complicated.

//堆顶元素的获取
HPDateType HeapTop(HP*php)
{
    
    
	assert(php);
	assert(php->size > 0);
	return php->a[0];
}

7. Heap deletion

For the deletion of the heap, the top element of the heap is generally deleted , because the deletion of elements other than the top element of the heap is of little significance, but the deletion of elements at the top of the heap will destroy the structure of the heap.
insert image description here
And if you directly delete the top element of the heap, and then adjust other elements after moving forward, there will be a lot of waste, because the time complexity of moving the array forward is O ( n ) O(n )O ( n ) . But the tail deletion efficiency of the array is very high, it isO ( 1 ) O(1)O ( 1 ) , so the array should be tail-deleted as much as possible.

So there is a good algorithm of exchanging first and then adjusting downwards. Its idea is: first exchange the position of the top element with the last element, so that the original top element becomes the bottom element, and then delete The elements at the bottom of the heap, and then adjust the elements at the top of the heap downwards. The core algorithm is to adjust downwards.
insert image description here
Downward adjustment algorithm The downward
adjustment algorithm requires us to: first take the larger child node of the top element and compare it with the top element, if the child node is greater than the parent node, exchange it, and then the subscript of the parent node changes to the original child node The subscript of the node, perform the above steps again, take the larger child node under the parent node for comparison and transposition...
until the child node is not larger than the parent node or exceeds the boundary of the array.
insert image description here

//堆的删除
void HeapPop(HP* php)
{
    
    
	assert(php);
	assert(php->size > 0);
	//交换
	Swap(&php->a[0], &php->a[php->size - 1]);
	php->size--;
	//向下调整
	AdjustDown(php->a,php->size,0);
}

Adjust the algorithm down:

//向下调整
void AdjustDown(HPDateType* a, int n, int parent)
{
    
    
	//假设左孩子是最大的
	int child = parent * 2 + 1;
	while (child<n)
	{
    
    
		//判断假设是否正确,若不正确进行更改
		if (a[child + 1] > a[child])
		{
    
    
			++child;
		}
		if (child + 1 < n && a[child] > a[parent])
		{
    
    
			Swap(&a[child],&a[parent]);
			parent = child;
			child = parent * 2 + 1;
		}
		else
		{
    
    
			break;
		}
	}
}

The algorithm of downward adjustment is similar to that of upward adjustment. The number of exchanges is uncertain, it may be once or twice, but at most it is the height of the tree h = log ⁡ 2 N h=\log_2Nh=log2N times (NNN is the number of nodes), which means that the time complexity ofour heap deletion is also O ( log ⁡ 2 n ) O(\log_2n)O(log2n)

Compare Upsizing Algorithms with Downscaling Algorithms

  • The upward adjustment algorithm requires that the original data is already a heap before the data is inserted, and the heap can be rebuilt.
  • Requirements for downward adjustment algorithm: the left and right subtrees must be a heap, so that the heap can be adjusted and rebuilt.

8. Acquisition of the number of heap elements

The number of elements in the heap is actually the size in the heap data structure.

// 堆元素个数的获取
int HeapSize(HP* php)
{
    
    
	assert(php);
	return php->size;
}

8. Heap judgment empty

//堆的判空
bool HeapEmpty(HP* php)
{
    
    
	assert(php);
	return (php->size == 0);
}

10. Stack simple applications

After learning here, we can actually use the heap to deal with some problems.

  • TOP-K problem
    The element at the top of the heap is the largest (smallest) element. We can take K times from the top of the heap and delete K times to get the top K largest (smallest) elements in a data set.
int main()
{
    
    
	HP hp;
	HeapInit(&hp);	//堆的初始化
	int arr[] = {
    
     27,15,19,18,28,34,65,49,25,37 };
	for (int i = 0; i < sizeof(arr) / sizeof(int); ++i)	//堆的插入,建堆
	{
    
    
		HeapPush(&hp, arr[i]);
	}
	HeapPrint(&hp); //打印堆中的元素
	printf("\n");
	//选出最大的前五个
	int k = 5;
	for (int i = 0; i < k; i++)
	{
    
    
		printf("%d ", HeapTop(&hp));	//取堆顶的元素
		HeapPop(&hp);		//删除堆顶元素,重新定位新的最大的。
	}
	
	HeapDestroy(&hp);	//堆的销毁,防止内存泄漏
	return 0;
}

insert image description here

  • Sorting problem
    The sorting problem is similar to the TOP-K problem, except that K here is all elements.
int main()
{
    
    
	HP hp;
	HeapInit(&hp);
	int arr[] = {
    
     27,15,19,18,28,34,65,49,25,37 };
	for (int i = 0; i < sizeof(arr) / sizeof(int); ++i)
	{
    
    
		HeapPush(&hp, arr[i]);
	}
	HeapPrint(&hp);
	printf("\n");
	//排序
	while(!HeapEmpty(&hp))//只要不为空就一直进行排序。
	{
    
    
		printf("%d ", HeapTop(&hp));
		HeapPop(&hp);
	}

	HeapDestroy(&hp);
	return 0;
}

insert image description here

3. Heap creation

In the above code, we see that we often use an array to create a heap, so it is still necessary for us to write another function - heap creation! In the above code, we create a heap by inserting the heap, that is, inserting data into the end of the array, and then adjusting it upwards. This method can help us complete the creation of the heap, but its efficiency is not very high, and we can optimize it to some extent.

1. Adjust the heap upwards

  • Heap creation using heap insertion
void HeapCreat(HP* php, HPDateType* arr, int n)
{
    
    
	assert(php);
	HPDateType* tmp = (HPDateType*)malloc(sizeof(HPDateType) * n);
	if (NULL == tmp)
	{
    
    
		perror("malloc fail:");
		exit(-1);
	}
	php->a = tmp;
	php->capacity = n;
	for (int i = 0; i < n; ++i)
	{
    
    
		HeapPush(php, arr[i]);//这里使用了AdjustUp()函数
	}
}

But this kind of heap building algorithm is relatively inefficient, let's find out its time complexity.
Calculate according to the worst case (the worst case of a complete binary tree is a full binary tree, and each node must be adjusted):

insert image description here

The first layer of upward adjustment to build the heap is not adjusted, set FFF is the total number of heap swap adjustments,h − 1 h-1h1 is the height of the tree,NNN is the number of nodes in the tree, then
F = 2 1 ∗ 1 + 2 2 ∗ 2 + 2 3 ∗ 3 + . . . . . . + 2 h − 2 ∗ h − 2 + 2 h − 1 ∗ h − 1 F=2^1*1+2^2*2+2^3*3+......+2^{h-2}*h-2+2^{h-1}*h -1F=211+222+233+......+2h2h2+2h1h1Using
the dislocation subtraction method:
F = ( h − 2 ) ∗ 2 h + 1 (1) F=(h-2)*2^h+1 \tag{1}F=(h2)2h+1( 1 )
And because the binary tree satisfies:
N = 2 0 + 2 1 + 2 2 + 2 3 + . . . . . . + 2 h − 2 + 2 h − 1 = 2 h − 1 (2) N=2^ 0+2^1+2^2+2^3+......+2^{h-2}+2^{h-1}=2^h-1\tag{2}N=20+21+22+23+......+2h2+2h1=2h1( 2 )
Put (2) into (1) to get:
F = ( N + 1 ) ∗ ( log ⁡ 2 ( N + 1 ) − 2 ) + 1 (3) F=(N+1)*(\ log_2{(N+1)}-2)+1\tag{3}F=(N+1)(log2(N+1)2)+1( 3 )
Therefore, the time complexity of adjusting the heap upwards isO ( n log ⁡ 2 n ) O(n\log_2n)O ( nlog2n)

2. Adjust the build pile downward

There are some requirements for the downward adjustment algorithm: the left and right subtrees must be a heap to adjust and rebuild the heap, but the array given to us is out of order, so how can we ensure that the left and right subtrees are a heap?
The answer is: we start to adjust from the last subtree of the first non-leaf node, and adjust to the tree of the root node, and then we can adjust into piles.

Here the subtree of the first non-leaf node is 28, we can find the subscript of 28 through the relationship between the child node and the parent node, and then adjust it downwards, so that the ① area becomes a heap, and then reduce the subscript by one to 18 Position, then adjust downwards, let the ② area become a heap, then reduce the subscript by one to the position of 19, and then adjust downwards, let the ③ area become a heap... until the subscript is zero, adjust it again, so that the heap will be given built up.
insert image description here
Code:

//堆的创建
void HeapCreat(HP* php, HPDateType* arr, int n)
{
    
    
	assert(php);
	HPDateType* tmp = (HPDateType*)malloc(sizeof(HPDateType) * n);
	if (NULL == tmp)
	{
    
    
		perror("malloc fail:");
		exit(-1);
	}
	php->a = tmp;
	php->size=php->capacity = n;
	memcpy(php->a, arr, sizeof(HPDateType)*n);//内存拷贝函数
	for (int i = (n - 1 - 1) / 2; i >= 0; i--)
	{
    
    
		AdjustDown(php->a, n, i);	//利用向下调整算法
	}

}

This heap building algorithm is relatively efficient, let's find out its time complexity.
Calculate according to the worst case (the worst case of a complete binary tree is a full binary tree, and each node must be adjusted):

insert image description here
Since the downward adjustment of the heap is adjusted from the penultimate layer, we assume that the height of the tree is hhh, F F F is the total number of heap swap adjustments,NNN is the number of nodes in the tree.
F = 2 h − 2 ∗ 1 + 2 h − 3 ∗ 2 + 2 h − 4 ∗ 3 + . . . . . . + 2 1 ∗ h − 2 + 2 0 ∗ h − 1 (1) F=2^ {h-2}*1+2^{h-3}*2+2^{h-4}*3+......+2^1*h-2+2^0*h-1 \tag{1}F=2h21+2h32+2h43+......+21h2+20h1( 1 )
Using the dislocation subtraction method:
F = 2 h − ( h + 1 ) (2) F=2^h-(h+1) \tag{2}F=2h(h+1)( 2 )
And because the binary tree satisfies:
N = 2 0 + 2 1 + 2 2 + 2 3 + . . . . . . + 2 h − 2 + 2 h − 1 = 2 h − 1 (2) N=2^ 0+2^1+2^2+2^3+......+2^{h-2}+2^{h-1}=2^{h}-1\tag{2}N=20+21+22+23+......+2h2+2h1=2h1( 2 )
Put (2) into (1) to get:
F = N − ( log ⁡ 2 ( N + 1 ) + 1 ) (3) F=N-(\log_2{(N+1)}+1 )\tag{3}F=N(log2(N+1)+1)( 3 )
Therefore, the time complexity of adjusting the heap downward isO ( n ) O(n)O ( n )

To sum up: down-adjusting heap building is a better algorithm.
By comparison, we can also find that the upward adjustment algorithm means that the more nodes there are, the more upward adjustments, and the downward adjustment algorithm means that the fewer nodes in that layer, the more downward adjustments. So the downward adjustment algorithm is even better!

Fourth, the application of the heap

We have briefly mentioned the application of the heap before: TOP-K and sorting.

1. Heap sort

But in the above applications, we all use the data structure of the heap (using the insertion and deletion of the heap). In actual application, we will give you an array for you to sort. Do we have to spend a lot of effort to build the heap? This is obviously too slow, so we need to find an algorithm that can sort without building a heap data structure.

First of all, we all know through the previous study that each consecutively stored array can be regarded as a complete binary tree, so we can use the idea of ​​creating a heap just now to build a heap for the data in the array, and then sort it.

But suppose we want to create an ascending array, should we create a large heap or a small heap?

  • Let's take a look at the small pile first.
    insert image description here

Suppose we build a small heap, the first element does not need to be sorted, we sort from the second, but if we reorder from the second element, our heap structure may be destroyed, which is not conducive to our subsequent selection. Data, or traversal to select a number, these efficiency is not very good. Therefore, small heaps are not suitable for creating ascending arrays.
insert image description here

  • Let's take a look at the big pile again.
    We can transpose the top element and the bottom element of the pile, so that our largest element is in the correct position, and then adjust the top element down once to rebuild the pile (because the left subtree The relationship with the right subtree has not changed), and then the last element is not regarded as an element in the heap, and then the top element of the heap and the bottom element of the heap are transposed, so that our next largest element is in the correct position, and then The top element of the heap is adjusted downward once to rebuild the heap, and then the last two elements are not regarded as elements in the heap, and the cycle continues. After the numbers in the array are traversed once, the sorting is over. In addition, we Adjust down once at most swap height times ( log ⁡ 2 n ) (\log_2n)log2n ) , so the time complexity of heap sorting isO ( n log ⁡ 2 n ) O(n\log_2n)O ( nlog2n)

insert image description here

//堆排序  升序建立大堆,降序建小堆!
void HeapSort(HPDateType* arr, int n)
{
    
    
	int parent = (n - 1 - 1) / 2;
	//建堆  --O(n)
	while(parent>=0)
	{
    
    
		AdjustDown(arr, n, parent);
		--parent;
	}
	//排序 --O(nlogn)
	int end = n - 1;
	while (end>0)
	{
    
    
		Swap(&arr[0], &arr[end]);
		//向下调整重新建堆
		AdjustDown(arr, end, 0);
		--end;
	}
}
int main()
{
    
    
	int arr[] = {
    
     27,15,19,18,28,34,65,49,25,37 };
	int n = sizeof(arr) / sizeof(int);
	HeapSort(arr, n);
	for (int i = 0; i < n; ++i)
	{
    
    
		printf("%d ", arr[i]);
	}
	printf("\n");
}

insert image description here

2. TOP-K problem

In the previous simple application of the heap, we also simply learned to use the data structure of the heap to solve the TOP-K problem. We build a large heap, delete the top elements of the heap to get the first data, and then adjust the construction downward again. Heap, and then delete the top element of the heap to get the second data... and so on, the problem of TOP-K is solved.
insert image description here

But in general, the amount of data in the TOP-K problem is relatively large, and we may not be able to use the above method. For example, select the 10 largest data from 10 billion data, assuming that they are all integers, then 40 billion words are needed section, and 1G≈1 billion bytes, we need 40G of memory to store these 10 billion numbers in an array, which is obviously insufficient memory, so we need to put so much data into the hard disk, and use the file to read Read data, but the data in the hard disk cannot be heaped, so we have to consider other algorithms.

We can create a small heap of K (K is the number of numbers to be selected) data, read the data from the file and compare it with the top element of the heap, and replace the top element if it is greater than the top element of the heap. Then adjust downwards and re-establish the small heap, so that after traversing all the numbers, the desired K numbers are in the small heap!
insert image description here

code example

int main()
{
    
    
	//选最大的5个数
	int randK = 5;
	//打开文件
	FILE* pfin = fopen("data.txt", "w");
	if (NULL == pfin)
	{
    
    
		perror("fopen fail:");
		return;
	}
	//设置随机种子
	srand(time(NULL));
	int val = 0;
	for (int i = 0; i < 500; i++)
	{
    
    
		//插入5个明显更大的数据,方便判断TOP-K结果是否正确
		if (i != 0 && i % 7 == 0 && randK > 0)
		{
    
    
			fprintf(pfin, "%d ", 1000 + i);
			randK--;
			continue;
		}
		//造500个随机数
		val = rand() % 1000;
		fprintf(pfin, "%d ", val);
	}
	//关闭文件
	fclose(pfin);
	//以读的方式打开文件
	FILE* pfout = fopen("data.txt", "r");
	if (NULL == pfout)
	{
    
    
		perror("fopen fail:");
		return;
	}
	//取5个数建立小堆
	int min_heap[5];
	for (int i = 0; i < 5; i++)
	{
    
    
		fscanf(pfout, "%d", &min_heap[i]);
	}
	for (int i = (5 - 1 - 1) / 2; i >= 0; --i)
	{
    
    
		AdjustDown(min_heap, 5, i);
	}
	while (fscanf(pfout, "%d", &val) != EOF)
	{
    
    
		if (val > min_heap[0])
		{
    
    
			min_heap[0] = val;
			AdjustDown(min_heap, 5, 0);
		}
	}
	fclose(pfout);
	//打印堆中的数据
	for (int i = 0; i < 5; i++)
	{
    
    
		printf("%d ", min_heap[i]);
	}
	return 0;
}

insert image description here

Guess you like

Origin blog.csdn.net/qq_65207641/article/details/129181710