"Data Structure Cultivation Manual"----Heap Sort and TOP-K Problem

1. Implementation of heap sort

1.1 Simple implementation of heap sort

Simple implementation of heap sorting ( to reuse the related functions of the previous heap )

Idea: Create a new heap, and then push each element of the array to be sorted into the heap, because the element at the top of the heap is always the largest or smallest element. Using this property, the smallest or largest element in the current heap can be placed every time. Get it, overwrite it into the original array, and delete the first element in the current heap at the same time, and proceed in turn to get array elements from large to small or array elements from small to large.

Ascending order: small heap (take the smallest element from the top of the heap each time)

Descending: Large heap (take the largest element from the top of the heap each time)

Time complexity: O(N*log 2 N)

void HeapSort(int* a, int size)//堆排序的模拟实现
{
	HP hp;//建立一个堆
	HeapInit(&hp);//堆的初始化
	for (int i = 0; i < size; i++)//将数组的每个元素push到数组中
	{
		HeapPush(&hp, a[i]);
	}
	size_t j = 0;//记录数组下标
	while (!HeapEmpty(&hp))
	{
		a[j++] = HeapTop(&hp);//取数组首元素覆盖到数组中
		HeapPop(&hp);//将堆的首元素删除,同时使用向下调整的函数将最小或者最大的元素重新调整到了首元素
	}
	HeapDestory(&hp);//堆的销毁
}
int main()
{
	int a[] = { 1,3,5,2,3,7,5,4 };//要进行排序的数组
	int size = sizeof(a) / sizeof(int);//数组元素的数目
	HeapSort(a, size);
	for (int i = 0; i < size; i++)//打印排序后数组的元素
	{
		printf("%d ", a[i]);
	}
	return 0;
}

Disadvantages: The above method is not easy, you need to implement the heap yourself, and there is also O(N) space complexity, so in practice, writing heap sorting is not like this. The following is an optimization of heap sort to optimize the space complexity to O(1).

1.2 Optimized Implementation of Heap Sort

Note: The essence of heap sort is selection sort.

void HeapSort(int* a, int size)//堆排序的优化
{
	//1.建立堆。建立堆有两种方式:一、使用向上调整,插入数据的思想建堆 二、使用向下调整建堆
	//方法一:向上调整建堆(向上调整的算法的前提是前面的都是大堆或者都是小堆)
	//思路:从第二层节点开始,对每一个节点进行向上调整,始终保持前面的所有节点是一个堆并满足堆的性质
	/*for (int i = 1; i < size; i++)
	{
		AdjustUp(a, i);
	}*/
	//问:为什么从1开始呢?因为从0开始的时候进行调整并没有任何意义,此时并没有数据
	
	//方法二:向下调整建堆(向下调整的算法的前提是左子树和右子树都是大堆或者都是小堆)
	//思路:从倒数第一个非叶子节点开始,对每一个节点进行向下调整,向前进行遍历,始终保持后每一个节点的子树都是一个堆
	for (int i = (size - 1 - 1) / 2; i >= 0; --i)
	{
		AdjustDown(a, size, i);
	}
	//最后一个节点下标是size-1,再进行-1然后÷2就是它的父亲
	//问:为什么从不从最后开始呢?因为最后那一层都是叶子节点,调整没有意义
    
    //2.排序
    //问:升序为什么不可以建立小堆?
    //答:因为最小的一个数已经在第一个位置了,剩下的数关系已经全部乱了,需要重新建堆,而建堆的时间复杂度最小也要O(N),此时整个函数的时间复杂度就变为O(N^2)了,不满足我们的要求,如果这样,还不如直接遍历选数呢,搞的这么复杂效率却没有提升
    //结论:升序要建立大堆,降序要建立小堆
    
    size_t end = size - 1;//记录当前堆最后一个元素的下标
	while (end>0)
	{
		Swap(&a[0],&a[end]);
		AdjustDown(a, end, 0);
		--end;
	}
    //思路:将最开始的节点元素和最后一个节点元素进行交换,即a[end]和a[0]此时最后节点存储的元素变成了最大的数,然后将end--
    //此时将前面的数进行向下调整,然后a[0]就会变为次大的数,然后将a[0]再与数组倒数第二个数进行交换,然后依次进行前面的操作
}

Sort icon:

image-20220408172136221

1.3 Time complexity of heap building

As shown in the figure below, it is assumed that there are a total of N nodes and a total of h layers. (Take a complete binary tree as an example)

image-20220408111503230

is related tolog2N+1 = hwhich is2h - 1 = N

1.3.1 Adjust heap build up

for (int i = 1; i < size; i++)
{
	AdjustUp(a, i);
}

Note: The adjustment is the number of times the node is exchanged. We consider the worst case and adjust the exchange every time.

The first layer has 20 nodes, each node needs to be adjusted 0 times, and a total of 20 * 0 times need to be adjusted. (The value is 0, not listed later)

The second layer has 2 1 nodes, each node needs to be adjusted 1 time, and a total of 2 1 *1 times need to be adjusted

The third layer has 2 2 nodes, each node needs to be adjusted 2 times, and a total of 2 2 * 2 times need to be adjusted

……

The h-1 layer has 2 h-2 nodes, each node needs to be adjusted h-2 times, and a total of 2 h-2 *(h-2) needs to be adjusted

There are 2 h-1 nodes in the hth layer, each node needs to be adjusted h-1 times, and a total of 2 h-1 *(h-1) needs to be adjusted

Record the total number of node exchanges as T (the arithmetic sequence multiplied by the sum of the first n items of the proportional sequence, the problem-solving method used is the dislocation subtraction method)

​ T(h) = 21*1 + 22*2 + 23*3 + ······ + 2h-2*(h-2) + 2h-1*(h-1) (1)

2*T(h) = 22*1 + 23*2 + ······ + 2h-2*(h-3) + 2h-1*(h-2) + 2h*(h-1) (2)

(1) - (2) available

-T(h) = 21*1 + 22*2 + 23*3 + ······ + 2h-2*(h-2) + 2h-1*(h-1) - 2h*(h-1)

-T(h) = 2 0 *1 + 2 2 *2 + 2 3 *3 + ... + 2 h-2 *(h-2) + 2 h-1 *(h-1) - 2 h *(h-1) -1 (Here, 2 0 *1 is added at the beginning, and 1 is subtracted at the end)

-T(h) = 2h - 1 - h*2h + 2h - 1

-T(h) = 2h*(2 - h) - 2

T(h) = 2h(h-2) + 2

T(N) = (N + 1)(log2(N+1) -2) + 2

So adjust the time complexity of heap building up:O (N * log 2 N)

1.3.2 Adjust the heap down

for (int i = (size - 1 - 1) / 2; i >= 0; --i)
{
	AdjustDown(a, size, i);
}

Layer 1 has 20 nodes, and each node needs to be adjusted h - 1 times, for a total of 20 * (h - 1) times.

Layer 2 has 2 1 nodes, each node needs to be adjusted h - 2 times, and a total of 2 1 *(h - 2) times need to be adjusted

Layer 3 has 2 2 nodes, each node needs to be adjusted h - 3 times, and a total of 2 2 *(h - 3) times need to be adjusted

……

The h-1 layer has 2 h-2 nodes, each node needs to be adjusted once, and a total of 2 h-2 *1 needs to be adjusted

There are 2 h-1 nodes in the hth layer, each node needs to be adjusted 0 times, and a total of 2 h-1 * 0 need to be adjusted (the value is 0, not listed)

​ T(h) = 20*(h - 1) + 21*(h - 2) + 22*(h - 3) + ······ + 2h-2*1 + 2h-1*0 (1)

2*T(h) = 21*(h - 1) + 22*(h - 2) + ······ + 2h-2*2 + 2h-1*1 + 2h*0 (2)

(2) - (1) Available

​ T(h) = 1 - h + 21 + 22 + 23 + ······ + 2h-2 + 2h-1

​ T(h) = 20 + 21 + 22 + 23 + ······ + 2h-2 + 2h-1 - h

T(h) = 2h - 1 - h

T(N) = N - log2(N+1) ≈ N

So adjust the time complexity of heap building down:O(N)

2. TOP-K problem

TOP-K problem: that is, to find the top K largest elements or the smallest elements in the data combination. Generally, the amount of data is relatively large.

For example: professional top 10, world top 500, rich list, top 100 active players in the game, etc.

For the Top-K problem, the simplest and most direct way I can think of is sorting, but: if the amount of data is very large, sorting is not desirable (maybe the data cannot be loaded into memory all at once).

  1. Sorting: Time complexity: O(N&log 2 N), space complexity: O(1).
  2. Build a large heap of N numbers and pop K times to select the largest top K, time complexity: O(N + K*log 2 N), space complexity: O(1).

The best way is to use the heap to solve the problem, the basic idea is as follows:

  1. Build a heap with the first K elements in the data set

    • The first k largest elements, then build a small heap
    • The first k smallest elements, then build a large heap
  2. Use the remaining NK elements to compare with the top elements of the heap in turn, and replace the top elements of the heap if they are not satisfied

Time complexity: O(K + (N - K)*log 2 K)

Function implementation:

void PrintTopK(int* a, int n, int k)
{
	//1.建堆--用k中前k个元素建堆
	int* kminHeap = (int*)malloc(sizeof(int)*k);
	assert(kminHeap);
	for (int i = 0; i < 10; i++)
	{
		kminHeap[i] = a[i];
	}
	//建小堆,这里是采用的
	for (int j = (k - 1 - 1)/2; j >=0; --j)
	{
		AdjustDown(kminHeap, k, j);
	}
	//2.将剩余n-k个元素依次与堆顶元素进行交换,不满则替换
	for (int i = k; i < n; i++)
	{
		if (a[i] > kminHeap[0])
		{
			kminHeap[0] = a[i];
			AdjustDown(kminHeap, k, 0);
		}
	}
    for(int j = 0;j < k;j++)
    {
        printf("%d ",kminHeap[i]);
    }
}

Example of use:

void PrintTopK(int* a, int n, int k)
{
	//1.建堆--用k中前k个元素建堆
	int* kminHeap = (int*)malloc(sizeof(int)*k);
	assert(kminHeap);
	for (int i = 0; i < 10; i++)
	{
		kminHeap[i] = a[i];
	}
	//建小堆,这里是采用的
	for (int j = (k - 1 - 1)/2; j >=0; --j)
	{
		AdjustDown(kminHeap, k, j);
	}
	//2.将剩余n-k个元素依次与堆顶元素进行交换,交换后同时进行向下调整
	for (int i = k; i < n; i++)
	{
		if (a[i] > kminHeap[0])
		{
			kminHeap[0] = a[i];
			AdjustDown(kminHeap, k, 0);
		}
	}
}
void TestTopk()
{
	int n = 10000;
	int* a = (int*)malloc(sizeof(int) * n);
	assert(a);
	srand(time(0));
	for (size_t i = 0; i<n;i++)
	{
		a[i] = rand() % 10000;
	}
	a[5] = 10001;
	a[1231] = 100002;
	a[12] = 10003;
	a[100] = 10004;
	a[107] = 10005;
	a[9] = 10006;
	a[1087] = 10007;
	a[1079] = 10008;
	a[17] = 10009;
	a[102] = 100010;
	PrintTopK(a, 10000, 10);
}

The final value stored in kminHeap is 10001 to 10010.

Guess you like

Origin blog.csdn.net/m0_57304511/article/details/124294233