The practical application of the heap (topk problem and heap sorting)

Table of contents

Foreword:

One: Solve the topk problem

Two: heap sort

[1] The first method (rarely used)

[2] The second method (very practical)


Foreword:

Last time we made a preliminary introduction to the binary tree and realized the basic functions of the heap, but the role of the heap is not to store data, it can be used to solve the topk problem ( find the top k of a set of larger or smaller data ) and to The data is sorted .

Attach the link of the first issue: http://t.csdn.cn/pMOia

One: Solve the topk problem

Before discussing the topk problem, let's review the nature of the heap:

(1) The parent node of the heap is always larger than the child node .

(2) The parent node of the small heap is always smaller than the child node .

From this we can draw a conclusion:

The root node must be the maximum or minimum value of the heap (the large heap is the largest, and the small heap is the smallest).

[1] We can store all the data in the heap, delete the root data after getting the root data , and then adjust and maintain the structure of the heap , and repeat this operation until the top k numbers are found.

Code (I built a large pile, find a larger number):

//初始化
void HeapInit(HP* hp)
{
	//断言,不能传空的结构体指针
	assert(hp);
	hp->a = NULL;
	//初始化size和容量都为0
	hp->size = hp->capacity = 0;
}
//交换函数
void HeapSwap(int* p1, int* p2)
{
	int tmp = *p1;
	*p1 = *p2;
	*p2 = tmp;
}
void AdjustUp(HPDataType* a, int child)
{
	//断言,不能传空指针
	assert(a);
	//找到父结点的下标
	int parent = (child - 1) / 2;
	//循环,以child到树根为结束条件
	while (child > 0)
	{
		//如果父结点比child小,交换并更新
		if (a[child] > a[parent])
		{
			HeapSwap(&a[child], &a[parent]);
			child = parent;
			parent = (child - 1) / 2;
		}
		//如果父结点比child大,跳出循环
		else
		{
			break;
		}
	}
}
//向下调整
void AdjustDown(HPDataType* a, int n, int parent)
{
	//默认左孩子最大
	int child = parent * 2 + 1;
	//当已经调整到超出数组时结束
	while (child<n)
	{
		//找出两个孩子中大的一方
		//考虑右孩子不存在的情况
		if (child+1<n&&a[child + 1] > a[child])
		{
			//如果右孩子大,child加1变成右孩子
			child++;
		}
		//如果父亲比大孩子小,进行调整,否则跳出
		if (a[child] > a[parent])
		{
			HeapSwap(&a[child], &a[parent]);
			//迭代
			parent = child;
			child = parent * 2 + 1;
		}
		else
		{
			break;
		}
	}
}

//插入数据
void HeapPush(HP* hp, HPDataType x)
{
	if (hp->size == hp->capacity)
	{
		//判断扩容多少
		int newcapacity = hp->capacity == 0 ? 4 : hp->capacity * 2;
		//扩容
		HPDataType* tmp =
			(HPDataType*)realloc(hp->a, sizeof(HPDataType) * newcapacity);
		//更新
		hp->capacity = newcapacity;
		hp->a = tmp;
	}
	//存储数据
	hp->a[hp->size] = x;
	hp->size++;
	//进行调整
	AdjustUp(hp->a, hp->size-1);
}

//打印数据
void HeapPrint(HP* hp)
{
	//断言,不能传空的结构体指针
	assert(hp);
	int i = 0;
	for (i = 0; i < hp->size; i++)
	{
		printf("%d ", hp->a[i]);
	}
	printf("\n");
}

//删除数据
void HeapPop(HP* hp)
{
	//断言,不能传空的结构体指针
	assert(hp);
	//如果为空,不能删除,避免数组越界
	assert(!HeapEmpty(hp));
	//不为空,先交换根和最后一片叶子,然后size减1
	HeapSwap(&hp->a[0], &hp->a[hp->size - 1]);
	hp->size--;
	AdjustDown(hp->a, hp->size, 0);
}
//取根部数据
HPDataType HeapTop(HP* hp)
{
	return hp->a[0];
}
int main()
{
    HP hp;
    HeapInit(&hp);
    int arr[20] = { 4,5,6,1,2,44,33,25,69,78,3,0,11,22,77,55,88,75,14,8 };
    //找前k个最大的数
    int k = 5;
    for (int i = 0; i < 20; i++)
    {
        HeapPush(&hp, arr[i]);
    }
    for (int i = 0; i < k; i++)
    {
        printf("%d ", HeapTop(&hp));
        HeapPop(&hp);
    }
}

 

 

shortcoming:

(1) A heap needs to be created, which consumes additional space .

(2) If there are many numbers to be sorted , the memory cannot be stored .

[2] There is another more commonly used method. This method only needs to create a heap that can store k data .

Here is the conclusion first:

(1) In this way, a small heap needs to be created to find a larger one .

(2) This method finds smaller ones and builds a large pile .

Seeing these two conclusions, you may be a little confused, because before looking for the big self, you will build a large pile, and if you look for the small self, you will build a small pile. Why is it reversed here? Don't worry, let's first explain why it is necessary to build a large pile to find a smaller number .

Let's look at the following set of data:

10  20  58  97  55  66  44  

Suppose we want to find the first 4 smaller data among these data , we first store the first 4 data in a large heap , as follows:

 Then we start traversing from 55 (the subscript is k). If the data is smaller than the root, we delete the top of the heap, and then put this data into the heap, and adjust to maintain the structure of the large heap .

The root data of a large heap is the largest data in the heap. Our behavior of traversing and inserting adjustments is actually to constantly eliminate the larger elements in the data . The elements that are still in the heap until the end of the traversal are the first k smaller values.

We iterate through replacements starting at 55:

The following operations are consistent

 

The elements in the final heap are 55, 44, 10, and 20, which meet our needs. 

On the contrary, if the first k larger numbers are required, we will , and then put this data into the heap to adjust to maintain the structure of the small heap .

Smaller data is gradually eliminated by traversing insertion and adjustment, and the last data in the heap is the top k larger data.

For better testing, we randomly generate a large amount of data to find the first k smaller numbers.

code:

void topk()
{
	HP hp;
	//堆初始化
	HeapInit(&hp);
	//随机生成一万个数
	int n = 10000;
	//找前五个数
	int k = 5;
	int* a = (int*)malloc(sizeof(int) * n);
	if (a == NULL)
	{
		printf("malloc error\n");
		exit(-1);
	}
	srand(time(0));
	for (int i = 0; i < n; i++)
	{
		//随机生成500到1000的数据
		a[i] = rand() % (500) + 500;
	}
	for (int i = 0; i < k; i++)
	{
		//把数组前面几个数拿过来作堆
		HeapPush(&hp, a[i]);
	}
	//为了方便我们观察,我们设置5个小于500的数据
	a[100] = 423;
	a[888] = 55;
	a[999] = 450;
	a[887] = 478;
	a[56] = 256;
	for (int i = k; i < n; i++)
	{
		//如果a[i]比堆顶小,删除堆顶,然后入堆
		if (a[i] < HeapTop(&hp))
		{
			HeapPop(&hp);
			HeapPush(&hp, a[i]);
		}
	}
	//遍历调整结束,最后堆中元素为最小的前5个
	HeapPrint(&hp);
}

Advantages of this approach:

(1) It only needs to create a heap to store k data, and the space consumption is small .

(2) Because the data is traversed one by one, the data can be stored in the disk and read from the disk .

Two: heap sort

[1] The first method (rarely used)

(1) We can create a heap and store data in the heap .

(2) The physical structure of the heap is an array . We can exchange the root node (the largest data) with the last leaf node , and reduce the size (effective data in the heap) by 1.

(3) Make adjustments to maintain the heap structure until the size becomes 1 .

Graphical diagram (taking Dadui as an example, you can read the previous issue for how to adjust):

 code:

void HeapSort1(int* arr, int n)
{
	//建堆
	HP hp;
	HeapInit(&hp);
	for (int i = 0; i < n; i++)
	{
		HeapPush(&hp, arr[i]);
	}
	//排序
	while (hp.size > 1)
	{
		HeapSwap(&hp.a[0], &hp.a[hp.size - 1]);
		hp.size--;
		AdjustDown(hp.a, hp.size, 0);
	}
	for (int i = 0; i < n; i++)
	{
		printf("%d ", hp.a[i]);
	}
}

int main()
{
	int arr[20] = { 78,5,8,9,7,44,55,66,99,458,41,20,0,777,458,994,2,57,7789,956 };
	HeapSort1(arr, sizeof(arr) / sizeof(arr[0]));
}

shortcoming:

(1) Additional space is consumed to build the heap.

(2) This method uses all the functions of the heap, and a large number of interfaces need to be written before using it .

[2] The second method (very practical)

(1) Adjust the array into a heap in situ

There are two adjustment ideas here:

①Adjustment from bottom to top (take Dadui as an example)

Illustration:

 Time complexity analysis:

②Adjustment from top to bottom (take Dadui as an example)

Illustration:

 Time complexity analysis:

(2) Sorting ( the time complexity of sorting is consistent with top-down adjustment, and the calculation ideas are also consistent )

The idea of ​​sorting is the same as the first method :

① The root node (the largest data) can be exchanged with the last leaf node , and n (valid data in the heap) can be reduced by 1.

② Make adjustments to maintain the structure of the heap until n becomes 1 .

code:

//堆排序
void HeapSort2(int*a,int n)
{
	//调整成堆
	int parent = (n - 1) / 2;
	while (parent>=0)
	{
		AdjustDown(a, n, parent);
		parent--;
	}
	//进行堆排序
	while (n>1)
	{
		HeapSwap(&a[0], &a[n - 1]);
		n--;
		AdjustDown(a, n, 0);
	}
}

int main()
{
	int arr[] = {300,578,65,78,5,8,9,7,44,55,66,99,458,41,20,0,777,458,994,2,57,7789,956 };
	HeapSort2(arr, sizeof(arr) / sizeof(arr[0]));
	for (int i = 0; i < sizeof(arr) / sizeof(arr[0]); i++)
	{
		printf("%d ", arr[i]);
	}
}

Advantages of this method:

(1) In-situ adjustment, no need to consume additional space .

(2) Only the adjusted interface is needed , and the amount of code is greatly reduced.

Time Complexity Analysis of Heap Sort

(1) If the adjustment is selected from top to bottom, the overall sorting time complexity is O(2*N*log2(N))=O(N*log2(N)).

(2) If the adjustment is bottom-up, the overall sorting time complexity is O(N*log2(N)+N)O(N*log2(N)).

To sum up, heap sorting is an algorithm with a time complexity of O(N*log2(N)) .

This time complexity is a big improvement compared to bubbling.

 

Guess you like

Origin blog.csdn.net/2301_76269963/article/details/130157994