[Elementary Data Structure]——Heap Sorting and TopK Problem

 =========================================================================

Homepage

code repository

C language column

Elementary data structure column

Linux column

 =========================================================================

Continuing from the previous article, the introduction of binary trees and heaps

========================================================================= 

Table of contents

Preface

Build a pile

Insert data and adjust the algorithm upward to build a heap

Move data upward to adjust the algorithm to build a heap

Heap construction using the downward adjustment algorithm for moving unordered arrays upward from the H-1 level

Heap sort

TOP-K problem


Preface

The last article explained the heap in detail. Finally, after executing the complete code, we found that sorting can be achieved when deleting the data in the heap. Of course, this is not an accident. Everything is traceable. Today we will explain how to use the heap to achieve it. Sorting, and using heap sort to solve the TopK problem.

Build a pile

Insert data and adjust the algorithm upward to build a heap

In the last article, we implemented this step. We created an array in the main function and then inserted each data in the array into the dynamically opened space using the insertion function and the upward adjustment algorithm function. Each inserted data is used as a child and The father compares and exchanges positions according to size, ultimately achieving a large/small heap.

insert function

void HPPush(HP* php, HPDatatype x)
{
	assert(php);
	
	if (php->size == php->capacity)
	{
		int newcapacity = php->capacity == 0 ? 4 : php->capacity * 2;
		HPDatatype* tmp = (HPDatatype*)realloc(php->a, sizeof(HPDatatype)*newcapacity);
		if (tmp == NULL)
		{
			perror("realloc failed");
			exit(-1);
		}
		php->a = tmp;
		php->capacity = newcapacity;
	}
	php->a[php->size] = x;
	php->size++;
	HPadjustUp(php->a, php->size-1);
}

When entering the function, first determine whether the space size is sufficient. If it is not enough, use the realloc library function to open up the space. If the opening is unsuccessful, exit directly. If the opening is successful, assign and modify the size and capacity. 

Upsizing and swapping functions 

void HPadjustUp(HPDatatype* a, int child)
{
	//找到父亲
	int parent = (child - 1) / 2;
	//根为0  当和根交换后child为0
	while (child > 0)
	{
		//当child小时和父亲交换 建成小堆
		//当child大时和父亲交换 建成大堆
		if (a[parent] > a[child])
		{
			swap(&a[parent], &a[child]);
			child = parent;
			parent = (child - 1) / 2;
		}
		else
		{
			break;
		}
	}
}

 Enter the upward adjustment function to provide the subscript relationship between the father and the child based on our previous article, adjust upward in turn according to our needs and the size relationship between the father and the child, and use the exchange function to implement large/small heaps.

​
void swap(HPDatatype* x, HPDatatype* y)
{
	HPDatatype tmp = *x;
	*x = *y;
	*y = tmp;
}

​

To prevent local variables from being destroyed outside the scope of the exchange function, here we use pointer exchange.

Disadvantages of this approach :

1. Space needs to be dynamically opened up, resulting in a waste of space.

2. Code that requires a complete heap implementation is more troublesome and not highly recommended.

We can use the following method to optimize the above function.


Move data upward to adjust the algorithm to build a heap

According to the name of our method, we can judge that our method does not need to dynamically open up additional space. It only needs to use array subscripts to adjust the algorithm function upward.

Implement code

	for (int i = 1; i < n; i++)
	{
		HPadjustUp(a, i);
	}

 Here we use the first data in the array as a heap to compare the following numbers with the previous numbers by moving the subscripts, which is equivalent to the previous number as the father and the following number as the child. The father and child use the upward adjustment function. Make adjustments to implement the heap.

After many moves like this, our heap is formed.  


Heap construction using the downward adjustment algorithm for moving unordered arrays upward from the H-1 level

In the last article, we introduced the downward movement adjustment algorithm, but this algorithm has a premise that the left and right subtrees except the root must be heaps, but here we give an unordered array and first let the array simulate a heap, except for the root and left subtrees. The subtree may not be a heap, so the downward adjustment algorithm cannot be implemented. In this way, we start to adjust downward from the penultimate non-cotyledon node, which is the father of the last node. We adjust downward until the array is in the space. It is continuous, then we start from this node and do not adjust downward in order. This node moves forward in order, so that a large pile is divided into small piles, and the downward adjustment is completed.

Implement code

for (int i = (n - 1 - 1) / 2; i >= 0; i--)
	{
		AdjustDown(a, n, i);
	}

After many downward adjustments like this, the heap can finally be achieved. 


Heap sort

Before implementing heap sort, let us first think about what heaps need to be built for ascending and descending order?

We give the answer directly here:

Ascending order: build a large pile

Descending order: Build a small pile

Ask what could be like this?

In the heap deletion in the previous article, we have implicitly told everyone that if we want to delete the data in the heap and move the data directly forward, it will not be the original large or small heap, so we will end the heap. The point data is exchanged with the last data, the two subtrees are still in the heap and then adjusted downward, and the size moves forward. If we don’t do the deletion step, it’s just sorting!

Implement code

int end = n - 1;
	while (end > 0)
	{
		swap(&a[end], &a[0]);
		AdjustDown(a, end, 0);
		end--;
	}

TOP-K problem

TOP-K problem: Find the top K largest elements or smallest elements in data combination. Generally, the amount of data is relatively large.
For example: top 10 professionals, Fortune 500, rich list, top 100 active players in the game, etc.
For the Top-K problem, the simplest and most direct way that can be thought of is sorting. However, if the amount of data is very large, sorting is not advisable (the data may not all be loaded into the memory at once). The best way is to use a heap to solve the problem. The basic idea is as follows:
1. Use the first K elements in the data set to build a heap. For
the first k largest elements, build a small heap.
For the first k smallest elements, build a large heap2
. . Use the remaining NK elements to compare with the top element of the heap in sequence. If not satisfied, replace the top element of the heap. After comparing the remaining NK elements with the top element of the heap in sequence, the remaining K elements in the heap are the first K required. The smallest or largest element.

void PrintTopK(const char* filename, int k)
{
	// 1. 建堆--用a中前k个元素建堆
	FILE* fout = fopen(filename, "r");
	if (fout == NULL)
	{
		perror("fopen fail");
		return;
	}

	int* minheap = (int*)malloc(sizeof(int) * k);
	if (minheap == NULL)
	{
		perror("malloc fail");
		return;
	}

	for (int i = 0; i < k; i++)
	{
		fscanf(fout, "%d", &minheap[i]);
	}

	// 前k个数建小堆
	for (int i = (k-2)/2; i >=0 ; --i)
	{
		AdjustDown(minheap, k, i);
	}


	// 2. 将剩余n-k个元素依次与堆顶元素交换,不满则则替换
	int x = 0;
	while (fscanf(fout, "%d", &x) != EOF)
	{
		if (x > minheap[0])
		{
			// 替换你进堆
			minheap[0] = x;
            // 向下调整算法函数
			AdjustDown(minheap, k, 0);
		}
	}

	for (int i = 0; i < k; i++)
	{
		printf("%d ", minheap[i]);
	}
	printf("\n");

	free(minheap);
	fclose(fout);
}

// fprintf  fscanf

void CreateNDate()
{
	// 造数据
	int n = 10000000;
	srand(time(0));
	const char* file = "data.txt";
	FILE* fin = fopen(file, "w");
	if (fin == NULL)
	{
		perror("fopen error");
		return;
	}

	for (int i = 0; i < n; ++i)
	{
		int x = (rand() + i) % 10000000;
		fprintf(fin, "%d\n", x);
	}

	fclose(fin);
}

int main()
{
	//CreateNDate();
	PrintTopK("data.txt", 5);

	return 0;
}

If you forget the file operation knowledge, review it yourself.

Today’s content ends here, thank you all for watching! You can communicate and discuss more in the comment area and point out my mistakes!

The next article will explain the implementation of a complete binary tree ! Please stay tuned! ! !

Guess you like

Origin blog.csdn.net/qq_55119554/article/details/133090375