Data structure: handling of big data issues-hashing, heap row

Topic leads


Generate 100,000 random numbers in the file. The value range of the numbers is 0~32767. It is implemented as follows. The longest array that can be used in the implementation process is 10000, and the memory of individual variables is ignored

  1. Find the number with the most repetitions (if there are more than one, choose any one)
  2. Find the top 100 most repeated

Algorithm ideas


First question : Find the number with the most repetitions (if there are more than one, choose any one)

  1. First generate one hundred thousand random numbers from the file
  2. Traverse these 100,000 data and create an array to act as a counter. Because the problem requires that the maximum array that can be used is 10,000, and the data range is 32767, the 100,000 data must be grouped and counted, the number of each group No more than 10,000. Since the result of the grouping cannot affect the result of the final count of the number of times, it cannot be directly grouped across the board. Hash grouping (that is, hash, you can use the remainder to group the remainder of 4) and store them in four files respectively (Hash file), the first hash file is a number with a remainder of 0 from 4, and the second hash file is a number with a remainder of 1 from 4...
  3. Find the number with the most times in each hash file, and then find the champion of the number of times in the four files

Second question : Find the top 100 with the most repetitions

  1. Perform heap sort on each hash file to get the top 100 most repeated times in each file
  2. Combining these four hundred numbers to get the top 100 times

Implementation


Macro definition

#define MAX_NUM 100000  //十万个随机数字
#define ITEM_NUM 10000	//计数器大小为1万
typedef struct Pair //定义数对
{
    
    
	int num;//数字
	int times; //次数
};

The first question: find the number with the most repetitions

1. First generate one hundred thousand random numbers from a file

void CreateBigFile(const char* path)//产生MAX_NUM个随机数字
{
    
    
	FILE *fw = fopen(path, "wb");
	assert(fw != NULL);

	int tmp;
	for (int i = 0; i < MAX_NUM; i++)
	{
    
    
		tmp = rand();
		fwrite(&tmp, sizeof(int), 1, fw);
	}
	fclose(fw);
}

Auxiliary function: display the number in the path file

void Show(const char *path)//显示path文件含有的数字
{
    
    
	FILE *fr = fopen(path, "rb");
	assert(fr != NULL);
	int tmp;
	int i = 0;
	while (fread(&tmp, sizeof(int), 1, fr) > 0)
	{
    
    
		printf("%d  ", tmp);
		i++;
		if (i % 10 == 0)	printf("\n");			
	}
	fclose(fr);
}

2. Count the most frequent numbers in a hash file

Pair HashFile(const char * path)
{
    
    
	int *arr = (int *)calloc(ITEM_NUM, sizeof(int));
	FILE *fr = fopen(path, "rb");
	assert(arr != NULL && fr != NULL);
	int tmp;
	//统计hash文件中每个数字出现的次数
	//文件0:0,4,8->0,1,2  文件1:1,5,9->0,1,2  文件2:2,6,10->0,1,2  文件3:3,7,11->0,1,2  哈希函数y=x/4
	while (fread(&tmp, sizeof(int), 1, fr) > 0)//(0,1,2,3)->0,即四个文件中最小的数字对应的计数器下标都是0
	{
    
    
		arr[tmp / 4]++;//hash函数 y = x/4
	}
	//找到次数最多的数字及次数 
	Pair pa = {
    
     0 };
	for (int i = 0; i < ITEM_NUM; i++)
	{
    
    
		if (pa.times < arr[i])
		{
    
    
			pa.num = i * 4 + tmp % 4; //反推:0->(0,1,2,3),i*4加该文件数字对4的余数
			pa.times = arr[i];
		}
	}
	fclose(fr);
	free(arr);
	return pa;
}

3. The number that appears most frequently in the statistical file

Pair MaxTimes(const char * path)
{
    
    
	FILE *fr = fopen(path, "rb");
	int tmp;
	//生成四个不同的文件名
	char pathArr[4][20];//四个文件名,0.txt  1.txt  2.txt  3.txt
	for (int i = 0; i < 4; i++)//批量生成文件名
	{
    
    
		sprintf(pathArr[i], "%d.txt", i);
	} 
	//定义四个hash文件并打开
	FILE *fw[4];
	for (int i = 0; i < 4; i++)
	{
    
    
		fw[i] = fopen(pathArr[i], "wb");
	}
	//将原来的数据散列到四个hash文件中
	while (fread(&tmp, sizeof(int), 1, fr) > 0)
	{
    
    
		fwrite(&tmp, sizeof(int), 1, fw[tmp % 4]);
	}
	for (int i = 0; i < 4; i++)
	{
    
    
		fclose(fw[i]);
	}
	//统计每个hash文件中出现次数最多的数字
	Pair paArr[4];
	for (int i = 0; i < 4; i++)
	{
    
    
		paArr[i] = HashFile(pathArr[i]);
	}
	//找到四个里面次数最大的
	int index = 0;//保存次数最多的数据下标
	for (int i = 0; i < 4; i++)
	{
    
    
		if (paArr[index].times < paArr[i].times)
		{
    
    
			index = i;
		}
	}
	return paArr[index];	
}

4. Main function test

int main()
{
    
    
	const char*path = "big.txt";
	CreateBigFile(path);

	Pair pa = MaxTimes(path);
	printf("十万个数据中重复次数最多的是:\n\n 数字=%d,次数=%d\n\n", pa.num, pa.times);
	//Show(path);
}

Insert picture description here

5. Hash the generated file

Insert picture description here

Second question: Find the top 100 with the most repetitions

1. Find the top 100 most frequent in each hash file

Note: Because of the hash, the location of the number is mapped to the number itself, and the exchange of sorting will destroy this relationship, so the number itself and its times must be saved, so the counter array element type should be changed to the structure Pair Types of

//统计hash文件中出现次数最多的前100个数字,计数器限制为ITEM_NUM
Pair *HashFile2(const char *path)
{
    
    
	//arr 为ITEM_NUM长的int型数组,计数器  注意:数字存放的位置和数字本身有关系
	//将arr改为元素类型为Pair的数组
	FILE *fr = fopen(path, "rb");
	Pair *arr = (Pair*)calloc(ITEM_NUM ,  sizeof(Pair));
	assert(fr != NULL && arr!=NULL);
	int tmp;
	//统计hash文件中每个数字出现的次数
	//文件0:0,4,8->0,1,2  文件1:1,5,9->0,1,2  文件2:2,6,10->0,1,2  文件3:3,7,11->0,1,2  哈希函数y=x/4
	while (fread(&tmp, sizeof(int), 1, fr) > 0)//(0,1,2,3)->0,即四个文件中最小的数字对应的计数器下标都是0
	{
    
    
		arr[tmp / 4].num = tmp;
		arr[tmp / 4].times++;//hash函数 y = x/4
	}
	//对arr数组按times递减排序, 排序选用堆排序,只需要得到前100个
	HeapSort(arr, ITEM_NUM);	
	fclose(fr);
	return arr;
}

2. Find the top 100 in the 400 data screened out

//在筛出的400个数据里找次数前100的
Pair * MaxTimes2(const char* path)
{
    
    
	FILE *fr = fopen(path, "rb");
	int tmp;
	//生成四个不同的文件名
	char pathArr[4][20];//四个文件名,0.txt  1.txt  2.txt  3.txt
	for (int i = 0; i < 4; i++)//批量生成文件名
	{
    
    
		sprintf(pathArr[i], "%d.txt", i);
	}
	//定义四个hash文件并打开
	FILE *fw[4];
	for (int i = 0; i < 4; i++)
	{
    
    
		fw[i] = fopen(pathArr[i], "wb");
	}
	//将原来的数据散列到四个hash文件中
	while (fread(&tmp, sizeof(int), 1, fr) > 0)
	{
    
    
		fwrite(&tmp, sizeof(int), 1, fw[tmp % 4]);
	}
	for (int i = 0; i < 4; i++)
	{
    
    
		fclose(fw[i]);
	}
	//统计每个hash文件中出现次数的前100-----------
	Pair *arr[4];
	for (int i = 0; i < 4; i++)
	{
    
    
		arr[i] = HashFile2(pathArr[i]);
	}
	//400个里面找前100, 先把400个数对汇总到一起
	Pair *fourHundred = (Pair*)malloc(sizeof(Pair) * 400);
	int index = 0;
	for (int i = 0; i < 4; i++)
	{
    
    
		for (int j = 0; j < 100; j++)
		{
    
    
			fourHundred[index++] = arr[i][j];
		}
	}
	//汇总的400个数据进行递减式堆排
	HeapSort(fourHundred, 400);
	return fourHundred;	
}

3. Attached: Decreasing stacking code (for Pair type)

/// 递减式堆排序 //
//一次堆调整
void HeapAdjust(Pair *arr, int start, int end)//start起始下标,end结尾下标,O(logn),O(1)
{
    
    
	Pair tmp = arr[start];
	int parent = start;//标记父节点下标

	for (int i = 2 * start + 1; i <= end; i = 2 * i + 1)//i下一次要到它的左孩子
	{
    
    
		//找左右孩子的较大值
		if (i + 1 <= end && arr[i].times > arr[i + 1].times)
		{
    
    
			i++;
		}//i变为左右孩子较大值的下标
		if (arr[i].times < tmp.times)
		{
    
    
			//arr[(i - 1) / 2] = arr[i];//放到i的父节点
			arr[parent] = arr[i];
		}
		else
		{
    
    
			break;
		}
		parent = i;//更新下一次i的父节点
	}
	arr[parent] = tmp;
}
void HeapSort(Pair *arr, int len)//O(nlogn),O(1),不稳定(父子相互交换数据,父子下标是跳跃式的)
{
    
    
	//建立大根堆,O(nlogn)
	for (int i = (len - 1 - 1) / 2; i >= 0; i--)//len-1最后一个的下标,再减一除以二是它的父节点下标,从后往前多次调整
	{
    
    
		HeapAdjust(arr, i, len - 1);//每一个i都遍历到len-1作为end,因为即使有的没有len-1这个子节点,也不影响,
	}

	//每次将根和待排序最后的值交换,然后再调整,O(nlogn)
	Pair tmp;
	for (int i = 0; i < len - 1; i++)
	{
    
    
		tmp = arr[0];
		arr[0] = arr[len - 1 - i];
		arr[len - 1 - i] = tmp;

		HeapAdjust(arr, 0, len - 2 - i);
	}
}

4. Main function test:

int main()
{
    
    
	const char*path = "big.txt";
	CreateBigFile(path);

	//第一问
	Pair pa = MaxTimes(path);
	printf("十万个数据中重复次数最多的是:\n\n 数字=%d,次数=%d\n\n", pa.num, pa.times);
	//Show(path);

	//第二问
	Pair *pa2 = MaxTimes2(path);
	printf("重复数字最多的前100个:\n\n");
	for (int i = 0; i < 100; i++)
	{
    
    
		printf("(%d %d)  ", pa2[i].num, pa2[i].times);
		if (i % 10 == 0)
			printf("\n");
	}
}

Insert picture description here


to sum up


Regarding the above problems, in fact, there are mainly three things done

  1. Hash (hash) large files to get multiple hash files
  2. Then get the most frequent number in each hash file
  3. Get the most repeated number in all files

For dealing with similar big data problems with a lot of data but less available memory, the core idea is to hash first, then heap


Mass data problem


Mass data processing is based on the storage, processing, and operation of mass data. The problem of massive data is that the amount of data is too large, so either it cannot be solved quickly in a short time, or the data is too large, which makes it impossible to load the memory at one time.

Basic ideas for handling massive problems

  1. Divide and conquer/hash mapping + hash statistics + heap/fast/merge sort;
  2. Double bucket division
  3. Bloom filter/Bitmap;
  4. Trie tree/database/inverted index;
  5. Outer sort
  6. Hadoop/Mapreduce for distributed processing.

Example 1: Massive log data, extract the IP that visits Baidu the most on a certain day

Algorithm idea: Divide and conquer +Hash

  1. The IP address has at most 2^32=4G values, so it cannot be completely loaded into the memory for processing
  2. Consider adopting the idea of ​​"divide and conquer". According to the Hash(IP)%1024 value of the IP address, the massive IP logs are stored in 1024 small files. In this way, each small file contains at most 4MB IP addresses
  3. For each small file, you can build a Hash map with IP as the key and the number of occurrences as value, and record the current IP address with the most occurrences
  4. You can get the IP with the most occurrences in 1024 small files, and then get the IP with the most occurrences overall according to the conventional sorting algorithm

Example 2: The search engine will record all the search strings used by the user for each search through the log file, and the length of each query string is 1-255 bytes

Suppose there are currently 10 million records (the repetition of these query strings is relatively high, although the total is 10 million, but if the repetition is removed, it will not exceed 3 million. The higher the repetition of a query string, it means that it is queried The more users there are, the more popular it is.), please count the most popular 10 query strings, and the required memory cannot exceed 1G.

Basic idea: hash + heap row

  • Preprocess this batch of massive data first, and complete the statistics with the Hash table in O(N) time
  • With the help of the heap data structure, find Top K, the time complexity is N'logK

That is, with the help of the heap structure, we can find and adjust/move in log time. Therefore, maintain a small root heap with the size of K (10 in this question), and then traverse 3 million Query, and compare with the root elements respectively. Therefore, our final time complexity is: O(N) + N'*O (LogK), (N is 10 million, N'is 3 million)

Example 3: There is a file with a size of 1G, each line in it is a word, the size of the word does not exceed 16 bytes, and the memory limit is 1M. Return the 100 most frequent words

Basic idea: hash + heap row + merge

  • In order to read the file, for each word x, take hash(x)%5000, and then store it in 5000 small files (denoted as x0, x1,...x4999) according to this value. So each file is about 200k
  • If some of the files exceed 1M in size, you can continue to divide according to a similar method until the size of the small files obtained by decomposition does not exceed 1M
  • For each small file, count the words that appear in each file and the corresponding frequency (trie tree/hash_map, etc. can be used), and take out the 100 words with the largest frequency (the smallest heap containing 100 nodes can be used), And save 100 words and the corresponding frequency into the file, and get another 5000 files
  • Combine these 5000 files to get all the top 100 words with the highest intermediate frequency

to sum up


Dealing with big data problems: hash + heap row

Guess you like

Origin blog.csdn.net/huifaguangdemao/article/details/109079871