Big data analysis problem

1. Topic and content

Big data analysis problem

[Problem Description]

某搜索公司一天的用户搜索词汇是海量的（百亿数据量），请设计一种求出每天最热top 100 词汇的可行办法。

[basic requirements]

(1) Mass data is randomly generated and stored in a file; data is read in from the file for processing.
(2) Display the results of each processing of the data file.

2. Thinking process

（1）生成随机大数据，并且要求大数据存在交集数据，以此保证有热词存在。

（2）顺序读大文件，对于每一行，也就是每个词x，取hash（x）%n，
然后按照该值存到n个小文件中（即为1，2，3，4…..n），这样将MB级别，或者GB级别大文件分块成kb级别，
如果有的文件超过了1M大小，还可以按照类似的方法继续往下分，直到分解得到的小文件的大小都不超过1M;

（3）读入每一个小文件，利用hash_map统计每个文件中出现的词以及相应的频率，
创建动态数据结构node组，存储data（键值）和num（频率），利用最大堆排列，并取出出现频率最大的100个词，
这里注意不一定只有100词，需要统计100之后是否存在和100相同的值，并把100词和相应的频率以结构[key,num]存入文件中，这样又得到n个文件。

（4）把n个文件读入归并成1个文件all.txt, 创建动态数据结构node组，存储data（键值）和num（频率），
利用最大堆排列，取出出现频率最大的100个词，思路到此。

3. Program flow chart and annotated program

1. Generate big data files

The data generated by python I used here is because I wanted to generate it directly on the server to reduce the burden on the local computer. The final data generated is 15503190 pieces of data, which is about 15 million, and the space is 81MB. Although this number does not reach the tens of billions level, in theory, the tens of billions of data are the same idea. The reason why English is not used in Chinese is because the initial use of random Chinese characters, but the intersection of Chinese characters is very small, it is difficult to have hot words.

Process:
Loop (1.5k) == set random source -> select random words of length to merge -> save file
Python code: # which used for inserting new data

1.	def create():  
2.	    filepath = 'data.txt'  
3.	    dataFrom = ['a','b','c','d','e','f','g','h','i','j']  # 随机来源
4.	    for i in range(10000000):  
5.	        if(i%10000 == 0):  
6.	            print(i)  
7.	  
8.	        data = ''  
9.	  
10.	        for j in range(random.randint(2,5)):  
11.	            data += random.choice(dataFrom)  # 产生2到5的长度数据
12.	  
13.	        with open(filepath , 'a' ) as f:  # 存储文件
14.	            f.write(data + '\n')  
15.	            f.close()

ps: After writing a report, I found that with open does not need to be executed every cycle. . . Can be modified to be global!

2. Read large files sequentially

C++ is used here. For each line, that is, for each word x, take hash(x)%n, and then store the value in n small files (that is, 1, 2, 3, 4...n) In this way, the MB level or GB level large file is divided into kb level, and the largest small file I got is 609kb.

Process:
Set the block size num --> read the large file all.txt -->
loop (=true) == calculate the hash of each line %num = block location -> store the corresponding block area

#include <iostream>  
    2.	#include <fstream>  
    3.	#include <string>  
    4.	#include <typeinfo>  
    5.	#include <iomanip>  
    6.	  
    7.	using namespace std;  
    8.	  
    9.	  
    10.	int spit(int num = 500) {  
    11.	  
    12.	    hash<string> str_hash; // 哈希函数  
    13.	  
    14.	    ifstream file;// 大数据来源  
    15.	    file.open("C:\\Users\\lenovo\\Desktop\\大二上作业\\数据结构\\data.txt",
    ios::in);  
    16.	  
    17.	    if (!file.is_open())  
    18.	  
    19.	        return 0;  
    20.	  
    21.	  
    22.	    std::string strLine;  
    23.	  
    24.	    int i = 0;  
    25.	  
    26.	    ofstream *ofile = new ofstream[num+1]; // 预处理存储num个小文件，num可以自定义，这里设置500  
    27.	  
    28.	    for (i = 1; i <= num; i++) { // 目的是open只需要1次，而不是每一次写入都要open，浪费资源  
    29.	        string filename = to_string(i) + ".txt"; // 存储名  
    30.	        ofile[i].open("C:\\Users\\lenovo\\Desktop\\大二上作业\\数据结构\\ton\\" +
    filename);  
    31.	    }  
    32.	  
    33.	    i = 0;  
    34.	  
    35.	    while (getline(file, strLine))// 读入大数据文件每一行  
    36.	    {  
    37.	        i++;  
    38.	  
    39.	        if (strLine.empty()) // 是否为空  
    40.	            continue;  
    41.	  
    42.	        int ton = str_hash(strLine) % num; // 哈希计算分块位置  
    43.	         
    44.	        ofile[ton] << strLine << endl; // 写入一行  
    45.	  
    46.	        if (i % 10000 == 0) {  
    47.	            cout << i << endl; // 每写入10000行输出一次，进度说明  
    48.	        }  
    49.	       
    50.	    }  
    51.	  
    52.	    file.close();  
    53.	    for (i = 1; i <= num; i++) {  
    54.	        ofile[i].close();  
    55.	    }  
    56.	    delete[] ofile;  
    57.	}

3. Read in every small file

Use hash_map to count the words that appear in each file and the corresponding frequency to
create a dynamic data structure node group, store data (key value) and num (frequency), use the largest heap arrangement, and take out the 100 words with the highest frequency. Note here There is not necessarily only 100 words. It is necessary to count whether there is the same value as 100 after 100, and store the 100 words and the corresponding frequency in the file with the structure [key, num], so that n files are obtained.

This is the core place.

Process:
loop (500) ==
read small files -> use hash_map to count the frequency ->
hash_map to traverse to the tree group of the node structure ->
use the largest heap to sort the tree structure ->
output the top 100 data to the file

1.	int get100() {  
2.	  
3.	    for (int q = 1; q <= 500; q++) {  
4.	        hash_map<string, int> hm; //建立hash_map 结构 以数据作为key，频率作为value  
5.	        hash_map<string, int> ::iterator it;  
6.	  
7.	        ifstream file; // 建立小文件对象  
8.	        ofstream ofile; // 建立小文件对象  
9.	  
10.	        string filename = to_string(q) + ".txt"; // 文件来向和去向设定  
11.	        file.open("C:\\Users\\lenovo\\Desktop\\大二上作业\\数据结构\\ton\\" + filename, ios::in);  
12.	        ofile.open("C:\\Users\\lenovo\\Desktop\\大二上作业\\数据结构\\100\\" + filename);  
13.	  
14.	        if (!file.is_open())  
15.	            break;  
16.	  
17.	        string strLine;  
18.	        int allNum = 0; // 统计当前小文件的行数，便于创建动态结构node组  
19.	        while (getline(file, strLine)) {  
20.	            if (strLine.empty())  
21.	                continue;  
22.	            hm[strLine] += 1; // hash_map 统计频率  
23.	            allNum++;  
24.	        }  
25.	        cout << "*****" << endl;  
26.	  
27.	        it = hm.begin();  
28.	  
29.	        node* tree = new node[allNum]; // 用来排序  
30.	        int startNum = 0;  
31.	  
32.	        while (it != hm.end()) // 遍历hash_map  
33.	        {  
34.	            tree[startNum].num = it->second; //存储数据  
35.	            tree[startNum].data = it->first;  
36.	            startNum++;  
37.	          //  cout << "pair it_map.key = " << it->first << " pair it_map.string = " << it->second << endl;  
38.	            ++it;  
39.	        }  
40.	  
41.	        Heap_sort(tree, allNum);// 最大堆排列，传入的是node结构  
42.	        cout << "***" << endl;  
43.	  
44.	        int i = 0;  
45.	        while (true) // 输出前一百的数据  
46.	        {  
47.	  
48.	            i++;  
49.	            cout << i << "   " << tree[i].num << "  " << tree[i].data << endl;  
50.	            ofile << tree[i].data <<","<< tree[i].num << endl;  
51.	  
52.	            if (i >= 100 && tree[i].num != tree[i + 1].num) // 保证100之后是否存在和100相同的数据  
53.	                break;  
54.	  
55.	  
56.	  
57.	        }  
58.	        delete[] tree; // 释放空间  
59.	        file.close();  
60.	        ofile.close();  
61.	  
62.	    }  
63.	  
64.	    return 0;  
65.	}

4. Read and merge n files into one file all.txt.

Process:
loop (500) == write the same all.txt -> count the number of rows and return.

1.	 int getALl(int &num) {  
2.	  
3.	    ofstream ofile;  
4.	    ofile.open("C:\\Users\\lenovo\\Desktop\\大二上作业\\数据结构\\all.txt"); //来源  
5.	  
6.	    for (int q = 1; q <= 500; q++) {  
7.	        ifstream file;  
8.	  
9.	           
10.	        string filename = to_string(q) + ".txt"; //去向  
11.	        file.open("C:\\Users\\lenovo\\Desktop\\大二上作业\\数据结构\\100\\" + filename, ios::in);  
12.	  
13.	        if (!file.is_open())  
14.	            break;  
15.	  
16.	        string strLine;  
17.	  
18.	        while (getline(file, strLine)) {  
19.	            if (strLine.empty())  
20.	                continue;  
21.	            ofile << strLine << endl;  
22.	            num += 1;  
23.	  
24.	        }  
25.	  
26.	        file.close();  
27.	  
28.	    }  
29.	  
30.	    return 0;  
31.	}

5. Create a dynamic data structure node group,

Store data (key value) and num (frequency), use the maximum heap permutation, and take out the 100 words with the highest frequency. That's it.
Output the top 100 data to a file

1.	int main() {  
2.	  
3.	    int dataLine;  
4.	    getALl(dataLine);  
5.	  
6.	    cout << dataLine;  
7.	      
8.	    ifstream file;  
9.	    file.open("C:\\Users\\lenovo\\Desktop\\大二上作业\\数据结构\\all.txt", ios::in);  
10.	    ofstream ofile;  
11.	    ofile.open("C:\\Users\\lenovo\\Desktop\\大二上作业\\数据结构\\last.txt");  
12.	   
13.	    node* tree = new node[dataLine];// 动态结构node  
14.	    int startNum = 0;  
15.	  
16.	    string strLine;  
17.	  
18.	    while (getline(file, strLine)) {  
19.	        if (strLine.empty())  
20.	            continue;  
21.	        
22.	        vector<string> v = split(strLine , ","); //分割split;  
23.	        tree[startNum].num = to_int(v[1]);  
24.	        tree[startNum].data = v[0];  
25.	        startNum++;  
26.	  
27.	    }  
28.	  
29.	  
30.	    Heap_sort(tree, dataLine);  
31.	    cout << "***" << endl;  
32.	  
33.	    int i = 0;  
34.	    while (true)  
35.	    {  
36.	        i++;  
37.	        cout << i << "   " << tree[i].num << "  " << tree[i].data << endl;  
38.	        ofile << i << "   " << tree[i].num << "  " << tree[i].data << endl;  
39.	  
40.	        if (i >= 100 && tree[i].num != tree[i + 1].num) // 输出判断前一百  
41.	            break;  
42.	    }  
43.	    delete[] tree;  
44.	    file.close();  
45.	    ofile.close();  
46.	  
47.	  
48.	}

4. Execute the program name, and print the initial value and operation result when the program is running

1. Generate big data and use python on the server

Generate file:

2. Block 500 files

The source file is 85MB, and the block memory takes up 4MB

3. Read in each small file, use hash_map to count the words appearing in each file and the corresponding frequency, create a dynamic data structure node group, store data (key value) and num (frequency), use the largest heap arrangement, and remove The 100 most frequent words.

Insert picture description here

4. Finally, the first hundred generations

5. Analysis of experimental results, experimental gains and experience

1. This question of big data is actually very important. If it is a 10G file that requires 1G of memory to read, then the memory requirements are very high, and the key idea is to divide and conquer, and block processing! ! !

2. At the beginning, I created the node group and did not use delete processing, which caused the memory footprint to soar to about 1G, but after delete, it occupies less than 4MB, which reflects that the code has a very high grasp of resources and how to deal with the resource occupation problem is very important.

3. The reason why hash_map uses it is because its query is very fast, the time complexity is very low, and its idea is also divide and conquer, similar to the principle of bucket distribution.

6. Suggestions and suggestions for improvement of the experiment.

The experiment here is based on the idea of divide and conquer, but it does not have to be run on the local machine. When the data reaches a certain level, in fact, distributed computing can be used to divide the data to different sub-machines, and the sub-machines will transmit the data to the central machine after processing. After synchronizing the parameters, continue to divide it into different sub-machines, and then loop to realize the statistics of the data.

And because the amount of data is still not a big problem, the bucket distribution is 500. If there is more data, more buckets can be divided.

Furthermore, in fact, it takes about 290 seconds to complete the whole process for 15 million processing. My device i5 runs 3.7Ghz, and the memory is 20G. Different machines may have different time, and the main time is spent reading row data. I thought about whether we can use multi-threaded processing mechanisms, or memory mapping, and so on. Later, I thought about the problem of processing big data with low memory. If multi-threaded processing, it would be meaningless. It would be better to use the above distributed processing!