Data structure big homework - DBLP scientific literature management system (2) author sorting, hotspot analysis function, fuzzy search (bucket sorting, dictionary tree)

        The sorting calculation will be executed in the library building, and the result will be output to a file. The front end does not need to do additional calculations.

Author statistics function

        After calling the author statistics function, the program will read the author names one by one, maintain a dictionary tree at the same time, record a weight value at each node of the tree, and count how many times the author name corresponding to this position appears. Finally, traverse the dictionary tree to obtain the specific occurrence times of all authors, and because the frequency of occurrence of authors is much smaller than the total number of authors, using bucket sorting can obtain the best sorting efficiency, and finally complete the sorting in O(n) time, taking up The space of is also O(n). After sorting, output to the file in order.

        Among them, the dictionary tree (word search tree) is a variant of the hash tree. Typical applications are used to count, sort and save a large number of strings (but not limited to strings), so they are often used by search engine systems for text word frequency statistics. Its advantages are: use the common prefix of the string to reduce the query time, minimize unnecessary string comparison, and the query efficiency is higher than the hash tree.

class node
	{
	public:
		static const node* Not_found;
		char str;
		DWORD weight;
		vector<node*> child;
		node(char Strings = '\0', DWORD weights = 0) {
			str = Strings;
			weight = weights;
			child.clear();
		}
		node* add_child(char String, DWORD weights = 0) {
			node* ans = new node(String, weights);
			child.push_back(ans);
			return ans;
		}
		node* find_child(char String) {
			for (vector<node*>::iterator now = child.begin(); now != child.end(); now++) {
				if ((*now)->str == String)
					return *now;
			}
			return NULL;
		}
	};
	const node* node::Not_found = NULL;

	class authors
	{
	private:
	public:
		string name;
		DWORD weight;
		authors(const string& names, const DWORD& weights) {
			name = names;
			weight = weights;
		}
		DWORD operator++() {
			return ++weight;
		}
		bool operator < (const authors& tar) {
			return this->weight < tar.weight;
		}
		bool operator > (const authors& tar) {
			return this->weight > tar.weight;
		}
		bool operator >= (const authors& tar) {
			return this->weight >= tar.weight;
		}
		bool operator <= (const authors& tar) {
			return this->weight <= tar.weight;
		}
		authors& operator =(const authors& tar) {
			name = tar.name;
			weight = tar.weight;
			return *this;
		}
	};

Hot spot analysis function

        Similar to the author statistics function, after calling the hotspot analysis function, the program will read all the topics by word, maintain a dictionary tree at the same time, record a weight at each tree node, and count the frequency of occurrence of the word corresponding to this position . Finally, traverse the dictionary tree to obtain the specific occurrence times of all words. Since there are a large number of identical words, bucket sorting is also used to obtain better sorting efficiency. Finally, the sorting is completed in O(n+m) time, taking up The space of is also O(n+m). After sorting, output to the file in order. After testing, the speed of bucket sorting is still higher than other sorting methods (tested quick sorting, cocktail sorting, heap sorting, merge sorting, Hill sorting, due to data characteristics and scale, all sorting methods except bucket sorting time cannot Accept) speed.

         (Some common words are actually brushed out here, such as is an are, etc.)

 

Partial match search function (fuzzy search )

        In order to achieve fast search, before performing fuzzy search, a dictionary tree needs to be established in memory (here requires about 700m of memory and corresponding release data)

int fuzzy_search(char* _saveUrl, DWORD thread_num, char* tar)
	{
		if (!is_initial) {
			cout << "未初始化" << endl;
			return false;
		}
		if (thread_num > 256)thread_num = 256;
		max_thread_num = thread_num;
		kill_me = false;
		DWORD numb = bags.size() / thread_num;
		fuzzy_target = tar;
		for (int i = 0; i < max_thread_num; i++) {
			answers[i].clear();
			finished_flag[i] = false;
			change_fuzzy_string[i] = true;
		}
		for (int i = 0; i < thread_num - 1; i++) {
			thread* tmp = new thread(calculate_thread, _saveUrl, numb * i, numb * (i + 1), i);
			tmp->detach();
		}
		thread* tmp = new thread(calculate_thread, _saveUrl, numb * (thread_num - 1), bags.size() - 1, thread_num - 1);
		tmp->detach();
		while (!check_finish()) { Sleep(100); }
		if (kill_me)return false;
		kill_me = true;
		string paths = "mkdir " + (string)_saveUrl + "database\\searchlog  2>nul";
		char* pathss = new char[paths.length() + 1];
		for (int i = 0; i < paths.length(); i++) {
			pathss[i] = paths[i];
		}
		pathss[paths.length()] = '\0';
		system(pathss);
		fstream outfile((string)_saveUrl + (string)"database\\searchlog\\" + Hash4(fuzzy_target) + ".db", ios::out);
		int tot = 0;
		for (int i = 0; i < thread_num; i++) {
			for (int j = 0; j < answers[i].size(); j++) {
				outfile << answers[i][j] << endl;
				tot++;
			}
			answers[i].clear();
		}
		outfile.close();
		return tot;
	}

        There are two types of fuzzy search, one is whole-word matching and the other is complete fuzzy matching. The difference is that full fuzzing matches parts of a word, while whole word matching does not. The search methods used by the two modes are exactly the same, so only complete fuzzy matching is described here. as the picture shows:

         After initialization, when the fuzzy search function is called, a large number of threads are created, and tasks are equally divided according to the number of threads. Through the KMP substring matching algorithm, all matching results are stored in the independent Vector container of each thread. After all the tasks are completed, the Hash value is calculated by the main thread, and the result is stored in the corresponding file. In any case, the program can complete all tasks within 100ms. However, due to IO limitations, the results need to be output to a file. If there are few results, they can be returned within 1s; if there are many, the time may reach 5s or more.

Guess you like

Origin blog.csdn.net/m0_51776409/article/details/124762788
Recommended