数据结构大作业-DBLP科学文献管理系统（二）作者排序、热点分析功能，模糊搜索（桶排序，字典树）

排序计算将在建库中执行完成，并将结果输出到文件。前端不需要做额外计算。

作者统计功能

调用作者统计函数后，程序将一个一个将作者名读入，同时维护一个字典树，在每个树的结点记录一个权值，统计该位置对应的作者名出现了多少遍。最后再遍历字典树，获得所有作者的具体出现次数，并由于作者出现频率远小于作者总数量，因此使用桶排序可以获得最好的排序效率，最终在O(n)的时间内完成排序，占用的空间也为O(n)。排序后按顺序输出到文件中。

其中，字典树（单词查找树）是一种哈希树的变种。典型应用是用于统计，排序和保存大量的字符串（但不仅限于字符串），所以经常被搜索引擎系统用于文本词频统计。它的优点是：利用字符串的公共前缀来减少查询时间，最大限度地减少无谓的字符串比较，查询效率比哈希树高。

class node
	{
	public:
		static const node* Not_found;
		char str;
		DWORD weight;
		vector<node*> child;
		node(char Strings = '\0', DWORD weights = 0) {
			str = Strings;
			weight = weights;
			child.clear();
		}
		node* add_child(char String, DWORD weights = 0) {
			node* ans = new node(String, weights);
			child.push_back(ans);
			return ans;
		}
		node* find_child(char String) {
			for (vector<node*>::iterator now = child.begin(); now != child.end(); now++) {
				if ((*now)->str == String)
					return *now;
			}
			return NULL;
		}
	};
	const node* node::Not_found = NULL;

	class authors
	{
	private:
	public:
		string name;
		DWORD weight;
		authors(const string& names, const DWORD& weights) {
			name = names;
			weight = weights;
		}
		DWORD operator++() {
			return ++weight;
		}
		bool operator < (const authors& tar) {
			return this->weight < tar.weight;
		}
		bool operator > (const authors& tar) {
			return this->weight > tar.weight;
		}
		bool operator >= (const authors& tar) {
			return this->weight >= tar.weight;
		}
		bool operator <= (const authors& tar) {
			return this->weight <= tar.weight;
		}
		authors& operator =(const authors& tar) {
			name = tar.name;
			weight = tar.weight;
			return *this;
		}
	};

热点分析功能

与作者统计函数类似，调用热点分析函数后，程序将分别将所有题目按单词读入，同时维护一个字典树，在每个树的结点记录一个权值，统计该位置对应的单词出现的频率。最后再遍历字典树，获得所有单词的具体出现次数，由于单词出现频率有大量相同，因此也使用桶排序以获得较好的排序效率，最终在O(n+m)的时间内完成排序，占用的空间也为O(n+m)。排序后按顺序输出到文件中。经过测试，桶排序的速度仍然高于其他排序方式（测试了快速排序、鸡尾酒排序、堆排序、归并排序、希尔排序，由于数据特点和规模，导致除桶排序外的所有排序方式时间均不可接受）速度。

（这里实际刷掉了一些常用词，如is an are等）

部分匹配搜索功能(模糊搜索)

为了实现快速搜索，在执行模糊搜索前，需要在内存中建立一个字典树（这里需要内存约700m,并有对应的释放数据）

int fuzzy_search(char* _saveUrl, DWORD thread_num, char* tar)
	{
		if (!is_initial) {
			cout << "未初始化" << endl;
			return false;
		}
		if (thread_num > 256)thread_num = 256;
		max_thread_num = thread_num;
		kill_me = false;
		DWORD numb = bags.size() / thread_num;
		fuzzy_target = tar;
		for (int i = 0; i < max_thread_num; i++) {
			answers[i].clear();
			finished_flag[i] = false;
			change_fuzzy_string[i] = true;
		}
		for (int i = 0; i < thread_num - 1; i++) {
			thread* tmp = new thread(calculate_thread, _saveUrl, numb * i, numb * (i + 1), i);
			tmp->detach();
		}
		thread* tmp = new thread(calculate_thread, _saveUrl, numb * (thread_num - 1), bags.size() - 1, thread_num - 1);
		tmp->detach();
		while (!check_finish()) { Sleep(100); }
		if (kill_me)return false;
		kill_me = true;
		string paths = "mkdir " + (string)_saveUrl + "database\\searchlog  2>nul";
		char* pathss = new char[paths.length() + 1];
		for (int i = 0; i < paths.length(); i++) {
			pathss[i] = paths[i];
		}
		pathss[paths.length()] = '\0';
		system(pathss);
		fstream outfile((string)_saveUrl + (string)"database\\searchlog\\" + Hash4(fuzzy_target) + ".db", ios::out);
		int tot = 0;
		for (int i = 0; i < thread_num; i++) {
			for (int j = 0; j < answers[i].size(); j++) {
				outfile << answers[i][j] << endl;
				tot++;
			}
			answers[i].clear();
		}
		outfile.close();
		return tot;
	}

模糊搜索提供两种类型，一种是单词全字匹配，一种是完全模糊匹配。其区别在于，完全模糊会将一个单词的部分作为匹配内容，而全字匹配则不会。两种模式所采用的搜索方式完全相同，所以在这里只阐述完全模糊匹配。如图所示：

初始化之后，调用模糊搜索函数时，创建大量线程，同时按线程数平分任务，通过KMP子串匹配算法，将所有匹配的将结果存入每个线程独立的Vector容器中。待到所有任务完成后，由主线程计算出Hash值，并将结果储存到对应文件中。在任何情况下，程序都可以在100ms内可完成所有任务。但由于IO限制，结果需要输出到文件，若结果较少，可在1s内返回；若较多，时间可能达到5s及以上。

数据结构大作业-DBLP科学文献管理系统（二）作者排序、热点分析功能，模糊搜索（桶排序，字典树）

猜你喜欢