整体架构

直接看代码

#python get_novel_info_from_feed_monitor.py ./data/novel_info.txt
#python get_video_info_from_video_film.py ./data/video_info.txt
#python get_star_info_from_video_film.py ./data/star_info.txt

#python stat_searchquery_times.py ./data/mid_searchquerys_20190331_31 ./data/searchquery_times

##############
#python analyse_searchquery.py ./data/novel_info.txt ./data/video_info.txt ./data/game_info.txt ./data/qingse_keyword.txt ./data/searchquery_times ./data/searchquery_times_analyse

#python stat_entity_searchquerynumber_searchquerytimes.py ./data/searchquery_times_analyse ./data/entity_searchquerynumber_searchquerytimes

#python cal_mid_entity_info.py ./data/searchquery_times_analyse ./data/mid_searchquerys_20190331_31 ./data/mid_searchquerys_entitys

分部介绍

获取novel_info.txt，video_info.txt，star_info.txt
novel_info.txt：从mysql获取，title+hot

video_info.txt：从mysql获取，dockey + doctype+ hit_count + name + alias_name + serial + alais_serial

star_info.txt：从mysql获取，star_id + name + alias_name + hit_count
stat_searchquery_times.py 统计每个搜索词的次数，并排序，输入文件mid+searchquery+times

searchquery = items3[0]
times = int(items3[2])
searchquery_times_dict[searchquery] =searchquery_times_dict.get(searchquery, 0) + times

外部文件包括game_info.txt,qingse_keyword.txt

analyse_searchquery.py 结合之前的文件分析搜索行为
都是构建关键词查询，可以参考之前的博客敏感词匹配——python使用esmre实现ac自动机，以情色为例

def gen_qingse_index(file_path):
	qingse_index = esm.Index()
	line_num = len([ "" for line in open(file_path, "r")])
	with tqdm.tqdm(total=line_num) as progress:
		valid_num = 0
		for line in file(file_path):
			progress.update(1)
			qingse_index.enter(line.strip())
			valid_num += 1
	print valid_num
	qingse_index.fix()
	return qingse_index

def get_match_entity(index, searchquery):
	index_result = index.query(searchquery)
	match_entity_dict = {}
	for (st_end, match_entity) in index_result:
		if st_end[0] % 2 == 0:
			match_entity_dict[match_entity] = True
	ret = ''
	if len(match_entity_dict) > 0:
		ret = ','.join(match_entity_dict.keys())
	return ret

qingse_index = gen_qingse_index(sys.argv[4])
qingse_result = get_match_entity(qingse_index, searchquery)

if len(qingse_result) > 0:
	output += 'qingse'

fw.write(searchquery + '\t' + str(times) + '\t' + output + '\n')

stat_entity_searchquerynumber_searchquerytimes.py 统计每一个分类下，有多少关键词，搜索了多少次，占总比多少。

fw.write(entity + '\t' + str(searchquerynumber) + '\t' + str(searchquerytimes) + '\t' + str(searchquerynumber*1.0/total_searchquerynumber) + '\t' + str(searchquerytimes*1.0/tota    l_searchquerytimes) + '\n')

cal_mid_entity_info.py 统计用户的实体信息

iwtbs_kevin

发布了79 篇原创文章 · 获赞 8 · 访问量 2万+

私信关注

分析用户搜索行为打标签

文章目录

整体架构

分部介绍

猜你喜欢