分析用户搜索行为打标签

文章目录


github地址:https://github.com/iwtbs/user_searchquery_analyse

整体架构

直接看代码

#python get_novel_info_from_feed_monitor.py ./data/novel_info.txt
#python get_video_info_from_video_film.py ./data/video_info.txt
#python get_star_info_from_video_film.py ./data/star_info.txt

#python stat_searchquery_times.py ./data/mid_searchquerys_20190331_31 ./data/searchquery_times

##############
#python analyse_searchquery.py ./data/novel_info.txt ./data/video_info.txt ./data/game_info.txt ./data/qingse_keyword.txt ./data/searchquery_times ./data/searchquery_times_analyse

#python stat_entity_searchquerynumber_searchquerytimes.py ./data/searchquery_times_analyse ./data/entity_searchquerynumber_searchquerytimes

#python cal_mid_entity_info.py ./data/searchquery_times_analyse ./data/mid_searchquerys_20190331_31 ./data/mid_searchquerys_entitys

分部介绍

  1. 获取novel_info.txt,video_info.txt,star_info.txt
    novel_info.txt:从mysql获取,title+hot
    在这里插入图片描述
    video_info.txt:从mysql获取,dockey + doctype+ hit_count + name + alias_name + serial + alais_serial
    在这里插入图片描述
    star_info.txt:从mysql获取,star_id + name + alias_name + hit_count
    在这里插入图片描述
  2. stat_searchquery_times.py 统计每个搜索词的次数,并排序,输入文件mid+searchquery+times
searchquery = items3[0]
times = int(items3[2])
searchquery_times_dict[searchquery] =searchquery_times_dict.get(searchquery, 0) + times
  1. 外部文件包括game_info.txt,qingse_keyword.txt
    在这里插入图片描述
    analyse_searchquery.py 结合之前的文件分析搜索行为
    都是构建关键词查询,可以参考之前的博客敏感词匹配——python使用esmre实现ac自动机,以情色为例
def gen_qingse_index(file_path):
	qingse_index = esm.Index()
	line_num = len([ "" for line in open(file_path, "r")])
	with tqdm.tqdm(total=line_num) as progress:
		valid_num = 0
		for line in file(file_path):
			progress.update(1)
			qingse_index.enter(line.strip())
			valid_num += 1
	print valid_num
	qingse_index.fix()
	return qingse_index

def get_match_entity(index, searchquery):
	index_result = index.query(searchquery)
	match_entity_dict = {}
	for (st_end, match_entity) in index_result:
		if st_end[0] % 2 == 0:
			match_entity_dict[match_entity] = True
	ret = ''
	if len(match_entity_dict) > 0:
		ret = ','.join(match_entity_dict.keys())
	return ret

qingse_index = gen_qingse_index(sys.argv[4])
qingse_result = get_match_entity(qingse_index, searchquery)

if len(qingse_result) > 0:
	output += 'qingse'

fw.write(searchquery + '\t' + str(times) + '\t' + output + '\n')
  1. stat_entity_searchquerynumber_searchquerytimes.py 统计每一个分类下,有多少关键词,搜索了多少次,占总比多少。
fw.write(entity + '\t' + str(searchquerynumber) + '\t' + str(searchquerytimes) + '\t' + str(searchquerynumber*1.0/total_searchquerynumber) + '\t' + str(searchquerytimes*1.0/tota    l_searchquerytimes) + '\n')
  1. cal_mid_entity_info.py 统计用户的实体信息
发布了79 篇原创文章 · 获赞 8 · 访问量 2万+

猜你喜欢

转载自blog.csdn.net/qq_34219959/article/details/104245236