Analysis of user search behavior tagging


github address: https: //github.com/iwtbs/user_searchquery_analyse

Overall structure

Direct look at the code

#python get_novel_info_from_feed_monitor.py ./data/novel_info.txt
#python get_video_info_from_video_film.py ./data/video_info.txt
#python get_star_info_from_video_film.py ./data/star_info.txt

#python stat_searchquery_times.py ./data/mid_searchquerys_20190331_31 ./data/searchquery_times

##############
#python analyse_searchquery.py ./data/novel_info.txt ./data/video_info.txt ./data/game_info.txt ./data/qingse_keyword.txt ./data/searchquery_times ./data/searchquery_times_analyse

#python stat_entity_searchquerynumber_searchquerytimes.py ./data/searchquery_times_analyse ./data/entity_searchquerynumber_searchquerytimes

#python cal_mid_entity_info.py ./data/searchquery_times_analyse ./data/mid_searchquerys_20190331_31 ./data/mid_searchquerys_entitys

Division reported

  1. Get novel_info.txt, video_info.txt, star_info.txt
    novel_info.txt: acquired from mysql, title + Hot
    Here Insert Picture Description
    video_info.txt: acquired from mysql, Dockey DOCTYPE + + + name + hit_count alias_name + Serial + alais_serial
    Here Insert Picture Description
    star_info.txt: acquired from mysql , star_id + name + alias_name + hit_count
    Here Insert Picture Description
  2. Stat_searchquery_times.py number of statistics for each search term, and sort the input file mid + searchquery + times
searchquery = items3[0]
times = int(items3[2])
searchquery_times_dict[searchquery] =searchquery_times_dict.get(searchquery, 0) + times
  1. External files, including game_info.txt, qingse_keyword.txt
    Here Insert Picture Description
    file before analyse_searchquery.py combined analysis of search behavior
    is to build keyword query, you can reference previous blog sensitive words Match --python use esmre achieve ac automata to the erotic, for example
def gen_qingse_index(file_path):
	qingse_index = esm.Index()
	line_num = len([ "" for line in open(file_path, "r")])
	with tqdm.tqdm(total=line_num) as progress:
		valid_num = 0
		for line in file(file_path):
			progress.update(1)
			qingse_index.enter(line.strip())
			valid_num += 1
	print valid_num
	qingse_index.fix()
	return qingse_index

def get_match_entity(index, searchquery):
	index_result = index.query(searchquery)
	match_entity_dict = {}
	for (st_end, match_entity) in index_result:
		if st_end[0] % 2 == 0:
			match_entity_dict[match_entity] = True
	ret = ''
	if len(match_entity_dict) > 0:
		ret = ','.join(match_entity_dict.keys())
	return ret

qingse_index = gen_qingse_index(sys.argv[4])
qingse_result = get_match_entity(qingse_index, searchquery)

if len(qingse_result) > 0:
	output += 'qingse'

fw.write(searchquery + '\t' + str(times) + '\t' + output + '\n')
  1. Under each category stat_entity_searchquerynumber_searchquerytimes.py statistics, the number of keyword search how many times, how many of the total ratio.
fw.write(entity + '\t' + str(searchquerynumber) + '\t' + str(searchquerytimes) + '\t' + str(searchquerynumber*1.0/total_searchquerynumber) + '\t' + str(searchquerytimes*1.0/tota    l_searchquerytimes) + '\n')
  1. cal_mid_entity_info.py user entity statistical information
Published 79 original articles · won praise 8 · views 20000 +

Guess you like

Origin blog.csdn.net/qq_34219959/article/details/104245236