[The 10th "Teddy Cup" Data Mining Challenge] Question C: The second plan and Python implementation of the problem of map analysis of surrounding travel demand under the background of the epidemic

insert image description here

Related Links

(1) Problem 1 solution and implementation blog introduction

(2) Problem 2 scheme and implementation blog introduction

(3) The plan and implementation of problem three, to be uploaded, pre-sale ing, released on the 22nd

code download

(1) Question 1 complete code download

(2) Download the complete code and solution for question 2

The following code is incomplete, please download the complete code

(3) Question 3 code and plan pre-sale ing, released on the 22nd

1 Question 2 Question

For the full title, see the first article

[The 10th "Teddy Cup" Data Mining Challenge] Question C: Map Analysis Problem of Surrounding Tour Demand in the Background of Epidemic 1 Scheme and Python Implementation

Question 2: Analysis of the popularity of peripheral tourism products

Extract examples and other useful information of tourism products including scenic spots, hotels, Internet celebrity attractions, homestays, specialty restaurants, rural tourism, cultural and creative industries, etc. Save as the file "result2-1.csv" in the form of . Establish a multi-dimensional heat evaluation model for tourism products, conduct heat analysis on the extracted tourism products by year, and rank them. Save the results in Table 3 as the file "result2-2.csv".

2 ideas

Idea: I calculate product popularity from two dimensions, the first is the word frequency of the product name, and the second is the sentiment score.

Extracting product names is to extract keyword phrases through the Chinese version of the Textrank algorithm. Sentiment is calculated by the cnsenti package, positive 1, negative -1, neutral 0. The total popularity score is obtained by adding the weights.

It can also be improved, and the scores need to be calculated from other dimensions and added according to a certain weight.

4 Python implementation

from tqdm import tqdm
import pandas as pd
tqdm.pandas()


Hotel_reviews1 = pd.read_excel(
    './data/2018-2019茂名(含自媒体).xlsx', sheet_name=0)   # 酒店评论
Scenic_reviews1 = pd.read_excel(
    './data/2018-2019茂名(含自媒体).xlsx', sheet_name=1)  # 景区评论
Travel_tips1 = pd.read_excel(
    './data/2018-2019茂名(含自媒体).xlsx', sheet_name=2)     # 游记攻略
Dining_reviews1 = pd.read_excel(
    './data/2018-2019茂名(含自媒体).xlsx', sheet_name=3)  # 餐饮评论
Wechat_article1 = pd.read_excel(
    './data/2018-2019茂名(含自媒体).xlsx', sheet_name=4)  # 微信公众号文章


Hotel_reviews2 = pd.read_excel(
    './data/2020-2021茂名(含自媒体).xlsx', sheet_name=0)   # 酒店评论
Scenic_reviews2 = pd.read_excel(
    './data/2020-2021茂名(含自媒体).xlsx', sheet_name=1)  # 景区评论
Travel_tips2 = pd.read_excel(
    './data/2020-2021茂名(含自媒体).xlsx', sheet_name=2)     # 游记攻略
Dining_reviews2 = pd.read_excel(
    './data/2020-2021茂名(含自媒体).xlsx', sheet_name=3)  # 餐饮评论
Wechat_article2 = pd.read_excel(
    './data/2020-2021茂名(含自媒体).xlsx', sheet_name=4)  # 微信公众号文章

Hotel_reviews = pd.concat([Hotel_reviews1, Hotel_reviews2],axis=0)  # 酒店评论
Scenic_reviews = pd.concat([Scenic_reviews1, Scenic_reviews2], axis=0)  # 景区评论
Travel_tips = pd.concat([Travel_tips1, Travel_tips2], axis=0)  # 游记攻略
Dining_reviews = pd.concat([Dining_reviews1, Dining_reviews2], axis=0)  # 餐饮评论
Wechat_article = pd.concat([Wechat_article1, Wechat_article2], axis=0)  # 微信公众号文章
'''
旅游产品,亦称旅游服务产品。是指由实物和服务构成。包括旅行商集合景点、交通、食宿、娱乐等设施设备、
项目及相应服务出售给旅游者的旅游线路类产品,旅游景区、旅游饭店等单个企业提供给旅游者的活动项目类产品
'''

4.1 Extract tourism products

4.1.1 Data Preparation

Due to hotel reviews, scenic spot reviews, and restaurant reviews, tourism products are directly given, which can be directly summarized. However, there are no specific tourism products in the WeChat public account and travel guide, and a model needs to be established to extract the names of the tourism products.

Scenic_reviews.head(10)

insert image description here

def addstr(s):
    return '景区评论-'+str(s)

Scenic_reviews['语料ID'] = Scenic_reviews['景区评论ID'].progress_apply(addstr)
Scenic_reviews['文本'] = Scenic_reviews['评论内容']
Scenic_reviews['产品名称'] = Scenic_reviews['景区名称']
Scenic_reviews['年份'] = pd.to_datetime(Scenic_reviews['评论日期']).dt.year
Hotel_reviews.head(10)

insert image description here

def addstr(s):
    return '酒店评论-'+str(s)

Hotel_reviews['语料ID'] = Hotel_reviews['酒店评论ID'].progress_apply(addstr)
Hotel_reviews['文本'] = Hotel_reviews['评论内容']
Hotel_reviews['产品名称'] = Hotel_reviews['酒店名称']
Hotel_reviews['年份'] = pd.to_datetime(Hotel_reviews['评论日期']).dt.year
Dining_reviews.head(10)

insert image description here

def addstr(s):
    return '餐饮评论-'+str(s)


Dining_reviews['语料ID'] = Dining_reviews['餐饮评论ID'].progress_apply(addstr)
Dining_reviews['文本'] = Dining_reviews['评论内容'] + '\n'+Dining_reviews['标题']
Dining_reviews['产品名称'] = Dining_reviews['餐饮名称']
Dining_reviews['年份'] = pd.to_datetime(Dining_reviews['评论日期']).dt.year

4.1.2 Separate travel products from official accounts and travel guides

# 采用Textrank提取关键词组算法
。。.。略
请下载完整代码

4.1.3 Store result2-1 table


。。.。略
请下载完整代码

insert image description here

product_id = ['ID'+str(i+1) for i in range(len(all_df))]
all_df['产品ID'] = product_id
result2 = all_df[['语料ID','产品ID','产品名称']]
result2

insert image description here

result2.to_csv('./data/result2-1.csv', index=False)
all_df.to_csv('./data/问题二所有数据汇总.csv', index=False)

4.2 Thermal Analysis

4.2.1 Read data

import warnings
import pandas as pd
from tqdm import tqdm
tqdm.pandas()
warnings.filterwarnings('ignore')

all_df = pd.read_csv('./data/问题二所有数据汇总.csv')
all_df

insert image description here

4.2.2 Statistical sentiment score

。。.。略
请下载完整代码

insert image description here

4.2.3 Statistics on the number of occurrences of tourism products by year

year_2018 = all_df[all_df['年份']==2018]
year_2019 = all_df[all_df['年份'] == 2019]
year_2020 = all_df[all_df['年份'] == 2020]
year_2021 = all_df[all_df['年份'] == 2021]

dict_2018 = dict(year_2018['产品名称'].value_counts())
def get_frequency(s):
    fre = dict_2018[s]
    return fre
year_2018['出现频次'] = year_2018['产品名称'].progress_apply(get_frequency)
dict_2019 = dict(year_2019['产品名称'].value_counts())
def get_frequency(s):
    fre = dict_2019[s]
    return fre
year_2019['出现频次'] = year_2019['产品名称'].progress_apply(get_frequency)
dict_2020 = dict(year_2020['产品名称'].value_counts())
def get_frequency(s):
    fre = dict_2020[s]
    return fre
year_2020['出现频次'] = year_2020['产品名称'].progress_apply(get_frequency)
dict_2021 = dict(year_2021['产品名称'].value_counts())
def get_frequency(s):
    fre = dict_2021[s]
    return fre
year_2021['出现频次'] = year_2021['产品名称'].progress_apply(get_frequency)

4.3 Calculate product popularity score

Add sentiment and frequency by weight

# 计算综合得分
year_2018['产品热度总分'] = 2*year_2018['出现频次']+year_2018['情感得分']
year_2019['产品热度总分'] = 2*year_2019['出现频次']+year_2019['情感得分']
year_2020['产品热度总分'] = 2*year_2020['出现频次']+year_2020['情感得分']
year_2021['产品热度总分'] = 2*year_2021['出现频次']+year_2021['情感得分']

year_2018['产品热度'] = year_2018['产品热度总分'].div(np.sum(year_2018['产品热度总分']), axis=0)
year_2019['产品热度'] = year_2019['产品热度总分'].div(np.sum(year_2019['产品热度总分']), axis=0)
year_2020['产品热度'] = year_2020['产品热度总分'].div(np.sum(year_2020['产品热度总分']), axis=0)
year_2021['产品热度'] = year_2021['产品热度总分'].div(np.sum(year_2021['产品热度总分']), axis=0)

year_2018 = year_2018.sort_values(by="产品热度", ascending=False).reset_index(drop=True)
year_2019 = year_2019.sort_values(by="产品热度", ascending=False).reset_index(drop=True)
year_2020 = year_2020.sort_values(by="产品热度", ascending=False).reset_index(drop=True)
year_2021 = year_2021.sort_values(by="产品热度", ascending=False).reset_index(drop=True)

product_hot_score = pd.concat([year_2018, year_2018, year_2020, year_2021], axis=0)
product_hot_score

insert image description here

4.4 Determine the product type

# 分词
import re
import jieba
stopword_list = [k.strip() for k in open(
    'stop/cn_stopwords.txt', encoding='utf8').readlines() if k.strip() != '']
def clearTxt(line):
    if line != '':
        line = str(line).strip()
        #去除文本中的英文和数字
        line = re.sub("[a-zA-Z0-9]", "", line)
        #只保留中文、大小写字母
        reg = "[^0-9A-Za-z\u4e00-\u9fa5]"
        line = re.sub(reg, '', line)
        #分词
        segList = jieba.cut(line, cut_all=False)
        segSentence = ''
        for word in segList:
            if word != '\t':
                segSentence += word + " "
    # 去停用词
    wordList = segSentence.split(' ')
    sentence = ''
    for word in wordList:
        word = word.strip()
        if word not in stopword_list:
            if word != '\t':
                sentence += word + " "
    return sentence.strip()


product_hot_score['文本'] = product_hot_score['文本'].progress_apply(clearTxt)
product_hot_score
# 景区、酒店、网红景点、民宿、特色餐饮、乡村旅游、文创
def get_product_type(s):
   。。.。略
请下载完整代码

product_hot_score['产品类型判断文本'] = product_hot_score['语料ID'] +' '+product_hot_score['文本']

product_hot_score['产品类型'] = product_hot_score['产品类型判断文本'].progress_apply(get_product_type)

insert image description here

# 去除重复的产品
product_hot_score2 = product_hot_score.drop_duplicates(['产品名称'])
product_hot_score2

insert image description here

4.5 Store result2-1.csv

# 产品 ID 产品类型 产品名称 产品热度 年份

result2_2 = product_hot_score2[['产品ID','产品类型','产品名称','产品热度','年份']]
result2_1.to_csv('./data/result2-2.csv',index=False)

insert image description here

Guess you like

Origin blog.csdn.net/weixin_43935696/article/details/124305017