[The 10th "Teddy Cup" Data Mining Challenge] Question C: Three Schemes and Python Implementation of Graph Analysis Problems for Surrounding Travel Demand under the Background of the Epidemic

insert image description here

Related Links

(1) Problem 1 solution and implementation blog introduction

(2) Problem 2 scheme and implementation blog introduction

(3) Problem three scheme and implementation blog introduction

(4) Question 4, code and scheme, has been released on April 26

code download

(1) Question 1 complete code download

(2) Download the complete code and solution for question 2

(3) Complete program and code download of the third generation of the problem

The following code is incomplete, please download the complete code

(4) Question 4, code and scheme, has been released on April 26

1 topic

For the full title, see the first article

[The 10th "Teddy Cup" Data Mining Challenge] Question C: Map Analysis Problem of Surrounding Tour Demand in the Background of Epidemic 1 Scheme and Python Implementation

Question 3: Construction and Analysis of Local Tourism Map

Based on the provided OTA and UGC data, carry out a correlation analysis on the tourism products extracted in Question 2, find out the strong correlation mode centered on scenic spots, hotels, restaurants, etc., and save the results as a file "result3.csv" in the form of a table . On this basis, a local tourism map is constructed and an appropriate method is selected for visual analysis. Encourage teams to discover and explain the implicit association patterns between tourism products

2 ideas

Take the weighted sum of support, confidence, and lift as the relevance score

  • Among them, I represents the total transaction set. num() indicates the number of occurrences of a specific item set in the transaction set;
  • For example, num(I) represents the number of total transaction sets;
  • num(X∪Y) represents the number of transaction sets containing {X,Y} (the number is also called the number of times).
  • 1. Support
    • The support degree represents the probability that the itemset {X,Y} appears in the total itemset. The formula is:
    • S u p p o r t ( X → Y ) = P ( X , Y ) / P ( I ) = P ( X ∪ Y ) / P ( I ) = n u m ( X U Y ) / n u m ( I ) Support(X→Y) = P(X,Y) / P(I) = P(X∪Y) / P(I) = num(XUY) / num(I) Support(XY)=P(X,Y)/P(I)=P(XY)/P(I)=n u m ( X U Y ) / n u m ( I )
  • 2. Confidence
    • Confidence represents the probability that Y is inferred by the association rule "X→Y" when the precondition X occurs. That is, in the item set containing X, the possibility of containing Y, the formula is:
    • C o n f i d e n c e ( X → Y ) = P ( Y ∣ X ) = P ( X , Y ) / P ( X ) = P ( X U Y ) / P ( X ) Confidence(X→Y) = P(Y|X) = P(X,Y) / P(X) = P(XUY) / P(X) Confidence(XY)=P(YX)=P(X,Y)/P(X)=P(XUY)/P(X)
  • 3. Lift (Lift)
    • Lift refers to the ratio of the probability of containing Y under the condition of containing X to the probability of containing Y under the condition of not containing X.
    • L i f t ( X → Y ) = P ( Y ∣ X ) / P ( Y ) Lift(X→Y) = P(Y|X) / P(Y) Lift(XY)=P(YX)/P(Y)

Improvement points:

I think what I only provide is a Baseline, a process from data to table to visualization, just a reference answer, not the final answer

There are many directions for improvement. For example, in question 2, tourism products can be extracted according to NER. In question 3, krl (knowledge representation learning), erl (entity recognition and linking), ere (entity relationship extraction), and ede (entity detection and linking) can be used. Extraction), ksq (knowledge storage and query), kr (knowledge reasoning) and other knowledge systems to improve. Make the program more advanced and improve the ability to increase the probability of winning

3 Python implementation

import pandas as pd
import numpy as np
from collections import defaultdict
# 此csv文件来自第二问的代码,请下载第二问的代码和数据https://mianbaoduo.com/o/bread/YpmYm5xy
data = pd.read_csv('./data/问题二所有数据汇总.csv')

3.1 Calculating Support as Relevance

3.1 One-hot encoding of tourism products in the sample set

The key is the name of the tourism product, and the value represents the column

{'周黑鸭(东汇城店)': 0, '茂名文华酒店': 1, '旅游度假区': 2, '古郡水城': 3, '南三岛': 4, '红树林公园': 5, '信宜酒店': 6, '菠斯蒂蛋糕': 7}

Then, as long as tourism products appear in each sentence, the corresponding column is coded as 1.

    [[0 1 0 0 0 1 0 1]
     [0 1 0 1 1 0 0 0]
     [0 0 0 0 1 0 0 1]
     [0 0 1 0 1 1 1 0]
     [1 1 0 0 0 1 0 1]
     [0 1 1 0 0 1 0 1]]
# 给每个样本中的产品onehot编码
# 总共有438个产品,则初始化有438列的0,如果一个样本中存在“丰年面包店“和”功夫鸡排”两个产品,则438列中,这两列对应是1。
# 返回one-hot数组和产品字典编号字典
def create_one_hot(data):
    """将实体数据转换成:0,1数据类型,类似于词袋模型
    """
    。。。略
    请下载完整代码
    return out_array, feature_dict

3.2 Calculate Support, Confidence, and Lift

# 计算支持度作为关联度,
def calculate(data_vector):
    """计算支持度,置信度,提升度
    """
    print('=' * 50)
    print('Calculating...')

    n_samples, n_features = data_vector.shape
    print('特征数: ', n_features)
    print('样本数: ', n_samples)

    support_dict = defaultdict(float)
    confidence_dict = defaultdict(float)
    lift_dict = defaultdict(float)

    # together_appear: {(0, 1): 3, (0, 3): 2, (0, 4): 1, (1, 0): 3, (1, 3): 2,...}
    # together_appear: 元组里的元素是特征的序号,后面的数字,是这两个特征同时出现的总次数
    together_appear_dict = defaultdict(int)

    # feature_num_dict:{0: 3, 1: 4, 2: 3,...}
    # feature_num_dict: key是特征的序号,后面的数字是这个特征出现的总次数
    feature_num_dict = defaultdict(int)

   	。。。略
    请下载完整代码

    return support_dict, confidence_dict, lift_dict

3.3 Convert encoding to string


def convert_to_sample(feature_dict, s, c, l):
    """把0,1,2,3,... 等字母代表的feature,转换成实体
    """
    print('=' * 50)
    print('Start converting to the required sample format...')
    # print(feature_dict)
    feature_mirror_dict = dict()
    for k, v in feature_dict.items():
        feature_mirror_dict[v] = k
    # print(feature_mirror_dict)

    。。。略
    请下载完整代码
data_array, feature_di = create_one_hot(data)
support_di, confidence_di, lift_di = calculate(data_array)

support = sorted(support_di.items(), key=lambda x: x[1], reverse=True)
confidence = sorted(confidence_di.items(),
                    key=lambda x: x[1], reverse=True)
lift = sorted(lift_di.items(), key=lambda x: x[1], reverse=True)

support_li, confidence_li, lift_li = convert_to_sample(feature_di, support, confidence, lift)

insert image description here

3.4 Calculate the correlation degree

support_df = pd.DataFrame(support_li,columns=['产品名称1','产品名称2','支持度'])
confidence_df = pd.DataFrame(confidence_li, columns=['产品名称1', '产品名称2', '置信度'])
lift_df = pd.DataFrame(lift_li, columns=['产品名称1', '产品名称2', '提升度'])

。。。略
请下载完整代码
del submit_3['支持度']
submit_3

insert image description here

3.5 Generate result3.csv

map_dict ={
    
    }
for i,d in enumerate(feature_di):
    map_dict[d] = 'ID'+str(feature_di[d]+1)
map_dict   

insert image description here

# 将名称转为ID
submit_3['产品1'] = submit_3['产品名称1'].map(map_dict)
submit_3['产品2'] = submit_3['产品名称2'].map(map_dict)
result3 = submit_3[['产品1', '产品2','关联度']]
result3

insert image description here

# 读取问题二的产品类型表,需要生成表3的关联类型
# # 此csv文件来自第二问的代码,请下载第二问的代码和数据https://mianbaoduo.com/o/bread/YpmYm5xy
result2_2 = pd.read_csv('./data/result2-2.csv')
p_k = result2_2['产品ID']
p_v = result2_2['产品类型']
p_type_dict  = dict(zip(p_k,p_v))
p_type_dict

insert image description here

result3['关联类型'] = p_type
result3.to_csv('./data/result3.csv',index=False)

insert image description here

3.6 Visual association

import matplotlib.pyplot as plt
import networkx as nx
import pandas as pd

from spacy import displacy
G = nx.from_pandas_edgelist(submit_3[submit_3['关联度'] > 0], "产品名称1", "产品名称2",
                            edge_attr=True, create_using=nx.MultiDiGraph())


plt.figure(figsize=(12, 12))
pos = nx.spring_layout(G, k=0.5)  # k regulates the distance between nodes
nx.draw(G, with_labels=True, node_color='skyblue',
        node_size=1500, edge_cmap=plt.cm.Blues, pos=pos)

insert image description here

Guess you like

Origin blog.csdn.net/weixin_43935696/article/details/124361365