2.Python data analysis project - ticket price prediction of tourist attractions

1. Summary

process Concrete operation
basic view View missing values, view numeric types
preprocessing Missing value processing (filling) splitting data (obtaining required values), unifying data format
data analysis groupby grouping to find the most value data, seaborn visualization
prediction (RandomForestRegressor) Split dataset, build model, train model, predict, evaluate model

Quantity View: Bar Chart
Proportion View: Pie Chart
Data Partition Distribution View: Probability Density Function Chart
View Correlation: Bar Chart
Distribution Analysis: Classification Histogram (countplot), Distribution Map (distplot)

2. Basic viewing of data

df = pd.read_csv('data.csv')
'''
    1.得出对于level我们需要获取它的单独等级,并且处理缺失值(填充0);
    2.获取地区数据,使用拆分成多列的方式
    3.获取热度值数据,整理成统一的数据格式
'''
df.head()

# 对数值列进行描述性统计
df.describe()

insert image description here

3. Data preprocessing

In the second step, the direction to be processed has been determined

# 处理等级数据
# 1.填充缺失值:将等级为NaN的数据填充为0
df['level'] = df['level'].fillna(0)
# 2.只保留等级数值
df['level'] = df['level'].apply(lambda x:0 if x==0 else int(x[0]))

# 处理热度列:只保留热度的数字
# 1.先将热度数字字符串提取出来,转换为浮点型数据,浮点型数据保留两位小数,最后结果再一次转换为浮点型数据
df['hot'] = df['hot'].apply(lambda x:float("%.2f"%float(x.split(" ")[-1])))

# 处理区域数据
# 1.对区域中的省、市、区 分别提取出来,并存储到新的列中
df['province'] = df['area'].apply(lambda x:x.split('·')[0]) # 新增省份列
df['city'] = df['area'].apply(lambda x:x.split('·')[1]) # 新增城市列
df['mini_city'] = df['area'].apply(lambda x:x.split('·')[-1]) # 新增区列
# 2.删除原来的area列
del df['area']
df

insert image description here

4. Data analysis

4.1 Statistics of the top 10 attractions with the most sales

# 1.分组统计销量
num_top = df.sort_values(by='num', ascending=False)
# 2.重置索引
num_top = num_top.reset_index(drop=True)# 重置索引,将原来的索引删了
# 3.绘制图形
plt.rcParams['font.sans-serif'] = ['Microsoft YaHei'] # 设置字体
import seaborn as sns
sns.set(font='SimHei') # 设置绘图字体中文编码
fig = plt.figure()
sns.barplot(num_top['name'][:10], num_top['num'][:10])# 绘制条形图
fig.show()

insert image description here

4.2 Relationship between scenic spot ratings and provinces

# 1.新增加应该level_sum列,设置初始值为1。最后用其计数
df['level_sum'] = 1
# 2.分组(景区和省份)再对应两者求和
var = df.groupby(['province','level'])['level_sum'].sum()
# 3.将Series转化为DataFrame。(两索引重置分别转换为DataFrame的行和列)
var.unstack()
# 4.绘制条形图
var.unstack().plot(kind='bar')

insert image description here

4.3 The top 10 5A-level scenic spots with the largest number of statistics

# 先选取5A级别的景区,然后根据数量降序排序,最后重置索引并删除原来的索引,选取前10个
top_5A = df[df['level'] == 5].sort_values(by='num',ascending=False).reset_index(drop=True)[:10]
# 绘制经典名称与销量(人数)条形图
sns.barplot(top_5A['name'], top_5A['num'])
plt.title('人数最多的5A级景区')
plt.xticks(rotation=90) # x轴名称旋转90度
plt.show()

insert image description here

4.4 Data distribution analysis - rating, popularity, price, sales

# 1.景区等级分布
plt.figure(figsize=(20,1))
sns.countplot(y='level', data=df)

insert image description here

# 2.景区热度分布
# 解决图形负号乱码问题
plt.rcParams['axes.unicode_minus'] = False
# 热度分布图绘制
sns.displot(df['hot'])
plt.xticks(rotation=25)

insert image description here

#3.价格分布分析
df.describe() # 查看发现不符合实际,国内票价很少超过300元,删除,不参与分析
df = df.drop(df[df['price']>300].index)
df.describe()
sns.displot(df['price'])
plt.xticks(rotation=25)

insert image description here

# 4.销量分布
# 数据过于集中,导致查看分布无意义,所以切分查看
df['num_cut'] = pd.cut(df['num'],10)
plt.figure()
sns.countplot(y='num_cut', data=df)

insert image description here

5. Data preprocessing before modeling

# 1.删除建模不需要的列
df.drop(['level_sum', 'num_cut'],axis=1,inplace=True)
df.head()

# 2.对指定列进行one-hot编码(推荐编码分类的列)
one_hot_df = pd.get_dummies(df[['province', 'city', 'mini_city']])
# 3.与原本的数据表合并
df = df[['level','hot','price','num']]
df = pd.merge(df, one_hot_df, left_index=True, right_index=True)
df.head()

insert image description here

6. Modeling - Predicting the fare

Use RandomForestRegressor (Random Forest) to build a regression model to predict fares

# 1.导包
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_error

# 2.获取数据集
X = df[df.columns.difference(['price'])].values # 获取样本特征集
y = df['price'].values # 获取样本标签值

# 3.分割数据集
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.3,random_state=456)

# 4.建立模型
rf = RandomForestRegressor(n_estimators=20, max_depth=7)

# 5.拟合训练模型
rf.fit(X_train, y_train)

# 6.预测值
pred = rf.predict(X_test)

# 7.模型评估
print("MSE: ", mean_squared_error(y_test, pred))
print("MAE: ", mean_absolute_error(y_test, pred))

Guess you like

Origin blog.csdn.net/m0_63953077/article/details/129168081