1. Summary
process | Concrete operation |
---|---|
basic view | View missing values, view numeric types |
preprocessing | Missing value processing (filling) splitting data (obtaining required values), unifying data format |
data analysis | groupby grouping to find the most value data, seaborn visualization |
prediction (RandomForestRegressor) | Split dataset, build model, train model, predict, evaluate model |
Quantity View: Bar Chart
Proportion View: Pie Chart
Data Partition Distribution View: Probability Density Function Chart
View Correlation: Bar Chart
Distribution Analysis: Classification Histogram (countplot), Distribution Map (distplot)
2. Basic viewing of data
df = pd.read_csv('data.csv')
'''
1.得出对于level我们需要获取它的单独等级,并且处理缺失值(填充0);
2.获取地区数据,使用拆分成多列的方式
3.获取热度值数据,整理成统一的数据格式
'''
df.head()
# 对数值列进行描述性统计
df.describe()
3. Data preprocessing
In the second step, the direction to be processed has been determined
# 处理等级数据
# 1.填充缺失值:将等级为NaN的数据填充为0
df['level'] = df['level'].fillna(0)
# 2.只保留等级数值
df['level'] = df['level'].apply(lambda x:0 if x==0 else int(x[0]))
# 处理热度列:只保留热度的数字
# 1.先将热度数字字符串提取出来,转换为浮点型数据,浮点型数据保留两位小数,最后结果再一次转换为浮点型数据
df['hot'] = df['hot'].apply(lambda x:float("%.2f"%float(x.split(" ")[-1])))
# 处理区域数据
# 1.对区域中的省、市、区 分别提取出来,并存储到新的列中
df['province'] = df['area'].apply(lambda x:x.split('·')[0]) # 新增省份列
df['city'] = df['area'].apply(lambda x:x.split('·')[1]) # 新增城市列
df['mini_city'] = df['area'].apply(lambda x:x.split('·')[-1]) # 新增区列
# 2.删除原来的area列
del df['area']
df
4. Data analysis
4.1 Statistics of the top 10 attractions with the most sales
# 1.分组统计销量
num_top = df.sort_values(by='num', ascending=False)
# 2.重置索引
num_top = num_top.reset_index(drop=True)# 重置索引,将原来的索引删了
# 3.绘制图形
plt.rcParams['font.sans-serif'] = ['Microsoft YaHei'] # 设置字体
import seaborn as sns
sns.set(font='SimHei') # 设置绘图字体中文编码
fig = plt.figure()
sns.barplot(num_top['name'][:10], num_top['num'][:10])# 绘制条形图
fig.show()
4.2 Relationship between scenic spot ratings and provinces
# 1.新增加应该level_sum列,设置初始值为1。最后用其计数
df['level_sum'] = 1
# 2.分组(景区和省份)再对应两者求和
var = df.groupby(['province','level'])['level_sum'].sum()
# 3.将Series转化为DataFrame。(两索引重置分别转换为DataFrame的行和列)
var.unstack()
# 4.绘制条形图
var.unstack().plot(kind='bar')
4.3 The top 10 5A-level scenic spots with the largest number of statistics
# 先选取5A级别的景区,然后根据数量降序排序,最后重置索引并删除原来的索引,选取前10个
top_5A = df[df['level'] == 5].sort_values(by='num',ascending=False).reset_index(drop=True)[:10]
# 绘制经典名称与销量(人数)条形图
sns.barplot(top_5A['name'], top_5A['num'])
plt.title('人数最多的5A级景区')
plt.xticks(rotation=90) # x轴名称旋转90度
plt.show()
4.4 Data distribution analysis - rating, popularity, price, sales
# 1.景区等级分布
plt.figure(figsize=(20,1))
sns.countplot(y='level', data=df)
# 2.景区热度分布
# 解决图形负号乱码问题
plt.rcParams['axes.unicode_minus'] = False
# 热度分布图绘制
sns.displot(df['hot'])
plt.xticks(rotation=25)
#3.价格分布分析
df.describe() # 查看发现不符合实际,国内票价很少超过300元,删除,不参与分析
df = df.drop(df[df['price']>300].index)
df.describe()
sns.displot(df['price'])
plt.xticks(rotation=25)
# 4.销量分布
# 数据过于集中,导致查看分布无意义,所以切分查看
df['num_cut'] = pd.cut(df['num'],10)
plt.figure()
sns.countplot(y='num_cut', data=df)
5. Data preprocessing before modeling
# 1.删除建模不需要的列
df.drop(['level_sum', 'num_cut'],axis=1,inplace=True)
df.head()
# 2.对指定列进行one-hot编码(推荐编码分类的列)
one_hot_df = pd.get_dummies(df[['province', 'city', 'mini_city']])
# 3.与原本的数据表合并
df = df[['level','hot','price','num']]
df = pd.merge(df, one_hot_df, left_index=True, right_index=True)
df.head()
6. Modeling - Predicting the fare
Use RandomForestRegressor (Random Forest) to build a regression model to predict fares
# 1.导包
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_error
# 2.获取数据集
X = df[df.columns.difference(['price'])].values # 获取样本特征集
y = df['price'].values # 获取样本标签值
# 3.分割数据集
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.3,random_state=456)
# 4.建立模型
rf = RandomForestRegressor(n_estimators=20, max_depth=7)
# 5.拟合训练模型
rf.fit(X_train, y_train)
# 6.预测值
pred = rf.predict(X_test)
# 7.模型评估
print("MSE: ", mean_squared_error(y_test, pred))
print("MAE: ", mean_absolute_error(y_test, pred))