Jingdong sales data based on forecast prices for mobile phone regression tree

Today to recommend a data analysis and mining of practical project case studies "based Jingdong mobile phone sales forecast price data with the regression tree." The project to make a series of analysis based Jingdong mobile phone sales data, only the price forecasts based on external characteristics of mobile phone use regression tree.

The project comes from the laboratory building "House + data analysis and mining real" fifth student: Ted_Wei.

data collection

As the price of mobile phones and the number of comments is dynamic javascript rendering information to go through, simply crawling with requests module is invisible. My solution is to first use of selenium in the webbrowser module uses Chrome browser each crawling inside a cell phone id in Jingdong, price, and number of comments. And then find each phone sales pages according crawled to the id, crawling key parameters for each handset.

Reptile Code parsing_code.pythrough this page: https://www.kaggle.com/ted0001/dm05-998494/data get.

Data cleaning

The acquired data contains 1199 rows, 21, each row represents a mobile phone being sold, each column contains a parameter on the phone (such as price, memory size, pixels, etc.). Most of the acquired data into natural language, non-numerical information for the analysis of very unfriendly. Therefore, the focus of data analysis is to find actually missing information (e.g., 'no', 'refer to the official data' etc. can be considered to be missing data NaN3), in order to use re module parses a natural language to extract useful information wherein the value.

Data cleaning step is actually quite complicated, if the code is posted will take up too much space, so the data cleansing code data_clean.pythrough this page: https://www.kaggle.com/ted0001/dm05-998494/data get.

data analysis

The following directly to the question, mainly to complete the brand mobile phone sales estimate and comparative analysis to explore factors in determining the price of the phone, trying to predict the price of mobile phone use machine learning methods.

First calling module required

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_absolute_error as mae

Read data has been cleaned, the data may be https://www.kaggle.com/ted0001/dm05-998494/data acquired.

data=pd.read_csv('../input/data-acquisitioncleaning/cleaned_data.csv')
data=data.set_index(['Unnamed: 0']) #DataFrame在存储为csv file以后原来的index会变为一个列,因此要重新设置index
data.shape

Output:

(1199,21)

Brand mobile phone sales estimate and comparative analysis

The first to get a good idea after cleaning the data, that is, on sales of various brands of mobile phones will be a comparison. Since Jingdong show only the number of comments rather than specific sales each phone, we had to review comments and the default number is proportional to sales, to estimate the proportion of all mobile phone brand sales

pie_plt=data.groupby(['brand']).sum()['comments'].sort_values(ascending=False)#统计每个品牌评论总数,以此作为我们对销量的估计
pie_plt

Output:

brand
HUAWEI        11320592.0
Apple          9797100.0
XIAOMI         7995236.0
NOKIA          1324000.0
Philips        1227100.0
OPPO           1205300.0
vivo           1101330.0
K-Touch         793300.0
MEIZU           574900.0
SAMSUNG         539800.0
smartisan       364000.0
lenovo          275500.0
realme          157000.0
Meitu           109200.0
nubia            86000.0
chilli           58000.0
360              31000.0
ZTE              17000.0
Coolpad          12000.0
BlackBerry       10000.0
WE                  90.0
Name: comments, dtype: float64

We can use a pie chart to show up better on the above data

#绘制各个手机品牌估计销量的占比扇形图
fig,axes=plt.subplots(figsize=(12,12))
comment_sum=pie_plt.values.sum()
percen=[np.round(each/comment_sum*100,2) for each in pie_plt.values]
axes.pie(pie_plt.values,labels=pie_plt.index,labeldistance=1.2,autopct = '%3.1f%%')
axes.legend([pie_plt.index[i]+': '+str(percen[i])+"%" for i in range(len(percen))],loc='upper right',bbox_to_anchor=(1, 0, 1, 1))
axes.set_title('Estimated Handphone Market Share in China')
plt.show()

Output:

From the above pie chart, we can estimate that sales in China are the top three mobile phone brands Huawei, Apple, millet, respectively, accounted for 30.6% of total sales, 26.5%, 21.6%, and the rest of the brand sales are far less than sales the first three brands. In order to verify the correctness of the above estimates, I looked up the relevant information by 2019 the share of each brand of Chinese mobile phone market. It found that Huawei, Apple, millet accounted for respectively 34%, 9%, 12%, in addition to Huawei, the millet and Apple's market share has greater access to data. And vivo, oppo's market share is 19% and 18%, respectively, with the original data gap is even greater. If Jingdong number of comments can be broadly reflect the sale of the phone line, I think there are reasons for such a great cause data discrepancy may lie in not considering selling the line of the Chinese market. According to the usual experience of life, and the number of stores under the Apple millet line (especially in small cities) is far less than the oppo, the number of stores in vivo. The entire mobile phone market, including both online and offline market market, but the cause of data with authoritative data statistical difference to me, probably because I did not include sales data offline. If the offline sales data can be obtained, it can be further verified the above reasoning.

Another point to my surprise that the estimated sales accounting for Nokia and Philips phones actually are more than 3% (even more than vivo, oppo), so he wanted to see what the price of mobile phone sales of the two brands, respectively,

data[(data['brand']=='NOKIA')|(data['brand']=='Philips')]['price'].median()#诺基亚和飞利浦手机价格中位数

Output:

208.5
data[(data['brand']=='NOKIA')|(data['brand']=='Philips')]['price'].mean()#诺基亚和飞利浦手机价格平均数

Output:

336.1190476190476

It would appear that Nokia and Philips mobile phones more than the price at 200-300 yuan, and then under the item id Jingdong visit the site and found that sure enough, these two brands sold mostly functional machine. (Some Nokia smart phones) in this smart phone has universal era, think of functional machine still has its third acre of land, and not completely out of the market (especially online sales channels). This may be because some elderly people still more accustomed to using the machine function, and function machine long standby time, ring volume large part of the population to meet specific needs.

To explore factors determine the price of the phone

Another issue is worth studying the relationship between the price of handsets and mobile configuration parameters. In order for us to have a grasp of the whole, we will first draw a correlation matrix between the various numerical data, and then explore non-numerical data (categorical data), such as brand, screen material impact on prices.

Draw Correlation Matrix and analyzed

As the price of iPhone, and Android, configuration quite different, and Apple phone configuration parameter set mostly missing in our data, we study this question, let us consider only Android phones. (Price 9999 mobile phone is the glory X9 Huawei will be issued in July 2019, due to the abnormal price model, it is also excluded from our analysis.)

correlation=data[(data['brand']!='Apple')&(data['price']!=9999)].corr()
#绘制相应correlation matrix的heatmap
fig,axes=plt.subplots(figsize=(8,8))
cax=sns.heatmap(correlation,vmin=-0.25, vmax=1,square=True,annot=True)
axes.set_xticklabels(['RAM', 'ROM', 'battery', 'comments', 'price', 'rear camera',
       'resolution', 'screen size', 'weight'])
axes.set_yticklabels(['RAM', 'ROM', 'battery', 'comments', 'price', 'rear camera',
       'resolution', 'screen size', 'weight'])
axes.set_title('Heatmap of Correlation Matrix of numerical data')
plt.show()

Output:

We can see from the figure, the price and the maximum price correlation degree storage ROM, and a RAM memory, 0.71 and 0.68 respectively. Second is the battery capacity battery, the number of rear camera rear camera, the screen size of the screen size, the degree of association are respectively 0.42. It is also more in line with our common-sense judgment. We can also clearly see, evaluate the number of comments the column and row are showing deep purple, associate representatives of the various comments and numerical parameters are small. (After our data set as more missing values, in line discard part missing values, the value of the above correlation matrix will be a certain degree of change)

Explore the impact of mobile phone brand mobile phone prices

Of course, the decision handset prices very crucial factor as well as some non-numeric data (categorical data). The mobile phone is the most likely to think of the brand.

data.groupby(['brand']).median()['price'].sort_values(ascending=False).values.std() #计算不同品牌价格中位数集合的标准差

Output:

1409.0576123336064

The above data may be more intuitive display by bar graphs.

bar_plt=data.groupby(['brand']).median()['price']

fig,axes=plt.subplots(figsize=(20,8))
axes.bar(bar_plt.index,bar_plt.values)
axes.set_title('Median price of handphones of various brands')

Output:

Text(0.5, 1.0, 'Median price of handphones of various brands')

image

我们可以看到,各个品牌手机中位数价格层次不齐,这也和我们的常识性判断吻合,因为不同手机品牌的定位以及消费群体均有较大差异。

不同屏幕材料对手机价格影响的探索

还有一个很关键的因素其实是手机的屏幕材料,我们也可以用同样的方法比较不同屏幕材料对价格的影响。

data.groupby(['screen material']).median()['price'].sort_values(ascending=False).values.std() #计算不同屏幕材料价格中位数集合的标准差

输出结果:

1523.0026019740856

各种屏幕材料的手机的价格中位数展示如下

data.groupby(['screen material']).median()['price'].sort_values(ascending=False)

输出结果:

screen material
Dynamic AMOLED    5999.0
OLED曲面屏           5488.0
OLED              4288.0
Super AMOLED      2998.0
AMOLED曲面屏         2908.0
LCD               2399.0
AMOLED            2299.0
TFT LCD(IPS)      1999.0
LTPS              1999.0
IPS               1399.0
TFT               1199.0
Name: price, dtype: float64

用柱状图展示如下:

bar_plt2=data.groupby(['screen material']).median()['price']

fig,axes=plt.subplots(figsize=(18,8))
axes.bar(bar_plt2.index,bar_plt2.values)
axes.set_title('Median price of handphones of various screen materials')

输出结果:

Text(0.5, 1.0, 'Median price of handphones of various screen materials')

可以注意到的是,以上价格数值均在千元以上,而我们的数据集中还包含有价格很低廉的功能机,那它们的屏幕又都是什么材料呢?

data[(data['brand']=='NOKIA')|(data['brand']=='Philips')]['screen material'].value_counts()

输出结果:

TFT             27
IPS              3
TFT LCD(IPS)     1
Name: screen material, dtype: int64

可以看到,诺基亚和飞利浦的手机(大多为功能机)的屏幕材料大多为TFT和IPS。

我们又可以反过来看看又到底是哪些价位的手机在使用TFT和IPS呢?

#绘制屏幕材料为IPS或TFT手机的价格分布图
hist_plot=data[(data['screen material']=='IPS')|(data['screen material']=='TFT')]['price']#查看所有屏幕材料为IPS或TFT手机的价格
sns.distplot(hist_plot)
plt.title('Price Distribution Plot of Handphones Whose Screen Material is TFT or IPS ')

输出结果:

Text(0.5, 1.0, 'Price Distribution Plot of Handphones Whose Screen Material is TFT or IPS ')

通过观察以上分布图以及进一步在数据集data中查看屏幕材料为TFT或IPS的手机发现,IPS主要用于华为的中低端手机,价格在千元以下,或者1300元左右。其中,价格在200元以下的功能机的屏幕材料均为TFT。 出乎我意料的是,也有部分高端手机使用的是IPS或TFT材料,比如华为的荣耀V20,苹果的iphone 8,使用的是IPS材料;华为mate20和华为p20使用的均为TFT材料,这些手机的价格都在3500元以上。 通过以上的探索分析我们可以知道,高端智能机和低端功能机所使用的屏幕材料也很可能是一样的。当然我们也不排除,同样是TFT或IPS材料,它们内部也可能有区别。

尝试用机器学习方法预测手机价格

在前面的小节中,我们探索了决定手机价格的几大因素,手机存储空间ROM,内存RAM,以及品牌,屏幕材料等都是决定手机价格的关键因素。 在这一小节中,我会使用回归决策树(Regression Decision Tree)的算法仅仅根据手机的外部特征来预测手机的价格。决策数的特征值仅仅采用了手机的品牌brand、后置摄像头数量rear camera、以及手机重量weight作为我们的特征(feature),目标(target)当然则是我们的价格price 这样做的原因一来是因为ROM和RAM存在太多的缺失值。如果选取这两个值做为特征,那么我们会丢失掉太多训练数据。 二来是想尝试在不知道手机具体配置,仅仅通过观察测量手机外部特征能否较为准确地预测手机价格。 以下数据显示,如果选用ROM、RAM、和brand作为特征,那么我们只能得到原数据集31%左右的数据用作训练和测试。

data.dropna(subset=['ROM','RAM','brand','price']).shape[0]/data.shape[0]

输出结果:

0.30692243536280234

所有列缺失值数据统计:

data.isnull().sum().sort_values(ascending=False)

输出结果:

CPU freq             998
CPU cores            824
front camera         773
RAM                  720
rear camera specs    710
ROM                  667
CPU model            479
screen material      381
brand                347
rear camera          249
battery              177
resolution           134
month                115
charging port         99
screen size           85
model                 70
weight                66
SIM cards             56
year                   6
price                  3
comments               0
dtype: int64

品牌brand对价格影响非常明显,所以虽然缺失值较多,我们也必须考虑这个特征。

考虑到品牌brand是非数值数据,我们选取使用回归决策树算法来进行机器学习建模。

从原来的数据集提取我们需要的数据

df=data.loc[:,['price','rear camera','brand','weight']].dropna()

由于回归决策树只接受数值型数据(numerical data),我们需要对brand进行独热编码(one-hot encoding)

to_model=pd.get_dummies(df)#对非数值型数据进行独热编码

提取特征值和目标值。 (考虑到各种手机品牌的型号数量毕竟很有限,而且部分品牌数据量较少,我们在这里就没有划分训练集和测试集了)

x=to_model.iloc[:,1:].values
y=to_model.iloc[:,0].values

训练回归决策数模型

model=DecisionTreeRegressor()
model.fit(x,y)

输出结果:

DecisionTreeRegressor(criterion='mse', max_depth=None, max_features=None,
                      max_leaf_nodes=None, min_impurity_decrease=0.0,
                      min_impurity_split=None, min_samples_leaf=1,
                      min_samples_split=2, min_weight_fraction_leaf=0.0,
                      presort=False, random_state=None, splitter='best')

检验我们的模型对各个品牌的预测准确性。

error_list=[]
for each in df['brand'].value_counts().index:   
    to_fill='brand_{}'.format(each)
    x_data=to_model[to_model[to_fill]==1].iloc[:,1:].values
    y_data=to_model[to_model[to_fill]==1].iloc[:,0].values

    test_result=model.predict(x_data)
    merror=mae(y_data.reshape(len(y_data),1),test_result.flatten())
    error=(np.abs(test_result-y_data)/y_data).mean()
    print(each,end=' : ') 
    print(np.round(merror,2),end=', ')
    print(str(np.round(error*100,3))+'%')
    error_list.append([each,merror,error])

输出结果:

HUAWEI : 238.55, 15.16%
XIAOMI : 202.0, 12.277%
Apple : 663.28, 8.087%
OPPO : 177.65, 9.582%
vivo : 134.78, 8.747%
Philips : 7.01, 2.841%
MEIZU : 51.79, 3.009%
SAMSUNG : 269.2, 3.24%
K-Touch : 7.23, 4.144%
NOKIA : 12.5, 1.321%
lenovo : 33.33, 3.374%
Meitu : 120.0, 6.141%
smartisan : 0.0, 0.0%
realme : 0.0, 0.0%
nubia : 0.0, 0.0%
360 : 0.0, 0.0%
BlackBerry : 0.0, 0.0%
Coolpad : 0.0, 0.0%
ZTE : 0.0, 0.0%
chilli : 0.0, 0.0%
error_df=pd.DataFrame(error_list,columns=['brand','mean_absolute_error','mean_proportional_error'])
error_df

输出结果:

以上的 DataFrame error_df 表示该决策树模型对于每个品牌手机预测的准确性,误差都均在 15% 以内,这个模型还是相对比较准确的。 实际上这个模型最关键的是提取了手机的重量weight这一关键信息,因为每个型号的手机重量多少是有些区别的,拿一个稍微精确一点的电子秤便能量出区别,决策数只不过是记住了数据而已。造成预测结果误差的原因我想多半还是因为不同的卖家对同一型号手机的标价不同吧。

项目总结

虽然没有详细地呈现数据采集以及数据清理的过程,但是这两个步骤确是所花时间最多的步骤。虽然京东的网页对于爬虫新手已经十分友好,但是头一回爬取 javascript 渲染后的价格、评论数据还是颇有挑战性。数据清理主要难点在于数据大多以自然语言呈现,要找到实际上的缺失值,以及将自然语言转变为数值(比如评论数 comments,后置摄像头数量 rear cameras )。除去写这个 kaggle kernel,这两个步骤大概花了所有时间的70%。 对于采集到的数据进行分析也不是之前想象到的那么容易,为了发掘更深一层次的信息,对于每一次通过 pandas 函数得到的结果都需要认真地分析结果,思考为什么会有这个结果。 总之,这次项目挑战收获还是比较大,也是头一次自己完成数据的采集,清洗,以及分析的全过程。

Guess you like

Origin www.cnblogs.com/shiyanlou/p/11423682.html