Table of contents
Introduction to the competition
Visualization of clustering results
Silhouette coefficient judgment k value
Fourth, analyze the clustering results
Introduction to the competition
This teaching competition is the third in the data analysis series initiated by the data scientist Dr. Chen - Cluster Analysis of Automotive Products
The competition topic is based on the analysis of competing products, and provides cluster classification for cars through data clustering. For a specified car model, its competing car models can be found through cluster analysis. Through this competition question, learners are encouraged to use model data to analyze model portraits and provide data decision-making for product positioning and competitive product analysis.
Background
The competition topic is based on the analysis of competing products, and provides cluster classification for cars through data clustering. For a specified car model, its competing car models can be found through cluster analysis. Through this competition question, learners are encouraged to use model data to analyze model portraits and provide data decision-making for product positioning and competitive product analysis.
Question data
Data source: car_price.csv, the data includes 26 fields of 205 cars
1. View data
import pandas as pd
import time
import matplotlib.pyplot as plt
car_price = pd.read_csv("./car_price.csv")
car_price.head()
car_price.info()
# car_price.duplicated().sum()
Data characteristics can be divided into three categories:
The first category: car ID class attributes
1 Car_ID car number
3 CarName car name
The second category: categorical variables (10)
2 Symboling Insurance Risk Rating
4 fueltype fuel type
5 aspiration engine suction form
6 doornumber car door number
7 carbody body type
8 drivewheel drive wheel
9 enginelocation engine location
15 enginetype engine type
16 cylindernumber cylinder number
18 fuelsystem fuel system
The third category: continuous numerical variables (14)
10 wheelbase
11 carlength car length
12 carwidth car width
13 carheight
14 curbweight curb weight (vehicle net weight)
17 enginesize engine size
19 boreratio Cylinder cross-sectional area to stroke ratio
20 stroke engine stroke
21 compressionratio compression ratio
22 horsepower
23 peakrpm maximum power speed
24 citympg city mileage (miles per gallon)
25 highwaympg highway miles per gallon
26 price(Dependent variable) price (dependent variable)
View categorical variables
# 提取类别变量的列名
cate_columns=['symboling','fueltype','aspiration','doornumber','carbody','drivewheel','enginelocation','enginetype','fuelsystem','cylindernumber']
#打印类别变量每个分类的取值情况
for i in cate_columns:
print (i)
print(set(car_price[i]))
symboling {0, 1, 2, 3, -2, -1} fueltype {'gas', 'diesel'} aspiration {'std', 'turbo'} doornumber {'two', 'four'} carbody {'convertible', 'hatchback', 'wagon', 'sedan', 'hardtop'} drivewheel {'4wd', 'fwd', 'rwd'} enginelocation {'rear', 'front'} enginetype {'ohcv', 'ohcf', 'dohc', 'ohc', 'l', 'rotor', 'dohcv'} fuel system {'idi', 'mfi', '4bbl', '2bbl', 'mpfi', 'spfi', '1bbl', 'spdi'} cylindernumber {'eight', 'six', 'five', 'two', 'four', 'three', 'twelve'}
View numeric variables
#提取连续数值型变量特征数据(除了'car_ID'和'CarName')
car_df=car_price.drop(['car_ID','CarName'],axis=1)
#查看连续数值型情况,并是检查否有异常值
#对数据进行描述性统计
car_df.describe()
# 描绘数据集的箱线图,查看异常值
#提取连续数值型数据的列名
num_cols=car_df.columns.drop(cate_columns)
print(num_cols)
#绘制连续数值型数据的箱线图,检查异常值
import seaborn as sns
fig=plt.figure(figsize=(12,8))
i=1
for col in num_cols:
ax=fig.add_subplot(3,5,i)
sns.boxplot(data=car_df[col],ax=ax)
i=i+1
plt.title(col)
plt.subplots_adjust(wspace=0.4,hspace=0.3)
plt.show()
#查看数值型特征的与价格的相关系数
df_corr=car_df.corr()
df_corr['price'].sort_values(ascending = False)
price 1.000000 enginesize 0.874145 curbweight 0.835305 horsepower 0.808139 carwidth 0.759325 carlength 0.682920 wheelbase 0.577816 boreratio 0.553173 carheight 0.119336 stroke 0.079443 compressionratio 0.067984 symboling -0.079978 peakrpm -0.085267 citympg -0.685751 highwaympg -0.697599 Name: price, dtype: float64
f , ax = plt.subplots(figsize = (7, 7))
plt.title('Correlation of Numeric Features with Price',y=1,size=16)
sns.heatmap(df_corr,square = True, vmax=0.8)
2. Data processing
cylindernumber cylinder number
Represent English as Arabic numerals
car_price['cylindernumber'] = car_price.cylindernumber.replace({'three':3,'four':4,'five':5,'six':6,'eight':8,'twelve':12})
CarName car name
#去重查看CarName
print(car_price['CarName'].drop_duplicates())#验证是否object全部改为数值类型
carBrand = car_price['CarName'].str.split(expand=True)[0]#根据车名提取品牌,车名中第一个词为品牌
print(set(carBrand))
Construct new feature carSize from carlength
# 由上面描述性统计可知,车身长范围为141.1~208.1英寸之间,可划分为6类
bins=[min(car_df.carlength)-0.01,145.67,169.29,181.10,192.91,200.79,max(car_df.carlength)+0.01]
label=['A00','A0','A','B','C','D']
carSize=pd.cut(car_df.carlength,bins,labels=label)
print(carSize)
#将车型大小分类放入数据集中
car_price['carSize']=carSize
car_df['carSize']=carSize
#剔除carlength
features=car_df.drop(['carlength'],axis=1)
Handling categorical features
For the value of categorical features, the data with size meaning is converted into numerical mapping, and there is no size meaning (different values indicate different categories), and one-hot encoding is performed.
LabelEncoder
# 将取值具有大小意义的类别型变量数据转变为数值型映射
features1=features.copy()
#使用LabelEncoder对不具实体数值数据编码
from sklearn.preprocessing import LabelEncoder
carSize1=LabelEncoder().fit_transform(features1['carSize'])
features1['carSize']=carSize1
carSize1a
one-hot
#对于类别离散型特征,取值间没有大小意义的,可采用one-hot编码
cate=features1.select_dtypes(include='object').columns
print(cate)
features1=features1.join(pd.get_dummies(features1[cate])).drop(cate,axis=1)
features1.head()
feature normalization
The obtained original features must be normalized for each feature separately. For example, the value range of feature A is [-1000,1000], and the value range of feature B is [-1,1]. If you use logistic
regression , w1*x1+w2*x2, because the value of x1 is too large, so x2 basically does not work.
Therefore, the normalization of features must be performed, and each feature is normalized separately.
- Continuous feature normalization:
1. Mean normalization (variance is 1, mean is 0)
2. Maximum and minimum normalization (0-1)
3. x = (2x - max - min)/(max - min). Linear scaling to [-1,1]
- Discrete features (categorical features):
After the discrete features are one-hot encoded, the encoded features, in fact, the features of each dimension can be regarded as continuous features. Just like the normalization method for continuous features, each dimension feature can be normalized again. For example, normalize to [-1,1] or normalize to a mean of 0 and a variance of 1
Because the categorical features were previously labeled and one-hot encoded, the categorical features can already be regarded as continuous features, so all features are uniformly normalized
#对特征进行归一化
from sklearn import preprocessing
features1=preprocessing.MinMaxScaler().fit_transform(features1)
features1=pd.DataFrame(features1)
features1.head()
PCA dimensionality reduction
#对数据集进行PCA降维(信息保留为99.99%)
from sklearn.decomposition import PCA
pca=PCA(n_components=0.9999) #保证降维后的数据保持90%的信息,则填0.9
features2=pca.fit_transform(features1)
#降维后,每个主要成分的解释方差占比(解释PC携带的信息多少)
ratio=pca.explained_variance_ratio_
print('各主成分的解释方差占比:',ratio)
#降维后有几个成分
print('降维后有几个成分:',len(ratio))
#累计解释方差占比
cum_ratio=np.cumsum(ratio)#cumsum函数通常用于计算一个数组各行的累加值
print('累计解释方差占比:',cum_ratio)
The explained variance ratio of each principal component: [2.34835648e-01 1.89291914e-01 1.11193502e-01 6.41024136e-02 5.90453139e-02 4.54763783e-02 4.21689429e-02 3.65477617e-02 2.97528000e-02 2.24095237e-02 1.98458305e-02 1.95803021e-02 1.70780800e-02 1.47611074e-02 1.32208566e-02 1.19093756e-02 9.01434709e-03 8.74908243e-03 7.28321292e-03 6.65001057e-03 5.68867886e-03 4.89870846e-03 4.50894857e-03 3.81422315e-03 3.45197486e-03 2.23759951e-03 2.14676779e-03 1.84529725e-03 1.56025958e-03 1.22067828e-03 1.12126257e-03 1.03278716e-03 8.30359553e-04 6.87972243e-04 5.63679041e-04 4.64609849e-04 3.33065301e-04 2.76366954e-04 1.67241531e-04 1.07861538e-04 7.49681455e-05] There are several components after dimensionality reduction: 41 Cumulative explained variance ratio: [0.23483565 0.42412756 0.53532106 0.59942348 0.65846879 0.70394517 0.74611411 0.78266187 0.81241467 0.8348242 0.85467003 0.87425033 0.89132841 0.90608952 0.91931037 0.93121975 0.9402341 0.94898318 0.95626639 0.9629164 0.96860508 0.97350379 0.97801274 0.98182696 0.98527894 0.98751654 0.9896633 0.9915086 0.99306886 0.99428954 0.9954108 0.99644359 0.99727395 0.99796192 0.9985256 0.99899021 0.99932327 0.99959964 0.99976688 0.99987474 0.99994971]
#绘制PCA降维后各成分方差占比的直方图和累计方差占比折线图
plt.figure(figsize=(8,6))
X=range(1,len(ratio)+1)
Y=ratio
plt.bar(X,Y,edgecolor='black')
plt.plot(X,Y,'r.-')
plt.plot(X,cum_ratio,'b.-')
plt.ylabel('explained_variance_ratio')
plt.xlabel('PCA')
plt.show()
#PCA选择降维保留8个主要成分
pca=PCA(n_components=8)
features3=pca.fit_transform(features1)
#降维后的累计各成分方差占比和(即解释PC携带的信息多少)
print(sum(pca.explained_variance_ratio_))#0.7826618733273734
features3
3. K-means clustering
Elbow method to see k value
##肘方法看k值,簇内离差平方和
#对每一个k值进行聚类并且记下对于的SSE,然后画出k和SSE的关系图
from sklearn.cluster import KMeans
sse=[]
for i in range(1,15):
km=KMeans(n_clusters=i,init='k-means++',n_init=10,max_iter=300,random_state=0)
km.fit(features3)
sse.append(km.inertia_)
plt.plot(range(1,15),sse,marker='*')
plt.xlabel('n_clusters')
plt.ylabel('distortions')
plt.title("The Elbow Method")
plt.show()
Select 5 clustering points for clustering
#进行K-Means聚类分析
kmeans=KMeans(n_clusters=5,init='k-means++',n_init=10,max_iter=300,random_state=0)
kmeans.fit(features3)
lab=kmeans.predict(features3)
print(lab)
Visualization of clustering results
#绘制聚类结果2维的散点图
plt.figure(figsize=(8,8))
plt.scatter(features3[:,0],features3[:,1],c=lab)
for ii in np.arange(205):
plt.text(features3[ii,0],features3[ii,1],s=car_price.car_ID[ii])
plt.xlabel('PC1')
plt.ylabel('PC2')
plt.title('K-Means PCA')
plt.show()
#绘制聚类结果后3d散点图
from mpl_toolkits.mplot3d import Axes3D
plt.figure(figsize=(8,8))
ax=plt.subplot(111,projection='3d')
ax.scatter(features3[:,0],features3[:,1],features3[:,2],c=lab)
#视角转换,转换后更易看出簇群
ax.view_init(30,45)
ax.set_xlabel('PC1')
ax.set_ylabel('PC2')
ax.set_zlabel('PC3')
plt.show()
Silhouette coefficient judgment k value
#绘制轮廓图和3d散点图
from sklearn.datasets import make_blobs
from sklearn.metrics import silhouette_samples, silhouette_score
import matplotlib.cm as cm
from mpl_toolkits.mplot3d import Axes3D
for n_clusters in range(2,9):
fig=plt.figure(figsize=(12,6))
ax1=fig.add_subplot(121)
ax2=fig.add_subplot(122,projection='3d')
ax1.set_xlim([-0.1,1])
ax1.set_ylim([0,len(features3)+(n_clusters+1)*10])
km=KMeans(n_clusters=n_clusters,init='k-means++',n_init=10,max_iter=300,random_state=0)
y_km=km.fit_predict(features3)
silhouette_avg=silhouette_score(features3,y_km)
print('n_cluster=',n_clusters,'The average silhouette_score is :',silhouette_avg)
cluster_labels=np.unique(y_km)
silhouette_vals=silhouette_samples(features3,y_km,metric='euclidean')
y_ax_lower=10
for i in range(n_clusters):
c_silhouette_vals=silhouette_vals[y_km==i]
c_silhouette_vals.sort()
cluster_i=c_silhouette_vals.shape[0]
y_ax_upper=y_ax_lower+cluster_i
color=cm.nipy_spectral(float(i)/n_clusters)
ax1.fill_betweenx(range(y_ax_lower,y_ax_upper),0,c_silhouette_vals,edgecolor='none',color=color)
ax1.text(-0.05,y_ax_lower+0.5*cluster_i,str(i))
y_ax_lower=y_ax_upper+10
ax1.set_title('The silhouette plot for the various clusters')
ax1.set_xlabel('The silhouette coefficient values')
ax1.set_ylabel('Cluster label')
ax1.axvline(x=silhouette_avg,color='red',linestyle='--')
ax1.set_yticks([])
ax1.set_xticks([-0.1,0,0.2,0.4,0.6,0.8,1.0])
colors=cm.nipy_spectral(y_km.astype(float)/n_clusters)
ax2.scatter(features3[:,0],features3[:,1],features3[:,2],marker='.',s=30,lw=0,alpha=0.7,c=colors,edgecolor='k')
centers=km.cluster_centers_
ax2.scatter(centers[:,0],centers[:,1],centers[:,2],marker='o',c='white',alpha=1,s=200,edgecolor='k')
for i,c in enumerate(centers):
ax2.scatter(c[0],c[1],c[2],marker='$%d$' % i,alpha=1,s=50,edgecolor='k')
ax2.set_title("The visualization of the clustered data.")
ax2.set_xlabel("Feature space for the 1st feature")
ax2.set_ylabel("Feature space for the 2nd feature")
ax2.view_init(30,45)
plt.suptitle(("Silhouette analysis for KMeans clustering on sample data "
"with n_clusters = %d" % n_clusters),
fontsize=14, fontweight='bold')
plt.show()
Combining contour plots and 3d scatterplots: When k is too small, individual clusters are merged; when k is too large, some clusters are split into multiples.
When k=2, each cluster is very large and most of the instance coefficients are close to 0, indicating that most of the instances in the cluster are close to the boundary, some individual clusters are merged, and the model effect is not good;
When k=3, the contour coefficients of most instances of the cluster '0' are lower than the contour scores of the cluster, and a small number of instance coefficients are less than 0 and tend to -1, indicating that some instances may have been assigned to the wrong cluster;
When k=4, the silhouette coefficients of most instances of cluster '0' are lower than the cluster's silhouette score and close to 0, indicating that these instances are close to the boundary, and it may be more appropriate to divide the cluster into two separate clusters;
With k=7 or 8, some clusters are split into multiples with very close centers, resulting in very bad models;
When k is 5 or 6, most of the instances are beyond the dotted line, the clusters look fine, and the clustering is all good. Ranking k according to the score is better if 6>5, when k=5, the cluster '3' is very large, and when k=6, the distribution of each cluster is more balanced;
To sum up, the value of k can be selected as 5 or 6, and the effect of the clustering model is acceptable, but considering the balance of each cluster, so k=6 is selected.
#调整选择k=6进行聚类
kmeans=KMeans(n_clusters=6,init='k-means++',n_init=10,max_iter=300,random_state=0)
y_pred=kmeans.fit_predict(features3)
print(y_pred)
#将聚类后的类目放入原特征数据中
car_df_km=car_price.copy()
car_df_km['km_result']=y_pred
[4 4 4 1 5 3 5 5 5 0 4 5 4 5 5 5 4 5 3 3 1 3 3 0 1 1 1 0 1 0 3 3 3 3 3 1 1 3 3 1 1 1 3 1 3 1 3 5 5 4 3 3 3 1 1 4 4 4 4 3 1 3 1 2 1 5 2 2 2 2 2 5 4 5 4 0 3 3 3 0 0 3 0 0 0 1 1 1 1 3 2 3 1 1 3 3 1 1 3 1 1 5 5 5 4 4 4 5 2 5 2 5 2 5 2 5 2 5 3 0 1 1 1 1 0 4 4 4 4 4 1 3 3 1 3 1 0 5 3 3 3 1 1 1 1 5 1 1 1 5 3 3 1 1 1 1 1 1 2 2 1 1 1 3 3 4 4 4 4 4 4 4 4 1 2 1 1 1 4 4 5 5 2 3 2 1 1 2 1 3 3 5 2 1 5 5 5 5 5 5 5 5 5 2 5]
Fourth, analyze the clustering results
#统计聚类后每个集群中包含的车型数
car_df_km.groupby('km_result')['car_ID'].count()
km_result 0 13 1 59 2 20 3 43 4 31 5 39 Name: car_ID, dtype: int64
import pandas as pd
#显示所有列
pd.set_option('display.max_columns',None)
#显示所有行
pd.set_option('display.max_rows',None)
#统计每个集群里各品牌的车型数
car_df_km.groupby(by=['km_result','carBrand'])['car_ID'].count()
#统计每个品牌在各个集群里的车型数
car_df_km.groupby(by=['carBrand','km_result'])['car_ID'].count()
#查看特指车名‘vokswagen’车型的聚类集群
df=car_df_km.loc[:,['car_ID','CarName','carBrand','km_result']]
print(df.loc[df['CarName'].str.contains("vokswagen")])
# ’vokswagen’的车名为‘vokswagen rabbit’,car_ID 为183,集群分类为2.
#查看特指车名为‘vokswagen’车型的竞品车型(分类2的所有车型)
df.loc[df['km_result']==2]
#查看大众volkswagen品牌在各集群内的竞品车型
li = [1, 2,3,5] #volkswagen品牌在1235这几个集群里分布
df_volk=df[df['km_result'].isin(li)].sort_values(by=['km_result','carBrand'])
df_volk
Extract the competing models of the 'vokswagen' model from the full amount of data
df0 = car_df_km.loc[car_df_km['km_result']==2]
df0.head()
df0_1=df0.drop(['car_ID','CarName','km_result'],axis=1)
#查看集群2的车型所有特征分布
fig=plt.figure(figsize=(20,20))
i=1
for c in df0_1.columns:
ax=fig.add_subplot(7,4,i)
if df0_1[c].dtypes=='int' or df0_1[c].dtypes=='float':#数值型变量
sns.histplot(df0_1[c],ax=ax)#直方图
else:
sns.barplot(df0_1[c].value_counts().index,df0_1[c].value_counts(),ax=ax)#条形图3
i=i+1
plt.xlabel('')
plt.title(c)
plt.subplots_adjust(top=1.2)
plt.show()
There is only one type of variable value:
fueltype: {'diesel'}; enginelocation: {'front'}; fuelsystem:{'idi'}
These common features may not be considered in the analysis of competing products
#对不同车型级别、品牌、车身等类型特征进行数据透视
#按车型大小级别进行对比
df2=df0.pivot_table(index=['carSize','carbody','carBrand','CarName'])
df2
boreratio | car_ID | carheight | carlength | carwidth | citympg | compressionratio | curbweight | enginesize | highwaympg | horsepower | km_result | peakrpm | price | stroke | symboling | wheelbase | ||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
carSize | carbody | carBrand | CarName | |||||||||||||||||
A0 | hatchback | toyota | Toyota Corolla | 3.27 | 160 | 52.8 | 166.3 | 64.4 | 38 | 22.5 | 2275 | 110 | 47 | 56 | 2 | 4500 | 7788.0 | 3.35 | 0 | 95.7 |
sedan | nissan | nissan gt-r | 2.99 | 91 | 54.5 | 165.3 | 63.8 | 45 | 21.9 | 2017 | 103 | 50 | 55 | 2 | 4800 | 7099.0 | 3.47 | 1 | 94.5 | |
toyota | toyota corona | 3.27 | 159 | 53.0 | 166.3 | 64.4 | 34 | 22.5 | 2275 | 110 | 36 | 56 | 2 | 4500 | 7898.0 | 3.35 | 0 | 95.7 | ||
A | sedan | mazda | mazda glc deluxe | 3.39 | 64 | 55.5 | 177.8 | 66.5 | 36 | 22.7 | 2443 | 122 | 42 | 64 | 2 | 4650 | 10795.0 | 3.39 | 0 | 98.8 |
mazda rx-7 gs | 3.43 | 67 | 54.4 | 175.0 | 66.1 | 31 | 22.0 | 2700 | 134 | 39 | 72 | 2 | 4200 | 18344.0 | 3.64 | 0 | 104.9 | |||
toyota | toyota celica gt | 3.27 | 175 | 54.9 | 175.6 | 66.5 | 30 | 22.5 | 2480 | 110 | 33 | 73 | 2 | 4500 | 10698.0 | 3.35 | -1 | 102.4 | ||
volkswagen | vokswagen rabbit | 3.01 | 183 | 55.7 | 171.7 | 65.5 | 37 | 23.0 | 2261 | 97 | 46 | 52 | 2 | 4800 | 7775.0 | 3.40 | 2 | 97.3 | ||
Volkswagen model 111 | 3.01 | 185 | 55.7 | 171.7 | 65.5 | 37 | 23.0 | 2264 | 97 | 46 | 52 | 2 | 4800 | 7995.0 | 3.40 | 2 | 97.3 | |||
volkswagen rabbit custom | 3.01 | 193 | 55.1 | 180.2 | 66.9 | 33 | 23.0 | 2579 | 97 | 38 | 68 | 2 | 4500 | 13845.0 | 3.40 | 0 | 100.4 | |||
volkswagen super beetle | 3.01 | 188 | 55.7 | 171.7 | 65.5 | 37 | 23.0 | 2319 | 97 | 42 | 68 | 2 | 4500 | 9495.0 | 3.40 | 2 | 97.3 | |||
B | hardtop | buick | buick century | 3.58 | 70 | 54.9 | 187.5 | 70.3 | 22 | 21.5 | 3495 | 183 | 25 | 123 | 2 | 4350 | 28176.0 | 3.64 | 0 | 106.7 |
sedan | buick | buick electra 225 custom | 3.58 | 68 | 56.5 | 190.9 | 70.3 | 22 | 21.5 | 3515 | 183 | 25 | 123 | 2 | 4350 | 25552.0 | 3.64 | -1 | 110.0 | |
peugeot | peugeot 304 | 3.70 | 109 | 56.7 | 186.7 | 68.4 | 28 | 21.0 | 3197 | 152 | 33 | 95 | 2 | 4150 | 13200.0 | 3.52 | 0 | 107.9 | ||
peugeot 504 | 3.70 | 117 | 56.7 | 186.7 | 68.4 | 28 | 21.0 | 3252 | 152 | 33 | 95 | 2 | 4150 | 17950.0 | 3.52 | 0 | 107.9 | |||
peugeot 604sl | 3.70 | 113 | 56.7 | 186.7 | 68.4 | 28 | 21.0 | 3252 | 152 | 33 | 95 | 2 | 4150 | 16900.0 | 3.52 | 0 | 107.9 | |||
volvo | volvo 246 | 3.01 | 204 | 55.5 | 188.8 | 68.9 | 26 | 23.0 | 3217 | 145 | 27 | 106 | 2 | 4800 | 22470.0 | 3.40 | -1 | 109.1 | ||
wagon | buick | buick century luxus (sw) | 3.58 | 69 | 58.7 | 190.9 | 70.3 | 22 | 21.5 | 3750 | 183 | 25 | 123 | 2 | 4350 | 28248.0 | 3.64 | -1 | 110.0 | |
C | wagon | peugeot | peugeot 504 | 3.70 | 111 | 58.7 | 198.9 | 68.4 | 25 | 21.0 | 3430 | 152 | 25 | 95 | 2 | 4150 | 13860.0 | 3.52 | 0 | 114.2 |
peugeot 505s turbo diesel | 3.70 | 115 | 58.7 | 198.9 | 68.4 | 25 | 21.0 | 3485 | 152 | 25 | 95 | 2 | 4150 | 17075.0 | 3.52 | 0 | 114.2 | |||
D | sedan | buick | buick skyhawk | 3.58 | 71 | 56.3 | 202.6 | 71.7 | 22 | 21.5 | 3770 | 183 | 25 | 123 | 2 | 4350 | 31600.0 | 3.64 | -1 | 115.6 |
The size classes of all car models in cluster 2 are: A0 small car, A compact car, B medium-sized car, C medium-large car, D luxury car.
The car vokswagen rabbit of car_id183 belongs to the A compact car, and its most direct subdivided competitors are 7 cars belonging to the A class
#提取集群2中的A级车
df0_A=df0.loc[df0['carSize']=='A']
df0_A
#查看集群0中A级车型的类别型变量的分类情况
ate_col=df0_A.select_dtypes(include='object').columns
df3=df0_A[ate_col]
df3
#对集群0中A级车的特征进行数据透视
df4=df0_A.pivot_table(index=['carBrand','CarName','doornumber','aspiration','drivewheel'])
df4
The 7 A-class cars including 'vokswagen rabbit' all have 4 cylinders, the stroke range is 3.4-3.64, the maximum power speed range is 4500-4800, the compression ratio range is 22.5-23.0, and the body width range is 66.1-66.9 , the vehicle height ranges from 54.4 to 55.7, and the cylinder cross-sectional area to stroke ratio ranges from 3.01 to 3.43; the above data are relatively similar.
The general car focuses on: model level (carSize), brand (carBrand), power performance (horsepower), quality and safety (Symboling), fuel consumption (citympg, highwaympg), space experience (wheelbase), body (carbbody, curbweight) ) etc.
Let's extract some other key features to consider the difference between 'vokswagen rabbit' and other competing products:
基本信息:‘carBrand’,‘doornumber’, ‘curbweight’
油耗:‘highwaympg’、‘citympg’
安全性:‘symboling’
底盘制动:‘drivewheel’
动力性能:‘aspiration’, ‘enginesize’, ‘horsepower’
空间体验:‘wheelbase’
价格: ‘price’
#对油耗的分析('citympg','highwaympg')
lab=df0_A['CarName']
fig,ax=plt.subplots(figsize=(10,8))
ax.barh(range(len(lab)),df0_A['highwaympg'],tick_label=lab,color='red')
ax.barh(range(len(lab)),df0_A['citympg'],tick_label=lab,color='blue')
#在水平直方图上标注数据
for i,(highway,city) in enumerate(zip(df0_A['highwaympg'],df0_A['citympg'])):
ax.text(highway,i,highway,ha='right')
ax.text(city,i,city,ha='right')
plt.legend(('highwaympg','citympg'), loc='upper right')
plt.title('miles per gallon')
plt.show()
#其他6个特征分析
colors=['yellow', 'blue', 'green','red', 'gray','tan','darkviolet']
col2=['symboling','wheelbase','enginesize','horsepower','curbweight','price']
data=df0_A[col2]
fig=plt.figure(figsize=(10,8))
i=1
for c in data.columns:
ax=fig.add_subplot(3,2,i)
plt.barh(range(len(lab)),data[c],tick_label=lab,color=colors)
for y,x in enumerate(data[c].values):
plt.text(x,y,"%s" %x)
i=i+1
plt.xlabel('')
plt.title(c)
plt.subplots_adjust(top=1.2,wspace=0.7)
plt.show()
由上面条形图,‘vokswagen rabbit’与其他竞品相比:
质量安全方面:其保险风险评级为2,比马自达品牌和丰田品牌车型相对更具有风险;
车身空间方面:轴距是最小的;
动力方面:发动机尺寸和马力都是最小的;
车重方面:整备质量最小的;
价格方面:价格是最小的;
综上所述,‘'vokswagen rabbit‘’与集群0中同是A级的竞品相比:
劣势:质量安全性偏低、车身空间偏小、动力马力偏小
优势:车身轻、油耗低、价格低(在类似的配置中性价比非常高)
设计特点:双车门三厢车
产品定位:“经济适用、城市代步紧凑型A级轿车”
建议: 在销售推广时,可偏重于:①同类配置车型中超高的性价比;②油耗低,城市代步非常省油省钱;③车身小巧,停车方便;④双车门设计,个性独特