[Data Analysis] Data Analysis Expert Competition 3: Cluster Analysis of Automobile Products

Table of contents

Introduction to the competition

Background

Question data

1. View data

 View categorical variables

View numeric variables 

 2. Data processing

Handling categorical features

LabelEncoder

one-hot

feature normalization

PCA dimensionality reduction 

3. K-means clustering

Elbow method to see k value 

Visualization of clustering results 

Silhouette coefficient judgment k value 

Fourth, analyze the clustering results 


Introduction to the competition

This teaching competition is the third in the data analysis series initiated by the data scientist Dr. Chen - Cluster Analysis of Automotive Products

The competition topic is based on the analysis of competing products, and provides cluster classification for cars through data clustering. For a specified car model, its competing car models can be found through cluster analysis. Through this competition question, learners are encouraged to use model data to analyze model portraits and provide data decision-making for product positioning and competitive product analysis.

Background

The competition topic is based on the analysis of competing products, and provides cluster classification for cars through data clustering. For a specified car model, its competing car models can be found through cluster analysis. Through this competition question, learners are encouraged to use model data to analyze model portraits and provide data decision-making for product positioning and competitive product analysis.

Question data

Data source: car_price.csv, the data includes 26 fields of 205 cars

1. View data

import pandas as pd
import time
import matplotlib.pyplot as plt


car_price = pd.read_csv("./car_price.csv")
car_price.head()

car_price.info()
# car_price.duplicated().sum()

Data characteristics can be divided into three categories:

The first category: car ID class attributes

1 Car_ID car number

3 CarName car name

The second category: categorical variables (10)

2 Symboling Insurance Risk Rating

4 fueltype fuel type

5 aspiration engine suction form

6 doornumber car door number

7 carbody body type

8 drivewheel drive wheel

9 enginelocation engine location

15 enginetype engine type

16 cylindernumber cylinder number

18 fuelsystem fuel system

The third category: continuous numerical variables (14)

10 wheelbase

11 carlength car length

12 carwidth car width

13 carheight

14 curbweight curb weight (vehicle net weight)

17 enginesize engine size

19 boreratio Cylinder cross-sectional area to stroke ratio

20 stroke engine stroke

21 compressionratio compression ratio

22 horsepower

23 peakrpm maximum power speed

24 citympg city mileage (miles per gallon)

25 highwaympg highway miles per gallon

26 price(Dependent variable) price (dependent variable)

 View categorical variables

# 提取类别变量的列名
cate_columns=['symboling','fueltype','aspiration','doornumber','carbody','drivewheel','enginelocation','enginetype','fuelsystem','cylindernumber']


#打印类别变量每个分类的取值情况
for i in cate_columns:
    print (i)
    print(set(car_price[i]))
symboling
{0, 1, 2, 3, -2, -1}
fueltype
{'gas', 'diesel'}
aspiration
{'std', 'turbo'}
doornumber
{'two', 'four'}
carbody
{'convertible', 'hatchback', 'wagon', 'sedan', 'hardtop'}
drivewheel
{'4wd', 'fwd', 'rwd'}
enginelocation
{'rear', 'front'}
enginetype
{'ohcv', 'ohcf', 'dohc', 'ohc', 'l', 'rotor', 'dohcv'}
fuel system
{'idi', 'mfi', '4bbl', '2bbl', 'mpfi', 'spfi', '1bbl', 'spdi'}
cylindernumber
{'eight', 'six', 'five', 'two', 'four', 'three', 'twelve'}

View numeric variables 

#提取连续数值型变量特征数据(除了'car_ID'和'CarName')
car_df=car_price.drop(['car_ID','CarName'],axis=1)
#查看连续数值型情况,并是检查否有异常值
#对数据进行描述性统计
car_df.describe()

# 描绘数据集的箱线图,查看异常值

#提取连续数值型数据的列名
num_cols=car_df.columns.drop(cate_columns)
print(num_cols)

#绘制连续数值型数据的箱线图,检查异常值
import seaborn as sns

fig=plt.figure(figsize=(12,8))
i=1
for col in num_cols:
    ax=fig.add_subplot(3,5,i)
    sns.boxplot(data=car_df[col],ax=ax)
    i=i+1
    plt.title(col) 

plt.subplots_adjust(wspace=0.4,hspace=0.3)
plt.show()

#查看数值型特征的与价格的相关系数
df_corr=car_df.corr()
df_corr['price'].sort_values(ascending = False)

price               1.000000
enginesize          0.874145
curbweight          0.835305
horsepower          0.808139
carwidth            0.759325
carlength           0.682920
wheelbase           0.577816
boreratio           0.553173
carheight           0.119336
stroke              0.079443
compressionratio    0.067984
symboling          -0.079978
peakrpm            -0.085267
citympg            -0.685751
highwaympg         -0.697599
Name: price, dtype: float64
f , ax = plt.subplots(figsize = (7, 7))

plt.title('Correlation of Numeric Features with Price',y=1,size=16)

sns.heatmap(df_corr,square = True,  vmax=0.8)

 2. Data processing

 cylindernumber cylinder number

Represent English as Arabic numerals

car_price['cylindernumber'] = car_price.cylindernumber.replace({'three':3,'four':4,'five':5,'six':6,'eight':8,'twelve':12})

CarName car name

#去重查看CarName
print(car_price['CarName'].drop_duplicates())#验证是否object全部改为数值类型

carBrand = car_price['CarName'].str.split(expand=True)[0]#根据车名提取品牌,车名中第一个词为品牌
print(set(carBrand))

Construct new feature carSize from carlength

# 由上面描述性统计可知,车身长范围为141.1~208.1英寸之间,可划分为6类
bins=[min(car_df.carlength)-0.01,145.67,169.29,181.10,192.91,200.79,max(car_df.carlength)+0.01]
label=['A00','A0','A','B','C','D']
carSize=pd.cut(car_df.carlength,bins,labels=label)
print(carSize)

#将车型大小分类放入数据集中
car_price['carSize']=carSize
car_df['carSize']=carSize

#剔除carlength
features=car_df.drop(['carlength'],axis=1)

Handling categorical features

For the value of categorical features, the data with size meaning is converted into numerical mapping, and there is no size meaning (different values ​​indicate different categories), and one-hot encoding is performed. 

LabelEncoder

# 将取值具有大小意义的类别型变量数据转变为数值型映射
features1=features.copy()

#使用LabelEncoder对不具实体数值数据编码
from sklearn.preprocessing import LabelEncoder
carSize1=LabelEncoder().fit_transform(features1['carSize'])
features1['carSize']=carSize1
carSize1a

one-hot

#对于类别离散型特征,取值间没有大小意义的,可采用one-hot编码
cate=features1.select_dtypes(include='object').columns
print(cate)

features1=features1.join(pd.get_dummies(features1[cate])).drop(cate,axis=1)
features1.head()

feature normalization

The obtained original features must be normalized for each feature separately. For example, the value range of feature A is [-1000,1000], and the value range of feature B is [-1,1]. If you use logistic
regression , w1*x1+w2*x2, because the value of x1 is too large, so x2 basically does not work.
Therefore, the normalization of features must be performed, and each feature is normalized separately.

  • Continuous feature normalization:

1. Mean normalization (variance is 1, mean is 0)

2. Maximum and minimum normalization (0-1)

3. x = (2x - max - min)/(max - min). Linear scaling to [-1,1]

  • Discrete features (categorical features):

After the discrete features are one-hot encoded, the encoded features, in fact, the features of each dimension can be regarded as continuous features. Just like the normalization method for continuous features, each dimension feature can be normalized again. For example, normalize to [-1,1] or normalize to a mean of 0 and a variance of 1

Because the categorical features were previously labeled and one-hot encoded, the categorical features can already be regarded as continuous features, so all features are uniformly normalized

#对特征进行归一化
from sklearn import preprocessing

features1=preprocessing.MinMaxScaler().fit_transform(features1)
features1=pd.DataFrame(features1)
features1.head()

PCA dimensionality reduction 

#对数据集进行PCA降维(信息保留为99.99%)
from sklearn.decomposition import PCA
pca=PCA(n_components=0.9999)  #保证降维后的数据保持90%的信息,则填0.9
features2=pca.fit_transform(features1)

#降维后,每个主要成分的解释方差占比(解释PC携带的信息多少)
ratio=pca.explained_variance_ratio_
print('各主成分的解释方差占比:',ratio)

#降维后有几个成分
print('降维后有几个成分:',len(ratio))

#累计解释方差占比
cum_ratio=np.cumsum(ratio)#cumsum函数通常用于计算一个数组各行的累加值
print('累计解释方差占比:',cum_ratio)
The explained variance ratio of each principal component: [2.34835648e-01 1.89291914e-01 1.11193502e-01 6.41024136e-02
 5.90453139e-02 4.54763783e-02 4.21689429e-02 3.65477617e-02
 2.97528000e-02 2.24095237e-02 1.98458305e-02 1.95803021e-02
 1.70780800e-02 1.47611074e-02 1.32208566e-02 1.19093756e-02
 9.01434709e-03 8.74908243e-03 7.28321292e-03 6.65001057e-03
 5.68867886e-03 4.89870846e-03 4.50894857e-03 3.81422315e-03
 3.45197486e-03 2.23759951e-03 2.14676779e-03 1.84529725e-03
 1.56025958e-03 1.22067828e-03 1.12126257e-03 1.03278716e-03
 8.30359553e-04 6.87972243e-04 5.63679041e-04 4.64609849e-04
 3.33065301e-04 2.76366954e-04 1.67241531e-04 1.07861538e-04
 7.49681455e-05]
There are several components after dimensionality reduction: 41
Cumulative explained variance ratio: [0.23483565 0.42412756 0.53532106 0.59942348 0.65846879 0.70394517
 0.74611411 0.78266187 0.81241467 0.8348242  0.85467003 0.87425033
 0.89132841 0.90608952 0.91931037 0.93121975 0.9402341  0.94898318
 0.95626639 0.9629164  0.96860508 0.97350379 0.97801274 0.98182696
 0.98527894 0.98751654 0.9896633  0.9915086  0.99306886 0.99428954
 0.9954108  0.99644359 0.99727395 0.99796192 0.9985256  0.99899021
 0.99932327 0.99959964 0.99976688 0.99987474 0.99994971]
#绘制PCA降维后各成分方差占比的直方图和累计方差占比折线图
plt.figure(figsize=(8,6))
X=range(1,len(ratio)+1)
Y=ratio
plt.bar(X,Y,edgecolor='black')
plt.plot(X,Y,'r.-')
plt.plot(X,cum_ratio,'b.-')
plt.ylabel('explained_variance_ratio')
plt.xlabel('PCA')
plt.show()

#PCA选择降维保留8个主要成分
pca=PCA(n_components=8) 
features3=pca.fit_transform(features1)

#降维后的累计各成分方差占比和(即解释PC携带的信息多少)
print(sum(pca.explained_variance_ratio_))#0.7826618733273734
features3

3. K-means clustering

Elbow method to see k value 

##肘方法看k值,簇内离差平方和
#对每一个k值进行聚类并且记下对于的SSE,然后画出k和SSE的关系图
from sklearn.cluster import KMeans

sse=[]
for i in range(1,15):
    km=KMeans(n_clusters=i,init='k-means++',n_init=10,max_iter=300,random_state=0)
    km.fit(features3)
    sse.append(km.inertia_)

plt.plot(range(1,15),sse,marker='*')
plt.xlabel('n_clusters')
plt.ylabel('distortions')
plt.title("The Elbow Method")
plt.show()

 Select 5 clustering points for clustering

#进行K-Means聚类分析
kmeans=KMeans(n_clusters=5,init='k-means++',n_init=10,max_iter=300,random_state=0)
kmeans.fit(features3)
lab=kmeans.predict(features3)
print(lab)

Visualization of clustering results 

#绘制聚类结果2维的散点图
plt.figure(figsize=(8,8))
plt.scatter(features3[:,0],features3[:,1],c=lab)

for ii in np.arange(205):
    plt.text(features3[ii,0],features3[ii,1],s=car_price.car_ID[ii])
plt.xlabel('PC1')
plt.ylabel('PC2')
plt.title('K-Means PCA')
plt.show()

#绘制聚类结果后3d散点图
from mpl_toolkits.mplot3d import Axes3D
plt.figure(figsize=(8,8))
ax=plt.subplot(111,projection='3d')
ax.scatter(features3[:,0],features3[:,1],features3[:,2],c=lab)
#视角转换,转换后更易看出簇群
ax.view_init(30,45) 
ax.set_xlabel('PC1')
ax.set_ylabel('PC2')
ax.set_zlabel('PC3')
plt.show()

Silhouette coefficient judgment k value 

#绘制轮廓图和3d散点图
from sklearn.datasets import make_blobs
from sklearn.metrics import silhouette_samples, silhouette_score
import matplotlib.cm as cm
from mpl_toolkits.mplot3d import Axes3D

for n_clusters in range(2,9):
    fig=plt.figure(figsize=(12,6))
    ax1=fig.add_subplot(121)
    ax2=fig.add_subplot(122,projection='3d')
    
    ax1.set_xlim([-0.1,1])
    ax1.set_ylim([0,len(features3)+(n_clusters+1)*10])
    km=KMeans(n_clusters=n_clusters,init='k-means++',n_init=10,max_iter=300,random_state=0)
    y_km=km.fit_predict(features3)
    silhouette_avg=silhouette_score(features3,y_km)
    print('n_cluster=',n_clusters,'The average silhouette_score is :',silhouette_avg)

    cluster_labels=np.unique(y_km)   
    silhouette_vals=silhouette_samples(features3,y_km,metric='euclidean')
    y_ax_lower=10
    for i in range(n_clusters):
        c_silhouette_vals=silhouette_vals[y_km==i]
        c_silhouette_vals.sort()
        cluster_i=c_silhouette_vals.shape[0]
        y_ax_upper=y_ax_lower+cluster_i
        color=cm.nipy_spectral(float(i)/n_clusters)
        ax1.fill_betweenx(range(y_ax_lower,y_ax_upper),0,c_silhouette_vals,edgecolor='none',color=color)
        ax1.text(-0.05,y_ax_lower+0.5*cluster_i,str(i))
        y_ax_lower=y_ax_upper+10
    
    ax1.set_title('The silhouette plot for the various clusters')
    ax1.set_xlabel('The silhouette coefficient values')
    ax1.set_ylabel('Cluster label')

    ax1.axvline(x=silhouette_avg,color='red',linestyle='--')

    ax1.set_yticks([])
    ax1.set_xticks([-0.1,0,0.2,0.4,0.6,0.8,1.0])

    colors=cm.nipy_spectral(y_km.astype(float)/n_clusters)
    ax2.scatter(features3[:,0],features3[:,1],features3[:,2],marker='.',s=30,lw=0,alpha=0.7,c=colors,edgecolor='k')

    centers=km.cluster_centers_
    ax2.scatter(centers[:,0],centers[:,1],centers[:,2],marker='o',c='white',alpha=1,s=200,edgecolor='k')

    for i,c in enumerate(centers):
        ax2.scatter(c[0],c[1],c[2],marker='$%d$' % i,alpha=1,s=50,edgecolor='k')
        
    ax2.set_title("The visualization of the clustered data.")
    ax2.set_xlabel("Feature space for the 1st feature")
    ax2.set_ylabel("Feature space for the 2nd feature")
    ax2.view_init(30,45)

    plt.suptitle(("Silhouette analysis for KMeans clustering on sample data "
                  "with n_clusters = %d" % n_clusters),
                 fontsize=14, fontweight='bold')
plt.show()

  

 

 

 

Combining contour plots and 3d scatterplots: When k is too small, individual clusters are merged; when k is too large, some clusters are split into multiples.

When k=2, each cluster is very large and most of the instance coefficients are close to 0, indicating that most of the instances in the cluster are close to the boundary, some individual clusters are merged, and the model effect is not good;

When k=3, the contour coefficients of most instances of the cluster '0' are lower than the contour scores of the cluster, and a small number of instance coefficients are less than 0 and tend to -1, indicating that some instances may have been assigned to the wrong cluster;

When k=4, the silhouette coefficients of most instances of cluster '0' are lower than the cluster's silhouette score and close to 0, indicating that these instances are close to the boundary, and it may be more appropriate to divide the cluster into two separate clusters;

With k=7 or 8, some clusters are split into multiples with very close centers, resulting in very bad models;

When k is 5 or 6, most of the instances are beyond the dotted line, the clusters look fine, and the clustering is all good. Ranking k according to the score is better if 6>5, when k=5, the cluster '3' is very large, and when k=6, the distribution of each cluster is more balanced;

To sum up, the value of k can be selected as 5 or 6, and the effect of the clustering model is acceptable, but considering the balance of each cluster, so k=6 is selected.

#调整选择k=6进行聚类
kmeans=KMeans(n_clusters=6,init='k-means++',n_init=10,max_iter=300,random_state=0)
y_pred=kmeans.fit_predict(features3)
print(y_pred)

#将聚类后的类目放入原特征数据中
car_df_km=car_price.copy()
car_df_km['km_result']=y_pred
[4 4 4 1 5 3 5 5 5 0 4 5 4 5 5 5 4 5 3 3 1 3 3 0 1 1 1 0 1 0 3 3 3 3 3 1 1
 3 3 1 1 1 3 1 3 1 3 5 5 4 3 3 3 1 1 4 4 4 4 3 1 3 1 2 1 5 2 2 2 2 2 5 4 5
 4 0 3 3 3 0 0 3 0 0 0 1 1 1 1 3 2 3 1 1 3 3 1 1 3 1 1 5 5 5 4 4 4 5 2 5 2
 5 2 5 2 5 2 5 3 0 1 1 1 1 0 4 4 4 4 4 1 3 3 1 3 1 0 5 3 3 3 1 1 1 1 5 1 1
 1 5 3 3 1 1 1 1 1 1 2 2 1 1 1 3 3 4 4 4 4 4 4 4 4 1 2 1 1 1 4 4 5 5 2 3 2
 1 1 2 1 3 3 5 2 1 5 5 5 5 5 5 5 5 5 2 5] 

Fourth, analyze the clustering results 

#统计聚类后每个集群中包含的车型数
car_df_km.groupby('km_result')['car_ID'].count()
km_result
0    13
1    59
2    20
3    43
4    31
5    39
Name: car_ID, dtype: int64
import pandas as pd
#显示所有列
pd.set_option('display.max_columns',None)
#显示所有行
pd.set_option('display.max_rows',None)

#统计每个集群里各品牌的车型数
car_df_km.groupby(by=['km_result','carBrand'])['car_ID'].count()

#统计每个品牌在各个集群里的车型数
car_df_km.groupby(by=['carBrand','km_result'])['car_ID'].count()

#查看特指车名‘vokswagen’车型的聚类集群
df=car_df_km.loc[:,['car_ID','CarName','carBrand','km_result']]
print(df.loc[df['CarName'].str.contains("vokswagen")])
# ’vokswagen’的车名为‘vokswagen rabbit’,car_ID 为183,集群分类为2.

#查看特指车名为‘vokswagen’车型的竞品车型(分类2的所有车型)
df.loc[df['km_result']==2]

#查看大众volkswagen品牌在各集群内的竞品车型

li = [1, 2,3,5] #volkswagen品牌在1235这几个集群里分布
df_volk=df[df['km_result'].isin(li)].sort_values(by=['km_result','carBrand'])  
df_volk

Extract the competing models of the 'vokswagen' model from the full amount of data

df0 = car_df_km.loc[car_df_km['km_result']==2]
df0.head()
df0_1=df0.drop(['car_ID','CarName','km_result'],axis=1)

#查看集群2的车型所有特征分布
fig=plt.figure(figsize=(20,20))
i=1
for c in df0_1.columns:
    ax=fig.add_subplot(7,4,i) 
    if df0_1[c].dtypes=='int' or df0_1[c].dtypes=='float':#数值型变量
        sns.histplot(df0_1[c],ax=ax)#直方图
    else:
        sns.barplot(df0_1[c].value_counts().index,df0_1[c].value_counts(),ax=ax)#条形图3
    i=i+1
    plt.xlabel('')
    plt.title(c)  
plt.subplots_adjust(top=1.2)
plt.show()

There is only one type of variable value:
fueltype: {'diesel'}; enginelocation: {'front'}; fuelsystem:{'idi'}

These common features may not be considered in the analysis of competing products

#对不同车型级别、品牌、车身等类型特征进行数据透视
#按车型大小级别进行对比
df2=df0.pivot_table(index=['carSize','carbody','carBrand','CarName'])
df2
boreratio car_ID carheight carlength carwidth citympg compressionratio curbweight enginesize highwaympg horsepower km_result peakrpm price stroke symboling wheelbase
carSize carbody carBrand CarName
A0 hatchback toyota Toyota Corolla 3.27 160 52.8 166.3 64.4 38 22.5 2275 110 47 56 2 4500 7788.0 3.35 0 95.7
sedan nissan nissan gt-r 2.99 91 54.5 165.3 63.8 45 21.9 2017 103 50 55 2 4800 7099.0 3.47 1 94.5
toyota toyota corona 3.27 159 53.0 166.3 64.4 34 22.5 2275 110 36 56 2 4500 7898.0 3.35 0 95.7
A sedan mazda mazda glc deluxe 3.39 64 55.5 177.8 66.5 36 22.7 2443 122 42 64 2 4650 10795.0 3.39 0 98.8
mazda rx-7 gs 3.43 67 54.4 175.0 66.1 31 22.0 2700 134 39 72 2 4200 18344.0 3.64 0 104.9
toyota toyota celica gt 3.27 175 54.9 175.6 66.5 30 22.5 2480 110 33 73 2 4500 10698.0 3.35 -1 102.4
volkswagen vokswagen rabbit 3.01 183 55.7 171.7 65.5 37 23.0 2261 97 46 52 2 4800 7775.0 3.40 2 97.3
Volkswagen model 111 3.01 185 55.7 171.7 65.5 37 23.0 2264 97 46 52 2 4800 7995.0 3.40 2 97.3
volkswagen rabbit custom 3.01 193 55.1 180.2 66.9 33 23.0 2579 97 38 68 2 4500 13845.0 3.40 0 100.4
volkswagen super beetle 3.01 188 55.7 171.7 65.5 37 23.0 2319 97 42 68 2 4500 9495.0 3.40 2 97.3
B hardtop buick buick century 3.58 70 54.9 187.5 70.3 22 21.5 3495 183 25 123 2 4350 28176.0 3.64 0 106.7
sedan buick buick electra 225 custom 3.58 68 56.5 190.9 70.3 22 21.5 3515 183 25 123 2 4350 25552.0 3.64 -1 110.0
peugeot peugeot 304 3.70 109 56.7 186.7 68.4 28 21.0 3197 152 33 95 2 4150 13200.0 3.52 0 107.9
peugeot 504 3.70 117 56.7 186.7 68.4 28 21.0 3252 152 33 95 2 4150 17950.0 3.52 0 107.9
peugeot 604sl 3.70 113 56.7 186.7 68.4 28 21.0 3252 152 33 95 2 4150 16900.0 3.52 0 107.9
volvo volvo 246 3.01 204 55.5 188.8 68.9 26 23.0 3217 145 27 106 2 4800 22470.0 3.40 -1 109.1
wagon buick buick century luxus (sw) 3.58 69 58.7 190.9 70.3 22 21.5 3750 183 25 123 2 4350 28248.0 3.64 -1 110.0
C wagon peugeot peugeot 504 3.70 111 58.7 198.9 68.4 25 21.0 3430 152 25 95 2 4150 13860.0 3.52 0 114.2
peugeot 505s turbo diesel 3.70 115 58.7 198.9 68.4 25 21.0 3485 152 25 95 2 4150 17075.0 3.52 0 114.2
D sedan buick buick skyhawk 3.58 71 56.3 202.6 71.7 22 21.5 3770 183 25 123 2 4350 31600.0 3.64 -1 115.6

The size classes of all car models in cluster 2 are: A0 small car, A compact car, B medium-sized car, C medium-large car, D luxury car.
The car vokswagen rabbit of car_id183 belongs to the A compact car, and its most direct subdivided competitors are 7 cars belonging to the A class  

#提取集群2中的A级车
df0_A=df0.loc[df0['carSize']=='A']
df0_A

#查看集群0中A级车型的类别型变量的分类情况
ate_col=df0_A.select_dtypes(include='object').columns
df3=df0_A[ate_col]
df3



#对集群0中A级车的特征进行数据透视
df4=df0_A.pivot_table(index=['carBrand','CarName','doornumber','aspiration','drivewheel'])
df4

The 7 A-class cars including 'vokswagen rabbit' all have 4 cylinders, the stroke range is 3.4-3.64, the maximum power speed range is 4500-4800, the compression ratio range is 22.5-23.0, and the body width range is 66.1-66.9 , the vehicle height ranges from 54.4 to 55.7, and the cylinder cross-sectional area to stroke ratio ranges from 3.01 to 3.43; the above data are relatively similar.

The general car focuses on: model level (carSize), brand (carBrand), power performance (horsepower), quality and safety (Symboling), fuel consumption (citympg, highwaympg), space experience (wheelbase), body (carbbody, curbweight) ) etc.

Let's extract some other key features to consider the difference between 'vokswagen rabbit' and other competing products:

基本信息:‘carBrand’,‘doornumber’, ‘curbweight’

油耗:‘highwaympg’、‘citympg’

安全性:‘symboling’

底盘制动:‘drivewheel’

动力性能:‘aspiration’, ‘enginesize’, ‘horsepower’

空间体验:‘wheelbase’

价格: ‘price’

#对油耗的分析('citympg','highwaympg')
lab=df0_A['CarName']

fig,ax=plt.subplots(figsize=(10,8))
ax.barh(range(len(lab)),df0_A['highwaympg'],tick_label=lab,color='red')
ax.barh(range(len(lab)),df0_A['citympg'],tick_label=lab,color='blue')

#在水平直方图上标注数据
for i,(highway,city) in enumerate(zip(df0_A['highwaympg'],df0_A['citympg'])):
    ax.text(highway,i,highway,ha='right')
    ax.text(city,i,city,ha='right')

plt.legend(('highwaympg','citympg'), loc='upper right')
plt.title('miles per gallon')
plt.show()

#其他6个特征分析
colors=['yellow', 'blue', 'green','red',  'gray','tan','darkviolet']
col2=['symboling','wheelbase','enginesize','horsepower','curbweight','price']
data=df0_A[col2]

fig=plt.figure(figsize=(10,8))
i=1
for c in data.columns:
    ax=fig.add_subplot(3,2,i)
    plt.barh(range(len(lab)),data[c],tick_label=lab,color=colors)
    for y,x in enumerate(data[c].values):
        plt.text(x,y,"%s" %x)
    i=i+1
    plt.xlabel('')
    plt.title(c)
plt.subplots_adjust(top=1.2,wspace=0.7)
plt.show()

由上面条形图,‘vokswagen rabbit’与其他竞品相比:

质量安全方面:其保险风险评级为2,比马自达品牌和丰田品牌车型相对更具有风险;

车身空间方面:轴距是最小的;

动力方面:发动机尺寸和马力都是最小的;

车重方面:整备质量最小的;

价格方面:价格是最小的;
综上所述,‘'vokswagen rabbit‘’与集群0中同是A级的竞品相比:

劣势:质量安全性偏低、车身空间偏小、动力马力偏小

优势:车身轻、油耗低、价格低(在类似的配置中性价比非常高)

设计特点:双车门三厢车

产品定位:“经济适用、城市代步紧凑型A级轿车”

建议: 在销售推广时,可偏重于:①同类配置车型中超高的性价比;②油耗低,城市代步非常省油省钱;③车身小巧,停车方便;④双车门设计,个性独特

【算法竞赛学习】数据分析达人赛3:汽车产品聚类分析

Guess you like

Origin blog.csdn.net/m0_51933492/article/details/127397390