❝本文分享最常用「4个分组( Groups)关系图」。
❞
目录
七、分组( Groups)关系图
48、聚类树形图(Dendrogram)
展示通过聚类形成的组内及组间相似性水平。
import scipy.cluster.hierarchy as shc
# Import Data
df = pd.read_csv('./datasets/USArrests.csv')
# Plot
plt.figure(figsize=(12, 8), dpi=80)
plt.title("USArrests Dendograms", fontsize=18)
dend = shc.dendrogram(shc.linkage(df[['Murder', 'Assault', 'UrbanPop',
'Rape']],
method='ward'),
labels=df.State.values,
color_threshold=200)
plt.xticks(fontsize=12)
plt.yticks(fontsize=12)
plt.show()
49、聚类图(Cluster Plot)
通过聚类计算距离,将同一类圈起来。
from sklearn.cluster import AgglomerativeClustering
from scipy.spatial import ConvexHull
# Import Data
df = pd.read_csv('./datasets/USArrests.csv')
# Agglomerative Clustering
cluster = AgglomerativeClustering(n_clusters=5,
affinity='euclidean',
linkage='ward')
cluster.fit_predict(df[['Murder', 'Assault', 'UrbanPop', 'Rape']])
# Plot
plt.figure(figsize=(12, 8), dpi=80)
plt.scatter(df.iloc[:, 0], df.iloc[:, 1], c=cluster.labels_, cmap='tab10')
# Encircle
def encircle(x, y, ax=None, **kw):
if not ax: ax = plt.gca()
p = np.c_[x, y]
hull = ConvexHull(p)
poly = plt.Polygon(p[hull.vertices, :], **kw)
ax.add_patch(poly)
# Draw polygon surrounding vertices
encircle(df.loc[cluster.labels_ == 0, 'Murder'],
df.loc[cluster.labels_ == 0, 'Assault'],
ec="k",
fc="#dc2624",
linewidth=0)
encircle(df.loc[cluster.labels_ == 1, 'Murder'],
df.loc[cluster.labels_ == 1, 'Assault'],
ec="k",
fc="#2b4750",
linewidth=0)
encircle(df.loc[cluster.labels_ == 2, 'Murder'],
df.loc[cluster.labels_ == 2, 'Assault'],
ec="k",
fc="#649E7D",
linewidth=0)
encircle(df.loc[cluster.labels_ == 3, 'Murder'],
df.loc[cluster.labels_ == 3, 'Assault'],
ec="k",
fc="#C89F91",
linewidth=0)
encircle(df.loc[cluster.labels_ == 4, 'Murder'],
df.loc[cluster.labels_ == 4, 'Assault'],
ec="k",
fc="#c7cccf",
linewidth=0)
# Decorations
plt.xlabel('Murder')
plt.xticks(fontsize=12)
plt.ylabel('Assault')
plt.yticks(fontsize=12)
plt.title('Agglomerative Clustering of USArrests (5 Groups)', fontsize=18)
plt.show()
50、安德鲁斯曲线(Andrews Curve)
展示是否存在基于给定分组的特征的固有分组。例如下图,如果数据集中的列不能帮助区分组(cyl),则行将不会被很好地分隔开。
from pandas.plotting import andrews_curves
# Import
df = pd.read_csv("./datasets/mtcars.csv")
df.drop(['cars', 'carname'], axis=1, inplace=True)
# Plot
plt.figure(figsize=(10, 6), dpi=80)
andrews_curves(df, 'cyl', colormap='Set2_r')
# Lighten borders
plt.gca().spines["top"].set_alpha(0)
plt.gca().spines["bottom"].set_alpha(.3)
plt.gca().spines["right"].set_alpha(0)
plt.gca().spines["left"].set_alpha(.3)
plt.title('Andrews Curves of mtcars', fontsize=18)
plt.xlim(-3, 3)
plt.grid(alpha=0.3)
plt.xticks(fontsize=12)
plt.yticks(fontsize=12)
plt.show()
51、平行坐标图(Parallel Coordinates)
展示某个特征是否有助于分组。如果一个特征隔离,分组受到影响,则该特征对该分组非常必要。
from pandas.plotting import parallel_coordinates
# Import Data
df_final = pd.read_csv("./datasets/diamonds_filter.csv")
# Plot
plt.figure(figsize=(11, 7), dpi=80)
parallel_coordinates(df_final, 'cut', colormap='Set2_r')
# Lighten borders
plt.gca().spines["top"].set_alpha(0)
plt.gca().spines["bottom"].set_alpha(.3)
plt.gca().spines["right"].set_alpha(0)
plt.gca().spines["left"].set_alpha(.3)
plt.title('Parallel Coordinated of Diamonds', fontsize=18)
plt.grid(alpha=0.3)
plt.xticks(fontsize=12)
plt.yticks(fontsize=12)
plt.show()