Practical information|Python data analysis consumer user portrait

Today I will introduce to you a project that combines clustering and dimensionality reduction, which is divided into two parts:

  • Use the original data directly, and after data preprocessing and coding, implement user clustering based on native K-Means and PCA/T-SNE.

  • Use the high-dimensional data converted by the Transformer-based pre-training model, and then use K-Means and PCA/T-SNE to implement user clustering

This article first introduces the complete process of the first solution.

1 project map

Map of the entire project:

picture

Technical exchange and source code acquisition

Technology needs to be communicated and shared, and it is not recommended to work behind closed doors. One person can go very fast, and a group of people can go further.

Good articles are inseparable from the sharing and recommendations of fans. Information, information sharing, data, and technical exchanges and improvements can all be obtained by joining the communication group. The group has more than 2,000 members. The best way to comment when adding is: source + interest Directions to find like-minded friends.

The methods for technical exchange, code, and data acquisition are as follows:

Method ①, add WeChat ID: dkl88194, note: from CSDN + consumer data
Method ②, WeChat search public account: Python learning and data mining, background reply: consumer Data

Fee 1
Insert image description here
Fee 2

We created "100 Super Powerful Algorithm Models". Features: Easy to learn from 0 to 1. Principles, codes, and cases are all available. All algorithm models are expressed according to this rhythm, so it is a complete set of cases. Library.

Many beginners have such a pain point, which is the case. The completeness of the case directly affects the interest of the students. Therefore, I have compiled 100 of the most common algorithm models to give you a boost on your learning journey!

Insert image description here

2 Import library

In [1]:

import pandas as pd 
import numpy as np 

np.random.seed(42)

import matplotlib.pyplot as plt
import matplotlib.cm as cm
import plotly.express as px
import plotly.graph_objects as go
import seaborn as sns
import shap

from sklearn.cluster import KMeans
from sklearn.preprocessing import PowerTransformer, OrdinalEncoder, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.manifold import TSNE
from sklearn.metrics import silhouette_score, silhouette_samples, accuracy_score, classification_report

from pyod.models.ecod import ECOD
from yellowbrick.cluster import KElbowVisualizer

import lightgbm as lgb
import prince

from tqdm.notebook import tqdm
from time import sleep

import warnings
warnings.filterwarnings("ignore")

3 Read data

In [2]:

df = pd.read_csv("train/train.csv",sep=";")

Exploratory analysis process of data to understand the basic information of the data:

In [3]:

df.shape

Out[3]:

(45211, 17)

In [4]:

df.columns

Out[4]:

Index(['age', 'job', 'marital', 'education', 'default', 'balance', 'housing',
       'loan', 'contact', 'day', 'month', 'duration', 'campaign', 'pdays',
       'previous', 'poutcome', 'y'],
      dtype='object')

In [5]:

df.dtypes

Out[5]:

age           int64
job          object
marital      object
education    object
default      object
balance       int64
housing      object
loan         object
contact      object
day           int64
month        object
duration      int64
campaign      int64
pdays         int64
previous      int64
poutcome     object
y            object
dtype: object

In [6]:

pd.Series.value_counts(df.dtypes)

Out[6]:

object    10
int64      7
Name: count, dtype: int64

In [7]:

# 缺失值信息

df.isnull().sum()

Out[7]:

age          0
job          0
marital      0
education    0
default      0
balance      0
housing      0
loan         0
contact      0
day          0
month        0
duration     0
campaign     0
pdays        0
previous     0
poutcome     0
y            0
dtype: int64

The results indicate that there are no missing values ​​in the data.

In [8]:

# 取出前面8个特征进行建模

df = df.iloc[:, 0:8]

4 Data preprocessing Preprocessing

Mainly for encoding typed data:

In [9]:

# 1-独热码
categorical_transformer_onehot = Pipeline(
    steps = [("encoder", OneHotEncoder(handle_unknown="ignore",drop="first", sparse=False))])

# 2-顺序编码
caterogorical_transformer_ordinal = Pipeline(
    steps=[("encoder", OrdinalEncoder())]
)

# 3-数据转换(对数变换、标准化、归一化等)
num = Pipeline(
    steps=[("encoder",PowerTransformer())]
)

In [10]:

df.dtypes

Out[10]:

age           int64
job          object
marital      object
education    object
default      object
balance       int64
housing      object
loan         object
dtype: object

Set up the data preprocessor:

In [11]:

preprocessor = ColumnTransformer(transformers=[
    ("cat_onehot", categorical_transformer_onehot, ["default","housing","loan","job","marital"]),
    ("cat_ordinal", caterogorical_transformer_ordinal, ["education"]),
    ("num", num, ["age", "balance"])
])

5 Create pipeline

In [12]:

pipeline = Pipeline(
    steps=[("preprocessor", preprocessor)]
)

# 训练
pipe_fit = pipeline.fit(df)

In [13]:

data = pd.DataFrame(pipe_fit.transform(df), columns=pipe_fit.get_feature_names_out().tolist())
data.shape

Out[13]:

(45211, 19)

In [14]:

data.columns

Out[14]:

Index(['cat_onehot__default_yes', 'cat_onehot__housing_yes',
       'cat_onehot__loan_yes', 'cat_onehot__job_blue-collar',
       'cat_onehot__job_entrepreneur', 'cat_onehot__job_housemaid',
       'cat_onehot__job_management', 'cat_onehot__job_retired',
       'cat_onehot__job_self-employed', 'cat_onehot__job_services',
       'cat_onehot__job_student', 'cat_onehot__job_technician',
       'cat_onehot__job_unemployed', 'cat_onehot__job_unknown',
       'cat_onehot__marital_married', 'cat_onehot__marital_single',
       'cat_ordinal__education', 'num__age', 'num__balance'],
      dtype='object')

6 Exception handling (ECOD)

Based on the Python Outlier Detection library for outlier processing (Kmeans is sensitive to outliers).

Another method, ECOD (empirical cumulative distribution functions for outlier detection), is an outlier detection method based on the empirical cumulative distribution function.

In [15]:

from pyod.models.ecod import ECOD

clf = ECOD()
clf.fit(data)

outliers = clf.predict(data)
outliers

Out[15]:

array([0, 0, 0, ..., 1, 0, 0])

In [16]:

data["outliers"] = outliers  # 添加预测结果
df["outliers"] = outliers  # 原始数据添加预测结果

In [17]:

# 包含异常值和不含包单独处理

# data无异常值
data_no_outliers = data[data["outliers"] == 0]
data_no_outliers = data_no_outliers.drop(["outliers"],axis=1)

# data有异常值
data_with_outliers = data.copy()
data_with_outliers = data_with_outliers.drop(["outliers"],axis=1)

# 原始数据无异常值
df_no_outliers = df[df["outliers"] == 0]
df_no_outliers = df_no_outliers.drop(["outliers"], axis = 1)

In [18]:

data_no_outliers.head()

Out[18]:

picture

Check the data volume:

In [19]:

data_no_outliers.shape

Out[19]:

(40690, 19)

In [20]:

data_with_outliers.shape

Out[20]:

(45211, 19)

7 Clustering modeling (K-Means)

7.1 Elbow diagram to identify k value

How is the k value determined during clustering? Introducing the method based on elbow diagram, please refer to:

https://www.geeksforgeeks.org/elbow-method-for-optimal-value-of-k-in-kmeans/

In [21]:

from yellowbrick.cluster import KElbowVisualizer

km = KMeans(init="k-means++", random_state=0, n_init="auto")
visualizer = KElbowVisualizer(km, k=(2,10))
 
visualizer.fit(data_no_outliers)      
visualizer.show()  

Out[21]:

picture

<Axes: title={
    
    'center': 'Distortion Score Elbow for KMeans Clustering'}, xlabel='k', ylabel='distortion score'>

We can see that k=6 is the best.

7.2 Changes in silhouette coefficient

In [22]:

from sklearn.metrics import davies_bouldin_score, silhouette_score, silhouette_samples
import matplotlib.cm as cm

def make_Silhouette_plot(X, n_clusters):
    plt.xlim([-0.1, 1])
    plt.ylim([0, len(X) + (n_clusters + 1) * 10])
    # 建立聚类模型
    clusterer = KMeans(n_clusters=n_clusters, 
                       max_iter=1000, 
                       n_init=10, 
                       init="k-means++",
                       random_state=10)
    
    # 聚类预测生成标签label
    cluster_label = clusterer.fit_predict(X)
    # 计算轮廓系数均值(整体数据样本)
    silhouette_avg = silhouette_score(X,cluster_label)
    print(f"n_clusterers: {
      
      n_clusters}, silhouette_score_avg:{
      
      silhouette_avg}")
    
    # 单个数据样本
    sample_silhouette_value = silhouette_samples(X, cluster_label)
    y_lower = 10
    
    for i in range(n_clusters):
        # 第i个簇群的轮廓系数
        i_cluster_silhouette_value = sample_silhouette_value[cluster_label == i]
        # 进行排序
        i_cluster_silhouette_value.sort()
        size_cluster_i = i_cluster_silhouette_value.shape[0]
        y_upper = y_lower + size_cluster_i
        # 颜色设置
        color = cm.nipy_spectral(float(i) / n_clusters)
        
        # 边界填充
        plt.fill_betweenx(
            np.arange(y_lower, y_upper),
            0,
            i_cluster_silhouette_value,
            facecolor=color,
            edgecolor=color,
            alpha=0.7
        )
        # 添加文本信息
        plt.text(-0.05, y_lower + 0.5 * size_cluster_i, str(i))
        y_lower = y_upper + 10
        plt.title(f"The Silhouette Plot for n_cluster = {
      
      n_clusters}", fontsize=26)
        plt.xlabel("The silhouette coefficient values", fontsize=24)
        plt.ylabel("Cluter Label", fontsize=24)
        plt.axvline(x=silhouette_avg, color="red", linestyle="--")
        # x-y轴的刻度标签
        plt.xticks([-0.1,0,0.2,0.4,0.6,0.8,1])
        plt.yticks([])
               
range_n_clusters = list(range(2, 10))
    
for n in range_n_clusters:
    print(f"N cluster:{
      
      n}")
    make_Silhouette_plot(data_no_outliers, n)
    plt.savefig(f"Silhouette_Plot_{
      
      n}.png")
    plt.close()
N cluster:2
n_clusterers: 2, silhouette_score_avg:0.18112038570087005
......
N cluster:9
n_clusterers: 9, silhouette_score_avg:0.1465020645956104

Comparison of silhouette coefficients under different k values:

picture

picture

picture

picture

picture

picture

picture

picture

7.3 Implementing clustering

From the results, the effect of k=6 or 5 is OK. Here we finally choose k=5 for clustering:

In [23]:

km = KMeans(n_clusters=5,
            init="k-means++", 
            n_init=10,
            max_iter=100,
            random_state=42
           )

# 对无离群点数据的聚类
clusters_predict = km.fit_predict(data_no_outliers)

7.4 Evaluate clustering effect

How to evaluate the clustering effect? Three commonly used evaluation indicators:

  • Davies-Bouldin index

  • Calinski-Harabasz Score

  • Silhouette Score

In [24]:

from sklearn.metrics import silhouette_score  # 轮廓系数
from sklearn.metrics import calinski_harabasz_score  
from sklearn.metrics import davies_bouldin_score  # 戴维森堡丁指数(DBI)
Davies-Bouldin index

The Davies-Bouldin index is an evaluation method for clustering algorithms. The smaller the value, the better the clustering result. The principle of this index is to measure the effect of clustering by comparing the distance between different clusters and the distance within different clusters. The calculation method is as follows:

  • For each cluster, calculate its centroid.

  • Calculate the distance between the inner point of each cluster and its center point, and find its average value to obtain the intra-cluster distance (intra-cluster distance).

  • Calculate the distance between the center points of different clusters and find their average value to obtain the inter-cluster distance.

  • For each cluster, calculate its Davies-Bouldin index: the ratio of the average distance between the center points of all other clusters and the center point of the cluster except this cluster to the internal distance of the cluster.

  • The Davies-Bouldin index of all clusters is averaged to obtain the Davies-Bouldin index of the overall cluster.

Through the Davies-Bouldin index, we can compare the clustering effects of different clustering algorithms and different parameters to select the best clustering solution. The Davies-Bouldin index can take into account the fluctuation of clustering results. For similar clustering results, the Davies-Bouldin index is larger. Therefore, the Davies-Bouldin index can distinguish the degree of similarity of different clustering results.

In addition, the Davies-Bouldin index does not assume a priori knowledge of cluster shape and size, so it can be applied to different clustering scenarios.

Calinski-Harabasz Score

The Calinski-Harabasz Score is a metric used to evaluate clustering quality and is calculated based on the ratio of the variance between cluster centers to the variance within clusters. The larger the index, the better the clustering effect.

Calinski-Harbasz Score is calculated by evaluating the variance between classes and the variance within classes. The specific formula is expressed as:

Among them, represents the number of clustering categories, represents the number of all data, is the between-class variance, and is the intra-class variance.

Calculation formula:

Trace only considers the elements on the diagonal of the matrix, that is, the Euclidean distance from all data points in the class to the class.

The calculation formula is:

Among them, is the set of all data in the class, is the particle point of class q, is the center point of all data, and is the total number of class data points.

Silhouette Score

Silhouette Score is expressed as silhouette coefficient.

Silhouette Score is a measure of the quality of clustering results that combines the closeness within clusters and the separation between different clusters. For each data point, Silhouette Score takes into account several factors:

  • a: The average distance between the data point and other points in the same cluster (closeness within the cluster)

  • b: The average distance from the data point to the nearest different clusters (inter-cluster separation)

Specifically, the Silhouette Score is calculated as:

The value of the silhouette coefficient is between -1 and 1. The closer to 1, the better the clustering effect, and the closer to -1, the poorer the clustering result.

In [25]:

print(f"Davies bouldin score: {
      
      davies_bouldin_score(data_no_outliers,clusters_predict)}")
print(f"Calinski Score: {
      
      calinski_harabasz_score(data_no_outliers,clusters_predict)}")
print(f"Silhouette Score: {
      
      silhouette_score(data_no_outliers,clusters_predict)}")
Davies bouldin score: 1.6775659296391374
Calinski Score: 6914.724747148267
Silhouette Score: 0.1672869940907178

8 Dimensionality reduction (based on Prince.PCA)

Refer to the official website learning address: https://github.com/MaxHalford/prince

8.1 Dimensionality reduction function

In [26]:

import prince 
import plotly.express as px


def get_pca_2d(df, predict):
    """
    建立聚类模型,保留2个主成分
    """
    pca_2d_object = prince.PCA(
        n_components=2,   #  保留两个主成分
        n_iter=3,  # 迭代次数
        rescale_with_mean=True,  # 基于均值和标准差的尺度缩放 
        rescale_with_std=True,
        copy=True,
        check_input=True,
        engine="sklearn",
        random_state=42
        )
    # 模型训练
    pca_2d_object.fit(df)
    # 原数据转换
    df_pca_2d = pca_2d_object.transform(df)
    df_pca_2d.columns = ["comp1", "comp2"]
    # 添加聚类预测结果
    df_pca_2d["cluster"] = predict
    
    return pca_2d_object, df_pca_2d


# 同样的方式创建保留3个主成分的功能函数
def get_pca_3d(df, predict):
    """
    保留3个主成分
    """
    pca_3d_object = prince.PCA(
    n_components=3,  # 保留3个主成分
    n_iter=3,  
    rescale_with_mean=True,
    rescale_with_std=True,
    copy=True,
    check_input=True,
    engine='sklearn',
    random_state=42
    )

    pca_3d_object.fit(df)

    df_pca_3d = pca_3d_object.transform(df)
    df_pca_3d.columns = ["comp1", "comp2", "comp3"]
    df_pca_3d["cluster"] = predict

    return pca_3d_object, df_pca_3d

8.2 Dimensionality reduction visualization

Below is a visual plotting function based on 2 principal components:

In [27]:

def plot_pca_2d(df, title="PCA Space", opacity=0.8, width_line=0.1):
    """
    2个主成分的降维可视化
    """
    df = df.astype({
    
    "cluster": "object"})  # 指定字段的数据类型
    df = df.sort_values("cluster")
    
    columns = df.columns[0:3].tolist()
    
    # 绘图
    fig = px.scatter(
        df,
        x=columns[0], 
        y=columns[1], 
        color='cluster',
        template="plotly",
        color_discrete_sequence=px.colors.qualitative.Vivid,
         title=title
     )
        
    # trace更新
    fig.update_traces(marker={
    
    
        "size": 8,
        "opacity": opacity,
        "line":{
    
    "width": width_line,
                "color":"black"}
    })
    
   # layout更新
    fig.update_layout(
        width=800,  # 长宽
        height=700,
        autosize=False,
        showlegend = True,
        legend=dict(title_font_family="Times New Roman", font=dict(size= 20)),
        scene = dict(xaxis=dict(title = 'comp1', titlefont_color = 'black'),
                     yaxis=dict(title = 'comp2', titlefont_color = 'black')),
        font = dict(family = "Gilroy", color  = 'black', size = 15))
    
    fig.show()    

The following is a visual plotting function based on 3 principal components:

In [28]:

def plot_pca_3d(df, title="PCA Space", opacity=0.8, width_line=0.1):
    """
    3个主成分的降维可视化
    """
    df = df.astype({
    
    "cluster": "object"})
    df = df.sort_values("cluster")
    
    # 定义fig
    fig = px.scatter_3d(
        df,
        x='comp1', 
        y='comp2', 
        z='comp3',
        color='cluster',
        template="plotly",
        color_discrete_sequence=px.colors.qualitative.Vivid,
        title=title
    )
    
    # trace更新
    fig.update_traces(marker={
    
    
        "size":4,
        "opacity":opacity,
        "line":{
    
    "width":width_line,
                "color":"black"}
    })
    
    # layout更新
    fig.update_layout(
        width=800,  # 长宽
        height=800,
        autosize=True,
        showlegend = True,
        legend=dict(title_font_family="Times New Roman", font=dict(size= 20)),
        scene = dict(xaxis=dict(title = 'comp1', 
                                titlefont_color = 'black'),
                     yaxis=dict(title = 'comp2', 
                                titlefont_color = 'black'),
                     zaxis=dict(title = 'comp3', 
                                titlefont_color = 'black')),
        font = dict(family = "Gilroy", color  = 'black', size = 15))
    
    fig.show()
8.2.1 2D

The following is the effect of 2D visualization:

In [29]:

pca_2d_object, df_pca_2d = get_pca_2d(data_no_outliers, clusters_predict) 

In [30]:

plot_pca_2d(df_pca_2d, title = "PCA Space", opacity=1, width_line = 0.1)

picture

It can be seen that the clustering effect is not very good and the data is not isolated.

8.2.2 3D

The following is the effect of 3D visualization:

In [31]:

pca_3d_object, df_pca_3d = get_pca_3d(data_no_outliers, clusters_predict)

In [32]:

plot_pca_3d(df_pca_3d, title = "PCA Space", opacity=1, width_line = 0.1)

print("The variability is : ", pca_3d_object.eigenvalues_summary)

picture

picture

picture

The variability is :            eigenvalue % of variance % of variance (cumulative)
component                                                    
0              2.245        11.81%                     11.81%
1              1.774         9.34%                     21.15%
2              1.298         6.83%                     27.98%

From the results, we can see that the clustering effect is not very good and the samples are not separated.

The total proportion of the first three principal components is 27.98%, which is not enough to capture the original data information and patterns. The following introduces dimensionality reduction based on T-SNE. This method is mainly used for dimensionality reduction visualization of high-dimensional data:

9 Dimensionality reduction optimization (based on T-SNE)

Take out some samples

In [33]:

from sklearn.manifold import TSNE

# 无离群点的数据随机取数
sampling_data = data_no_outliers.sample(frac=0.5, replace=True, random_state=1)
# 聚类后的数据随机取数
# clusters_predict  表示从聚类结果中随机取数

sampling_cluster = pd.DataFrame(clusters_predict).sample(frac=0.5, replace=True, random_state=1)[0].values
sampling_cluster

Out[33]:

array([4, 1, 1, ..., 2, 0, 4])

9.1 Implementing 2D dimensionality reduction

9.1.1 Dimensionality reduction

In [34]:

# 建立降维模型

tsne2 = TSNE(
    n_components=2,
    learning_rate=500, 
    init='random', 
    perplexity=200, 
    n_iter = 5000) 

In [35]:

data_tsne_2d = tsne2.fit_transform(sampling_data)

In [36]:

# 转成df格式 + 原聚类结果

df_tsne_2d = pd.DataFrame(data_tsne_2d, columns=["comp1","comp2"])
df_tsne_2d["cluster"] = sampling_cluster
9.1.2 Visualization

In [37]:

plot_pca_2d(df_tsne_2d, title = "T-SNE Space", opacity=1, width_line = 0.1)  

picture

9.2 Implement 3D dimensionality reduction

9.2.1 Dimensionality reduction

Implement T-SNE dimensionality reduction on the clustered results:

In [38]:

# 建立3D降维模型

tsne3 = TSNE(
    n_components=3,
    learning_rate=500, 
    init='random', 
    perplexity=200, 
    n_iter = 5000 
)

In [39]:

# 模型训练并转换数据

data_tsne_3d = tsne3.fit_transform(sampling_data)

In [40]:

# 转成df格式 + 原聚类结果

df_tsne_3d = pd.DataFrame(data_tsne_3d, columns=["comp1","comp2","comp3"])
df_tsne_3d["cluster"] = sampling_cluster
9.2.2 Visualization of dimensionality reduction results

In [41]:

plot_pca_3d(df_tsne_3d, title = "T-SNE Space", opacity=1, width_line = 0.1)

picture

picture

Compare the two-dimensional effects of the two dimensionality reduction methods: Obviously, the effect of T-SNE is much better ~

picture

10 Classification based on LGBMClassifer

Use the original data without anomaliesdf_no_outliers as feature X, and cluster the labelclusters_predict< /span>As the target label y, establish a LGBMClassifer classification model:

10.1 Building a model

In [42]:

import lightgbm as lgb
import shap

clf_lgb = lgb.LGBMClassifier(colsample_by_tree=0.8)

# 将部分字段的数据类型进行转化
for col in ["job","marital","education","housing","loan","default"]:
    df_no_outliers[col] = df_no_outliers[col].astype("category")

    
clf_lgb.fit(X=df_no_outliers, 
            y=clusters_predict,
            feature_name = "auto", 
            categorical_feature = "auto"
           )
[LightGBM] [Warning] Unknown parameter: colsample_by_tree
[LightGBM] [Warning] Unknown parameter: colsample_by_tree
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000593 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 342
[LightGBM] [Info] Number of data points in the train set: 40690, number of used features: 8
[LightGBM] [Info] Start training from score -1.626166
[LightGBM] [Info] Start training from score -1.292930
[LightGBM] [Info] Start training from score -1.412943
[LightGBM] [Info] Start training from score -2.815215
[LightGBM] [Info] Start training from score -1.489282

Out[42]:

LGBMClassifier

LGBMClassifier(colsample_by_tree=0.8)

10.2 shap visualization

In [43]:

explainer = shap.TreeExplainer(clf_lgb)  # 建立解释器
shap_values = explainer.shap_values(df_no_outliers)  # 求出shap值

shap.summary_plot(shap_values, df_no_outliers, plot_type="bar",plot_size=(15,10))

picture

As you can see from the results, the age field is the most important.

10.3 Model prediction

In [44]:

y_pred = clf_lgb.predict(df_no_outliers)  # 预测
acc = accuracy_score(y_pred, clusters_predict)  # 预测值和真实值计算acc

# acc
print('Training-set accuracy score: {0:0.4f}'. format(acc))
[LightGBM] [Warning] Unknown parameter: colsample_by_tree
Training-set accuracy score: 1.0000

In [45]:

# 分类报告
print(classification_report(clusters_predict, y_pred))
              precision    recall  f1-score   support

           0       1.00      1.00      1.00      8003
           1       1.00      1.00      1.00     11168
           2       1.00      1.00      1.00      9905
           3       1.00      1.00      1.00      2437
           4       1.00      1.00      1.00      9177

    accuracy                           1.00     40690
   macro avg       1.00      1.00      1.00     40690
weighted avg       1.00      1.00      1.00     40690

10.4 Aggregation results

In [46]:

# 原始数据无异常

df_no_outliers = df[df.outliers == 0]
df_no_outliers["cluster"] = clusters_predict  # 聚类结果

Use the cluster result cluster as the grouping field:

  • Statistical mean of numeric fields (mean)

  • The highest frequency field of the classification field (the first data information after grouping)

In [47]:

df_no_outliers.groupby("cluster").agg({
    
    
    "job":lambda x: x.value_counts().index[0],
    "marital": lambda x: x.value_counts().index[0],
    "education":lambda x: x.value_counts().index[0],
    "housing":lambda x: x.value_counts().index[0],
    "loan":lambda x: x.value_counts().index[0],
    "age":"mean",
    "balance":"mean",
    "default":lambda x: x.value_counts().index[0]
}).sort_values("age").reset_index()

Out[47]:

|
| cluster | job | marital | education | housing | loan | age | balance | default |
| — | — | — | — | — | — | — | — | — | — |
| 0 | 4 | technician | single | secondary | yes | no | 32.069740 | 794.696306 | no |
| 1 | 2 | blue-collar | married | secondary | yes | no | 34.569409 | 592.025644 | no |
| 2 | 3 | management | married | secondary | yes | no | 42.183012 | 7526.310217 | no |
| 3 | 0 | management | married | tertiary | no | no | 43.773960 | 872.797951 | no |
| 4 | 1 | blue-collar | married | secondary | no | no | 50.220989 | 836.407504 | no |

reference

Refer to the original English learning address: https://towardsdatascience.com/mastering-customer-segmentation-with-llm-3d9008235f41

I will share with you the solution laterTransformer model+Kmeans+PCA/T-SNE~

The code has been sorted out and the article is being sorted out. Let’s take a look in advance at the effect of using PCA on the data converted based on the Transformer model (the data is expanded to 384 dimensions): Indeed So much better!

picture

picture

Guess you like

Origin blog.csdn.net/qq_34160248/article/details/134904775