Today I will introduce to you a project that combines clustering and dimensionality reduction, which is divided into two parts:
-
Use the original data directly, and after data preprocessing and coding, implement user clustering based on native K-Means and PCA/T-SNE.
-
Use the high-dimensional data converted by the Transformer-based pre-training model, and then use K-Means and PCA/T-SNE to implement user clustering
This article first introduces the complete process of the first solution.
1 project map
Map of the entire project:
Technical exchange and source code acquisition
Technology needs to be communicated and shared, and it is not recommended to work behind closed doors. One person can go very fast, and a group of people can go further.
Good articles are inseparable from the sharing and recommendations of fans. Information, information sharing, data, and technical exchanges and improvements can all be obtained by joining the communication group. The group has more than 2,000 members. The best way to comment when adding is: source + interest Directions to find like-minded friends.
The methods for technical exchange, code, and data acquisition are as follows:
Method ①, add WeChat ID: dkl88194, note: from CSDN + consumer data
Method ②, WeChat search public account: Python learning and data mining, background reply: consumer Data
Fee 1
Fee 2
We created "100 Super Powerful Algorithm Models". Features: Easy to learn from 0 to 1. Principles, codes, and cases are all available. All algorithm models are expressed according to this rhythm, so it is a complete set of cases. Library.
Many beginners have such a pain point, which is the case. The completeness of the case directly affects the interest of the students. Therefore, I have compiled 100 of the most common algorithm models to give you a boost on your learning journey!
2 Import library
In [1]:
import pandas as pd
import numpy as np
np.random.seed(42)
import matplotlib.pyplot as plt
import matplotlib.cm as cm
import plotly.express as px
import plotly.graph_objects as go
import seaborn as sns
import shap
from sklearn.cluster import KMeans
from sklearn.preprocessing import PowerTransformer, OrdinalEncoder, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.manifold import TSNE
from sklearn.metrics import silhouette_score, silhouette_samples, accuracy_score, classification_report
from pyod.models.ecod import ECOD
from yellowbrick.cluster import KElbowVisualizer
import lightgbm as lgb
import prince
from tqdm.notebook import tqdm
from time import sleep
import warnings
warnings.filterwarnings("ignore")
3 Read data
In [2]:
df = pd.read_csv("train/train.csv",sep=";")
Exploratory analysis process of data to understand the basic information of the data:
In [3]:
df.shape
Out[3]:
(45211, 17)
In [4]:
df.columns
Out[4]:
Index(['age', 'job', 'marital', 'education', 'default', 'balance', 'housing',
'loan', 'contact', 'day', 'month', 'duration', 'campaign', 'pdays',
'previous', 'poutcome', 'y'],
dtype='object')
In [5]:
df.dtypes
Out[5]:
age int64
job object
marital object
education object
default object
balance int64
housing object
loan object
contact object
day int64
month object
duration int64
campaign int64
pdays int64
previous int64
poutcome object
y object
dtype: object
In [6]:
pd.Series.value_counts(df.dtypes)
Out[6]:
object 10
int64 7
Name: count, dtype: int64
In [7]:
# 缺失值信息
df.isnull().sum()
Out[7]:
age 0
job 0
marital 0
education 0
default 0
balance 0
housing 0
loan 0
contact 0
day 0
month 0
duration 0
campaign 0
pdays 0
previous 0
poutcome 0
y 0
dtype: int64
The results indicate that there are no missing values in the data.
In [8]:
# 取出前面8个特征进行建模
df = df.iloc[:, 0:8]
4 Data preprocessing Preprocessing
Mainly for encoding typed data:
In [9]:
# 1-独热码
categorical_transformer_onehot = Pipeline(
steps = [("encoder", OneHotEncoder(handle_unknown="ignore",drop="first", sparse=False))])
# 2-顺序编码
caterogorical_transformer_ordinal = Pipeline(
steps=[("encoder", OrdinalEncoder())]
)
# 3-数据转换(对数变换、标准化、归一化等)
num = Pipeline(
steps=[("encoder",PowerTransformer())]
)
In [10]:
df.dtypes
Out[10]:
age int64
job object
marital object
education object
default object
balance int64
housing object
loan object
dtype: object
Set up the data preprocessor:
In [11]:
preprocessor = ColumnTransformer(transformers=[
("cat_onehot", categorical_transformer_onehot, ["default","housing","loan","job","marital"]),
("cat_ordinal", caterogorical_transformer_ordinal, ["education"]),
("num", num, ["age", "balance"])
])
5 Create pipeline
In [12]:
pipeline = Pipeline(
steps=[("preprocessor", preprocessor)]
)
# 训练
pipe_fit = pipeline.fit(df)
In [13]:
data = pd.DataFrame(pipe_fit.transform(df), columns=pipe_fit.get_feature_names_out().tolist())
data.shape
Out[13]:
(45211, 19)
In [14]:
data.columns
Out[14]:
Index(['cat_onehot__default_yes', 'cat_onehot__housing_yes',
'cat_onehot__loan_yes', 'cat_onehot__job_blue-collar',
'cat_onehot__job_entrepreneur', 'cat_onehot__job_housemaid',
'cat_onehot__job_management', 'cat_onehot__job_retired',
'cat_onehot__job_self-employed', 'cat_onehot__job_services',
'cat_onehot__job_student', 'cat_onehot__job_technician',
'cat_onehot__job_unemployed', 'cat_onehot__job_unknown',
'cat_onehot__marital_married', 'cat_onehot__marital_single',
'cat_ordinal__education', 'num__age', 'num__balance'],
dtype='object')
6 Exception handling (ECOD)
Based on the Python Outlier Detection library for outlier processing (Kmeans is sensitive to outliers).
Another method, ECOD (empirical cumulative distribution functions for outlier detection), is an outlier detection method based on the empirical cumulative distribution function.
In [15]:
from pyod.models.ecod import ECOD
clf = ECOD()
clf.fit(data)
outliers = clf.predict(data)
outliers
Out[15]:
array([0, 0, 0, ..., 1, 0, 0])
In [16]:
data["outliers"] = outliers # 添加预测结果
df["outliers"] = outliers # 原始数据添加预测结果
In [17]:
# 包含异常值和不含包单独处理
# data无异常值
data_no_outliers = data[data["outliers"] == 0]
data_no_outliers = data_no_outliers.drop(["outliers"],axis=1)
# data有异常值
data_with_outliers = data.copy()
data_with_outliers = data_with_outliers.drop(["outliers"],axis=1)
# 原始数据无异常值
df_no_outliers = df[df["outliers"] == 0]
df_no_outliers = df_no_outliers.drop(["outliers"], axis = 1)
In [18]:
data_no_outliers.head()
Out[18]:
Check the data volume:
In [19]:
data_no_outliers.shape
Out[19]:
(40690, 19)
In [20]:
data_with_outliers.shape
Out[20]:
(45211, 19)
7 Clustering modeling (K-Means)
7.1 Elbow diagram to identify k value
How is the k value determined during clustering? Introducing the method based on elbow diagram, please refer to:
https://www.geeksforgeeks.org/elbow-method-for-optimal-value-of-k-in-kmeans/
In [21]:
from yellowbrick.cluster import KElbowVisualizer
km = KMeans(init="k-means++", random_state=0, n_init="auto")
visualizer = KElbowVisualizer(km, k=(2,10))
visualizer.fit(data_no_outliers)
visualizer.show()
Out[21]:
<Axes: title={
'center': 'Distortion Score Elbow for KMeans Clustering'}, xlabel='k', ylabel='distortion score'>
We can see that k=6 is the best.
7.2 Changes in silhouette coefficient
In [22]:
from sklearn.metrics import davies_bouldin_score, silhouette_score, silhouette_samples
import matplotlib.cm as cm
def make_Silhouette_plot(X, n_clusters):
plt.xlim([-0.1, 1])
plt.ylim([0, len(X) + (n_clusters + 1) * 10])
# 建立聚类模型
clusterer = KMeans(n_clusters=n_clusters,
max_iter=1000,
n_init=10,
init="k-means++",
random_state=10)
# 聚类预测生成标签label
cluster_label = clusterer.fit_predict(X)
# 计算轮廓系数均值(整体数据样本)
silhouette_avg = silhouette_score(X,cluster_label)
print(f"n_clusterers: {
n_clusters}, silhouette_score_avg:{
silhouette_avg}")
# 单个数据样本
sample_silhouette_value = silhouette_samples(X, cluster_label)
y_lower = 10
for i in range(n_clusters):
# 第i个簇群的轮廓系数
i_cluster_silhouette_value = sample_silhouette_value[cluster_label == i]
# 进行排序
i_cluster_silhouette_value.sort()
size_cluster_i = i_cluster_silhouette_value.shape[0]
y_upper = y_lower + size_cluster_i
# 颜色设置
color = cm.nipy_spectral(float(i) / n_clusters)
# 边界填充
plt.fill_betweenx(
np.arange(y_lower, y_upper),
0,
i_cluster_silhouette_value,
facecolor=color,
edgecolor=color,
alpha=0.7
)
# 添加文本信息
plt.text(-0.05, y_lower + 0.5 * size_cluster_i, str(i))
y_lower = y_upper + 10
plt.title(f"The Silhouette Plot for n_cluster = {
n_clusters}", fontsize=26)
plt.xlabel("The silhouette coefficient values", fontsize=24)
plt.ylabel("Cluter Label", fontsize=24)
plt.axvline(x=silhouette_avg, color="red", linestyle="--")
# x-y轴的刻度标签
plt.xticks([-0.1,0,0.2,0.4,0.6,0.8,1])
plt.yticks([])
range_n_clusters = list(range(2, 10))
for n in range_n_clusters:
print(f"N cluster:{
n}")
make_Silhouette_plot(data_no_outliers, n)
plt.savefig(f"Silhouette_Plot_{
n}.png")
plt.close()
N cluster:2
n_clusterers: 2, silhouette_score_avg:0.18112038570087005
......
N cluster:9
n_clusterers: 9, silhouette_score_avg:0.1465020645956104
Comparison of silhouette coefficients under different k values:
7.3 Implementing clustering
From the results, the effect of k=6 or 5 is OK. Here we finally choose k=5 for clustering:
In [23]:
km = KMeans(n_clusters=5,
init="k-means++",
n_init=10,
max_iter=100,
random_state=42
)
# 对无离群点数据的聚类
clusters_predict = km.fit_predict(data_no_outliers)
7.4 Evaluate clustering effect
How to evaluate the clustering effect? Three commonly used evaluation indicators:
-
Davies-Bouldin index
-
Calinski-Harabasz Score
-
Silhouette Score
In [24]:
from sklearn.metrics import silhouette_score # 轮廓系数
from sklearn.metrics import calinski_harabasz_score
from sklearn.metrics import davies_bouldin_score # 戴维森堡丁指数(DBI)
Davies-Bouldin index
The Davies-Bouldin index is an evaluation method for clustering algorithms. The smaller the value, the better the clustering result. The principle of this index is to measure the effect of clustering by comparing the distance between different clusters and the distance within different clusters. The calculation method is as follows:
-
For each cluster, calculate its centroid.
-
Calculate the distance between the inner point of each cluster and its center point, and find its average value to obtain the intra-cluster distance (intra-cluster distance).
-
Calculate the distance between the center points of different clusters and find their average value to obtain the inter-cluster distance.
-
For each cluster, calculate its Davies-Bouldin index: the ratio of the average distance between the center points of all other clusters and the center point of the cluster except this cluster to the internal distance of the cluster.
-
The Davies-Bouldin index of all clusters is averaged to obtain the Davies-Bouldin index of the overall cluster.
Through the Davies-Bouldin index, we can compare the clustering effects of different clustering algorithms and different parameters to select the best clustering solution. The Davies-Bouldin index can take into account the fluctuation of clustering results. For similar clustering results, the Davies-Bouldin index is larger. Therefore, the Davies-Bouldin index can distinguish the degree of similarity of different clustering results.
In addition, the Davies-Bouldin index does not assume a priori knowledge of cluster shape and size, so it can be applied to different clustering scenarios.
Calinski-Harabasz Score
The Calinski-Harabasz Score is a metric used to evaluate clustering quality and is calculated based on the ratio of the variance between cluster centers to the variance within clusters. The larger the index, the better the clustering effect.
Calinski-Harbasz Score is calculated by evaluating the variance between classes and the variance within classes. The specific formula is expressed as:
Among them, represents the number of clustering categories, represents the number of all data, is the between-class variance, and is the intra-class variance.
Calculation formula:
Trace only considers the elements on the diagonal of the matrix, that is, the Euclidean distance from all data points in the class to the class.
The calculation formula is:
Among them, is the set of all data in the class, is the particle point of class q, is the center point of all data, and is the total number of class data points.
Silhouette Score
Silhouette Score is expressed as silhouette coefficient.
Silhouette Score is a measure of the quality of clustering results that combines the closeness within clusters and the separation between different clusters. For each data point, Silhouette Score takes into account several factors:
-
a: The average distance between the data point and other points in the same cluster (closeness within the cluster)
-
b: The average distance from the data point to the nearest different clusters (inter-cluster separation)
Specifically, the Silhouette Score is calculated as:
The value of the silhouette coefficient is between -1 and 1. The closer to 1, the better the clustering effect, and the closer to -1, the poorer the clustering result.
In [25]:
print(f"Davies bouldin score: {
davies_bouldin_score(data_no_outliers,clusters_predict)}")
print(f"Calinski Score: {
calinski_harabasz_score(data_no_outliers,clusters_predict)}")
print(f"Silhouette Score: {
silhouette_score(data_no_outliers,clusters_predict)}")
Davies bouldin score: 1.6775659296391374
Calinski Score: 6914.724747148267
Silhouette Score: 0.1672869940907178
8 Dimensionality reduction (based on Prince.PCA)
Refer to the official website learning address: https://github.com/MaxHalford/prince
8.1 Dimensionality reduction function
In [26]:
import prince
import plotly.express as px
def get_pca_2d(df, predict):
"""
建立聚类模型,保留2个主成分
"""
pca_2d_object = prince.PCA(
n_components=2, # 保留两个主成分
n_iter=3, # 迭代次数
rescale_with_mean=True, # 基于均值和标准差的尺度缩放
rescale_with_std=True,
copy=True,
check_input=True,
engine="sklearn",
random_state=42
)
# 模型训练
pca_2d_object.fit(df)
# 原数据转换
df_pca_2d = pca_2d_object.transform(df)
df_pca_2d.columns = ["comp1", "comp2"]
# 添加聚类预测结果
df_pca_2d["cluster"] = predict
return pca_2d_object, df_pca_2d
# 同样的方式创建保留3个主成分的功能函数
def get_pca_3d(df, predict):
"""
保留3个主成分
"""
pca_3d_object = prince.PCA(
n_components=3, # 保留3个主成分
n_iter=3,
rescale_with_mean=True,
rescale_with_std=True,
copy=True,
check_input=True,
engine='sklearn',
random_state=42
)
pca_3d_object.fit(df)
df_pca_3d = pca_3d_object.transform(df)
df_pca_3d.columns = ["comp1", "comp2", "comp3"]
df_pca_3d["cluster"] = predict
return pca_3d_object, df_pca_3d
8.2 Dimensionality reduction visualization
Below is a visual plotting function based on 2 principal components:
In [27]:
def plot_pca_2d(df, title="PCA Space", opacity=0.8, width_line=0.1):
"""
2个主成分的降维可视化
"""
df = df.astype({
"cluster": "object"}) # 指定字段的数据类型
df = df.sort_values("cluster")
columns = df.columns[0:3].tolist()
# 绘图
fig = px.scatter(
df,
x=columns[0],
y=columns[1],
color='cluster',
template="plotly",
color_discrete_sequence=px.colors.qualitative.Vivid,
title=title
)
# trace更新
fig.update_traces(marker={
"size": 8,
"opacity": opacity,
"line":{
"width": width_line,
"color":"black"}
})
# layout更新
fig.update_layout(
width=800, # 长宽
height=700,
autosize=False,
showlegend = True,
legend=dict(title_font_family="Times New Roman", font=dict(size= 20)),
scene = dict(xaxis=dict(title = 'comp1', titlefont_color = 'black'),
yaxis=dict(title = 'comp2', titlefont_color = 'black')),
font = dict(family = "Gilroy", color = 'black', size = 15))
fig.show()
The following is a visual plotting function based on 3 principal components:
In [28]:
def plot_pca_3d(df, title="PCA Space", opacity=0.8, width_line=0.1):
"""
3个主成分的降维可视化
"""
df = df.astype({
"cluster": "object"})
df = df.sort_values("cluster")
# 定义fig
fig = px.scatter_3d(
df,
x='comp1',
y='comp2',
z='comp3',
color='cluster',
template="plotly",
color_discrete_sequence=px.colors.qualitative.Vivid,
title=title
)
# trace更新
fig.update_traces(marker={
"size":4,
"opacity":opacity,
"line":{
"width":width_line,
"color":"black"}
})
# layout更新
fig.update_layout(
width=800, # 长宽
height=800,
autosize=True,
showlegend = True,
legend=dict(title_font_family="Times New Roman", font=dict(size= 20)),
scene = dict(xaxis=dict(title = 'comp1',
titlefont_color = 'black'),
yaxis=dict(title = 'comp2',
titlefont_color = 'black'),
zaxis=dict(title = 'comp3',
titlefont_color = 'black')),
font = dict(family = "Gilroy", color = 'black', size = 15))
fig.show()
8.2.1 2D
The following is the effect of 2D visualization:
In [29]:
pca_2d_object, df_pca_2d = get_pca_2d(data_no_outliers, clusters_predict)
In [30]:
plot_pca_2d(df_pca_2d, title = "PCA Space", opacity=1, width_line = 0.1)
It can be seen that the clustering effect is not very good and the data is not isolated.
8.2.2 3D
The following is the effect of 3D visualization:
In [31]:
pca_3d_object, df_pca_3d = get_pca_3d(data_no_outliers, clusters_predict)
In [32]:
plot_pca_3d(df_pca_3d, title = "PCA Space", opacity=1, width_line = 0.1)
print("The variability is : ", pca_3d_object.eigenvalues_summary)
The variability is : eigenvalue % of variance % of variance (cumulative)
component
0 2.245 11.81% 11.81%
1 1.774 9.34% 21.15%
2 1.298 6.83% 27.98%
From the results, we can see that the clustering effect is not very good and the samples are not separated.
The total proportion of the first three principal components is 27.98%, which is not enough to capture the original data information and patterns. The following introduces dimensionality reduction based on T-SNE. This method is mainly used for dimensionality reduction visualization of high-dimensional data:
9 Dimensionality reduction optimization (based on T-SNE)
Take out some samples
In [33]:
from sklearn.manifold import TSNE
# 无离群点的数据随机取数
sampling_data = data_no_outliers.sample(frac=0.5, replace=True, random_state=1)
# 聚类后的数据随机取数
# clusters_predict 表示从聚类结果中随机取数
sampling_cluster = pd.DataFrame(clusters_predict).sample(frac=0.5, replace=True, random_state=1)[0].values
sampling_cluster
Out[33]:
array([4, 1, 1, ..., 2, 0, 4])
9.1 Implementing 2D dimensionality reduction
9.1.1 Dimensionality reduction
In [34]:
# 建立降维模型
tsne2 = TSNE(
n_components=2,
learning_rate=500,
init='random',
perplexity=200,
n_iter = 5000)
In [35]:
data_tsne_2d = tsne2.fit_transform(sampling_data)
In [36]:
# 转成df格式 + 原聚类结果
df_tsne_2d = pd.DataFrame(data_tsne_2d, columns=["comp1","comp2"])
df_tsne_2d["cluster"] = sampling_cluster
9.1.2 Visualization
In [37]:
plot_pca_2d(df_tsne_2d, title = "T-SNE Space", opacity=1, width_line = 0.1)
9.2 Implement 3D dimensionality reduction
9.2.1 Dimensionality reduction
Implement T-SNE dimensionality reduction on the clustered results:
In [38]:
# 建立3D降维模型
tsne3 = TSNE(
n_components=3,
learning_rate=500,
init='random',
perplexity=200,
n_iter = 5000
)
In [39]:
# 模型训练并转换数据
data_tsne_3d = tsne3.fit_transform(sampling_data)
In [40]:
# 转成df格式 + 原聚类结果
df_tsne_3d = pd.DataFrame(data_tsne_3d, columns=["comp1","comp2","comp3"])
df_tsne_3d["cluster"] = sampling_cluster
9.2.2 Visualization of dimensionality reduction results
In [41]:
plot_pca_3d(df_tsne_3d, title = "T-SNE Space", opacity=1, width_line = 0.1)
Compare the two-dimensional effects of the two dimensionality reduction methods: Obviously, the effect of T-SNE is much better ~
10 Classification based on LGBMClassifer
Use the original data without anomaliesdf_no_outliers as feature X, and cluster the labelclusters_predict< /span>As the target label y, establish a LGBMClassifer classification model:
10.1 Building a model
In [42]:
import lightgbm as lgb
import shap
clf_lgb = lgb.LGBMClassifier(colsample_by_tree=0.8)
# 将部分字段的数据类型进行转化
for col in ["job","marital","education","housing","loan","default"]:
df_no_outliers[col] = df_no_outliers[col].astype("category")
clf_lgb.fit(X=df_no_outliers,
y=clusters_predict,
feature_name = "auto",
categorical_feature = "auto"
)
[LightGBM] [Warning] Unknown parameter: colsample_by_tree
[LightGBM] [Warning] Unknown parameter: colsample_by_tree
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000593 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 342
[LightGBM] [Info] Number of data points in the train set: 40690, number of used features: 8
[LightGBM] [Info] Start training from score -1.626166
[LightGBM] [Info] Start training from score -1.292930
[LightGBM] [Info] Start training from score -1.412943
[LightGBM] [Info] Start training from score -2.815215
[LightGBM] [Info] Start training from score -1.489282
Out[42]:
LGBMClassifier
LGBMClassifier(colsample_by_tree=0.8)
10.2 shap visualization
In [43]:
explainer = shap.TreeExplainer(clf_lgb) # 建立解释器
shap_values = explainer.shap_values(df_no_outliers) # 求出shap值
shap.summary_plot(shap_values, df_no_outliers, plot_type="bar",plot_size=(15,10))
As you can see from the results, the age field is the most important.
10.3 Model prediction
In [44]:
y_pred = clf_lgb.predict(df_no_outliers) # 预测
acc = accuracy_score(y_pred, clusters_predict) # 预测值和真实值计算acc
# acc
print('Training-set accuracy score: {0:0.4f}'. format(acc))
[LightGBM] [Warning] Unknown parameter: colsample_by_tree
Training-set accuracy score: 1.0000
In [45]:
# 分类报告
print(classification_report(clusters_predict, y_pred))
precision recall f1-score support
0 1.00 1.00 1.00 8003
1 1.00 1.00 1.00 11168
2 1.00 1.00 1.00 9905
3 1.00 1.00 1.00 2437
4 1.00 1.00 1.00 9177
accuracy 1.00 40690
macro avg 1.00 1.00 1.00 40690
weighted avg 1.00 1.00 1.00 40690
10.4 Aggregation results
In [46]:
# 原始数据无异常
df_no_outliers = df[df.outliers == 0]
df_no_outliers["cluster"] = clusters_predict # 聚类结果
Use the cluster result cluster as the grouping field:
-
Statistical mean of numeric fields (mean)
-
The highest frequency field of the classification field (the first data information after grouping)
In [47]:
df_no_outliers.groupby("cluster").agg({
"job":lambda x: x.value_counts().index[0],
"marital": lambda x: x.value_counts().index[0],
"education":lambda x: x.value_counts().index[0],
"housing":lambda x: x.value_counts().index[0],
"loan":lambda x: x.value_counts().index[0],
"age":"mean",
"balance":"mean",
"default":lambda x: x.value_counts().index[0]
}).sort_values("age").reset_index()
Out[47]:
|
| cluster | job | marital | education | housing | loan | age | balance | default |
| — | — | — | — | — | — | — | — | — | — |
| 0 | 4 | technician | single | secondary | yes | no | 32.069740 | 794.696306 | no |
| 1 | 2 | blue-collar | married | secondary | yes | no | 34.569409 | 592.025644 | no |
| 2 | 3 | management | married | secondary | yes | no | 42.183012 | 7526.310217 | no |
| 3 | 0 | management | married | tertiary | no | no | 43.773960 | 872.797951 | no |
| 4 | 1 | blue-collar | married | secondary | no | no | 50.220989 | 836.407504 | no |
reference
Refer to the original English learning address: https://towardsdatascience.com/mastering-customer-segmentation-with-llm-3d9008235f41
I will share with you the solution laterTransformer model+Kmeans+PCA/T-SNE~
The code has been sorted out and the article is being sorted out. Let’s take a look in advance at the effect of using PCA on the data converted based on the Transformer model (the data is expanded to 384 dimensions): Indeed So much better!