R language cluster analysis and principal component analysis final report

1. Summary of knowledge points learned this semester

In the multivariate statistical analysis course this semester, I learned two important data analysis methods: cluster analysis and principal component analysis (or factor analysis). Through these studies, I have a deeper understanding of the following aspects.

  1. Cluster analysis: Cluster analysis is an unsupervised learning method used to divide individuals in a data set into different groups or clusters, so that individuals within the same group have similarities, and individuals between different groups have similarities. difference. I learned about the basic principles of clustering algorithms, including distance measures, clustering algorithms (such as K-means, hierarchical clustering, etc.), and indicators for evaluating clustering results (such as silhouette coefficient, Davies-Bouldin index, etc.). Through practical operations and case analysis, I further understood the application scenarios of cluster analysis, such as market segmentation, social network analysis, etc.

  2. Principal component analysis (or factor analysis): Principal component analysis is a dimensionality reduction technique used to extract the most relevant features from high-dimensional data in order to better understand and interpret the data. I learned the basic principles and steps of principal component analysis, including eigenvalue decomposition, selection and interpretation of principal components, principal component scores, etc. At the same time, I also learned about factor analysis as an extension of principal component analysis for exploring underlying latent factor structures. Through the analysis of actual cases and data sets, I further understood the application of principal component analysis and factor analysis in data dimensionality reduction and variable interpretation.

During the learning process, I found that there are some easily confused concepts between cluster analysis and principal component analysis (or factor analysis), such as the similarity and difference of data, eigenvalues ​​and eigenvectors, principal components and factors, etc. In order to understand the differences and connections between these concepts, I read relevant textbooks and papers, referred to examples and case analyses, and discussed and communicated with my classmates. Through these efforts, I gradually became clear about the meaning and application of these concepts.

2. Problem Research

Research Background

In the telecommunications industry, customer churn, also known as churn, is an important challenge. Churn refers to customers who have left or stopped using a company's services. Customer churn is an important issue for telecommunications companies because it is more expensive to lose existing customers than to attract new ones. Therefore, understanding the causes of customer churn and predicting patterns of customer churn is of great significance for telecommunications companies.

Research purposes

The purpose of this study is to analyze the Telco telecommunications customer churn data set by applying techniques such as cluster analysis, principal component analysis, or factor analysis to reveal the similarities between customer groups, main influencing factors, and possible patterns. Specific goals include:

  • Identify different customer groups: Through cluster analysis, customers are divided into different groups to discover potential customer segments. This helps telcos understand the differences in behavior and needs of different groups.

-Determine the main influencing factors: Identify the variables that have the most influence on customer churn through principal component analysis or factor analysis. This helps identify key factors and develop strategies accordingly to reduce customer churn.

  • Provide decision support: Provide telecom companies with insights and decision support regarding customer churn by analyzing customer churn data sets. This can help them develop customized marketing strategies and customer retention plans for different customer segments.

Significance

This research is of great significance to the telecommunications industry as it helps companies better understand customer behavior, needs and preferences. Through cluster analysis, companies can provide personalized products and services based on the characteristics of different customer groups, thereby increasing customer satisfaction and loyalty. Principal component analysis, or factor analysis, can help companies determine which factors are most important so they can more specifically target customer retention measures. Additionally, by accurately predicting customer churn, companies can take timely action and take steps to retain existing customers and reduce business risk.

3. Data set

The Telco Churn Dataset is a telecommunications industry data set used to study customer churn problems. This data set contains customer information about the Telco telecommunications company and is designed to help researchers and analysts understand the patterns and influencing factors of customer churn.

The data set contains 7043 observations (rows) and 21 variables (columns), providing multiple aspects of information about each customer, including personal information, service subscriptions, account information, and payment status.

Below is a table of variables and descriptions for the Telco churn data set:

variable name describe
customerID Customer's unique identifier
gender Customer's gender
SeniorCitizen Indicates whether it is an elderly customer, 1 means yes, 0 means no
Partner Does the customer have a partner?
Dependents Does the client have family members?
tenure The length of the contract between the customer and the company (in months)
PhoneService Whether the customer subscribes to telephone service
MultipleLines Whether the customer subscribes to multi-line services
InternetService The type of Internet service the customer subscribes to
OnlineSecurity Whether the customer subscribes to online security services
OnlineBackup Whether the customer subscribes to online backup services
DeviceProtection Whether the customer subscribes to device protection services
TechSupport Whether the customer subscribes to technical support services
StreamingTV Whether the customer subscribes to a streaming TV service
StreamingMovies Whether the customer subscribes to the streaming movie service
Contract Customer's contract type
PaperlessBilling Whether the customer chooses paperless billing
PaymentMethod Customer's payment method
MonthlyCharges Customer monthly cost
TotalCharges Customer's total cost accumulation value
Churn Whether the customer has been lost, "Yes" means that the customer has been lost, "No" means that the customer has not been lost.

This dataset provides extensive information for studying the customer churn problem and can be used to perform various analysis and modeling tasks, such as cluster analysis, principal component analysis, factor analysis, and construction of customer churn prediction models. Analysis of this data set can help telecommunications companies identify potential customer segments, understand customer needs and behaviors, and develop corresponding strategies to reduce customer churn, improve customer satisfaction, and sustainably develop the business.

4. Principal component analysis

The idea of ​​principal component analysis is to map the original high-dimensional data to a new low-dimensional space through linear transformation, so as to maximize the variance of the data in the new space. Specifically, principal component analysis finds a new set of variables (called principal components) that are linear combinations of the original variables and are independent of each other by calculating the covariance matrix or correlation matrix of the data set. These principal components are ordered according to the size of the variance. Therefore, the first few principal components can explain most of the variance in the data, while the variance contained in the subsequent principal components gradually decreases.

The steps of principal component analysis are as follows:

  • Standardized data: If each variable of the original data has different scales, the data needs to be standardized so that each variable has the same scale.
  • Calculate covariance matrix: Based on the standardized data, calculate the covariance matrix or correlation matrix between variables.
  • Calculate eigenvalues ​​and eigenvectors: Perform eigenvalue decomposition or singular value decomposition on the covariance matrix to obtain eigenvalues ​​and corresponding eigenvectors.
  • Select principal components: According to the size of the eigenvalues, select the eigenvectors corresponding to the top k largest eigenvalues ​​as the principal components.
  • Data transformation: Project the original data onto the selected principal components to obtain a new low-dimensional data representation.
library(tidyverse)
theme_set(theme(plot.title = element_text(hjust = 0.5)))
data <- read.csv("WA_Fn-UseC_-Telco-Customer-Churn.csv",stringsAsFactors = TRUE)
data <- data %>% select(-customerID) %>% drop_na() 
# 将因子变量转换为数字变量
df <- data %>% mutate_if(is.factor, as.numeric)
# 计算相关系数矩阵
cor_matrix <- cor(df)
# 绘制相关系数图
library(corrplot)
corrplot(cor_matrix, method = "circle",  tl.cex = 0.7)

Based on the correlation coefficient between each variable in the correlation coefficient matrix and "Churn" (customer churn), the following conclusions can be drawn:

  • The correlation coefficient between gender and customer churn is close to zero, indicating that gender has a small impact on customer churn, that is, there is no obvious linear relationship between gender and customer churn.

  • The correlation coefficient between SeniorCitizen and customer churn is positive, indicating that older customers are more likely to churn.

  • The correlation coefficients between partners and dependents and customer churn are both negative, indicating that customers without partners and dependents are more likely to churn.

  • Among the various services, those with higher correlation with customer churn are OnlineSecurity, OnlineBackup, DeviceProtection and TechSupport. Negative values ​​of the correlation coefficient indicate that customers without these services are more likely to churn.

  • The correlation coefficient between contract type (Contract) and customer churn is negative, indicating that customers who choose monthly payment contracts (Month-to-month) are more likely to churn. In contrast, customers who choose long-term contracts (One year, Two year ) customers are more stable.

  • The correlation coefficient between whether to choose electronic billing (Paperless Billing) and customer churn is positive, indicating that customers who choose electronic billing are more likely to churn.

  • The correlation coefficient between payment method (PaymentMethod), monthly charges (MonthlyCharges) and total charges (TotalCharges) and customer churn is around 0.1, indicating that they have a small impact on customer churn.

library(psych)

# 执行主成分分析
pca_result <- principal(df)

# 提取主成分分析结果的特征值
eigenvalues <- pca_result$values

# 计算方差解释比例
variance_explained <- eigenvalues / sum(eigenvalues)

# 绘制碎石图
plot(1:length(variance_explained), variance_explained, type = "b", pch = 19, xlab = "主成分个数", ylab = "方差解释比例", main = "主成分分析的碎石图")

Insert image description here

Based on the inflection point of the scree plot and the proportion of the overall variance explained by the principal components (approximately 80%), the number of principal components was selected to be 3.

# 执行主成分分析并设置3个主成分
pca_result <- principal(df, nfactors = 3)
pca_result$loadings

After performing principal component analysis, we can extract the results of the principal components and interpret them. The following are the extracted principal component results and their interpretation:

  • The loadings of principal component 1 (RC1) are 3.958, the proportion of variance explained is 0.198, and the proportion of accumulated variance is 0.198.
  • The loadings of principal component 2 (RC2) are 2.828, the proportion of variance explained is 0.141, and the proportion of cumulative variance is 0.339.
  • The loadings of principal component 3 (RC3) are 1.569, the proportion of variance explained is 0.078, and the proportion of cumulative variance is 0.418.

These data tell us how much variance is explained by each principal component, and what proportion of the cumulative variance is. In this example, principal component 1 (RC1) explains about 19.8% of the variance, principal component 2 (RC2) explains about 14.1% of the variance, and principal component 3 (RC3) explains about 7.8% of the variance. The cumulative variance proportion represents the sum of the variance proportions explained by the first n principal components. For 3 principal components, the cumulative variance proportion is 41.8%.

# 获取主成分权重
weights <- pca_result$weights


# 计算每个样本的主成分得分
scores <- as.matrix(df) %*% weights

# 计算每个样本的总主成分得分
total_scores <- rowSums(scores)

# 将总的主成分得分添加到数据框中
data$score <- total_scores

5. Cluster analysis

First, select the variables that need to be clustered and preprocess the data. Then, we use the K-means clustering algorithm to cluster the standardized data and choose the number of clusters to be 3.

df1 <- data %>% filter(Churn == "Yes")  %>%  select(-Churn,-score)
df2 <- df1 %>% mutate_if(is.factor, as.numeric)  


scaled_data <- scale(df2)

# 执行聚类分析
k <- 3  # 设置聚类数目
set.seed(123)
kmeans_result <- kmeans(scaled_data, centers = k)

# 提取聚类结果
cluster_labels <- kmeans_result$cluster

# 将聚类结果添加到原始数据集
clustered_data <- bind_cols(df1, cluster = cluster_labels)
cluster1_data <- clustered_data %>% filter(cluster == 1)
summary(cluster1_data)

Insert image description here

The analysis results of cluster category 1 are as follows:

  • gender: There are 113 female and 142 male customers in cluster category 1.
  • SeniorCitizen: In cluster category 1, approximately 21.6% of customers are senior citizens.
  • Partner: There are 82 customers in cluster category 1 who do not have a spouse and 173 customers who have a spouse.
  • Dependents: In cluster category 1, there are 178 clients without dependents and 77 clients with dependents.
  • Tenure: The average renewal length of customers in cluster category 1 is approximately 53.6 months, with a minimum value of 14 months and a maximum value of 72 months.

Here are the statistics for the other variables in cluster category 1:

  • PhoneService: 12 customers have no phone service and 243 customers have phone service.
  • MultipleLines: 58 customers do not have multi-line services, and 185 customers have multiple lines.
  • InternetService: 53 customers are on DSL, 195 customers are on fiber, and 7 customers have no Internet service.
  • Distribution and description of other variables (OnlineSecurity, OnlineBackup, DeviceProtection, TechSupport, StreamingTV, StreamingMovies, Contract, PaperlessBilling, PaymentMethod, MonthlyCharges, TotalCharges).
cluster2_data <- clustered_data %>% filter(cluster == 2)
summary(cluster2_data)

Insert image description here

Insert image description here

Based on the information you provided, the analysis results of cluster category 2 are as follows:

  • gender: There are 454 female and 447 male customers in cluster category 2.
  • SeniorCitizen: In cluster category 2, approximately 15.4% of customers are senior citizens.
  • Partner: There are 687 customers in cluster category 2 who do not have a spouse and 214 customers who have a spouse.
  • Dependents: In cluster category 2, there are 740 clients without dependents and 161 clients with dependents.
  • Tenure: The average renewal length of customers in cluster category 2 is approximately 7.092 months, with a minimum value of 1 month and a maximum value of 61 months.

Here are the statistics for the other variables in cluster category 2:

  • PhoneService: 145 customers have no phone service and 756 customers have phone service.
  • MultipleLines: 605 customers do not have multi-line services, and 151 customers have multiple lines.
  • InternetService: 378 customers are on DSL, 419 customers are on fiber, and 104 customers have no Internet service.
  • Distribution and description of other variables (OnlineSecurity, OnlineBackup, DeviceProtection, TechSupport, StreamingTV, StreamingMovies, Contract, PaperlessBilling, PaymentMethod, MonthlyCharges, TotalCharges).
cluster3_data <- clustered_data %>% filter(cluster == 3)
summary(cluster3_data)

Insert image description here
Insert image description here

The analysis results of cluster category 3 are as follows:

  • gender: There are 372 female and 341 male customers in cluster category 3.
  • SeniorCitizen: In cluster category 3, approximately 39.55% of customers are senior citizens.
  • Partner: There are 431 customers in cluster category 3 who do not have a spouse, and 282 customers who have a spouse.
  • Dependents: In cluster category 3, there are 625 clients without dependents and 88 clients with dependents.
  • Tenure: The average renewal length of customers in cluster category 3 is approximately 19 months, with a minimum value of 1 month and a maximum value of 66 months.

Here are the statistics for the other variables in cluster category 3:

  • PhoneService: 13 customers have no phone service and 700 customers have phone service.
  • MultipleLines: 186 customers do not have multi-line services, and 514 customers have multiple lines.
  • InternetService: 28 customers are on DSL, 683 customers are on fiber, and 2 customers have no Internet service.
  • Distribution and description of other variables (OnlineSecurity, OnlineBackup, DeviceProtection, TechSupport, StreamingTV, StreamingMovies, Contract, PaperlessBilling, PaymentMethod, MonthlyCharges, TotalCharges).

Guess you like

Origin blog.csdn.net/weixin_54707168/article/details/132661078