Telecommunications customer churn data analysis (1)

To be a data analysis project -

Background: I found this dataset on the kaggle website, so I used it as a data analysis project. I hope to find some interesting results ~~ interested friends can download it on the kaggle website: https: //www.kaggle .com / blastchar / telco-customer-churn. Telecom customer churn data set has 7043 records and 21 fields. The field includes 20 input features and 1 target feature. The target feature is whether it is a lost customer. The input features include: customer Id, gender, elderly, partner, child, years of use, whether to use telephone service, whether to use network service, service contract period, payment method A series of descriptions such as bill type, monthly consumption, total consumption, etc.

Below I propose a series of related questions and tasks to start data mining!

Task 1: Explore the data set

The first step to get a data set is to understand and become familiar with the data set, and to preprocess the data missing and data errors.

1.1 Input data set

import pandas as pd
Telco_data=pd.read_csv('WA_Fn-UseC_-Telco-Customer-Churn.csv')

1.2 Check the data set content

Telco_data.head(10)

Data preview
#Dataset size

Telco_data.shape

(7043, 21)

#Do data sets have missing values

Telco_data.isnull().sum()

No missing values

#Check the data type

Telco_data.dtypes

From the preliminary exploration of this data set, we can understand that this is a 7403 * 21 data set, which currently shows no missing values, however, there is a characteristic data type that is wrong. According to daily experience and observation of the content of the data set, it can be found that the data type of the TotalCharges field should be float, not object, and the field should be a continuous numeric field. The reason for this error is that in the process of using pandas to read the csv file, pandas will read the missing value as an empty string.

Check the number of records of missing values:

Te_data[Te_data['TotalCharges'].isin([' '])]

There are currently 11 records with missing values, because their number is insignificant relative to the overall sample. So I took the approach of deleting records.

1.3 Fix wrongly entered data set

Re-enter the data set, convert the empty string to the missing value representation symbol "NaN", delete the field with the missing value, and finally convert the data type of the TotalCharges field to a numeric type to obtain the correct input data set (all subsequent operations The basis is the following code):

import seaborn as sns
import pandas as pd
import numpy as np
Te_data = pd.read_csv("WA_Fn-UseC_-Telco-Customer-Churn.csv")
Te_data.replace(to_replace=r'^\s*$',value=np.nan,regex=True,inplace=True)
Te_data.dropna(axis=0, how='any', inplace=True)
Te_data['TotalCharges'] = pd.to_numeric(Te_data['TotalCharges'])

1.4 Summary statistical analysis
Field content statistics:

Te_data.describe(include='all')

Task 2: Which input features are relevant to customer churn?

2.1 Analysis of thinking To
solve task 2, the first question that must be answered is, why are customers lost? (Causality) The first reason, I believe, is also the answer given by the vast majority of lost customers, that is, "the services provided by telecom companies make me unsatisfactory." This is the answer given from the company's product dimension. The second reason is that customers are lost due to changes in their personal circumstances, such as their age. This is the answer given from the perspective of changes in personal attributes.
The second question to answer is, what kind of behavior shows that customers have a tendency to lose? (Relevance) Behind the behavior performance often represents the customer's attitude towards the company and the product, thus reflecting some of the customer's label attributes, such as loyal users, active users, and so on.
Based on the above three answers, I divided all the input features into three dimensions: product, personal attributes, and user behavior, as shown in the following figure, and analyzed them separately.

2.2 Product dimension analysis
For the telecommunications company in this example, its main products are telephone service, network service and some additional services based on these two services. Statistics show that in the sample of this data set, there are 4,832 customers using both network services and telephone services, 1,520 customers using telephone services only, and 680 customers using network services only. In the case of no obvious imbalance in sample size, the following two assumptions are made:

Hypothesis 1: Due to the telephone service of the telecommunications company, customers are lost.
Hypothesis 2: Due to network service problems of the telecommunications company, customers are lost.

Use the count graph drawn by the countplot function to verify the above two sets of assumptions:

sns.countplot(x='PhoneService',hue='Churn',data=Te_data,order=['Yes','No'],hue_order=['Yes','No'])
sns.countplot(x='InternetService',hue='Churn',data=Te_data,hue_order=['Yes','No'],order=['DSL','Fiber optic','No'])

Insert picture description here
The first picture is the loss of them with and without telephone service. It can be seen that the churn rate is not much different. Assumption 1 is not true.
Insert picture description here
In the second graph, the churn rate of users receiving network services is significantly higher than that of customers who do not receive services, especially those using fiber optic technology (Fiber optic). It is inferred that there are certain problems with this service, and it is a service in urgent need of improvement. At the same time, it also proves that the assumption of hypothesis 2 is true: whether it is a network service user is closely related to whether it is lost.
For network services, we can further explore whether having other additional services will affect the loss of users. The same is the way of counting graphs:

fig,axes = plt.subplots(2,2,figsize=(14,10))
sns.countplot(x='OnlineSecurity',hue='Churn',data=Te_data,hue_order=['Yes','No'],order=['Yes','No','No internet service'],ax=axes[0,0])
sns.countplot(x='OnlineBackup',hue='Churn',data=Te_data,hue_order=['Yes','No'],order=['Yes','No','No internet service'],ax=axes[0,1])
sns.countplot(x='DeviceProtection',hue='Churn',data=Te_data,hue_order=['Yes','No'],order=['Yes','No','No internet service'],ax=axes[1,0])
sns.countplot(x='TechSupport',hue='Churn',data=Te_data,hue_order=['Yes','No'],order=['Yes','No','No internet service'],ax=axes[1,1])

Insert picture description here
We can find that customers who use network services continue to pay for additional services such as network security, network backup, device protection, and technical support, which will effectively reduce the possibility of their loss. One message displayed here is that although the basic network service of the telecommunications company has certain problems, the supplement of additional services can effectively reduce the loss caused by service problems.
Summary of product dimensions : whether to use network services, and whether to use network security, network backup, equipment protection, technical support and other additional services when using network services, has a greater relevance to customer churn.

2.3 Dimensional analysis of
personal attributes Personal attributes include gender, whether they are elderly, whether they have partners, and whether there are children to be raised. Judging from experience, except for whether it is the elderly, other indicators are not the main reason for the loss. I personally think that the reason for the loss of the elderly is due to changes in age, the demand for calls or network services has changed, and the loss caused by it does not rule out the impact of death and other factors. So make assumptions:

Hypothesis 3: The elderly group is more inclined to lose than the non-elderly group.

Use the count graph drawn by the countplot function to verify Hypothesis 3:

sns.countplot(x='SeniorCitizen',hue='Churn',data=Te_data,order=[0,1],hue_order=['Yes','No'])

Insert picture description here
Among them, 0 represents non-elderly people, and 1 represents elderly people. The above chart confirms our hypothesis that the turnover rate of the elderly group is significantly higher than that of the general group. Therefore, it can be judged whether the customer is an elderly person or not and it may be related to the loss.

2.4 Analysis of user behavior dimension
From the perspective of user behavior, each feature is no longer causally related to the loss index, but is related. That is, the feature of this dimension is not the cause of churn, but it can predict the possibility of customer churn behavior. Based on this analysis, I also give the indicators that I think are related to customer churn: contract duration, years of use, payment methods, monthly consumption, and total consumption. Among them, the contract period, years of use, and total consumption can better reflect the loyalty of a customer to the company, and whether loyalty is closely related to the loss. The user's payment method and monthly consumption status can show the user's consumption concept, and the consumption concept will also have a certain impact on whether it is lost.
In addition, total consumption and service life are often closely related. The longer the time of use, the more accumulated consumption on the telecommunications platform. Total consumption ≈use time * monthly consumption. Therefore, a regression analysis of total consumption and service life:

sns.regplot(x='tenure',y='TotalCharges',order=4,data=Te_data)

Insert picture description here
As you can see from the picture, they have a certain linear relationship, and further calculate the correlation coefficients of these two features:

corr=Te_data[['tenure','TotalCharges']].corr()
corr

The results show that their correlation coefficient is higher than 0.8 and has a high linear correlation. Therefore, in the following analysis, I will ignore the feature of total consumption. By analyzing the service life, we can better reflect the correlation between the two indicators of service life and total consumption and the loss.
Based on the above analysis, the following hypothesis verification of each indicator begins:

Hypothesis 4: The contract period is closely related to the loss situation.
Hypothesis 5: The service life is closely related to the loss situation.
Hypothesis 6: Payment methods are closely related to churn.
Hypothesis 7: Monthly consumption is closely related to the loss situation.

sns.countplot(x='Contract',hue='Churn',data=Te_data,hue_order=['Yes','No'],order=['Month-to-month','One year','Two year'])

Insert picture description here
Obviously, the longer the contract period, the lower the churn rate. Hypothesis 4 is established.

sns.boxplot(x='Churn', y='tenure', data=Te_data,order=['Yes','No'])

Insert picture description here
As can be seen from the box diagram, the median, first quartile, and third quartile of the service life of the non-churn customer group are higher than the corresponding indicators of the churn customer group. Users with a service life of less than 15 months are more likely to lose customers. On the other hand, users with a service life of more than 30 months are more likely not to lose. Relatively speaking, the longer the service life of the customer, the less likely it is to churn. Hypothesis 5 holds.

import matplotlib.pyplot as plt
plt.figure(figsize=(10, 6))
sns.countplot(x='PaymentMethod',hue='Churn',data=Te_data,hue_order=['Yes','No'],order=['Electronic check','Mailed check','Bank transfer (automatic)','Credit card (automatic)'])

In terms of payment methods, the churn rate of customers who pay by electronic check is significantly higher than customers of other payment methods. The payment method is related to the churn rate. Assumption 6 holds.

sns.boxplot(x='Churn', y='MonthlyCharges', data=Te_data,order=['Yes','No'])

Insert picture description here

In terms of monthly consumption, the overall monthly consumption of drained users is higher than that of non-churn users, and its variability is smaller than that of non-churn users. Hypothesis 7 holds.
Analysis and summary of user behavior dimension : contract duration, years of use, payment methods, monthly consumption, total consumption are related to customer churn.

Task 2 summarizes, after analysis and screening, among the 20 input features, the indicators currently considered to be more relevant to customer churn include: whether to use network services, and whether to use network security, network backup, Equipment protection, technical support, whether it is the elderly, contract period, years of use, payment method, monthly consumption.

Next: Analysis of telecommunication customer churn data (2) Task 3 How to determine whether a customer is a potential churn target?

Published 2 original articles · Likes0 · Visits 16

Guess you like

Origin blog.csdn.net/gdben_user/article/details/105645585