Kaggle case selection-Telecom Customer Churn Prediction Part Two: data standardization; principal component analysis machine visualization; correlation analysis and heat map; customer portrait and radar chart

Continue with the previous content.

4 Data Preprocessing: data preprocessing + other variable information display

4.1 Data preprocessing

# 数据预处理
from sklearn.preprocessing import LabelEncoder, StandardScaler

# 客户ID列
Id_col = ['customerID']
# 目标列
target_col = ['Churn']
# 分类变量列
cat_cols = telcom.nunique()[telcom.nunique()<6].keys().tolist()
cat_cols = [x for x in cat_cols if x not in target_col]
# 数值变量列
num_cols = [x for x in telcom.columns if x not in cat_cols+target_col+Id_col]
# 二元类型列
bin_cols = telcom.nunique()[telcom.nunique()==2].keys().tolist()
# 多元类型变量列
multi_cols = [i for i in cat_cols if i not in bin_cols]

# 对二元类型列的表情进行解析(label encoding)
le = LabelEncoder()
for i in bin_cols:
    telcom[i] = le.fit_transform(telcom[i])
    
# 复制多元类型列
telcom = pd.get_dummies(data=telcom, columns=multi_cols)

# 对数值型数据进行标准化
std = StandardScaler()
scaled = std.fit_transform(telcom[num_cols])
scaled = pd.DataFrame(scaled, columns=num_cols)

# 将原始的数值型列删除,将进行标准化之后的数值列与原数据表合并
df_telcom_og = telcom.copy()
telcom = telcom.drop(columns=num_cols, axis=1)
telcom = telcom.merge(scaled, left_index=True, right_index=True, how='left')

The above code has no output

  • The function is to divide the variables according to different types and carry out the standardized operation of numerical data
  • Pave the way for the following overview of variable information, PCA (Principal Component Analysis) and related analysis

4.2 Overview of Variable Information

summary = (df_telcom_og[[i for i in df_telcom_og.columns if i not in Id_col]].describe().transpose().reset_index())
summary = summary.rename(columns={
    
    'index':'feature'})
summary = np.around(summary, 3)

val_list = [summary['feature'], summary['count'], summary['mean'], summary['std'],
            summary['min'], summary['25%'], summary['50%'], summary['75%'], summary['max']]
trace = go.Table(header=dict(values=summary.columns.tolist(),
                             line=dict(color=['#506784']),
                             fill=dict(color=['#119DFF'])),
                 cells=dict(values=val_list,
                            line=dict(color=['#506784']),
                            fill=dict(color=['lightgrey', '#F5F8FF'])),
                 columnwidth=[200, 60, 100, 100, 60, 60, 80, 80, 80])
layout = go.Layout(dict(title='Variable Summary'))
fig = go.Figure(data=[trace], layout=layout)
py.iplot(fig)

#%%
# 删除tenure_group变量
# 为之后绘图做准备,删除无用的列。
telcom = telcom.drop(columns=['tenure_group_Tenure_0_12', 'tenure_group_Tenure_12_24', 'tenure_group_Tenure_24_48',
                              'tenure_group_Tenure_48_60', 'tenure_group_Tenure_gt_60'], axis=1)

Result output:
output a table containing definition information:
Overview of variable customization information

4.3 Variable correlation coefficient matrix (heat map)

# 计算相关性
corr = telcom.corr()
# 提取矩阵标签
matrix_cols = corr.columns.tolist()
# 转化为array数据类型
corr_array = np.array(corr)

# 绘图
trace = go.Heatmap(z=corr_array, x=matrix_cols, y=matrix_cols, 
                   colorscale='Viridis', colorbar=dict(
        title='Pearson Correlation Coefficient', 
        titleside='right'
                   ))
layout = go.Layout(dict(title='Correlation Matrix for variables', 
                        autosize=False, height=720, width=800, 
                        margin=dict(r=0, l=210, t=25, b=210), 
                        yaxis=dict(tickfont=dict(size=9)), 
                        xaxis=dict(tickfont=dict(size=9))))
data = [trace]
fig = go.Figure(data=data, layout=layout)
py.iplot(fig)

Result output:

  • I don’t know why, my heat map has a drift of merits. Here is the result of the original text:
    Variable correlation matrix heat map
  • The more prominent variables in the positive correlation are: the correlation coefficient between total consumption and duration is 0.826, which is very positive; the correlation coefficient between total consumption and monthly consumption is 0.651, and the correlation is strong; monthly consumption and fiber optic activation The correlation coefficient between Internet services is 0.78, which is very relevant; in addition, the correlation between monthly consumption and the opening of digital movies and digital TV is slightly greater than 0.62, showing a strong correlation.
  • The more noteworthy variables in the negative correlation are: the correlation coefficient between the monthly contract and the duration of the contract is -0.65, which is a strong negative correlation; the monthly consumption has a strong negative correlation with whether to open Internet services, the coefficient is -0.76 .

4.4 PCA analysis and result visualization

from sklearn.decomposition import PCA

pca = PCA(n_components=2)

X = telcom[[i for i in telcom.columns if i not in Id_col+target_col]]
Y = telcom[target_col+Id_col]

principal_components = pca.fit_transform(X)
pca_data = pd.DataFrame(principal_components, columns=['PC1', 'PC2'])
pca_data = pca_data.merge(Y, left_index=True, right_index=True, how='left')
pca_data['Churn'] = pca_data['Churn'].replace({
    
    1:'Churn', 0:'Not Churn'})

def pca_scatter(target, color):
    tracer = go.Scatter(x=pca_data[pca_data['Churn']==target]['PC1'],
                        y=pca_data[pca_data['Churn']==target]['PC2'],
                        name=target, mode='markers',
                        marker=dict(color=color, line=dict(width=0.5), symbol='diamond-open'),
                        text='Customer Id:' + pca_data[pca_data['Churn']==target]['customerID'])
    return tracer
layout = go.Layout(dict(title='Visualising data with prinsipal components',
                        plot_bgcolor='rgb(243,243,243)',
                        paper_bgcolor='rgb(243,243,243)',
                        xaxis=dict(gridcolor='rgb(255,255,255)',
                                   title='principal component 1',
                                   zerolinewidth=1, ticklen=5, gridwidth=2),
                        yaxis=dict(gridcolor='rgb(255,255,255)',
                                   title='principal component 2',
                                   zerolinewidth=1, ticklen=5, gridwidth=2),
                        height=600))
trace1 = pca_scatter('Churn', 'red')
trace2 = pca_scatter('Not Churn', 'royalblue')
data = [trace2, trace1]
fig = go.Figure(data=data, layout=layout)
py.iplot(fig)

Result output:
Visual display of PCA results

  • PCA tries to extract two principal components (PC1, PC2) to interpret the data
  • It can be seen from the above figure that the scatter plot drawn by using two principal components is classified according to whether it is lost or not, the intersection of the scatter points is still relatively large. However, the parts with higher scatter density can still be properly distinguished.
  • The above code gives the extraction and visualization method of PCA, and represents that the two principal components extracted are the most variable number to distinguish whether it is lost.

4.5 The distribution of binary variables of the customer situation (radar chart)-can be used to visualize customer portraits

# 分离二元类型列
bi_cs = telcom.nunique()[telcom.nunique()==2].keys()
dat_rad = telcom[bi_cs]

# 绘制流失和未流失客户的雷达图
def plot_radar(df, aggregate, title):
    data_frame = df[df['Churn']==aggregate]
    data_frame_x = data_frame[bi_cs].sum().reset_index()
    data_frame_x.columns = ['feature', 'yes']
    data_frame_x['no'] = data_frame.shape[0] - data_frame_x['yes']
    data_frame_x = data_frame_x[data_frame_x['feature'] != 'Churn']

    # 计算yes出现的次数
    trace1 = go.Scatterpolar(r=data_frame_x['yes'].values.tolist(),
                             theta=data_frame_x['feature'].tolist(),
                             fill='toself', name='count of 1\'s', mode='markers+lines',
                             marker=dict(size=5))
    # 计算no的次数
    trace2 = go.Scatterpolar(r=data_frame_x['no'].values.tolist(),
                             theta=data_frame_x['feature'].tolist(),
                             fill='toself', name='count of 0\'s', mode='markers+lines',
                             marker=dict(size=5))
    layout = go.Layout(dict(polar=dict(radialaxis=dict(visible=True,
                                                       side='counterclockwise',
                                                       showline=True,
                                                       linewidth=2,
                                                       tickwidth=2,
                                                       gridcolor='white',
                                                       gridwidth=2),
                                       angularaxis=dict(tickfont=dict(size=10),
                                                        layer='below traces'),
                                       bgcolor='rgb(243,243,243)'),
                            paper_bgcolor='rgb(243,243,243)',
                            title=title, height=700))
    data = [trace2, trace1]
    fig = go.Figure(data=data, layout=layout)
    py.iplot(fig)

# 绘图
plot_radar(dat_rad, 1, 'Churn - Customers')
plot_radar(dat_rad, 0, 'Non Churn - Customers')

Code explanation

  • bi_cs is used to store the value of binary categorical variables (ie, yes or no) in the telcom dataset.
  • Count of 0's refers to the count of records with No in the binary categorical variable; counts of 1's refers to the count of records with Yes.
  • Group the customers according to whether they are churn, and see how many counts are in the number of different binary categories among churn and non-churn customers. Draw into a radar chart.

Result output
Radar chart of binary categorical variables in lost customers

  • Among the lost customers, the telephone service has not been activated; the contract method is not to renew the contract for one or two years, and there are more customers who have not activated the network service.
  • Among the lost customers, more customers have signed monthly contracts, opened telephone services, and used paperless bills.
  • First of all, the telephone service of the communication company is the basic business, so regardless of the loss of customers, the telephone service is generally used.
  • Secondly, the customers who adopt the short-term contract may only use the company's products, and the possibility of later loss is greater.
  • Customers who have not opened network services are more likely to lose, which shows that improving the quality of network services can increase customer loyalty.
  • It is also necessary to pay attention to customer feedback on the delivery method of paperless bills. It may be that paperless bills can easily cause customers to ignore expenditures, causing misunderstanding of unreasonable deductions by the company, and loss of customers. Additional investigation is required here.
    Radar chart of binary variables among unchurned customers
  • Judging from the shadow area of ​​whether the service is opened, the customers who have not opened the service account for the majority
  • There are more people using products in partnership with others than customers who are not partners. It can be seen that if the business involves multiple people, it can greatly increase the stickiness of the product. ——People are still lazy.

Reprint address:

Kaggle typical customer churn data analysis and prediction boutique case!

Data set download address:

Data set, more notebook browsing addresses

Guess you like

Origin blog.csdn.net/Haoyu_xie/article/details/108572708