45_Pandas.DataFrame calculates the correlation coefficient between each column and visualizes it with a heat map

45_Pandas.DataFrame calculates the correlation coefficient between each column and visualizes it with a heat map

Use the corr() method to calculate the correlation coefficient between columns in a pandas.DataFrame.

Here, the following will be described.

  • Basic usage of pandas.DataFrame.corr()
    • Calculation targets for columns with numeric or boolean data types
    • Exclude and count missing values ​​NaN
  • Specifies how to calculate the correlation coefficient: argument method
  • Visualize correlation coefficients using heatmaps: seaborn

Take the pandas.DataFrame below as an example.

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

df = pd.DataFrame({
    
    'A': range(5),
                   'B': [x**2 for x in range(5)],
                   'C': [x**3 for x in range(5)]})

print(df)
#    A   B   C
# 0  0   0   0
# 1  1   1   1
# 2  2   4   8
# 3  3   9  27
# 4  4  16  64

Basic usage of pandas.DataFrame.corr()

Call the corr() method from a pandas.DataFrame object to calculate the correlation coefficient between each column. Results are returned in pandas.DataFrame.

df_corr = df.corr()
print(df_corr)
print(type(df_corr))
#           A         B         C
# A  1.000000  0.958927  0.905882
# B  0.958927  1.000000  0.987130
# C  0.905882  0.987130  1.000000
# <class 'pandas.core.frame.DataFrame'>

Calculation targets for columns with numeric or boolean data types

Add string and boolean columns for clarification.

df['D'] = list('abcde')
df['E'] = [True, False, True, True, False]
print(df)
#    A   B   C  D      E
# 0  0   0   0  a   True
# 1  1   1   1  b  False
# 2  2   4   8  c   True
# 3  3   9  27  d   True
# 4  4  16  64  e  False

print(df.dtypes)
# A     int64
# B     int64
# C     int64
# D    object
# E      bool
# dtype: object

The corr() method excludes columns of data type object (string), and calculates the correlation coefficient between columns of numeric (int, float) and bool types.

For bool type, True is considered as 1 and False as 0.

df_corr = df.corr()
print(df_corr)
#           A         B         C         E
# A  1.000000  0.958927  0.905882 -0.288675
# B  0.958927  1.000000  0.987130 -0.346023
# C  0.905882  0.987130  1.000000 -0.424522
# E -0.288675 -0.346023 -0.424522  1.000000

Exclude and count missing values ​​NaN

Prepare a pandas.DataFrame object containing missing values ​​Nan for interpretation.

df_nan = df.copy()
df_nan.iloc[[2, 3, 4], 1] = np.nan
print(df_nan)
#    A    B   C  D      E
# 0  0  0.0   0  a   True
# 1  1  1.0   1  b  False
# 2  2  NaN   8  c   True
# 3  3  NaN  27  d   True
# 4  4  NaN  64  e  False

In the corr() method, the missing value NaN is removed and the correlation coefficient is calculated.

df_nan_corr = df_nan.corr()
print(df_nan_corr)
#           A    B         C         E
# A  1.000000  1.0  0.905882 -0.288675
# B  1.000000  1.0  1.000000 -1.000000
# C  0.905882  1.0  1.000000 -0.424522
# E -0.288675 -1.0 -0.424522  1.000000

Specifies how to calculate the correlation coefficient: method

In the corr() method, the method of calculating the correlation coefficient can be specified through the argument method.

Choose from the following three types.

  • 'pearson' : Pearson product-moment correlation coefficient (default)
  • 'kendall': Kendall rank correlation coefficient
  • 'spearman': Spearman rank correlation coefficient

Visualize correlation coefficients using heatmaps: seaborn

Python's visualization library seaborn makes it easy to visualize pandas.DataFrames as heatmaps, just like using corr().

sns.heatmap(df_corr, vmax=1, vmin=-1, center=0)
plt.savefig('./data/45/seaborn_heatmap_corr_example.png')

insert image description here

The original data has many columns (features), but as mentioned above, the corr() method excludes columns with data type object.

df_house = pd.read_csv('./data/45/house_prices_train.csv', index_col=0)

print(df_house.shape)
# (1460, 80)

print(df_house.dtypes.value_counts())
# object     43
# int64      34
# float64     3
# dtype: int64

df_house_corr = df_house.corr()

print(df_house_corr.shape)
# (37, 37)

Visualize with the seaborn.heatmap() function.

fig, ax = plt.subplots(figsize=(12, 9))
sns.heatmap(df_house_corr, square=True, vmax=1, vmin=-1, center=0)
plt.savefig('./data/45/seaborn_heatmap_house_price.png')

insert image description here

Guess you like

Origin blog.csdn.net/qq_18351157/article/details/119214494