45_Pandas.DataFrame calculates the correlation coefficient between each column and visualizes it with a heat map
Use the corr() method to calculate the correlation coefficient between columns in a pandas.DataFrame.
Here, the following will be described.
- Basic usage of pandas.DataFrame.corr()
- Calculation targets for columns with numeric or boolean data types
- Exclude and count missing values NaN
- Specifies how to calculate the correlation coefficient: argument method
- Visualize correlation coefficients using heatmaps: seaborn
Take the pandas.DataFrame below as an example.
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
df = pd.DataFrame({
'A': range(5),
'B': [x**2 for x in range(5)],
'C': [x**3 for x in range(5)]})
print(df)
# A B C
# 0 0 0 0
# 1 1 1 1
# 2 2 4 8
# 3 3 9 27
# 4 4 16 64
Basic usage of pandas.DataFrame.corr()
Call the corr() method from a pandas.DataFrame object to calculate the correlation coefficient between each column. Results are returned in pandas.DataFrame.
df_corr = df.corr()
print(df_corr)
print(type(df_corr))
# A B C
# A 1.000000 0.958927 0.905882
# B 0.958927 1.000000 0.987130
# C 0.905882 0.987130 1.000000
# <class 'pandas.core.frame.DataFrame'>
Calculation targets for columns with numeric or boolean data types
Add string and boolean columns for clarification.
df['D'] = list('abcde')
df['E'] = [True, False, True, True, False]
print(df)
# A B C D E
# 0 0 0 0 a True
# 1 1 1 1 b False
# 2 2 4 8 c True
# 3 3 9 27 d True
# 4 4 16 64 e False
print(df.dtypes)
# A int64
# B int64
# C int64
# D object
# E bool
# dtype: object
The corr() method excludes columns of data type object (string), and calculates the correlation coefficient between columns of numeric (int, float) and bool types.
For bool type, True is considered as 1 and False as 0.
df_corr = df.corr()
print(df_corr)
# A B C E
# A 1.000000 0.958927 0.905882 -0.288675
# B 0.958927 1.000000 0.987130 -0.346023
# C 0.905882 0.987130 1.000000 -0.424522
# E -0.288675 -0.346023 -0.424522 1.000000
Exclude and count missing values NaN
Prepare a pandas.DataFrame object containing missing values Nan for interpretation.
df_nan = df.copy()
df_nan.iloc[[2, 3, 4], 1] = np.nan
print(df_nan)
# A B C D E
# 0 0 0.0 0 a True
# 1 1 1.0 1 b False
# 2 2 NaN 8 c True
# 3 3 NaN 27 d True
# 4 4 NaN 64 e False
In the corr() method, the missing value NaN is removed and the correlation coefficient is calculated.
df_nan_corr = df_nan.corr()
print(df_nan_corr)
# A B C E
# A 1.000000 1.0 0.905882 -0.288675
# B 1.000000 1.0 1.000000 -1.000000
# C 0.905882 1.0 1.000000 -0.424522
# E -0.288675 -1.0 -0.424522 1.000000
Specifies how to calculate the correlation coefficient: method
In the corr() method, the method of calculating the correlation coefficient can be specified through the argument method.
Choose from the following three types.
- 'pearson' : Pearson product-moment correlation coefficient (default)
- 'kendall': Kendall rank correlation coefficient
- 'spearman': Spearman rank correlation coefficient
Visualize correlation coefficients using heatmaps: seaborn
Python's visualization library seaborn makes it easy to visualize pandas.DataFrames as heatmaps, just like using corr().
sns.heatmap(df_corr, vmax=1, vmin=-1, center=0)
plt.savefig('./data/45/seaborn_heatmap_corr_example.png')
The original data has many columns (features), but as mentioned above, the corr() method excludes columns with data type object.
df_house = pd.read_csv('./data/45/house_prices_train.csv', index_col=0)
print(df_house.shape)
# (1460, 80)
print(df_house.dtypes.value_counts())
# object 43
# int64 34
# float64 3
# dtype: int64
df_house_corr = df_house.corr()
print(df_house_corr.shape)
# (37, 37)
Visualize with the seaborn.heatmap() function.
fig, ax = plt.subplots(figsize=(12, 9))
sns.heatmap(df_house_corr, square=True, vmax=1, vmin=-1, center=0)
plt.savefig('./data/45/seaborn_heatmap_house_price.png')