matplotlib plotting
foreword
Vue框架:
Learn from the project Vue
OJ算法系列:
magical machine hundreds of refining - algorithm detailed explanation
Linux操作系统:
of Fenghou Qimen - linux
C++11:
Tongtianlu - C++11
Python常用模块:
Tongtianlu - Python
One line to check whether Pandas has been downloaded:
pip list
One line to download:
pip install pandas
Basic use:
-
Token character maker:
- Point marker: .
- pixel tag: ,
- Circle mark: o
- Inverted triangle mark: v
-
Style character linestyle:
- solid line:-
- Dashes: –
- Dotted line: -.
- dotted line::
-
colorcolor:
- blue: b
- red: r
- Green: g
- Cyan: c
Graphics Necessary Libraries
-
matplotlib
-
matplotlib.pyplot
-
seaborn
-
pandas
-
numpy
-
Remarks: There are three modes for using plt drawing in notebook
- %matplotlib inline: This is the default mode, output static pictures directly in Notebook
- %matplotlib auto: Pop up a separate drawing window, the same as in pycharm
- %matplotlib notebook: Generate a drawing window in the notebook, which can zoom in and out of the picture
import matplotlib.pyplot as plt
import matplotlib
import seaborn as sns
import numpy as np
import warnings
warnings.filterwarnings("ignore")
# %matplotlib inline的用途:直接在Notebook中渲染图形
%matplotlib inline
read training set
import pandas as pd
train = pd.read_csv('./data_set/house_price_train.csv')
print(train.shape)
print(train.dtypes)
#train.head()
#train.tail()
#查看非数值型特征值:
categorical_feats = train.dtypes[train.dtypes == 'object'].index
print(type(categorical_feats))
print('*'*100)
print(categorical_feats)
#查看数值型特征值:
value_feats = train.dtypes[train.dtypes != 'object'].index
print(type(value_feats))
print('*'*100)
#特征值数组还可以回用:
print(train[value_feats].values)
(1460, 81)
Id int64
MSSubClass int64
MSZoning object
LotFrontage float64
LotArea int64
...
MoSold int64
YrSold int64
SaleType object
SaleCondition object
SalePrice int64
Length: 81, dtype: object
<class 'pandas.core.indexes.base.Index'>
****************************************************************************************************
Index(['MSZoning', 'Street', 'Alley', 'LotShape', 'LandContour', 'Utilities',
'LotConfig', 'LandSlope', 'Neighborhood', 'Condition1', 'Condition2',
'BldgType', 'HouseStyle', 'RoofStyle', 'RoofMatl', 'Exterior1st',
'Exterior2nd', 'MasVnrType', 'ExterQual', 'ExterCond', 'Foundation',
'BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2',
'Heating', 'HeatingQC', 'CentralAir', 'Electrical', 'KitchenQual',
'Functional', 'FireplaceQu', 'GarageType', 'GarageFinish', 'GarageQual',
'GarageCond', 'PavedDrive', 'PoolQC', 'Fence', 'MiscFeature',
'SaleType', 'SaleCondition'],
dtype='object')
<class 'pandas.core.indexes.base.Index'>
****************************************************************************************************
[[1.00000e+00 6.00000e+01 6.50000e+01 ... 2.00000e+00 2.00800e+03
2.08500e+05]
[2.00000e+00 2.00000e+01 8.00000e+01 ... 5.00000e+00 2.00700e+03
1.81500e+05]
[3.00000e+00 6.00000e+01 6.80000e+01 ... 9.00000e+00 2.00800e+03
2.23500e+05]
...
[1.45800e+03 7.00000e+01 6.60000e+01 ... 5.00000e+00 2.01000e+03
2.66500e+05]
[1.45900e+03 2.00000e+01 6.80000e+01 ... 4.00000e+00 2.01000e+03
1.42125e+05]
[1.46000e+03 2.00000e+01 7.50000e+01 ... 6.00000e+00 2.00800e+03
1.47500e+05]]
Title and caption:
- xlabel: the text under the x-axis
- ylabel: the text under the y axis
- title: image title
plt.title("House-SalePrice", fontsize=16) # 设置标题
plt.xlabel("SalePrice_Log", fontsize=15) # 横坐标
plt.ylabel("count", fontsize=15) # 纵坐标
print(train['LotArea'].values)
plt.plot(train['LotArea'].index, train['LotArea'].values,
linestyle=':',marker=".",color='r')
#":"表示虚线样式,"."表示点标记,"r"表示线为红色
plt.show()
[ 8450 9600 11250 ... 9042 9717 9937]
Sub-area graphics drawing method:
- subplot(): returns graphics and quadrants
import matplotlib.pyplot as plt
plt.figure(figsize = (8, 8)) # 指定整体图片大小
fig, ax = plt.subplots(2, 2) # fig是图形对象,ax表示子图轴数组, 继续传参则为第几个子图
print(type(ax))
#ax用于区分象限区域内图形:
ax[0][0].set_title(1)
ax[0][1].set_title(2)
ax[1][0].set_title(3)
ax[1][1].set_title(4)
print(ax)
<class 'numpy.ndarray'>
[[<AxesSubplot:title={'center':'1'}> <AxesSubplot:title={'center':'2'}>]
[<AxesSubplot:title={'center':'3'}> <AxesSubplot:title={'center':'4'}>]]
<Figure size 800x800 with 0 Axes>
- for drawing in graphics boxes
plt.figure(figsize = (8, 8))
fig, ax = plt.subplots(2, 2)
'''
???
'''
print(categorical_feats[:4]) # 输出对应属性值列名
for row in range(2):
for col in range(2):
data = train[categorical_feats[row*2+col]].value_counts()
'''
???
'''F
ax[row][col].plot(data.index, data.values)
ax[row][col].set_title(f'{
categorical_feats[row*2+col]}')
fig.tight_layout() # 自动保持子图之间的正确间距。
plt.show()
Index(['MSZoning', 'Street', 'Alley', 'LotShape'], dtype='object')
<Figure size 800x800 with 0 Axes>
Draw four types of diagrams
Histogram hist
- Describe the number of times the data appears in each value, the horizontal axis is the value, and the vertical axis is the frequency
- The drawing parameter has only one x
parameter | describe |
---|---|
x | Required parameter, array or array sequence. |
bins | If it is an integer, it is the number of divisions. If it is an array, it is the specific position |
range | Specify the lower limit and upper limit of the global interval (min, max), tuple type, the default value is None. |
density | If True, returns a probability density histogram; defaults to False, returns a histogram of the number of elements in the corresponding interval. |
histtype | The type of histogram to be drawn, the default value is "bar", and the optional values are barstacked (stacked bar chart), step (unfilled step chart), and stepfilled (filled step chart). |
plt.hist(x = train['SalePrice'],
bins=50,
density=False,
histtype='stepfilled')
plt.show()
Histogram bar:
- There are two drawing parameters: x and height
parameter | describe |
---|---|
x | A scalar sequence representing the x coordinates of the histogram. The default x value is the midpoint of each histogram, or it can be the left edge of the histogram. |
height | A scalar or sequence of scalars representing the height of the histogram. |
width | Optional parameter, scalar or array-like, the default width of the histogram is 0.8. |
bottom | Optional parameter, scalar or array-like, the y-coordinate of the histogram defaults to None. |
Algin | There are two options {"center", "edge"}, the default is 'center', this parameter determines the position of the x value in the histogram. |
data = train['MSZoning'].value_counts()
print(data)
print(type(data))
plt.bar(x = data.index,
height = data.values,
width=0.5,
align='center')
RL 1151
RM 218
FV 65
RH 16
C (all) 10
Name: MSZoning, dtype: int64
<class 'pandas.core.series.Series'>
<BarContainer object of 5 artists>
Scatter plot scatter:
- Two drawing parameters: x+y determines the point coordinates
- Plot data points on horizontal and vertical axes
parameter | describe |
---|---|
x, y | Coordinates of scattered points |
s | area of scattered points |
c | The color of the scatter point (default is blue, 'b', other colors are the same as plt.plot()) |
marker | Scatter style (default value is solid circle, 'o', other styles are the same as plt.plot()) |
alpha | Scatter transparency (the number between [0, 1], 0 means completely transparent, 1 means completely opaque) |
linewidths | point edge line width |
edgecolors | The edge color of the scatter point |
plt.scatter(x = train.TotalBsmtSF,
y = train.SalePrice,
c='b',
marker=',',
alpha=0.5
)
plt.scatter(x = train.BsmtUnfSF,
y = train.SalePrice,
c='r',
marker='.',
alpha=0.8
)
# 绘制多个属性到一个散点图
<matplotlib.collections.PathCollection at 0x216c67c0130>
pie chart
- A drawing parameter: x
parameter | describe |
---|---|
x | Array sequence, the array elements correspond to the number of fan-shaped areas. |
explode | Highlight, set the gap size of each block |
labels | A sequence of list strings, annotating a label name for each sector. |
color | Set the color for each fan-shaped area, and the default is automatically set according to the color cycle. |
autopct | The format string "fmt%pct" formats each slice's label using a percentage and places it within the slice. |
plt.figure(figsize=(8,8))
plt.pie(
x = train['MSZoning'].value_counts(),
explode = (0, 0.1, 0.2, 0, 0),
autopct = '%1.2f%%'
)
plt.show()
Special skills:
-
Logarithmic:
- Usage: x distribution is very wide, part of y is high
- Purpose of use: distribution tends to normal distribution
Pandas plotting
-
Pandas encapsulates a simple drawing function based on matplotlib,
which can directly add .plot() to DataFrame and Series to draw -
.plot() parameters:
- kind: the above hist, scatter, pie, bar
Series drawing
#直接绘图
train['SalePrice'].plot()
<AxesSubplot:>
#统计某值后绘图
train['MSZoning'].value_counts().plot()
<AxesSubplot:>
#指定kind
train['MSZoning'].value_counts().plot(kind="bar") # 柱状图
<AxesSubplot:>
train['SalePrice'].plot.hist()
# 直方图,.hist()相当于kind='hist'
# train['SalePrice'].plot(kind='hist')
<AxesSubplot:ylabel='Frequency'>
DataFrame plotting
# 部分数据直接绘图
train_d = train.loc[:,value_feats[1:5]]
print(train_d)
#不使用plot则多种数据分开绘图
train_d.hist(
figsize=(8, 10),
bins=50,
xlabelsize=8, #x轴说明字体大小
ylabelsize=8
)
plt.show()
MSSubClass LotFrontage LotArea OverallQual
0 60 65.0 8450 7
1 20 80.0 9600 6
2 60 68.0 11250 7
3 70 60.0 9550 7
4 60 84.0 14260 8
... ... ... ... ...
1455 60 62.0 7917 6
1456 20 85.0 13175 6
1457 70 66.0 9042 7
1458 20 68.0 9717 5
1459 20 75.0 9937 5
[1460 rows x 4 columns]
#使用plot则多种数据合并绘图
train.loc[:,['MSSubClass','LotFrontage','OpenPorchSF']].plot.hist()
<AxesSubplot:ylabel='Frequency'>
# 对属性值含有NaN的列进行统计使用柱状图显示
missing = train.isnull().sum()
print(missing)
'''
此处的missing > 0指的是?
'''
missing = missing[missing > 0]
#print(missing)
missing = missing.sort_values(ascending=False)
# ascending=False 为升序
# inplace=True 为不原地修改
missing.plot.bar()
Id 0
MSSubClass 0
MSZoning 0
LotFrontage 259
LotArea 0
...
MoSold 0
YrSold 0
SaleType 0
SaleCondition 0
SalePrice 0
Length: 81, dtype: int64
<AxesSubplot:>
seaborn
-
Introduction: Simplified packaging of matplotlib, but not fully functional
-
Five theme styles:
- darkgrid (gray grid)
- whitegrid (white grid)
- dark (black)
- white (white)
- ticks (cross)
Histogram displot:
import seaborn as sns
sns.set(style = "darkgrid")
sns.displot(x = train['SalePrice'])
'''
???
'''
<seaborn.axisgrid.FacetGrid at 0x216ce4078b0>
Kernel density map kdeplot:
- kernel density plot:
- Purpose: Display the distribution of data in the continuous data segment of the X-axis.
- Features: This chart is a variant of the histogram that uses smooth curves to plot horizontal values, resulting in a smoother distribution.
sns.kdeplot(x = train['SalePrice'])
<AxesSubplot:xlabel='SalePrice', ylabel='Density'>
Histogram density map distplot:
- Plot histogram and density plot at the same time
sns.distplot(x = train["SalePrice"],bins=20, kde=True)
<AxesSubplot:ylabel='Density'>
Bar chart countplot & barplot:
-
countplot: a drawing parameter data
-
barplot: two drawing parameters x&y
-
countplot
ax = sns.countplot(
x="MSSubClass", #x轴坐标文字说明
data = train
)
- barplot:
- When viewing an attribute using a bar chart, the point estimates and confidence intervals for the data are displayed as rectangular bars.
- The black line of each bar is the error bar, indicating the data error range. When the error bars are relatively "long", generally either the data dispersion is large or the data samples are small.
plt.figure(figsize=(10,6))
sns.barplot(x='GarageCars',y = 'SalePrice',data=train)
plt.show()
- Use the histogram to count the proportion of NaN in the corresponding column
# 使用矩形图统计NaN在对应列所占比例
missing = train.isnull().mean()
missing = missing.sort_values(ascending=False)[:20]
# 以条形图显示NaN值在每个属性值所占比例
sns.barplot(x=missing.index,y=missing.values)
# 将x坐标表示旋转90度
plt.xticks(rotation=90)
plt.show()
Boxplot:
- Advantages: Show the distribution of data related to the category
Show the maximum, minimum, median, and upper and lower quartiles of a set of data - Two plot parameters: x & y
plt.figure(figsize=(10,5))
sns.boxplot(
x = 'GarageCars',
y = 'SalePrice',
data = train
)
plt.show()
Bivariate relationship graph jointplot:
- Plot a graph of two variables using bivariate and univariate plots
# 使用sns散点图,描述变量和房价之间的关系
sns.scatterplot(
x = train['TotalBsmtSF'],
y = train['SalePrice']
)
# 使用双变量图,描述变量的分布图和变量相关的散点图组合在一起
plt.figure(figsize = (4, 3))
sns.jointplot(
x = train.TotalBsmtSF,
y = train.SalePrice
)
plt.xlabel('GrLvArea')
plt.ylabel('SalePrice')
plt.title('Basis')
plt.show()
<Figure size 400x300 with 0 Axes>
Pairwise relationship graph pairplot:
- Function: Focus on drawing pairwise relationships, showing the relationship between two features
- Method: Create an Axes network, so that each variable in the data type can share 1 row on the y-axis and 1 column on the x-axis
- Features: The diagonal line is the histogram of each attribute, and the off-diagonal line is the correlation graph between different attributes
train_pair = train.loc[:,["LotArea", "GarageArea", "SalePrice"]]
tmp = sns.pairplot(data = train_pair)
print(type(tmp))
<class 'seaborn.axisgrid.PairGrid'>
Heat map heatmap:
- Function: Identify predictor variables and target variable correlation methods
import numpy as np
sns.set(font_scale=1.1)
correlation_train = train.corr()
# 返回协方差的上三角矩阵
mask = np.triu(correlation_train.corr())
'''
???mask???
'''
plt.figure(figsize=(20, 20))
sns.heatmap(data = correlation_train,
annot = True,#是否对heatmap中每个方格写入数据。
fmt='.1f',# 注释格式
cmap='coolwarm',# 颜色列表
square=True,# 将每个单元格为方形
mask=mask,
#vmax,vmin, 图例中最大值和最小值的显示值,没有该参数时默认不显示
)
plt.show()
# 对数据的部分列进行热力图显示
plt.figure(figsize=(10, 10))
# corr_abs = train.corr().abs()
# ser_corr = corr_abs.nlargest(len(numerical_feats), "SalePrice_Log")["SalePrice_Log"]
# cols = ser_corr[ser_corr.values > 0.43].index
cols = ['SalePrice', 'OverallQual', 'GrLivArea', 'GarageCars',
'GarageArea', 'TotalBsmtSF', '1stFlrSF', 'FullBath', 'YearBuilt',
'YearRemodAdd', 'TotRmsAbvGrd', 'Fireplaces']
cm = train[cols].corr()
sns.heatmap(
data = cm,
annot=True,
square=True,
fmt='.1f'
)
plt.show()
Regression graph implot®plot:
-
While drawing a scatter plot, output a linear relationship between two variables
-
regplot():
- data can be a DataFrame
- x, y parameters accept a variety of data types, including numpy arrays, Series
-
lmplot():
- The data parameter cannot be empty
- The x and y parameters must be specified as strings.
-
The one-dimensional array-like data format supported by regplot() but not supported by lmplot() is called "long-form data" or "tidy data"
lmplot
import seaborn as sns
import pandas
train = pandas.read_csv('./data_set/house_price_train.csv')
sns.lmplot(
x = 'OverallQual',
y = 'SalePrice',
data = train
)
<seaborn.axisgrid.FacetGrid at 0x296ec3fb310>
sns.lmplot(
x = '1stFlrSF',
y = 'SalePrice',
data = train
)
<seaborn.axisgrid.FacetGrid at 0x296ec3fa680>
regplot()
sns.regplot(
x = 'YearBuilt',
y = 'SalePrice',
data = train
)
<AxesSubplot:xlabel='YearBuilt', ylabel='SalePrice'>
# order大于 1,使用numpy.polyfit来估计多项式回归,常用于进行曲线拟合的函数
sns.regplot(
x = train['YearBuilt'],
y = train['SalePrice'],
order = 3
)
<AxesSubplot:xlabel='YearBuilt', ylabel='SalePrice'>
for loop draw subgraph
import matplotlib.pyplot as plt
# 使用for循环对属性遍历
fig, axes = plt.subplots(4, 3, figsize=(25, 30))
# flatten用于降维,将几个多维数组变成几个一维数组
axes = axes.flatten()
for columns, j in zip(train.select_dtypes(include=['number']).columns[:13], axes):
print(columns,' : ', j)
sns.regplot(
x=columns,
y="SalePrice",
data=train,
ax=j,
order=3
)
Id : AxesSubplot(0.125,0.712609;0.227941x0.167391)
MSSubClass : AxesSubplot(0.398529,0.712609;0.227941x0.167391)
LotFrontage : AxesSubplot(0.672059,0.712609;0.227941x0.167391)
LotArea : AxesSubplot(0.125,0.511739;0.227941x0.167391)
OverallQual : AxesSubplot(0.398529,0.511739;0.227941x0.167391)
OverallCond : AxesSubplot(0.672059,0.511739;0.227941x0.167391)
YearBuilt : AxesSubplot(0.125,0.31087;0.227941x0.167391)
YearRemodAdd : AxesSubplot(0.398529,0.31087;0.227941x0.167391)
MasVnrArea : AxesSubplot(0.672059,0.31087;0.227941x0.167391)
BsmtFinSF1 : AxesSubplot(0.125,0.11;0.227941x0.167391)
BsmtFinSF2 : AxesSubplot(0.398529,0.11;0.227941x0.167391)
BsmtUnfSF : AxesSubplot(0.672059,0.11;0.227941x0.167391)