Python - matplotlib - pyplot plotting

matplotlib

foreword

Vue框架:Learn from the project Vue
OJ算法系列:magical machine hundreds of refining - algorithm detailed explanation
Linux操作系统:of Fenghou Qimen - linux
C++11:Tongtianlu - C++11
Python常用模块:Tongtianlu - Python

One line to check whether Pandas has been downloaded:
pip list
One line to download:
pip install pandas

Basic use:

  • Token character maker:

    1. Point marker: .
    2. pixel tag: ,
    3. Circle mark: o
    4. Inverted triangle mark: v
  • Style character linestyle:

    1. solid line:-
    2. Dashes: –
    3. Dotted line: -.
    4. dotted line::
  • colorcolor:

    1. blue: b
    2. red: r
    3. Green: g
    4. Cyan: c

Graphics Necessary Libraries

  • matplotlib

  • matplotlib.pyplot

  • seaborn

  • pandas

  • numpy

  • Remarks: There are three modes for using plt drawing in notebook

    • %matplotlib inline: This is the default mode, output static pictures directly in Notebook
    • %matplotlib auto: Pop up a separate drawing window, the same as in pycharm
    • %matplotlib notebook: Generate a drawing window in the notebook, which can zoom in and out of the picture
import matplotlib.pyplot as plt
import matplotlib
import seaborn as sns
import numpy as np 
import warnings

warnings.filterwarnings("ignore")
# %matplotlib inline的用途:直接在Notebook中渲染图形
%matplotlib inline

read training set

import pandas as pd

train = pd.read_csv('./data_set/house_price_train.csv')
print(train.shape)
print(train.dtypes)
#train.head()
#train.tail()

#查看非数值型特征值:
categorical_feats = train.dtypes[train.dtypes == 'object'].index
print(type(categorical_feats))
print('*'*100)
print(categorical_feats)

#查看数值型特征值:
value_feats = train.dtypes[train.dtypes != 'object'].index
print(type(value_feats))
print('*'*100)

#特征值数组还可以回用:
print(train[value_feats].values)
(1460, 81)
Id                 int64
MSSubClass         int64
MSZoning          object
LotFrontage      float64
LotArea            int64
                  ...   
MoSold             int64
YrSold             int64
SaleType          object
SaleCondition     object
SalePrice          int64
Length: 81, dtype: object
<class 'pandas.core.indexes.base.Index'>
****************************************************************************************************
Index(['MSZoning', 'Street', 'Alley', 'LotShape', 'LandContour', 'Utilities',
       'LotConfig', 'LandSlope', 'Neighborhood', 'Condition1', 'Condition2',
       'BldgType', 'HouseStyle', 'RoofStyle', 'RoofMatl', 'Exterior1st',
       'Exterior2nd', 'MasVnrType', 'ExterQual', 'ExterCond', 'Foundation',
       'BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2',
       'Heating', 'HeatingQC', 'CentralAir', 'Electrical', 'KitchenQual',
       'Functional', 'FireplaceQu', 'GarageType', 'GarageFinish', 'GarageQual',
       'GarageCond', 'PavedDrive', 'PoolQC', 'Fence', 'MiscFeature',
       'SaleType', 'SaleCondition'],
      dtype='object')
<class 'pandas.core.indexes.base.Index'>
****************************************************************************************************
[[1.00000e+00 6.00000e+01 6.50000e+01 ... 2.00000e+00 2.00800e+03
  2.08500e+05]
 [2.00000e+00 2.00000e+01 8.00000e+01 ... 5.00000e+00 2.00700e+03
  1.81500e+05]
 [3.00000e+00 6.00000e+01 6.80000e+01 ... 9.00000e+00 2.00800e+03
  2.23500e+05]
 ...
 [1.45800e+03 7.00000e+01 6.60000e+01 ... 5.00000e+00 2.01000e+03
  2.66500e+05]
 [1.45900e+03 2.00000e+01 6.80000e+01 ... 4.00000e+00 2.01000e+03
  1.42125e+05]
 [1.46000e+03 2.00000e+01 7.50000e+01 ... 6.00000e+00 2.00800e+03
  1.47500e+05]]

Title and caption:

  • xlabel: the text under the x-axis
  • ylabel: the text under the y axis
  • title: image title
plt.title("House-SalePrice", fontsize=16) # 设置标题
plt.xlabel("SalePrice_Log", fontsize=15) # 横坐标
plt.ylabel("count", fontsize=15) # 纵坐标
print(train['LotArea'].values)
plt.plot(train['LotArea'].index, train['LotArea'].values, 
         linestyle=':',marker=".",color='r')
 #":"表示虚线样式,"."表示点标记,"r"表示线为红色
plt.show()
[ 8450  9600 11250 ...  9042  9717  9937]

6_1

Sub-area graphics drawing method:

  • subplot(): returns graphics and quadrants
import matplotlib.pyplot as plt
plt.figure(figsize = (8, 8)) # 指定整体图片大小

fig, ax = plt.subplots(2, 2) # fig是图形对象,ax表示子图轴数组, 继续传参则为第几个子图
print(type(ax))
#ax用于区分象限区域内图形:
ax[0][0].set_title(1)
ax[0][1].set_title(2)
ax[1][0].set_title(3)
ax[1][1].set_title(4)

print(ax)
<class 'numpy.ndarray'>
[[<AxesSubplot:title={'center':'1'}> <AxesSubplot:title={'center':'2'}>]
 [<AxesSubplot:title={'center':'3'}> <AxesSubplot:title={'center':'4'}>]]



<Figure size 800x800 with 0 Axes>

8-2

  • for drawing in graphics boxes
plt.figure(figsize = (8, 8))
fig, ax = plt.subplots(2, 2) 
'''
???
'''
print(categorical_feats[:4]) # 输出对应属性值列名

for row in range(2):
    for col in range(2):
        data = train[categorical_feats[row*2+col]].value_counts()
        '''
        ???
        '''F
        ax[row][col].plot(data.index, data.values)
        ax[row][col].set_title(f'{
      
      categorical_feats[row*2+col]}')
fig.tight_layout() # 自动保持子图之间的正确间距。
plt.show()
Index(['MSZoning', 'Street', 'Alley', 'LotShape'], dtype='object')



<Figure size 800x800 with 0 Axes>

10-2

Draw four types of diagrams

Histogram hist

  • Describe the number of times the data appears in each value, the horizontal axis is the value, and the vertical axis is the frequency
  • The drawing parameter has only one x
parameter describe
x Required parameter, array or array sequence.
bins If it is an integer, it is the number of divisions. If it is an array, it is the specific position
range Specify the lower limit and upper limit of the global interval (min, max), tuple type, the default value is None.
density If True, returns a probability density histogram; defaults to False, returns a histogram of the number of elements in the corresponding interval.
histtype The type of histogram to be drawn, the default value is "bar", and the optional values ​​are barstacked (stacked bar chart), step (unfilled step chart), and stepfilled (filled step chart).
plt.hist(x = train['SalePrice'], 
         bins=50,
         density=False,
         histtype='stepfilled')
plt.show()


13-0

Histogram bar:

  • There are two drawing parameters: x and height
parameter describe
x A scalar sequence representing the x coordinates of the histogram. The default x value is the midpoint of each histogram, or it can be the left edge of the histogram.
height A scalar or sequence of scalars representing the height of the histogram.
width Optional parameter, scalar or array-like, the default width of the histogram is 0.8.
bottom Optional parameter, scalar or array-like, the y-coordinate of the histogram defaults to None.
Algin There are two options {"center", "edge"}, the default is 'center', this parameter determines the position of the x value in the histogram.
data = train['MSZoning'].value_counts() 
print(data)
print(type(data))
plt.bar(x = data.index, 
        height = data.values,
        width=0.5,
        align='center') 
RL         1151
RM          218
FV           65
RH           16
C (all)      10
Name: MSZoning, dtype: int64
<class 'pandas.core.series.Series'>





<BarContainer object of 5 artists>

15-2

Scatter plot scatter:

  • Two drawing parameters: x+y determines the point coordinates
  • Plot data points on horizontal and vertical axes
parameter describe
x, y Coordinates of scattered points
s area of ​​scattered points
c The color of the scatter point (default is blue, 'b', other colors are the same as plt.plot())
marker Scatter style (default value is solid circle, 'o', other styles are the same as plt.plot())
alpha Scatter transparency (the number between [0, 1], 0 means completely transparent, 1 means completely opaque)
linewidths point edge line width
edgecolors The edge color of the scatter point
plt.scatter(x = train.TotalBsmtSF, 
            y = train.SalePrice, 
            c='b',
            marker=',',
            alpha=0.5
            )
plt.scatter(x = train.BsmtUnfSF, 
            y = train.SalePrice, 
            c='r',
            marker='.',
            alpha=0.8
            ) 
# 绘制多个属性到一个散点图
<matplotlib.collections.PathCollection at 0x216c67c0130>

17-1

pie chart

  • A drawing parameter: x
parameter describe
x Array sequence, the array elements correspond to the number of fan-shaped areas.
explode Highlight, set the gap size of each block
labels A sequence of list strings, annotating a label name for each sector.
color Set the color for each fan-shaped area, and the default is automatically set according to the color cycle.
autopct The format string "fmt%pct" formats each slice's label using a percentage and places it within the slice.
plt.figure(figsize=(8,8))
plt.pie(
    x = train['MSZoning'].value_counts(),
    explode = (0, 0.1, 0.2, 0, 0), 
    autopct = '%1.2f%%'
    )
plt.show()

19-0

Special skills:

  • Logarithmic:

    1. Usage: x distribution is very wide, part of y is high
    2. Purpose of use: distribution tends to normal distribution

Pandas plotting

  • Pandas encapsulates a simple drawing function based on matplotlib,
    which can directly add .plot() to DataFrame and Series to draw

  • .plot() parameters:

    1. kind: the above hist, scatter, pie, bar

Series drawing

#直接绘图
train['SalePrice'].plot()
<AxesSubplot:>

23_1

#统计某值后绘图
train['MSZoning'].value_counts().plot()
<AxesSubplot:>

24-1

#指定kind
train['MSZoning'].value_counts().plot(kind="bar") # 柱状图
<AxesSubplot:>

25-1

train['SalePrice'].plot.hist() 
# 直方图,.hist()相当于kind='hist'
# train['SalePrice'].plot(kind='hist')
<AxesSubplot:ylabel='Frequency'>

26-1

DataFrame plotting

# 部分数据直接绘图
train_d = train.loc[:,value_feats[1:5]]
print(train_d)

#不使用plot则多种数据分开绘图
train_d.hist(
    figsize=(8, 10), 
    bins=50, 
    xlabelsize=8,   #x轴说明字体大小
    ylabelsize=8
             )

plt.show()
      MSSubClass  LotFrontage  LotArea  OverallQual
0             60         65.0     8450            7
1             20         80.0     9600            6
2             60         68.0    11250            7
3             70         60.0     9550            7
4             60         84.0    14260            8
...          ...          ...      ...          ...
1455          60         62.0     7917            6
1456          20         85.0    13175            6
1457          70         66.0     9042            7
1458          20         68.0     9717            5
1459          20         75.0     9937            5

[1460 rows x 4 columns]

28-1

#使用plot则多种数据合并绘图
train.loc[:,['MSSubClass','LotFrontage','OpenPorchSF']].plot.hist()
<AxesSubplot:ylabel='Frequency'>

29-1

# 对属性值含有NaN的列进行统计使用柱状图显示
missing = train.isnull().sum()
print(missing)
''' 
此处的missing > 0指的是?
'''
missing = missing[missing > 0]
#print(missing)
missing = missing.sort_values(ascending=False)
# ascending=False 为升序
# inplace=True 为不原地修改
missing.plot.bar()
Id                 0
MSSubClass         0
MSZoning           0
LotFrontage      259
LotArea            0
                ... 
MoSold             0
YrSold             0
SaleType           0
SaleCondition      0
SalePrice          0
Length: 81, dtype: int64





<AxesSubplot:>

30-2

seaborn

  • Introduction: Simplified packaging of matplotlib, but not fully functional

  • Five theme styles:

    1. darkgrid (gray grid)
    2. whitegrid (white grid)
    3. dark (black)
    4. white (white)
    5. ticks (cross)

Histogram displot:

import seaborn as sns
sns.set(style = "darkgrid")
sns.displot(x = train['SalePrice'])
''' 
???
'''
<seaborn.axisgrid.FacetGrid at 0x216ce4078b0>


33-1

Kernel density map kdeplot:

  • kernel density plot:
  1. Purpose: Display the distribution of data in the continuous data segment of the X-axis.
  2. Features: This chart is a variant of the histogram that uses smooth curves to plot horizontal values, resulting in a smoother distribution.
sns.kdeplot(x = train['SalePrice'])
<AxesSubplot:xlabel='SalePrice', ylabel='Density'>

35-1

Histogram density map distplot:

  • Plot histogram and density plot at the same time
sns.distplot(x = train["SalePrice"],bins=20, kde=True)
<AxesSubplot:ylabel='Density'>

37-2

Bar chart countplot & barplot:

  • countplot: a drawing parameter data

  • barplot: two drawing parameters x&y

  • countplot

ax = sns.countplot(
    x="MSSubClass", #x轴坐标文字说明
    data = train
    )

40-0

  • barplot:
  1. When viewing an attribute using a bar chart, the point estimates and confidence intervals for the data are displayed as rectangular bars.
  2. The black line of each bar is the error bar, indicating the data error range. When the error bars are relatively "long", generally either the data dispersion is large or the data samples are small.
plt.figure(figsize=(10,6))
sns.barplot(x='GarageCars',y = 'SalePrice',data=train)
plt.show()

42-0

  • Use the histogram to count the proportion of NaN in the corresponding column
# 使用矩形图统计NaN在对应列所占比例
missing = train.isnull().mean()
missing = missing.sort_values(ascending=False)[:20]

# 以条形图显示NaN值在每个属性值所占比例
sns.barplot(x=missing.index,y=missing.values)
# 将x坐标表示旋转90度
plt.xticks(rotation=90)
plt.show()


44-0

Boxplot:

  • Advantages: Show the distribution of data related to the category
    Show the maximum, minimum, median, and upper and lower quartiles of a set of data
  • Two plot parameters: x & y
plt.figure(figsize=(10,5))
sns.boxplot(
    x = 'GarageCars',
    y = 'SalePrice',
    data = train
    )
plt.show()

46-0

Bivariate relationship graph jointplot:

  • Plot a graph of two variables using bivariate and univariate plots
# 使用sns散点图,描述变量和房价之间的关系
sns.scatterplot(
    x = train['TotalBsmtSF'],
    y = train['SalePrice']
    )

# 使用双变量图,描述变量的分布图和变量相关的散点图组合在一起
plt.figure(figsize = (4, 3))
sns.jointplot(
    x = train.TotalBsmtSF, 
    y = train.SalePrice
    )
plt.xlabel('GrLvArea')
plt.ylabel('SalePrice')
plt.title('Basis')
plt.show()


48-0

<Figure size 400x300 with 0 Axes>

[External link picture transfer failed, the source site may have an anti-leeching mechanism, it is recommended to save the picture and upload it directly (img-OxLMyIqv-1680433039890)(output_48_2.png)]

Pairwise relationship graph pairplot:

  • Function: Focus on drawing pairwise relationships, showing the relationship between two features
  • Method: Create an Axes network, so that each variable in the data type can share 1 row on the y-axis and 1 column on the x-axis
  • Features: The diagonal line is the histogram of each attribute, and the off-diagonal line is the correlation graph between different attributes
train_pair = train.loc[:,["LotArea", "GarageArea", "SalePrice"]]
tmp = sns.pairplot(data = train_pair)
print(type(tmp))
<class 'seaborn.axisgrid.PairGrid'>

50-1

Heat map heatmap:

  • Function: Identify predictor variables and target variable correlation methods
import numpy as np
sns.set(font_scale=1.1)
correlation_train = train.corr()
# 返回协方差的上三角矩阵
mask = np.triu(correlation_train.corr())
''' 
???mask???
'''
plt.figure(figsize=(20, 20))
sns.heatmap(data = correlation_train,
            annot = True,#是否对heatmap中每个方格写入数据。
            fmt='.1f',# 注释格式
            cmap='coolwarm',# 颜色列表
            square=True,# 将每个单元格为方形
            mask=mask,
#vmax,vmin, 图例中最大值和最小值的显示值,没有该参数时默认不显示
           )
plt.show()

52-1

# 对数据的部分列进行热力图显示
plt.figure(figsize=(10, 10))
# corr_abs = train.corr().abs() 
# ser_corr = corr_abs.nlargest(len(numerical_feats), "SalePrice_Log")["SalePrice_Log"]
# cols = ser_corr[ser_corr.values > 0.43].index
cols = ['SalePrice', 'OverallQual', 'GrLivArea', 'GarageCars',
       'GarageArea', 'TotalBsmtSF', '1stFlrSF', 'FullBath', 'YearBuilt',
       'YearRemodAdd', 'TotRmsAbvGrd', 'Fireplaces']
cm = train[cols].corr()
sns.heatmap(
    data = cm, 
    annot=True, 
    square=True, 
    fmt='.1f'
    )
plt.show()

53-0

Regression graph implot®plot:

  • While drawing a scatter plot, output a linear relationship between two variables

  • regplot():

    1. data can be a DataFrame
    2. x, y parameters accept a variety of data types, including numpy arrays, Series
  • lmplot():

    1. The data parameter cannot be empty
    2. The x and y parameters must be specified as strings.
  • The one-dimensional array-like data format supported by regplot() but not supported by lmplot() is called "long-form data" or "tidy data"

lmplot

import seaborn as sns
import pandas 
train = pandas.read_csv('./data_set/house_price_train.csv')

sns.lmplot(
    x = 'OverallQual',
    y = 'SalePrice',
    data = train
)
<seaborn.axisgrid.FacetGrid at 0x296ec3fb310>

56-1

sns.lmplot(
    x = '1stFlrSF',
    y = 'SalePrice',
    data = train
)
<seaborn.axisgrid.FacetGrid at 0x296ec3fa680>

57-1

regplot()

sns.regplot(
    x = 'YearBuilt', 
    y = 'SalePrice', 
    data = train
    )
<AxesSubplot:xlabel='YearBuilt', ylabel='SalePrice'>


59-1

# order大于 1,使用numpy.polyfit来估计多项式回归,常用于进行曲线拟合的函数
sns.regplot(
    x = train['YearBuilt'], 
    y = train['SalePrice'], 
    order = 3
    )
<AxesSubplot:xlabel='YearBuilt', ylabel='SalePrice'>

60-1

for loop draw subgraph

import matplotlib.pyplot as plt
# 使用for循环对属性遍历
fig, axes = plt.subplots(4, 3, figsize=(25, 30))
# flatten用于降维,将几个多维数组变成几个一维数组
axes = axes.flatten()
for columns, j in zip(train.select_dtypes(include=['number']).columns[:13], axes):
    print(columns,' : ', j)
    sns.regplot(
        x=columns, 
        y="SalePrice", 
        data=train,
        ax=j, 
        order=3
        )
Id  :  AxesSubplot(0.125,0.712609;0.227941x0.167391)
MSSubClass  :  AxesSubplot(0.398529,0.712609;0.227941x0.167391)
LotFrontage  :  AxesSubplot(0.672059,0.712609;0.227941x0.167391)
LotArea  :  AxesSubplot(0.125,0.511739;0.227941x0.167391)
OverallQual  :  AxesSubplot(0.398529,0.511739;0.227941x0.167391)
OverallCond  :  AxesSubplot(0.672059,0.511739;0.227941x0.167391)
YearBuilt  :  AxesSubplot(0.125,0.31087;0.227941x0.167391)
YearRemodAdd  :  AxesSubplot(0.398529,0.31087;0.227941x0.167391)
MasVnrArea  :  AxesSubplot(0.672059,0.31087;0.227941x0.167391)
BsmtFinSF1  :  AxesSubplot(0.125,0.11;0.227941x0.167391)
BsmtFinSF2  :  AxesSubplot(0.398529,0.11;0.227941x0.167391)
BsmtUnfSF  :  AxesSubplot(0.672059,0.11;0.227941x0.167391)

62-1

Guess you like

Origin blog.csdn.net/buptsd/article/details/129915089