Learn thematic seven - Data Visualization

Welcome to the first lesson data visualization in the world


In fact, the third topic has introduced some knowledge of data visualization, was based on seaborn to introduce, and the topic is mainly using some visual tools pandas and seaborn some other methods for analysis.

The first lesson emphasizes the importance of visual data analysis, must master the basics of some of the pandas before the start of the topic to learn, you do not know if the venue and theme a
) and thematic five .


The second lesson with pandas draw univariate map


1. Bar and disaggregated data

(reviews['province'].value_counts().head(10) / len(reviews)).plot.bar()

1437023-79b95e88c35e69e0.png
Bar chart

2. Line Chart

reviews['points'].value_counts().sort_index().plot.line()

1437023-90efb0dffd5555cb.png
Line chart

- Try not to use a bar graph with a line graph
- learn to use a percentage
3. area chart

reviews['points'].value_counts().sort_index().plot.area()

1437023-1a0aec01d6a8ef45.png
Area chart

4. Histogram

reviews[reviews['price'] < 200]['price'].plot.hist()
1437023-f19dcd3ac77d029c.png
Histogram

Lesson draw bivariate map with pandas


5. Scatter

reviews[reviews['price'] < 100].sample(100).plot.scatter(x='price', y='points')

1437023-de905d2680a5d60f.png
Scatter

Note that in order to make effective use of this plot, we had to downsample our data, taking just 100 points from the full set. This is because naive scatter plots do not effectively treat points which map to the same place. For example, if two wines, both costing 100 dollars, get a rating of 90, then the second one is overplotted onto the first one, and we add just one point to the plot.
This isn't a problem if it happens just a few times. But with enough points the distribution starts to look like a shapeless blob, and you lose the forest for the trees
散点图适用于数据量较小,特征分类多的数据。

6. FIG Hex

reviews[reviews['price'] < 100].plot.hexbin(x='price', y='points', gridsize=15)

1437023-99200eb058327bdd.png
Hexagonal map

Hexagon Figure overcome the problem of covering the point scatter plot that may exist, the more darker colors represent data.
7. Stacked Bar

wine_counts.plot.bar(stacked=True)

1437023-fce6727fe6822190.png
Stacked Bar

8. FIG area

wine_counts.plot.area()

1437023-913c2a36641be226.png
Area chart

Two yuan line 9. FIG.

wine_counts.plot.line()
1437023-c3e28abc108af40f.png
FIG line two yuan
  • A scatter plot or hex plot is good for what two types of data?
    Scatter plots and hex plots work best with a mixture of ordinal categorical and interval data.

  • What type of data makes sense to show in a stacked bar chart, but not in a bivariate line chart?
    Nominal categorical data makes sense in a stacked bar chart, but not in a bivariate line chart.

  • What type of data makes sense to show in a bivariate line chart, but not in a stacked bar chart?
    Interval data makes sense in a bivariate line chart, but not in a stacked bar chart.

  • Suppose we create a scatter plot but find that due to the large number of points it's hard to interpret. What are two things we can do to fix this issue?
    One way to fix this issue would be to sample the points. Another way to fix it would be to use a hex plot.


第四课 定制你的可视化图表


reviews['points'].value_counts().sort_index().plot.bar()

原图如下:

1437023-874483ef3a518323.png
原图

1.变大

reviews['points'].value_counts().sort_index().plot.bar(figsize=(12, 6))

1437023-538299716f80380c.png
大图

变色

reviews['points'].value_counts().sort_index().plot.bar(
    figsize=(12, 6),
    color='mediumvioletred'
)

1437023-070917429d5b2778.png
红图

标签文字变大

reviews['points'].value_counts().sort_index().plot.bar(
    figsize=(12, 6),
    color='mediumvioletred',
    fontsize=16
)

1437023-6e14265626567c9a.png
标签文字变大

加标题

reviews['points'].value_counts().sort_index().plot.bar(
    figsize=(12, 6),
    color='mediumvioletred',
    fontsize=16,
    title='Rankings Given by Wine Magazine',
)

1437023-28592382ed833624.png
加标题

标题变大

import matplotlib.pyplot as plt

ax = reviews['points'].value_counts().sort_index().plot.bar(
    figsize=(12, 6),
    color='mediumvioletred',
    fontsize=16
)
ax.set_title("Rankings Given by Wine Magazine", fontsize=20)

1437023-bddbce42eba39272.png
标题变大

去掉黑框框

import matplotlib.pyplot as plt
import seaborn as sns

ax = reviews['points'].value_counts().sort_index().plot.bar(
    figsize=(12, 6),
    color='mediumvioletred',
    fontsize=16
)
ax.set_title("Rankings Given by Wine Magazine", fontsize=20)
sns.despine(bottom=True, left=True)
1437023-d328726ac54a5d0f.png
去黑框

第五课 Subplots


fig, axarr = plt.subplots(2, 2, figsize=(12, 8))

reviews['points'].value_counts().sort_index().plot.bar(
    ax=axarr[0][0], fontsize=12, color='mediumvioletred'
)
axarr[0][0].set_title("Wine Scores", fontsize=18)

reviews['variety'].value_counts().head(20).plot.bar(
    ax=axarr[1][0], fontsize=12, color='mediumvioletred'
)
axarr[1][0].set_title("Wine Varieties", fontsize=18)

reviews['province'].value_counts().head(20).plot.bar(
    ax=axarr[1][1], fontsize=12, color='mediumvioletred'
)
axarr[1][1].set_title("Wine Origins", fontsize=18)

reviews['price'].value_counts().plot.hist(
    ax=axarr[0][1], fontsize=12, color='mediumvioletred'
)
axarr[0][1].set_title("Wine Prices", fontsize=18)

plt.subplots_adjust(hspace=.3)

import seaborn as sns
sns.despine()
1437023-40f8c1ca1cbf7de6.png
Subplot

第六课 用seaborn画图


这一课的部分内容在专题三已经介绍过,只将新内容节选出来。
10.计数图

sns.countplot(reviews['points'])
1437023-d29a2c4a596eaf50.png
计数图

countplot与barplot相似,但是countplot不能同时输入x,y。

11.箱形图(Box-plot)又称为盒须图、盒式图或箱线图

df = reviews[reviews.variety.isin(reviews.variety.value_counts().head(5).index)]

sns.boxplot(
    x='variety',
    y='points',
    data=df
)
1437023-769f5b334ad2ce19.png
箱形图

12.小提琴图

sns.violinplot(
    x='variety',
    y='points',
    data=reviews[reviews.variety.isin(reviews.variety.value_counts()[:5].index)]
)
1437023-ada56aa2512879ee.png
小提琴图

第七课 用seaborn画faceting


1.FaceGrid
基本工作流程是FacetGrid使用数据集和用于构造网格的变量初始化对象。然后,可以通过调用FacetGrid.map()或将一个或多个绘图函数应用于每个子集 FacetGrid.map_dataframe()。最后,可以使用其他方法调整绘图,以执行更改轴标签,使用不同刻度或添加图例等操作

df = footballers[footballers['Position'].isin(['ST', 'GK'])]
g = sns.FacetGrid(df, col="Position")
g.map(sns.kdeplot, "Overall")
1437023-a22d7132adef8b0c.png
faceting

简单理解就是FaceGrid中的参数表示图像分类的信息,一般会显示在图像的上边(通过设置col和row实现)。而map中的参数则是输出xlabel和ylabel,有时可能只有xlabel,它们显示在图像的下边和左边。

  • FaceGrid中的还有一些参数如col_val(固定输出的列数)、hue(为某个特征设置不同的颜色)、margin_title(行变量的标题就会被绘制到最后一列的右边)、row_order(对所给命令的级别进行排序)、height(高度)、aspect(纵横比)等,详情可参考:zsffuture
  • map中的第一个参数为目标图像比如上面提到的各种bar,dist等,可以是plt的,也可以是sns的,后面参数分别是横纵坐标,对应目标图像的相关参数等。
    2.PairGrid和Pairplot
    PairGrid还允许您使用相同的绘图类型快速绘制小子图的网格,以可视化每个子图中的数据。在一个PairGrid中,每个行和列都分配给不同的变量,因此结果图显示数据集中的每个成对关系。这种情节有时被称为“散点图矩阵”,因为这是显示每种关系的最常用方式,但PairGrid不限于散点图。
    理解FacetGrid和PairGrid 之间的差异很重要。在前者中,每个方面都表现出以不同级别的其他变量为条件的相同关系。在后者中,每个图显示不同的关系(尽管上三角和下三角将具有镜像图)。
iris = sns.load_dataset("iris")
# 该数据大家应该很熟悉了,就不看数据了
g = sns.PairGrid(iris)
g.map(plt.scatter)
g = sns.PairGrid(iris, hue="species")
g.map_diag(plt.hist)
g.map_offdiag(plt.scatter)
g.add_legend()
1437023-b461f2a257aeeea1.png
PairGrid

可以对上三角、下三角、对角线、非对角线进行区别作图。

pairplot默认使用散点图和直方图,但会添加一些其他类型

sns.pairplot(footballers[['Overall', 'Potential', 'Value']])
1437023-33de3a1592d01b5a.png
pairplot

再次强调FaceGrid和pairplot都是在seaborn中。


第八课 多元变量绘图


1.多元变量散点图

sns.lmplot(x='Value', y='Overall', hue='Position', 
           data=footballers.loc[footballers['Position'].isin(['ST', 'RW', 'LW'])], 
           fit_reg=False)

1437023-c85a7487bbaff85c.png
利用lmplot画多元变量图

This is not anything new.
2. FIG packet box

f = (footballers
         .loc[footballers['Position'].isin(['ST', 'GK'])]
         .loc[:, ['Value', 'Overall', 'Aggression', 'Position']]
    )
f = f[f["Overall"] >= 80]
f = f[f["Overall"] < 85]
f['Aggression'] = f['Aggression'].astype(float)

sns.boxplot(x="Overall", y="Aggression", hue='Position', data=f)
1437023-8b03ef970ecb00a3.png
FIG grouping box 1

The plurality of categories may be grouped box, such as positions where all LW is also taken into account, the drawing is as follows:

1437023-dd2aacdd090c35bb.png
FIG packet box 2

FIG 3.Heatmap Related
utilized herein is used to draw a correlation diagram heatmap.

f = (
    footballers.loc[:, ['Acceleration', 'Aggression', 'Agility', 'Balance', 'Ball control']]
        .applymap(lambda v: int(v) if str.isdecimal(v) else np.nan)
        .dropna()
).corr()

sns.heatmap(f, annot=True)

1437023-88669ead15dad07f.png
heatmap correlation diagram

4. The parallel coordinate plot

from pandas.plotting import parallel_coordinates

f = (
    footballers.iloc[:, 12:17]
        .loc[footballers['Position'].isin(['ST', 'GK'])]
        .applymap(lambda v: int(v) if str.isdecimal(v) else np.nan)
        .dropna()
)
f['Position'] = footballers['Position']
f = f.sample(200)

parallel_coordinates(f, 'Position')
1437023-20d0d501baff8ec4.png
Parallel coordinate plot

Modification have many parallel coordinate plots, used widely.

Reference: zsffuture

Reproduced in: https: //www.jianshu.com/p/3f7a496956ee

Guess you like

Origin blog.csdn.net/weixin_34389926/article/details/91178176