python visualization 48|The 11 most commonly used distribution diagrams


This article shares the most commonly used "11 Distribution Diagrams" .

table of Contents

Fourth, the distribution (Distribution) relationship diagram

21. Stacked Histogram for Continuous Variable

22. Stacked Histogram for Categorical Variable

23. Density Plot

24. Density Curves with Histogram

25. Joy Plot

26. Distributed Dot Plot

27, box plot (boxplot)

28. Box plot combined with dot plot (Dot + Box Plot)

29. Violin Plot

30.Population Pyramid

31. Categorical Plots


Fourth, the distribution (Distribution) relationship diagram

21. Stacked Histogram for Continuous Variable

The graph shows the frequency distribution of a given continuous variable.

# Import Data
df = pd.read_csv("./datasets/mpg_ggplot2.csv")

# Prepare data
x_var = 'displ'
groupby_var = 'class'
df_agg = df.loc[:, [x_var, groupby_var]].groupby(groupby_var)
vals = [df[x_var].values.tolist() for i, df in df_agg]

# Draw
plt.figure(figsize=(10, 6), dpi=80)
colors = [plt.cm.Set1(i / float(len(vals) - 1)) for i in range(len(vals))]
n, bins, patches = plt.hist(vals,
                            30,
                            stacked=True,
                            density=False,
                            color=colors[:len(vals)])

# Decoration
plt.legend({
    group: col
    for group, col in zip(
        np.unique(df[groupby_var]).tolist(), colors[:len(vals)])
})
plt.title(f"Stacked Histogram of ${x_var}$ colored by ${groupby_var}$",
          fontsize=22)
plt.xlabel(x_var)
plt.ylabel("Frequency")
#plt.ylim(0, 25)
plt.xticks(ticks=bins[::3], labels=[round(b, 1) for b in bins[::3]])
plt.show()

22. Stacked Histogram for Categorical Variable

The graph shows the frequency distribution of a given category variable.

# Import Data
df = pd.read_csv("./datasets/mpg_ggplot2.csv")

# Prepare data
x_var = 'manufacturer'
groupby_var = 'class'
df_agg = df.loc[:, [x_var, groupby_var]].groupby(groupby_var)
vals = [df[x_var].values.tolist() for i, df in df_agg]

# Draw
plt.figure(figsize=(10, 6), dpi=80)
colors = [plt.cm.Set1(i / float(len(vals) - 1)) for i in range(len(vals))]
n, bins, patches = plt.hist(vals,
                            df[x_var].unique().__len__(),
                            stacked=True,
                            density=False,
                            color=colors[:len(vals)])

# Decoration
plt.legend({
    group: col
    for group, col in zip(
        np.unique(df[groupby_var]).tolist(), colors[:len(vals)])
})
plt.title(f"Stacked Histogram of ${x_var}$ colored by ${groupby_var}$",
          fontsize=22)
plt.xlabel(x_var)
plt.ylabel("Frequency")
plt.ylim(0, 40)
plt.xticks(ticks=bins,
           labels=np.unique(df[x_var]).tolist(),
           rotation=90,
           horizontalalignment='left')
plt.show()

Learn more about histograms:

23. Density Plot

The figure shows the distribution of continuous variables.

# Import Data
df = pd.read_csv("./datasets/mpg_ggplot2.csv")

# Draw Plot
plt.figure(figsize=(10, 8), dpi=80)
sns.kdeplot(df.loc[df['cyl'] == 4, "cty"],
            shade=True,
            color="#01a2d9",
            label="Cyl=4",
            alpha=.7)
sns.kdeplot(df.loc[df['cyl'] == 5, "cty"],
            shade=True,
            color="#dc2624",
            label="Cyl=5",
            alpha=.7)
sns.kdeplot(df.loc[df['cyl'] == 6, "cty"],
            shade=True,
            color="#C89F91",
            label="Cyl=6",
            alpha=.7)
sns.kdeplot(df.loc[df['cyl'] == 8, "cty"],
            shade=True,
            color="#649E7D",
            label="Cyl=8",
            alpha=.7)

# Decoration
sns.set(style="whitegrid", font_scale=1.1)
plt.title('Density Plot of City Mileage by n_Cylinders', fontsize=18)
plt.legend()
plt.show()

24. Density Curves with Histogram

# Import Data
df = pd.read_csv("./datasets/mpg_ggplot2.csv")

# Draw Plot
plt.figure(figsize=(10, 8), dpi=80)
sns.distplot(df.loc[df['class'] == 'compact', "cty"],
             color="#01a2d9",
             label="Compact",
             hist_kws={'alpha': .7},
             kde_kws={'linewidth': 3})
sns.distplot(df.loc[df['class'] == 'suv', "cty"],
             color="#dc2624",
             label="SUV",
             hist_kws={'alpha': .7},
             kde_kws={'linewidth': 3})
sns.distplot(df.loc[df['class'] == 'minivan', "cty"],
             color="g",
             label="#C89F91",
             hist_kws={'alpha': .7},
             kde_kws={'linewidth': 3})
plt.ylim(0, 0.35)

# Decoration
sns.set(style="whitegrid", font_scale=1.1)
plt.title('Density Plot of City Mileage by Vehicle Type', fontsize=18)
plt.legend()
plt.show()

More nuclear density maps:

25. Joy Plot

The figure shows the relationship between a large number of groups, which is more image than heatmap.

!pip install joypy#安装依赖包
#每组数据绘制核密度图,R中有ggjoy
import joypy
# Import Data
mpg = pd.read_csv("./datasets/mpg_ggplot2.csv")

# Draw Plot
plt.figure(figsize=(10, 6), dpi=80)
fig, axes = joypy.joyplot(mpg,
                          column=['hwy', 'cty'],
                          by="class",
                          ylim='own',
                          colormap=plt.cm.Set1,
                          figsize=(10, 6))

# Decoration
plt.title('Joy Plot of City and Highway Mileage by Class', fontsize=18)
plt.show()

26. Distributed Dot Plot

The distribution point plot shows the univariate distribution of points divided by groups. The lighter the point color, the higher the concentration of data points in the area. By coloring the median differently, the actual position of each group becomes immediately obvious.

import matplotlib.patches as mpatches

# Prepare Data
df_raw = pd.read_csv("./datasets/mpg_ggplot2.csv")
cyl_colors = {4: 'tab:red', 5: 'tab:green', 6: 'tab:blue', 8: 'tab:orange'}
df_raw['cyl_color'] = df_raw.cyl.map(cyl_colors)

# Mean and Median city mileage by make
df = df_raw[['cty',
             'manufacturer']].groupby('manufacturer').apply(lambda x: x.mean())
df.sort_values('cty', ascending=False, inplace=True)
df.reset_index(inplace=True)
df_median = df_raw[['cty', 'manufacturer'
                    ]].groupby('manufacturer').apply(lambda x: x.median())

# Draw horizontal lines
fig, ax = plt.subplots(figsize=(11, 7), dpi=80)
ax.hlines(y=df.index,
          xmin=0,
          xmax=40,
          color='#01a2d9',
          alpha=0.5,
          linewidth=.5,
          linestyles='dashdot')

# Draw the Dots
for i, make in enumerate(df.manufacturer):
    df_make = df_raw.loc[df_raw.manufacturer == make, :]
    ax.scatter(y=np.repeat(i, df_make.shape[0]),
               x='cty',
               data=df_make,
               s=75,
               edgecolors='#01a2d9',
               c='w',
               alpha=0.5)
    ax.scatter(y=i,
               x='cty',
               data=df_median.loc[df_median.index == make, :],
               s=75,
               c='#dc2624')

# Annotate
ax.text(33,
        13,
        "$red \; dots \; are \; the \: median$",
        fontdict={'size': 12},
        color='#dc2624')

# Decorations
red_patch = plt.plot([], [],
                     marker="o",
                     ms=10,
                     ls="",
                     mec=None,
                     color='#dc2624',
                     label="Median")
plt.legend(handles=red_patch)
ax.set_title('Distribution of City Mileage by Make', fontdict={'size': 18})
ax.set_xlabel('Miles Per Gallon (City)')
ax.set_yticks(df.index)
ax.set_yticklabels(df.manufacturer.str.title(),
                   fontdict={'horizontalalignment': 'right'})
ax.set_xlim(1, 40)
plt.gca().spines["top"].set_visible(False)
plt.gca().spines["bottom"].set_visible(False)
plt.gca().spines["right"].set_visible(False)
plt.gca().spines["left"].set_visible(False)
plt.grid(axis='both', alpha=.4, linewidth=.1)
plt.show()

27, box plot (boxplot)

A good display of the distribution of data~

# Import Data
df = pd.read_csv("./datasets/mpg_ggplot2.csv")

# Draw Plot
plt.figure(figsize=(10, 6), dpi=80)
sns.boxplot(
    x='class',
    y='hwy',
    data=df,
    notch=False,
    palette="Set1",
)


# Add N Obs inside boxplot (optional)
def add_n_obs(df, group_col, y):
    medians_dict = {
        grp[0]: grp[1][y].median()
        for grp in df.groupby(group_col)
    }
    xticklabels = [x.get_text() for x in plt.gca().get_xticklabels()]
    n_obs = df.groupby(group_col)[y].size().values
    for (x, xticklabel), n_ob in zip(enumerate(xticklabels), n_obs):
        plt.text(x,
                 medians_dict[xticklabel] * 1.01,
                 "#obs : " + str(n_ob),
                 horizontalalignment='center',
                 fontdict={'size': 12},
                 color='black')


add_n_obs(df, group_col='class', y='hwy')

# Decoration
sns.set(style="whitegrid", font_scale=1.1)
plt.title('Box Plot of Highway Mileage by Vehicle Class', fontsize=16)
plt.ylim(10, 40)
plt.show()

28. Box plot combined with dot plot (Dot + Box Plot)

The figure shows the box plot and the detailed points used for drawing the box plot.

# Import Data
df = pd.read_csv("./datasets/mpg_ggplot2.csv")

# Draw Plot
plt.figure(figsize=(13, 10), dpi=80)
sns.boxplot(
    x='class',
    y='hwy',
    data=df,
    hue='cyl',
    palette="Set1",
)
plt.legend(loc=9)
sns.stripplot(x='class',
              y='hwy',
              data=df,
              color='#dc2624',
              size=5,
              jitter=1)

for i in range(len(df['class'].unique()) - 1):
    plt.vlines(i + .5, 10, 45, linestyles='solid', colors='gray', alpha=0.2)

# Decoration
plt.title('Box Plot of Highway Mileage by Vehicle Class', fontsize=18)

plt.show()

More about box plots:

29. Violin Plot

It looks better than the box plot, but it is not commonly used. The shape or area of ​​the violin is determined by the number of times the position data.

# Import Data
df = pd.read_csv("./datasets/mpg_ggplot2.csv")

# Draw Plot
plt.figure(figsize=(13, 10), dpi=80)
sns.violinplot(x='class',
               y='hwy',
               data=df,
               scale='width',
               palette='Set1',
               inner='quartile')

# Decoration
plt.title('Violin Plot of Highway Mileage by Vehicle Class', fontsize=18)
plt.show()

30.Population Pyramid

It can be understood as a sorted grouped horizontal histogram barplot, which can well show the difference between different groups, and can visualize the step-by-step filtering or each stage of the funnel.

# Read data
df = pd.read_csv("./datasets/email_campaign_funnel.csv")

# Draw Plot
plt.figure(figsize=(12, 8), dpi=80)
group_col = 'Gender'
order_of_bars = df.Stage.unique()[::-1]
colors = [
    plt.cm.Set1(i / float(len(df[group_col].unique()) - 1))
    for i in range(len(df[group_col].unique()))
]


for c, group in zip(colors, df[group_col].unique()):
    sns.barplot(x='Users',
                y='Stage',
                data=df.loc[df[group_col] == group, :],
                order=order_of_bars,
                color=c,
                label=group)

# Decorations
plt.xlabel("$Users$")
plt.ylabel("Stage of Purchase")
plt.yticks(fontsize=12)
plt.title("Population Pyramid of the Marketing Funnel", fontsize=18)
plt.legend()
plt.show()

31. Categorical Plots

Show the count distributions of multiple (>=2) categorical variables that are related to each other, which is actually a facet map of seaborn.

# Load Dataset
titanic = pd.read_csv('./datasets/titanic.csv')
# Plot
g = sns.catplot("alive",
                col="deck",
                col_wrap=4,
                data=titanic[titanic.deck.notnull()],
                kind="count",
                height=3.5,
                aspect=.8,
                palette='Set1')

plt.show()
# Plot
sns.catplot(x="age",
            y="embark_town",
            hue="sex",
            col="class",
            data=titanic[titanic.embark_town.notnull()],
            orient="h",
            height=5,
            aspect=1,
            palette="Set1",
            kind="violin",
            dodge=True,
            cut=0,
            bw=.2)

More about faceted diagrams:

Useful, please "Like", "Watching", "Share"

Guess you like

Origin blog.csdn.net/qq_21478261/article/details/113750415