Data analysis Seaborn Draw Normal mode summary | drawing a variety of methods to find trends, find a relationship, looking distribution | 20 mins Express | Kaggle learning (a)

Here Insert Picture Description

  • Trends

    - A trend is defined as a pattern of change.

    • sns.lineplot - Line charts are best to show trends over a period of time, and multiple lines can be used to show trends in more than one group.
  • Relationship

    - There are many different chart types that you can use to understand relationships between variables in your data.

    • sns.barplot - Bar charts are useful for comparing quantities corresponding to different groups.
    • sns.heatmap - Heatmaps can be used to find color-coded patterns in tables of numbers.
    • sns.scatterplot - Scatter plots show the relationship between two continuous variables; if color-coded, we can also show the relationship with a third categorical variable.
    • sns.regplot - Including a regression line in the scatter plot makes it easier to see any linear relationship between two variables.
    • sns.lmplot - This command is useful for drawing multiple regression lines, if the scatter plot contains multiple, color-coded groups.
    • sns.swarmplot - Categorical scatter plots show the relationship between a continuous variable and a categorical variable.
  • Distribution

    - We visualize distributions to show the possible values that we can expect to see in a variable, along with how likely they are.

    • sns.distplot - Histograms show the distribution of a single numerical variable.
    • sns.kdeplot - KDE plots (or 2D KDE plots) show an estimated, smooth distribution of a single numerical variable (or two numerical variables).
    • sns.jointplot - This command is useful for simultaneously displaying a 2D KDE plot with the corresponding KDE plots for each individual variable.

1. Line Chart

import pandas as pd
pd.plotting.register_matplotlib_converters()
import matplotlib,pyplot as plt
%matplotlib inline
import seaborn as sns
# Path of the file to read
spotify_filepath = "../input/spotify.csv"

# Read the file into a variable spotify_data
spotify_data = pd.read_csv(spotify_filepath, index_col="Date", parse_dates=True)

spotify_data.tail()
Shape of You Slowly Something Just Like This HUMBLE. Unforgettable
Date
2018-01-05 4492978 3450315.0 2408365.0 2685857.0 2869783.0
2018-01-06 4416476 3394284.0 2188035.0 2559044.0 2743748.0
2018-01-07 4009104 3020789.0 1908129.0 2350985.0 2441045.0
2018-01-08 4135505 2755266.0 2023251.0 2523265.0 2622693.0
2018-01-09 4168506 2791601.0 2058016.0 2727678.0 2627334.0
# Line chart showing daily global streams of each song
sns.lineplot(data=spotify_data)
<matplotlib.axes._subplots.AxesSubplot at 0x7fc8b2bb6f98>

Here Insert Picture Description

# Set the width and height of the figure
# sets the size of the figure to 14 inches (in width) by 6 inches (in height)
plt.figure(figsize=(14, 6))

# Add title
plt.title("Daily Global Streams of Popular Songs in 2017-2018")

# Line chart showing daily global streams of each song
sns.lineplot(data=spotify_data)
<matplotlib.axes._subplots.AxesSubplot at 0x7fc8b2a74780>

Here Insert Picture Description

Changing styles

# Seaborn has five different themes:(1)"darkgrid", (2)"whitegrid", (3)"dark", (4)"white", and (5)"ticks"
# Change the style of the figure to the "dark" theme
sns.set_style("dark")

# Line chart 
plt.figure(figsize=(12,6))
sns.lineplot(data=spotify_data)
<matplotlib.axes._subplots.AxesSubplot at 0x7f5faa4bc828>

Here Insert Picture Description

Plot a subset of the data

# Set the width and height of the figure
plt.figure(figsize=(14,6))

# Add title
plt.title("Daily Global Streams of Popular Songs in 2017-2018")

# Line chart showing daily global streams of 'Shape of You'
sns.lineplot(data=spotify_data['Shape of You'], label="Shape of You")

# Line chart showing daily global streams of 'Despacito'
sns.lineplot(data=spotify_data['Despacito'], label="Despacito")

# Add label for horizontal axis
plt.xlabel("Date")

Here Insert Picture Description

2.Bar Charts

# Print the data
flight_data
AA AS B6 DL EV F9 HE HAS MQ NK AND UA US VX WN
Month
1 6.955843 -0.320888 7.347281 -2.043847 8.537497 18.357238 3.512640 18.164974 11.398054 10.889894 6.352729 3.107457 1.420702 3.389466
2 7.530204 -0.782923 18.657673 5.614745 10.417236 27.424179 6.029967 21.301627 16.474466 9.588895 7.260662 7.114455 7.784410 3.501363
3 6.693587 -0.544731 10.741317 2.077965 6.730101 20.074855 3.468383 11.018418 10.039118 3.181693 4.892212 3.330787 5.348207 3.263341
4 4.931778 -3.009003 2.780105 0.083343 4.821253 12.640440 0.011022 5.131228 8.766224 3.223796 4.376092 2.660290 0.995507 2.996399
5 5.173878 -1.716398 -0.709019 0.149333 7.724290 13.007554 0.826426 5.466790 22.397347 4.141162 6.827695 0.681605 7.102021 5.680777
6 8.191017 -0.220621 5.047155 4.419594 13.952793 19.712951 0.882786 9.639323 35.561501 8.338477 16.932663 5.766296 5.779415 10.743462
7 3.870440 0.377408 5.841454 1.204862 6.926421 14.464543 2.001586 3.980289 14.352382 6.790333 10.262551 NaN 7.135773 10.504942
8 3.193907 2.503899 9.280950 0.653114 5.154422 9.175737 7.448029 1.896565 20.519018 5.606689 5.014041 NaN 5.106221 5.532108
9 -1.432732 -1.813800 3.539154 -3.703377 0.851062 0.978460 3.696915 -2.167268 8.000101 1.530896 -1.794265 NaN 0.070998 -1.336260
10 -0.580930 -2.993617 3.676787 -5.011516 2.303760 0.082127 0.467074 -3.735054 6.810736 1.750897 -2.456542 NaN 2.254278 -0.688851
11 0.772630 -1.916516 1.418299 -3.175414 4.415930 11.164527 -2.719894 0.220061 7.543881 4.925548 0.281064 NaN 0.116370 0.995684
12 4.149684 -1.846681 13.839290 2.504595 6.685176 9.346221 -1.706475 0.662486 12.733123 10.947612 7.012079 NaN 13.498720 6.720893
# Set the width and height of the figure
plt.figure(figsize=(10,6))

# Add title
plt.title("Average Arrival Delay for Spirit Airlines Flights, by Month")

# Bar chart showing average arrival delay for Spirit Airlines flights by month
sns.barplot(x=flight_data.index, y=flight_data['NK'])

# Add label for vertical axis
plt.ylabel("Arrival delay (in minutes)")
Text(0, 0.5, 'Arrival delay (in minutes)')

Here Insert Picture Description

3.Heatmap

# Set the width and height of the figure
plt.figure(figsize=(14,7))

# Add title
plt.title("Average Arrival Delay for Each Airline, by Month")

# Heatmap showing average arrival delay for each airline by month
sns.heatmap(data=flight_data, annot=True)

# Add label for horizontal axis
plt.xlabel("Airline")
Text(0.5, 42.0, 'Airline')

Here Insert Picture Description

4.Scatter Plots

insurance_data.head()
age sex bmi children smoker region charges
0 19 female 27.900 0 yes southwest 16884.92400
1 18 male 33.770 1 no southeast 1725.55230
2 28 male 33.000 3 no southeast 4449.46200
3 33 male 22.705 0 no northwest 21984.47061
4 32 male 28.880 0 no northwest 3866.85520
sns.scatterplot(x=insurance_data['bmi'], y=insurance_data['charges'])
<matplotlib.axes._subplots.AxesSubplot at 0x7f44f2300048>

Here Insert Picture Description

# Add a regression line
sns.regplot(x=insurance_data['bmi'], y=insurance_data['charges'])
<matplotlib.axes._subplots.AxesSubplot at 0x7f44f222c588>

Here Insert Picture Description

Color-coded scatter plots

# color-code the points by 'smoker', plot the other two columns('bmi', 'charges') on the axes 
sns.scatterplot(x=insurance_data['bmi'], y=insurance_data['charges'], hue=insurance_data['smoker'])
<matplotlib.axes._subplots.AxesSubplot at 0x7f44f19b49e8>

Here Insert Picture Description

# add two regression lines, corresponding to smokers and nonsmokers
# Instead of setting x=insurance_data['bmi'] to select the 'bmi' column in insurance_data, we set x="bmi" to specify the name of the column only.
# Similarly, y="charges" and hue="smoker" also contain the names of columns.
# We specify the dataset with data=insurance_data.

sns.lmplot(x="bmi", y="charges", hue="smoker", data=insurance_data)
<seaborn.axisgrid.FacetGrid at 0x7f44f192d668>

Here Insert Picture Description

sns.swarmplot(x=insurance_data['smoker'],
							y=insurance_data['charges'])

Here Insert Picture Description

5.Histograms

Sepal Length (cm) Sepal Width (cm) Petal Length (cm) Petal Width (cm) Species
Id
1 5.1 3.5 1.4 0.2 Iris-silky
2 4.9 3.0 1.4 0.2 Iris-silky
3 4.7 3.2 1.3 0.2 Iris-silky
4 4.6 3.1 1.5 0.2 Iris-silky
5 5.0 3.6 1.4 0.2 Iris-silky
# 'a' chooses the columns of the data
# kde=False is something we'll always provide when creating a histogram, as leaving it out will create a slightly different plot.
sns.displot(a=iris_data['Petal Length(cm)'], kde=False)
<matplotlib.axes._subplots.AxesSubplot at 0x7f96c5b1da20>

Here Insert Picture Description

Color-coded plots

# Histograms for each species
sns.distplot(a=iris_set_data['Petal Length (cm)'], label="Iris-setosa", kde=False)
sns.distplot(a=iris_ver_data['Petal Length (cm)'], label="Iris-versicolor", kde=False)
sns.distplot(a=iris_vir_data['Petal Length (cm)'], label="Iris-virginica", kde=False)

# Add title
plt.title("Histogram of Petal Lengths, by Species")

# Force legend to appear
plt.legend()
<matplotlib.legend.Legend at 0x7f96c5849470>

Here Insert Picture Description

6. Density plots

# Kernel density estimate(KDE) plot is like as a smoothed histogram
# 'shade=True' colors the area below the curve
sns.kdeplot(data=iris_data['Petal Length (cm)'], shade=True)
<matplotlib.axes._subplots.AxesSubplot at 0x7f96c5a664e0>

Here Insert Picture Description

# 2D KDE plot
sns.jointplot(x=iris_data['Petal Length (cm)'], y=iris_data['Sepal Width (cm)'], kind="kde")
<seaborn.axisgrid.JointGrid at 0x7f96c59cbef0>

The color-coding shows us how likely we are to see different combinations of sepal width and petal length, where darker parts of the figure are more likely.
Here Insert Picture Description

  • the curve at the top of the figure is a KDE plot for the data on the x-axis (in this case, iris_data['Petal Length (cm)']), and
  • the curve on the right of the figure is a KDE plot for the data on the y-axis (in this case, iris_data['Sepal Width (cm)']).

Color-coded plots

# KDE plots for each species
sns.kdeplot(data=iris_set_data['Petal Length (cm)'], label="Iris-setosa", shade=True)
sns.kdeplot(data=iris_ver_data['Petal Length (cm)'], label="Iris-versicolor", shade=True)
sns.kdeplot(data=iris_vir_data['Petal Length (cm)'], label="Iris-virginica", shade=True)

# Add title
plt.title("Distribution of Petal Lengths, by Species")
Text(0.5, 1.0, 'Distribution of Petal Lengths, by Species')

Here Insert Picture Description

Published 37 original articles · won praise 0 · Views 810

Guess you like

Origin blog.csdn.net/SanyHo/article/details/105171231