Data Science Essentials A Beginner's Guide to Pandas Data Visualization

Whether you are just learning about a dataset or preparing to publish relevant analysis results, visualization is an essential tool. Python's popular data analysis library pandas provides several different options to use **.plot()** Even if you are in the beginning stages of your Pandas journey, you can quickly create basic plots that yield valuable insights into your data opinion.
insert image description here

data preparation

University major data using the American Community Survey 2010–2012 Public Use Microdata Sample

import pandas as pd
df = pd.read_csv(
    "数据科学必备 Pandas 数据可视化初学者指南/recent-grads.csv",
)
df.head()

insert image description here

Create Pandas plots

The dataset contains some columns related to the earnings of graduates of each major:

  • "Median" is the median income for full-time, year-round workers.
  • "P25th" is the 25th percentile of income.
  • "P75th" is the 75th percentile of income.
  • "Rank" is a professional ranking by median income.

.plot() returns a line graph with data for each row in the DataFrame. The x-axis values ​​represent the rank of each institution, and the 'P25th', 'Median' and 'P75th' values ​​are plotted on the y-axis.

import matplotlib.pyplot as plt
df.plot(x="Rank", y=["P25th", "Median", "P75th"])
plt.show()

insert image description here
Relevant information can be obtained intuitively.

  • Median income falls as the ranking goes down. That's to be expected, since rankings are determined by median income.
  • Some majors have a wide gap between 25% and 75%. Those with these degrees may earn significantly less or significantly more than the median income.
  • The gap between 25% and 75% for other majors is very small. The salaries of those with these degrees are very close to the median income.

.plot() has several optional parameters. The kind parameter accepts 11 different string values ​​and determines what kind of plot you will create:

  1. "area" is used for area charts .
  2. "bar" is used for vertical bar charts .
  3. "barh" is used for horizontal bar charts .
  4. "box" is used for boxplots .
  5. "hexbin" is used for hexagon plots .
  6. "hist" is used for histograms .
  7. "kde" is used for kernel density estimation plots .
  8. "density" is an alias for "kde" .
  9. "line" is used for line charts .
  10. "pie" is used for pie charts .
  11. "scatter" is used for scatter plots .

Deep dive into Matplotlib

When calling **.plot()** on a DataFrame object, Matplotlib creates the plot behind the scenes.

First import the matplotlib.pyplot module and rename it to plt. Then call .plot() and pass the "Rank" column of the DataFrame object as the first parameter and the "P75th" column as the second parameter.

import matplotlib.pyplot as plt

plt.plot(df["Rank"], df["P75th"])
[<matplotlib.lines.Line2D at 0x7f859928fbb0>]

Draw a line graph, df["Rank"] and df["P75th"] two-dimensional coordinate relationship polyline.
insert image description here
A DataFrame can create the exact same graph using the .plot() method of the object.

df.plot(x="Rank", y="P75th")
<AxesSubplot:xlabel='Rank'>

insert image description here

Description and inspection of data

Distributions and Histograms

DataFrame is not the only class in pandas with a .plot() method, Series objects provide similar functionality. You can treat each column of the DataFrame as a Series object.

An example of creating a histogram using the "Median" column of a DataFrame created from college majors data.

median_column = df["Median"]

type(median_column)
pandas.core.series.Series

median_column.plot(kind="hist")
<AxesSubplot:ylabel='Frequency'>

insert image description here
The histogram shows the data divided into 10 bins ranging from $20,000 to $120,000, each with a width of $10,000. The shape of the histogram is different from the normal distribution, which has a symmetrical bell shape with a peak in the middle.

Outlier detection

Outliers refer to sample points where some values ​​in the sample deviate significantly from the rest, so they are also called outliers. Outlier analysis is to find out these outliers and then analyze them.

Question: Although the professional ranking is not very high, it can also get a correspondingly higher salary. How should these data be detected?

Such outliers can be detected using a histogram.

Create a new DataFrame named top_5.

top_5 = df.sort_values(by="Median", ascending=False).head(5)

Create a bar graph to limit the salaries of these 5 majors.

top_5.plot(x="Major", y="Median", kind="bar", rot=5, fontsize=4)
<AxesSubplot:xlabel='Major'>

insert image description here
The median salary for petroleum engineering majors was found to be more than $20,000 higher than other majors. The earnings of the second to fourth place majors are relatively close.

If one data point has a much higher or lower value than the others, further investigation may be required. For example, you can view columns that contain related data.

A survey of all majors with a median salary of more than $60,000 shows three income columns.

top_medians = df[df["Median"] > 60000].sort_values("Median")

In [18]: top_medians.plot(x="Major", y=["P25th", "Median", "P75th"], kind="bar")
Out[18]: <AxesSubplot:xlabel='Major'>

insert image description here

The 25th and 75th percentiles confirm what has been seen above: Petroleum engineering majors are by far the highest-earning fresh graduates.

Check for dependencies

Often you want to see if two columns of a dataset are related. If I choose a major with a higher median income, is there a lower chance of unemployment?

Create a scatterplot with "Median" and "Unemployment_rate".

df.plot(x="Median", y="Unemployment_rate", kind="scatter")

insert image description here
There seems to be no obvious pattern, and there is no significant correlation between income and unemployment.

While a scatter plot is an excellent tool for getting a first impression about possible correlations, it's certainly not clear evidence of a connection. To understand the correlation between different columns, you can use .corr(). If you suspect a correlation between two values, there are several tools you can use to verify your hunch and measure how strong the correlation is.

For details, please refer to the three operation methods and visualization details of data correlation analysis necessary for data science

But keep in mind that even though there is a correlation between two values, it doesn't mean that a change in one will cause a change in the other. In other words, correlation does not imply causation.

Analyze categorical data

To process larger chunks of information, the human brain categorizes data, both consciously and unconsciously. This technique is often useful, but far from perfect. Sometimes we put things into a category that, upon further inspection, are not that similar. So you need to know some tools for checking categories and verifying whether a given classification makes sense.

grouping

The basic usage of categories is grouping and aggregation. The popularity of each category in the college majors dataset can be determined using .groupby() .

cat_totals = df.groupby("Major_category")["Total"].sum().sort_values()
cat_totals

Major_category
Interdisciplinary                        12296.0
Agriculture & Natural Resources          75620.0
Law & Public Policy                     179107.0
Physical Sciences                       185479.0
Industrial Arts & Consumer Services     229792.0
Computers & Mathematics                 299008.0
Arts                                    357130.0
Communications & Journalism             392601.0
Biology & Life Science                  453862.0
Health                                  463230.0
Psychology & Social Work                481007.0
Social Science                          529966.0
Engineering                             537583.0
Education                               559129.0
Humanities & Liberal Arts               713468.0
Business                               1302376.0
Name: Total, dtype: float64

Plot a horizontal bar chart showing the totals for all categories in cat_totals.

cat_totals.plot(kind="barh", fontsize=4)
<AxesSubplot:ylabel='Major_category'>

insert image description here

ratio

To see the differences between categories, vertical and horizontal bar charts are usually a good choice. If you are interested in ratios, then a pie chart is a great tool.

Combine all categories with a total of less than 100,000 into a category called "Other" and create a pie chart.

small_cat_totals = cat_totals[cat_totals < 100_000]
big_cat_totals = cat_totals[cat_totals > 100_000]

small_sums = pd.Series([small_cat_totals.sum()], index=["Other"])
big_cat_totals = big_cat_totals.append(small_sums)

big_cat_totals.plot(kind="pie", label="")
<AxesSubplot:>

insert image description here

Guess you like

Origin blog.csdn.net/qq_20288327/article/details/124241111