[Getting Started with Python] Pandas

[Zero Basic Introduction to Python] Lesson 7 Pandas

Pandas is an open source Python data analysis library created by Wes McKinney in 2008, and over the next few years it quickly became one of the most popular and influential tools in the Python data analysis community. Pandas The name comes from Panel Data and Python Data Analysis.

Insert image description here

What are pandas?

Pandas is an open source Python data analysis library that provides a large number of functions that can help us easily process structured data. Pandas is our right-hand assistant in data cleaning, transformation, and analysis.

Pandas

Why choose Pandas

Features of pandas

For us developers, Pandas provides a powerful, easy-to-use data structure and data analysis. Pandas is the first choice for data cleaning, transformation, analysis and visualization.

List some features of Pandas:

  • Flexible data structure: Pandas can easily handle various types of data, such as structured numerical tables, time series, statistical data sets, etc.
  • Powerful data processing capabilities: including data deletion, insertion, deletion, aggregation, analysis, data pivot and other operations
  • Easy to integrate with other Python libraries: such as Matplotlib, Numpy etc.

Application scenarios of Pandas

Main application scenarios of Pandas:

  • Data cleaning: such as processing lost data, filtering data, etc.
  • Data transformation: such as creating new data structures, aggregating data, etc.
  • Data analysis: such as calculating statistics, grouping and summarizing data, etc.
  • Data visualization: combine with other libraries (Matplotlib, Seaborn) to draw visual charts

Pandas bottom layer

Low-level implementation:

  • The bottom layer of Pandas is built on Numpy, which means that Pandas data structures, such as "Series" and "DataFrame" are actually operated on Numpy arrays. Because Numpy is specifically optimized for numerical calculations, Pandas High performance when processing large amounts of data
  • C Language Extensions: Although Pandas itself is written in Python, Pandas uses Cython to write key parts of the code, further improving performance

Install pandas

Enter in cmd:

pip install pandas

Install in conda:

conda install pandas

Check whether the installation is successful:

import pandas as pd

Series array

Pandas has two core data structures. “Series” and “DataFrame”. Let’s talk about Series first.

What is Series?

Series is a one-dimensional labeled array, which can hold any data type (integer, string, floating point number, Python object, etc.). Series is similar to an ordinary Python list, but has more features and flexibility.

Series creation

Format:

pd.Series(data, index=None, dtype=None, name=None, copy=None, fastpath=False)

parameter:

  • data: list-like data
  • index: index, default is None
  • drype: Type of the returned Series array, default is None
  • name: the name of the Series array, default is None
  • copy: copy the input data, default is None

example:

import pandas as pd


# 创建 Series 数组
list1 = [1, 2, 3]  # 创建列表
series1 = pd.Series(list1)  # 通过 list 创建 Series 数组
print(series1) # 调试输出

import pandas as pd

# 创建 Series 数组, 带 Index
student_name = ["张三", "李四", "我是小白呀"]  # 创建学生名字列表, 用于索引学生id
student_id = [1, 2, 3]  # 创建学生 id 列表
series2 = pd.Series(student_id, index=student_name)  # 创建 Series 数组
print(series2)  # 调试输出

# 通过字典创建 Series 数组
dict1 = {'a':1,'b':2, 'c':3}  # 创建字典
series3 = pd.Series(dict1)  # 通过字典创建 Series 数组
print(series3)  # 调试输出

Output result:

0    1
1    2
2    3
dtype: int64
张三       1
李四       2
我是小白呀    3
dtype: int64
a    1
b    2
c    3
dtype: int64

Series array operations

Data retrieval

In the Series array, we can achieve data retrieval through indexing.

example:

import pandas as pd


# 创建 Series 数组, 带 Index
student_name = ["张三", "李四", "我是小白呀"]  # 创建学生名字列表, 用于索引学生id
student_id = [1, 2, 3]  # 创建学生 id 列表
series1 = pd.Series(student_id, index=student_name)  # 创建 Series 数组
print(series1)  # 调试输出

# 数据检索
zhangsan_id = series1["张三"]  # 通过索引提取张三对应的 id
lisi_id = series1["李四"]  # 通过索引提取李四对应的 id
iamarookie_id = series1["我是小白呀"]  # 通过索引提取小白对应的 id
print("张三的 id:", zhangsan_id)
print("李四的 id:", lisi_id)
print("张三的 id:", iamarookie_id)

# 多重检索
ids = series1[["张三", "李四"]]  # 通过索引提取张三和李四的 id
print("张三 & 李四的 id: \n{}".format(ids))  # 调试输出

Output result:

张三       1
李四       2
我是小白呀    3
dtype: int64
张三的 id: 1
李四的 id: 2
张三的 id: 3
张三 & 李四的 id: 
张三    1
李四    2
dtype: int64

Data modification

In a Series array, you can use indexing to modify the data in the Series.

example:

import pandas as pd


# 创建 Series 数组, 带 Index
student_name = ["张三", "李四", "我是小白呀"]  # 创建学生名字列表, 用于索引学生id
student_id = [1, 2, 3]  # 创建学生 id 列表
series1 = pd.Series(student_id, index=student_name)  # 创建 Series 数组
print(series1)  # 调试输出

# 数据修改
series1["张三"] = 123  # 将 Series 数组中, 索引张三对应的 id 修改为 123
print(series1)  # 调试输出

Output result:

张三       1
李四       2
我是小白呀    3
dtype: int64
张三       123
李四         2
我是小白呀      3
dtype: int64

filter

Series arrays can be filtered using Boolean indexes.

example:

import pandas as pd


# 创建关于学生成绩的 Series 数组
student_name = ["张三", "李四", "我是小白呀"]  # 创建学生名字列表, 用于索引学生成绩
student_grade = [88, 90, 55]  # 创建学生成绩列表
series1 = pd.Series(student_grade, index=student_name)  # 创建 Series 数组
print(series1)  # 调试输出

# 数据修改
result = series1[series1 < 60]
print("成绩不及格的同学: \n{}".format(result))  # 调试输出

Output result:

张三       88
李四       90
我是小白呀    55
dtype: int64
成绩不及格的同学: 
我是小白呀    55
dtype: int64

Series array operations

import pandas as pd


# 创建关于学生成绩的 Series 数组
student_name = ["张三", "李四", "我是小白呀"]  # 创建学生名字列表, 用于索引学生成绩
student_grade = [88, 90, 55]  # 创建学生成绩列表
series1 = pd.Series(student_grade, index=student_name)  # 创建 Series 数组
print("加分前: \n{}".format(series1))  # 调试输出

# Series 数组运算
series1 = series1 + 5  # 鉴于小白同学不及格, 老师觉得给大家都加 5 分
print("加分后: \n{}".format(series1))  # 调试输出

Output result:

加分前: 
张三       88
李四       90
我是小白呀    55
dtype: int64
加分后: 
张三       93
李四       95
我是小白呀    60
dtype: int64

Summarize

Series in Pandas provides a flexible and powerful way to process data. Whether it is data analysis, data cleaning or data manipulation, Series is a very useful tool.

DataFrame array

What is a DataFrame?

DataFrame is a two-dimensional labeled data structure, similar to an Excel table. The values ​​in DataFrame are all Series of the same length. DataFrame is the most commonly used and powerful data structure in Pandas.

DataFrame creation

By using thepd.DataFrame function we can create a DataFrame array. The DataFrame can be punctuated by a variety of data, such asdictionary, list, or external file.

Format:

pd.DataFrame(data=None, index=None, columns=None, dtype=None, copy=None)

parameter:

  • data: list-like data
  • index: index, default is None
  • columns: column name, default is None
  • detype: The type of the returned Series array, default is None
  • copy: copy the input data, default is None

example:

import pandas as pd


# 创建 DataFrame 数组
data = {"名字":["张三", "李四", "我是小白呀"], "年龄":[25, 32, 18]}  # 创建字典
df = pd.DataFrame(data)  # 由字典创建 DataFrame 数组
print(df)  # 调试输出

Output result:

      名字  年龄
0     张三  25
1     李四  32
2  我是小白呀  18

Data operations

In Pandas, index is a very powerful tool that can help us access, query and operate data more efficiently. After understanding the data structure, we need to understand how to use indexes to perform array operations.

Access column data

Through the column names, we can retrieve the data in the DataFrame.

data:

        名字  年龄
第一行     张三  25
第二行     李四  32
第三行  我是小白呀  18

example:

import pandas as pd


# 创建 DataFrame 数组
data = {"名字":["张三", "李四", "我是小白呀"], "年龄":[25, 32, 18]}  # 创建字典
df = pd.DataFrame(data, index=["第一行", "第二行", "第三行"])  # 由字典创建 DataFrame 数组
print(df)  # 调试输出

# 检索名字列
name = df["名字"]  # 提取名字列
print("提取名字列: \n{}".format(name))  # 调试输出

# 通过 iloc 实现切片
name = df.iloc[:,0]  # 提取名字列 (所有行, 第一列)
print("提取名字列: \n{}".format(name))  # 调试输出

# 通过 loc 实现切片
name = df.loc[:,"名字"]  # 提取名字列
print("提取名字列: \n{}".format(name))  # 调试输出

Output result:

        名字  年龄
第一行     张三  25
第二行     李四  32
第三行  我是小白呀  18
提取名字列: 
第一行       张三
第二行       李四
第三行    我是小白呀
Name: 名字, dtype: object
提取名字列: 
第一行       张三
第二行       李四
第三行    我是小白呀
Name: 名字, dtype: object
提取名字列: 
第一行       张三
第二行       李四
第三行    我是小白呀
Name: 名字, dtype: object

Access row data

example:

import pandas as pd


# 创建 DataFrame 数组
data = {"名字":["张三", "李四", "我是小白呀"], "年龄":[25, 32, 18]}  # 创建字典
df = pd.DataFrame(data, index=["第一行", "第二行", "第三行"])  # 由字典创建 DataFrame 数组
print(df)  # 调试输出

# 检索第一行
row0 = df.iloc[0]  # 提取第一行
print("提第一行: \n{}".format(row0))  # 调试输出

# 检索第一行
row0 = df.loc["第一行"]  # 提取第一行
print("提第一行: \n{}".format(row0))  # 调试输出

Output result:

        名字  年龄
第一行     张三  25
第二行     李四  32
第三行  我是小白呀  18
提第一行: 
名字    张三
年龄    25
Name: 第一行, dtype: object
提第一行: 
名字    张三
年龄    25
Name: 第一行, dtype: object

place vs illoc vs ix

In Pandas, loc, iloc, ix are all methods used to select data.

The difference between the three:

  • loc["row","column"]: select data by label
    • Select rows: df.loc["row label name"]
    • Select columns: df.loc[:,"column label name"]
  • iloc[row index, column index]: select data by index
    • Select rows: df.iloc[row index]
    • Select columns: df.iloc[column index]
  • ix: It can be used either through labels or indexes, which is approximately equal to the function of loc + iloc (just understand it)
    • For code readability, it is recommended to use loc or iloc. ix is ​​now deprecated and is not recommended.

DataFrame operations

Filter data

example:

import pandas as pd


# 创建 DataFrame 数组
data = {"名字":["张三", "李四", "我是小白呀"], "年龄":[25, 32, 18]}  # 创建字典
df = pd.DataFrame(data)  # 由字典创建 DataFrame 数组
print(df)  # 调试输出

# 检索名字列
name = df["名字"]  # 提取名字列
print("提取名字列: \n{}".format(name))  # 调试输出

Output result:

      名字  年龄
0     张三  25
1     李四  32
2  我是小白呀  18
提取名字列: 
0       张三
1       李四
2    我是小白呀
Name: 名字, dtype: object

sort

Format:

pd.DataFrame.sort_values(by, *, axis=0, ascending=True, inplace=False, kind='quicksort', na_position='last', ignore_index=False, key=None)

parameter:

  • axis: axis, default is vertical sorting
  • ascending: from low to high
  • inplace: replace the original DataFrame, default is False

example:

import pandas as pd


# 创建 DataFrame 数组
data = {"名字":["张三", "李四", "我是小白呀"], "年龄":[25, 32, 18]}  # 创建字典
df = pd.DataFrame(data)  # 由字典创建 DataFrame 数组
print(df)  # 调试输出

# DataFrame 排序
df = df.sort_values(by="年龄")  # 通过布尔条件筛选特定数据
df.reset_index(inplace=True)  # 重新索引
print("排序: \n{}".format(df))  # 调试输出

Note: Bydf.reset_index(inplace=True), the DataFrame array will be re-indexed.

polymerization

We can perform various aggregation operations on DataFrame numbers.

example:

import pandas as pd


# 创建 DataFrame 数组
data = {"名字":["张三", "李四", "我是小白呀"], "年龄":[25, 32, 18]}  # 创建字典
df = pd.DataFrame(data)  # 由字典创建 DataFrame 数组
print(df)  # 调试输出

# DataFrame 聚合
mean = df["年龄"].mean()  # 通过布尔条件筛选特定数据
print("平均年龄:", mean)  # 调试输出

Output result:

      名字  年龄
0     张三  25
1     李四  32
2  我是小白呀  18
平均年龄: 25.0

Add and delete

import pandas as pd


# 创建 DataFrame 数组
data = {"名字":["张三", "李四", "我是小白呀"], "年龄":[25, 32, 18]}  # 创建字典
df = pd.DataFrame(data)  # 由字典创建 DataFrame 数组
print(df)  # 调试输出

# 添加列
data["成绩"] = [78, 82, 60]  # 添加一个新的列, 成绩
print(df)  # 调试输出

# 删除列
del data["年龄"]  # 删除年龄列
print(df)

Output result:

      名字  年龄
0     张三  25
1     李四  32
2  我是小白呀  18
      名字  年龄
0     张三  25
1     李四  32
2  我是小白呀  18
      名字  年龄
0     张三  25
1     李四  32
2  我是小白呀  18

Data loading

We often encounter situations where we need to add data from csv files. Pandas provides theread_csv method, which makes it very simple for us to load data from csv files. a>

CSV file loading

Format:

pandas.read_csv(filepath_or_buffer, *, sep=_NoDefault.no_default, delimiter=None, header='infer', names=_NoDefault.no_default, index_col=None, usecols=None, dtype=None, engine=None, converters=None, true_values=None, false_values=None, skipinitialspace=False, skiprows=None, skipfooter=0, nrows=None, na_values=None, keep_default_na=True, na_filter=True, verbose=False, skip_blank_lines=True, parse_dates=None, infer_datetime_format=_NoDefault.no_default, keep_date_col=False, date_parser=_NoDefault.no_default, date_format=None, dayfirst=False, cache_dates=True, iterator=False, chunksize=None, compression='infer', thousands=None, decimal='.', lineterminator=None, quotechar='"', quoting=0, doublequote=True, escapechar=None, comment=None, encoding=None, encoding_errors='strict', dialect=None, on_bad_lines='error', delim_whitespace=False, low_memory=True, memory_map=False, float_precision=None, storage_options=None, dtype_backend=_NoDefault.no_default)

parameter:

  • filepath_or_buffer: file path
  • header: header
  • names: column names

example:

import pandas as pd


# 读取 txt/csv
data = pd.read_csv("test.txt", header=None, names=["链接"])
print(data)  # 调试输出

Output result:

                                                  链接
0  http://melanz.phorum.pl/viewtopic.php?f=7&t=64041
1  http://www.reo14.moe.go.th/phpBB3/viewtopic.ph...
2  https://www.xroxy.com/xorum/viewtopic.php?p=30...
3  http://armasow.forumbb.ru/viewtopic.php?id=840...
4  http://telecom.liveforums.ru/viewtopic.php?id=...
5  http://www.crpsc.org.br/forum/viewtopic.php?f=...
6  http://community.getvideostream.com/topic/4803...
7  http://www.shop.minecraftcommand.science/forum...
8  https://www.moddingway.com/forums/thread-31914...
9  https://webhitlist.com/forum/topics/main-featu...

Excel file loading

example:

df = pd.read_excel('path_to_file.xlsx')

Data exploration

In DataFrame, we can use some functions to view the basic structure and content of DataFrame data.

Commonly used functions:

  • df.info(): Return basic information of data, including data type, non-empty, etc.
  • df.head(): show first 5 lines
  • df.tail(): show last 5 lines
  • df.describe(): Display basic statistical information, including: such as mean, standard deviation, minimum value, 25th, 50th (median) and 75th percentile, maximum value, etc.

example:

import pandas as pd


# 读取数据
data = pd.read_csv("students.txt", header=None)
print(data.info())  # 显示总览, 包括每列的数据类型和非空值的数量
print(data.head())  # 显示前 5 行
print(data.tail())  # 显示后 5 行
print(data.describe())  # 显示基本统计信息

Debug output:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 131 entries, 0 to 130
Data columns (total 2 columns):
0    131 non-null object
1    131 non-null object
dtypes: object(2)
memory usage: 2.1+ KB
None
                      0                   1
0  c1235666 Fink-Nottle      Augustus James
1    c3456765 O'Mahoney            Geoffrey
2       c8732719 De Leo   Victoria Margaret
3     c9676814 Thompson             Sabrina
4         c4418710 Heck               Kevin
                    0                        1
126     c6060052 Long                  Marilyn
127    c2390980 Martz       Perry Tony William
128   c5456142 Wilson          Christine Mabel
129    c1036678 Bunch            Richard Frank
130  c8306065 Hartley   Marcel Jonathan Philip
                    0        1
count             131      131
unique            131      127
top     c3827371 Bush   Thomas
freq                1        2

Pandas missing value filling

In most cases, the data may not always be complete, so we have to deal with missing data. Pandas provides a variety of ways to deal with missing data.

Identify missing values

In Pandas, missing values ​​are usually represented as "NaN" (Not a Number). We can use the isnull() function or the isna() function to list Missing values ​​in the data.

The following data is missing several values ​​in the Fare column:

     PassengerId  Survived  Pclass  ...      Fare        Cabin  Embarked
0              1         0       3  ...    7.2500          NaN         S
1              2         1       1  ...   71.2833          C85         C
2              3         1       3  ...    7.9250          NaN         S
3              4         1       1  ...   53.1000         C123         S

example:

import pandas as pd


# 读取数据
data = pd.read_csv("train.csv")
print(data)

# 调试输出每列的缺失值
print(data.isnull().sum())

Output result:

     PassengerId  Survived  Pclass  ...      Fare        Cabin  Embarked
0              1         0       3  ...    7.2500          NaN         S
1              2         1       1  ...   71.2833          C85         C
2              3         1       3  ...    7.9250          NaN         S
3              4         1       1  ...   53.1000         C123         S
4              5         0       3  ...    8.0500          NaN         S
5              6         0       3  ...    8.4583          NaN         Q
6              7         0       1  ...   51.8625          E46         S
7              8         0       3  ...   21.0750          NaN         S
8              9         1       3  ...   11.1333          NaN         S
9             10         1       2  ...   30.0708          NaN         C
10            11         1       3  ...   16.7000           G6         S
11            12         1       1  ...   26.5500         C103         S
12            13         0       3  ...    8.0500          NaN         S
13            14         0       3  ...   31.2750          NaN         S
14            15         0       3  ...    7.8542          NaN         S
15            16         1       2  ...   16.0000          NaN         S
16            17         0       3  ...   29.1250          NaN         Q
17            18         1       2  ...   13.0000          NaN         S
18            19         0       3  ...   18.0000          NaN         S
19            20         1       3  ...    7.2250          NaN         C
20            21         0       2  ...   26.0000          NaN         S
21            22         1       2  ...   13.0000          D56         S
22            23         1       3  ...    8.0292          NaN         Q
23            24         1       1  ...   35.5000           A6         S
24            25         0       3  ...   21.0750          NaN         S
25            26         1       3  ...   31.3875          NaN         S
26            27         0       3  ...    7.2250          NaN         C
27            28         0       1  ...  263.0000  C23 C25 C27         S
28            29         1       3  ...    7.8792          NaN         Q
29            30         0       3  ...    7.8958          NaN         S
..           ...       ...     ...  ...       ...          ...       ...
861          862         0       2  ...   11.5000          NaN         S
862          863         1       1  ...   25.9292          D17         S
863          864         0       3  ...   69.5500          NaN         S
864          865         0       2  ...   13.0000          NaN         S
865          866         1       2  ...   13.0000          NaN         S
866          867         1       2  ...   13.8583          NaN         C
867          868         0       1  ...   50.4958          A24         S
868          869         0       3  ...    9.5000          NaN         S
869          870         1       3  ...   11.1333          NaN         S
870          871         0       3  ...    7.8958          NaN         S
871          872         1       1  ...   52.5542          D35         S
872          873         0       1  ...    5.0000  B51 B53 B55         S
873          874         0       3  ...    9.0000          NaN         S
874          875         1       2  ...   24.0000          NaN         C
875          876         1       3  ...    7.2250          NaN         C
876          877         0       3  ...    9.8458          NaN         S
877          878         0       3  ...    7.8958          NaN         S
878          879         0       3  ...    7.8958          NaN         S
879          880         1       1  ...   83.1583          C50         C
880          881         1       2  ...   26.0000          NaN         S
881          882         0       3  ...    7.8958          NaN         S
882          883         0       3  ...   10.5167          NaN         S
883          884         0       2  ...   10.5000          NaN         S
884          885         0       3  ...    7.0500          NaN         S
885          886         0       3  ...   29.1250          NaN         Q
886          887         0       2  ...   13.0000          NaN         S
887          888         1       1  ...   30.0000          B42         S
888          889         0       3  ...   23.4500          NaN         S
889          890         1       1  ...   30.0000         C148         C
890          891         0       3  ...    7.7500          NaN         Q

[891 rows x 12 columns]
PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

Usedropna()Discard missing values

df.dropna()Can help us delete all missing rows in the data.

example:

import pandas as pd
import numpy as np

# 创建一个模拟数据集
data = {
    'Product': ['Apple', 'Banana', 'Cherry', 'Date', 'Fig', 'Grape', 'Mango', 'Watermelon'],
    'Price': [1, 0.5, np.nan, 0.75, np.nan, 2.5, 1.2, np.nan],
    'Date_sold': [np.nan, '2023-01-15', '2023-01-16', '2023-01-17', '2023-01-18', '2023-01-19', np.nan, '2023-01-21']
}

df = pd.DataFrame(data)
print("原始数据:")
print(df)

# 删除任何含有 NaN 的行
df_dropped = df.dropna()
print("\n删除含有 NaN 后的数据:")
print(df_dropped)

Output result:

原始数据:
      Product  Price   Date_sold
0       Apple   1.00         NaN
1      Banana   0.50  2023-01-15
2      Cherry    NaN  2023-01-16
3        Date   0.75  2023-01-17
4         Fig    NaN  2023-01-18
5       Grape   2.50  2023-01-19
6       Mango   1.20         NaN
7  Watermelon    NaN  2023-01-21

删除含有 NaN 后的数据:
  Product  Price   Date_sold
1  Banana   0.50  2023-01-15
3    Date   0.75  2023-01-17
5   Grape   2.50  2023-01-19

Fill missing values ​​usingfillna()

df.fillna()Helps us fill missing data with specified values.

example:

import pandas as pd
import numpy as np

# 创建一个模拟数据集
data = {
    'Product': ['Apple', 'Banana', 'Cherry', 'Date', 'Fig', 'Grape', 'Mango', 'Watermelon'],
    'Price': [1, 0.5, np.nan, 0.75, np.nan, 2.5, 1.2, np.nan],
    'Date_sold': [np.nan, '2023-01-15', '2023-01-16', '2023-01-17', '2023-01-18', '2023-01-19', np.nan, '2023-01-21']
}

df = pd.DataFrame(data)
print("原始数据:")
print(df)

# 使用固定值填充
df1 = df.copy()
df1['Price'].fillna("1", inplace=True)
print("\n使用固定值填充后的数据:")
print(df1)

# 使用前一个值填充
df2 = df.copy()
df2['Price'].fillna(method='ffill', inplace=True)
print("\n使用前一个值填充后的数据:")
print(df2)

# 使用平均值填充
df3 = df.copy()
df3['Price'].fillna(df3['Price'].mean(), inplace=True)
print("\n使用平均值填充后的数据:")
print(df3)

Output result:

原始数据:
      Product  Price   Date_sold
0       Apple   1.00         NaN
1      Banana   0.50  2023-01-15
2      Cherry    NaN  2023-01-16
3        Date   0.75  2023-01-17
4         Fig    NaN  2023-01-18
5       Grape   2.50  2023-01-19
6       Mango   1.20         NaN
7  Watermelon    NaN  2023-01-21

使用固定值填充后的数据:
      Product Price   Date_sold
0       Apple     1         NaN
1      Banana   0.5  2023-01-15
2      Cherry     1  2023-01-16
3        Date  0.75  2023-01-17
4         Fig     1  2023-01-18
5       Grape   2.5  2023-01-19
6       Mango   1.2         NaN
7  Watermelon     1  2023-01-21

使用前一个值填充后的数据:
      Product  Price   Date_sold
0       Apple   1.00         NaN
1      Banana   0.50  2023-01-15
2      Cherry   0.50  2023-01-16
3        Date   0.75  2023-01-17
4         Fig   0.75  2023-01-18
5       Grape   2.50  2023-01-19
6       Mango   1.20         NaN
7  Watermelon   1.20  2023-01-21

使用平均值填充后的数据:
      Product  Price   Date_sold
0       Apple   1.00         NaN
1      Banana   0.50  2023-01-15
2      Cherry   1.19  2023-01-16
3        Date   0.75  2023-01-17
4         Fig   1.19  2023-01-18
5       Grape   2.50  2023-01-19
6       Mango   1.20         NaN
7  Watermelon   1.19  2023-01-21

Remove duplicates

Remove duplicates:

df.drop.duplicates()

Data conversion:

df['column_name'] = df['column_name'].astype('new_type')  # 转换数据类型
df['new_column'] = df['column1'] + df['column2']

inplace parameter

The inplace parameter in Pandas is found in many functions. Its function is: whether to modify the original object.

The meaning of True and False in inplace:

  • inplace = True: Do not create a new object, directly modify the original object
  • inplace = False: Modify the data, create and return a new object to carry the modification results

example:

# 以下两行代码意思相同
df.dropna(inplace=True)  # 直接对 df 对象进行修改
df_dropped = df.dropna(inplace=True)  # 对 df 进行修改并返回对象

The inplace parameter defaults to False in the function, that is, a new object is created for modification, and the original object remains unchanged, which is similar to deep copy and shallow copy.

Data merging, connection and management

merge

contact()Functions can help us connect two or more data.

Format:

df.concat(objs, axis=0, join='outer', join_axes=None, ignore_index=False,
           keys=None, levels=None, names=None, verify_integrity=False,
           sort=None, copy=True)

parameter:

  • objs: Numbers to be spliced
  • join: splicing mode
  • join_axes: splicing axes

example:

import pandas as pd


# 初始化 DataFrame 数组
df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2'],
                   'B': ['B0', 'B1', 'B2']})
df2 = pd.DataFrame({'A': ['A3', 'A4', 'A5'],
                   'B': ['B3', 'B4', 'B5']})
# 调试输出
print(df1)
print(df2)

# 进行 concat 连接
result = pd.concat([df1, df2])
print(result)

Debug output:

    A   B
0  A0  B0
1  A1  B1
2  A2  B2
    A   B
0  A3  B3
1  A4  B4
2  A5  B5
    A   B
0  A0  B0
1  A1  B1
2  A2  B2
0  A3  B3
1  A4  B4
2  A5  B5

The ``merge()`` function can also help us perform merge operations, similar to SQL's JOIN.

example:

import pandas as pd


# 初始化 DataFrame 数组
left = pd.DataFrame({'key': ['K0', 'K1', 'K2'],
                     'A': ['A0', 'A1', 'A2'],
                     'B': ['B0', 'B1', 'B2']})
right = pd.DataFrame({'key': ['K0', 'K1', 'K2'],
                      'C': ['C0', 'C1', 'C2'],
                      'D': ['D0', 'D1', 'D2']})

# 调试输出
print(left)
print(right)

# 使用 merge 拼接
result = pd.merge(left, right, on='key')
print(result)

Output result:

  key   A   B
0  K0  A0  B0
1  K1  A1  B1
2  K2  A2  B2
  key   C   D
0  K0  C0  D0
1  K1  C1  D1
2  K2  C2  D2
  key   A   B   C   D
0  K0  A0  B0  C0  D0
1  K1  A1  B1  C1  D1
2  K2  A2  B2  C2  D2

Grouping and aggregation

Usegroupby() to group an array box by column value and then apply an aggregate function to each group.

Common aggregate functions:

  • sum():Sum
  • mean(): average
  • median(): Find the median
  • min(): Find the minimum value
  • max(): Find the maximum value

Example 1:

import pandas as pd


# 初始化 DataFrame 数组
df = pd.DataFrame({'A': ['foo', 'bar', 'foo', 'bar'],
                   'B': [1, 2, 3, 4],
                   'C': [2.0, 3.0, 4.0, 5.0]})
print(df)

# 聚合求和
grouped = df.groupby('A').sum()
print(grouped)

Debug output:

     A  B    C
0  foo  1  2.0
1  bar  2  3.0
2  foo  3  4.0
3  bar  4  5.0
     B    C
A          
bar  6  8.0
foo  4  6.0

Example 2, a data set on global employee salaries:

import pandas as pd

# 数据集
data = {
    'Country': ['USA', 'India', 'UK', 'USA', 'India', 'UK', 'USA', 'India'],
    'Employee': ['Sam', 'Amit', 'John', 'Alice', 'Alok', 'Bob', 'Charlie', 'Deepak'],
    'Salary': [70000, 45000, 60000, 80000, 50000, 55000, 85000, 65000]
}

# 初始化 DataFrame
df = pd.DataFrame(data)
print(df)

# 使用 goupby() 根据国家计算平均薪水
salary_by_country = df.groupby("Country")["Salary"].mean()
print("平均薪水:", salary_by_country)

# 使用 groupby() 计算每个国家的员工数
employee_by_country = df.groupby("Country")["Employee"].count()
print("员工数量: \n{}",format(employee_by_country))

# 多重聚合
result = df.groupby("Country")['Salary'].agg(['mean', 'median', 'sum', 'max', 'min'])
print(result)

Output result:

  Country Employee  Salary
0     USA      Sam   70000
1   India     Amit   45000
2      UK     John   60000
3     USA    Alice   80000
4   India     Alok   50000
5      UK      Bob   55000
6     USA  Charlie   85000
7   India   Deepak   65000
平均薪水: Country
India    53333.333333
UK       57500.000000
USA      78333.333333
Name: Salary, dtype: float64
员工数量: 
{} Country
India    3
UK       2
USA      3
Name: Employee, dtype: int64
                 mean  median     sum    max    min
Country                                            
India    53333.333333   50000  160000  65000  45000
UK       57500.000000   57500  115000  60000  55000
USA      78333.333333   80000  235000  85000  70000

time series analysis

Pandas provides "datatime" and "timedelta" types for processing time data.

example:

import pandas as pd


# 时间序列分析
data = pd.to_datetime(['2023-01-01', '2023-02-01'])
print(data)  # 调试输出

Output result:

DatetimeIndex(['2023-01-01', '2023-02-01'], dtype='datetime64[ns]', freq=None)

data visualization

We can do basic data visualization with Pandas.

In addition to data processing and analysis functions, Pandas also provides simple and intuitive drawing methods:

  • ````plot()```: Draw line graphs, usually used for time series data
  • hist(): Draw a histogram to help observe data distribution
  • boxpolt(): Draw box plots, which can be used to observe statistical indicators such as medians and quantiles of data.

Useplot to draw line graphs:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt


# 生成示例数据
date_rng = pd.date_range(start='2020-01-01', end='2020-12-31', freq='D')
df = pd.DataFrame(date_rng, columns=['date'])
df['data'] = np.random.randint(0, 100, size=(len(date_rng)))


df.set_index('date', inplace=True)
df.plot(figsize=(10, 6))
plt.title('Time Series Data Visualization')
plt.ylabel('Random Data')
plt.show()

Output result:
Please add image description

Usehist() to draw a histogram:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# 生成示例数据
date_rng = pd.date_range(start='2020-01-01', end='2020-12-31', freq='D')
df = pd.DataFrame(date_rng, columns=['date'])
df['data'] = np.random.randint(0, 100, size=(len(date_rng)))

# 绘制直方图
df['data'].hist(bins=30, figsize=(10, 6))
plt.title('Histogram Data Visualization')
plt.xlabel('Random Data')
plt.ylabel('Frequency')
plt.show()

Output result:
Please add image description
Useboxplot() to draw a box plot:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt


# 生成示例数据
date_rng = pd.date_range(start='2020-01-01', end='2020-12-31', freq='D')
df = pd.DataFrame(date_rng, columns=['date'])
df['data'] = np.random.randint(0, 100, size=(len(date_rng)))

# 绘制箱线图
df.boxplot(column='data', figsize=(6, 10))
plt.title('Boxplot Data Visualization')
plt.ylabel('Random Data')
plt.show()

Output result:
Please add image description

Summarize

The importance of Pandas in data analysis. Pandas is an indispensable library in Python data analysis. Through Pandas, data scientists and data analysts can easily read, process, analyze and visualize data.

First, we introduced the core data structures of Pandas - Series and DataFrame, which provide rich operations and functions for one-dimensional and two-dimensional data respectively. Next, we discuss in detail how to index, select, and modify data to make data processing simple and efficient.

In the data cleaning section, we learned various data cleaning techniques such as handling missing values, duplicate data, and string operations. These are key steps in the data preprocessing process and are essential for subsequent analysis and model building.

After that, we dived into aggregating, transforming, and filtering data. By grouping and summarizing data, we can gain interesting insights and statistics about the data.

Finally, we explored advanced features of Pandas such as data merging, reshaping, pivoting, and how to handle big data and performance optimization. These tips can help us better organize and optimize our code to make it more readable and efficient.

Overall, Pandas is a powerful, flexible, and efficient tool that you should learn and master in depth whether you are a data novice or an experienced analyst. I hope this blog can provide valuable reference and guidance for your data analysis journey. Continue to explore, learn and practice, and let the charm of data help you go further!

practise

Dataset introduction

Titanic dataset. This dataset contains information about the passengers on the Titanic and whether they survived the sinking.

Data description:

  • PassengerId: passenger number
  • Survived: Whether he survived (0 = No, 1 = Yes)
  • Pclass: ticket category (1 = 1st, 2 = 2nd, 3 = 3rd)
  • Name: Passenger’s name
  • Sex: gender
  • Age: age
  • SibSp: Number of siblings and spouses on board
  • Parch: Number of parents and children on board
  • Ticket: ticket number
  • Fare: fare
  • Cabin: cabin number
  • Embarked: port of embarkation (C = Cherbourg, Q = Queenstown, S = Southampton)

Data record

Download the Titanic dataset: Click here to download
Use Pandas to read the downloaded CSV file.

Initial exploration of data

Initial exploration of data:

  • Describe the size of the data set and the data type of the columns.
  • See if there are any missing values ​​in the data set and decide what to do with them.

Data filtering and manipulation

Data filtering and manipulation:

  • Select the Pclass, Sex, and Survived columns for further analysis.
  • Find all female passengers in 1st class.
    Create a new column IsChild to mark passengers younger than 18 years old.

Data statistics and aggregation

Data statistics and aggregation:

  • Calculate the average survival rate for female passengers.
  • Use groupby() on Pclass and calculate the average survival rate for each class.
  • Create a pivot table showing survival rates by cabin class and gender.

Data cleaning

Data cleaning:

  • To handle missing values ​​in the Age column, consider filling in the mean age or using other methods.
  • Find possible duplicate passengers based on name.
  • Convert the Sex column to numeric type, such as: 0 is male, 1 is female.

data visualization

data visualization:

  • Use Pandas to draw a bar chart of survival rates for different Pclasses.
  • Draw a histogram of the Age column to observe the passenger age distribution.
  • Draw a boxplot of the Fare column to observe the distribution of fares.

Reference answer

import pandas as pd
from matplotlib import pyplot as plt


# 读取数据
data = pd.read_csv("train.csv")
print(data)


# 显示基本信息
print(data.info())
print(data.describe())
print(data.isnull().sum())  # 调试输出缺失值

# 填充缺失
data["Age"].fillna(data["Age"].mean(), inplace=True)  # 填充平均值
data['Cabin'].fillna('Unknown', inplace=True)
print(data.isnull().sum())  # 调试输出缺失值


# 数据筛选与操作
selected_data = data[['Pclass', 'Sex', 'Survived']]
first_class_females = data[(data['Pclass'] == 1) & (data['Sex'] == 'female')]

# 标记18岁以下的乘客
data['IsChild'] = data['Age'].apply(lambda x: 1 if x < 18 else 0)

# 数据统计与聚合
female_survival_rate = data[data['Sex'] == 'female']['Survived'].mean()
print(f"Female Survival Rate: {female_survival_rate}")


class_survival_rates = data.groupby('Pclass')['Survived'].mean()
print(class_survival_rates)

pivot_table = data.pivot_table('Survived', index='Sex', columns='Pclass')
print(pivot_table)

# 数据清洗
data['Sex'] = data['Sex'].map({'male': 0, 'female': 1})  # 将 Sex 列转为数值型

# 数据可视化
class_survival_rates.plot(kind='bar', title='Survival Rates by Pclass')
plt.ylabel('Survival Rate')
plt.show()

data['Age'].hist(bins=30, edgecolor='black')
plt.title('Age Distribution')
plt.xlabel('Age')
plt.ylabel('Number of Passengers')
plt.show()

data['Fare'].plot(kind='box')
plt.title('Fare Distribution')
plt.ylabel('Fare')
plt.show()

Output result:

     PassengerId  Survived  Pclass  ...      Fare        Cabin  Embarked
0              1         0       3  ...    7.2500          NaN         S
1              2         1       1  ...   71.2833          C85         C
2              3         1       3  ...    7.9250          NaN         S
3              4         1       1  ...   53.1000         C123         S
4              5         0       3  ...    8.0500          NaN         S
5              6         0       3  ...    8.4583          NaN         Q
6              7         0       1  ...   51.8625          E46         S
7              8         0       3  ...   21.0750          NaN         S
8              9         1       3  ...   11.1333          NaN         S
9             10         1       2  ...   30.0708          NaN         C
10            11         1       3  ...   16.7000           G6         S
11            12         1       1  ...   26.5500         C103         S
12            13         0       3  ...    8.0500          NaN         S
13            14         0       3  ...   31.2750          NaN         S
14            15         0       3  ...    7.8542          NaN         S
15            16         1       2  ...   16.0000          NaN         S
16            17         0       3  ...   29.1250          NaN         Q
17            18         1       2  ...   13.0000          NaN         S
18            19         0       3  ...   18.0000          NaN         S
19            20         1       3  ...    7.2250          NaN         C
20            21         0       2  ...   26.0000          NaN         S
21            22         1       2  ...   13.0000          D56         S
22            23         1       3  ...    8.0292          NaN         Q
23            24         1       1  ...   35.5000           A6         S
24            25         0       3  ...   21.0750          NaN         S
25            26         1       3  ...   31.3875          NaN         S
26            27         0       3  ...    7.2250          NaN         C
27            28         0       1  ...  263.0000  C23 C25 C27         S
28            29         1       3  ...    7.8792          NaN         Q
29            30         0       3  ...    7.8958          NaN         S
..           ...       ...     ...  ...       ...          ...       ...
861          862         0       2  ...   11.5000          NaN         S
862          863         1       1  ...   25.9292          D17         S
863          864         0       3  ...   69.5500          NaN         S
864          865         0       2  ...   13.0000          NaN         S
865          866         1       2  ...   13.0000          NaN         S
866          867         1       2  ...   13.8583          NaN         C
867          868         0       1  ...   50.4958          A24         S
868          869         0       3  ...    9.5000          NaN         S
869          870         1       3  ...   11.1333          NaN         S
870          871         0       3  ...    7.8958          NaN         S
871          872         1       1  ...   52.5542          D35         S
872          873         0       1  ...    5.0000  B51 B53 B55         S
873          874         0       3  ...    9.0000          NaN         S
874          875         1       2  ...   24.0000          NaN         C
875          876         1       3  ...    7.2250          NaN         C
876          877         0       3  ...    9.8458          NaN         S
877          878         0       3  ...    7.8958          NaN         S
878          879         0       3  ...    7.8958          NaN         S
879          880         1       1  ...   83.1583          C50         C
880          881         1       2  ...   26.0000          NaN         S
881          882         0       3  ...    7.8958          NaN         S
882          883         0       3  ...   10.5167          NaN         S
883          884         0       2  ...   10.5000          NaN         S
884          885         0       3  ...    7.0500          NaN         S
885          886         0       3  ...   29.1250          NaN         Q
886          887         0       2  ...   13.0000          NaN         S
887          888         1       1  ...   30.0000          B42         S
888          889         0       3  ...   23.4500          NaN         S
889          890         1       1  ...   30.0000         C148         C
890          891         0       3  ...    7.7500          NaN         Q

[891 rows x 12 columns]
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB
None
       PassengerId    Survived      Pclass  ...       SibSp       Parch        Fare
count   891.000000  891.000000  891.000000  ...  891.000000  891.000000  891.000000
mean    446.000000    0.383838    2.308642  ...    0.523008    0.381594   32.204208
std     257.353842    0.486592    0.836071  ...    1.102743    0.806057   49.693429
min       1.000000    0.000000    1.000000  ...    0.000000    0.000000    0.000000
25%     223.500000    0.000000    2.000000  ...    0.000000    0.000000    7.910400
50%     446.000000    0.000000    3.000000  ...    0.000000    0.000000   14.454200
75%     668.500000    1.000000    3.000000  ...    1.000000    0.000000   31.000000
max     891.000000    1.000000    3.000000  ...    8.000000    6.000000  512.329200

[8 rows x 7 columns]
PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64
PassengerId    0
Survived       0
Pclass         0
Name           0
Sex            0
Age            0
SibSp          0
Parch          0
Ticket         0
Fare           0
Cabin          0
Embarked       2
dtype: int64
Female Survival Rate: 0.7420382165605095
Pclass
1    0.629630
2    0.472826
3    0.242363
Name: Survived, dtype: float64
Pclass         1         2         3
Sex                                 
female  0.968085  0.921053  0.500000
male    0.368852  0.157407  0.135447

Please add image description
Please add image description
Please add image description

Guess you like

Origin blog.csdn.net/weixin_46274168/article/details/133819282