[Zero Basic Introduction to Python] Lesson 7 Pandas
- [Zero Basic Introduction to Python] Lesson 7 Pandas
- What are pandas?
- Why choose Pandas
- Install pandas
- Series array
- Series array operations
- DataFrame array
- Data operations
- DataFrame operations
- Data loading
- Pandas missing value filling
- Data merging, connection and management
- Summarize
- practise
- Reference answer
[Zero Basic Introduction to Python] Lesson 7 Pandas
Pandas is an open source Python data analysis library created by Wes McKinney in 2008, and over the next few years it quickly became one of the most popular and influential tools in the Python data analysis community. Pandas The name comes from Panel Data and Python Data Analysis.
What are pandas?
Pandas is an open source Python data analysis library that provides a large number of functions that can help us easily process structured data. Pandas is our right-hand assistant in data cleaning, transformation, and analysis.
Why choose Pandas
Features of pandas
For us developers, Pandas provides a powerful, easy-to-use data structure and data analysis. Pandas is the first choice for data cleaning, transformation, analysis and visualization.
List some features of Pandas:
- Flexible data structure: Pandas can easily handle various types of data, such as structured numerical tables, time series, statistical data sets, etc.
- Powerful data processing capabilities: including data deletion, insertion, deletion, aggregation, analysis, data pivot and other operations
- Easy to integrate with other Python libraries: such as Matplotlib, Numpy etc.
Application scenarios of Pandas
Main application scenarios of Pandas:
- Data cleaning: such as processing lost data, filtering data, etc.
- Data transformation: such as creating new data structures, aggregating data, etc.
- Data analysis: such as calculating statistics, grouping and summarizing data, etc.
- Data visualization: combine with other libraries (Matplotlib, Seaborn) to draw visual charts
Pandas bottom layer
Low-level implementation:
- The bottom layer of Pandas is built on Numpy, which means that Pandas data structures, such as "Series" and "DataFrame" are actually operated on Numpy arrays. Because Numpy is specifically optimized for numerical calculations, Pandas High performance when processing large amounts of data
- C Language Extensions: Although Pandas itself is written in Python, Pandas uses Cython to write key parts of the code, further improving performance
Install pandas
Enter in cmd:
pip install pandas
Install in conda:
conda install pandas
Check whether the installation is successful:
import pandas as pd
Series array
Pandas has two core data structures. “Series” and “DataFrame”. Let’s talk about Series first.
What is Series?
Series is a one-dimensional labeled array, which can hold any data type (integer, string, floating point number, Python object, etc.). Series is similar to an ordinary Python list, but has more features and flexibility.
Series creation
Format:
pd.Series(data, index=None, dtype=None, name=None, copy=None, fastpath=False)
parameter:
- data: list-like data
- index: index, default is None
- drype: Type of the returned Series array, default is None
- name: the name of the Series array, default is None
- copy: copy the input data, default is None
example:
import pandas as pd
# 创建 Series 数组
list1 = [1, 2, 3] # 创建列表
series1 = pd.Series(list1) # 通过 list 创建 Series 数组
print(series1) # 调试输出
import pandas as pd
# 创建 Series 数组, 带 Index
student_name = ["张三", "李四", "我是小白呀"] # 创建学生名字列表, 用于索引学生id
student_id = [1, 2, 3] # 创建学生 id 列表
series2 = pd.Series(student_id, index=student_name) # 创建 Series 数组
print(series2) # 调试输出
# 通过字典创建 Series 数组
dict1 = {'a':1,'b':2, 'c':3} # 创建字典
series3 = pd.Series(dict1) # 通过字典创建 Series 数组
print(series3) # 调试输出
Output result:
0 1
1 2
2 3
dtype: int64
张三 1
李四 2
我是小白呀 3
dtype: int64
a 1
b 2
c 3
dtype: int64
Series array operations
Data retrieval
In the Series array, we can achieve data retrieval through indexing.
example:
import pandas as pd
# 创建 Series 数组, 带 Index
student_name = ["张三", "李四", "我是小白呀"] # 创建学生名字列表, 用于索引学生id
student_id = [1, 2, 3] # 创建学生 id 列表
series1 = pd.Series(student_id, index=student_name) # 创建 Series 数组
print(series1) # 调试输出
# 数据检索
zhangsan_id = series1["张三"] # 通过索引提取张三对应的 id
lisi_id = series1["李四"] # 通过索引提取李四对应的 id
iamarookie_id = series1["我是小白呀"] # 通过索引提取小白对应的 id
print("张三的 id:", zhangsan_id)
print("李四的 id:", lisi_id)
print("张三的 id:", iamarookie_id)
# 多重检索
ids = series1[["张三", "李四"]] # 通过索引提取张三和李四的 id
print("张三 & 李四的 id: \n{}".format(ids)) # 调试输出
Output result:
张三 1
李四 2
我是小白呀 3
dtype: int64
张三的 id: 1
李四的 id: 2
张三的 id: 3
张三 & 李四的 id:
张三 1
李四 2
dtype: int64
Data modification
In a Series array, you can use indexing to modify the data in the Series.
example:
import pandas as pd
# 创建 Series 数组, 带 Index
student_name = ["张三", "李四", "我是小白呀"] # 创建学生名字列表, 用于索引学生id
student_id = [1, 2, 3] # 创建学生 id 列表
series1 = pd.Series(student_id, index=student_name) # 创建 Series 数组
print(series1) # 调试输出
# 数据修改
series1["张三"] = 123 # 将 Series 数组中, 索引张三对应的 id 修改为 123
print(series1) # 调试输出
Output result:
张三 1
李四 2
我是小白呀 3
dtype: int64
张三 123
李四 2
我是小白呀 3
dtype: int64
filter
Series arrays can be filtered using Boolean indexes.
example:
import pandas as pd
# 创建关于学生成绩的 Series 数组
student_name = ["张三", "李四", "我是小白呀"] # 创建学生名字列表, 用于索引学生成绩
student_grade = [88, 90, 55] # 创建学生成绩列表
series1 = pd.Series(student_grade, index=student_name) # 创建 Series 数组
print(series1) # 调试输出
# 数据修改
result = series1[series1 < 60]
print("成绩不及格的同学: \n{}".format(result)) # 调试输出
Output result:
张三 88
李四 90
我是小白呀 55
dtype: int64
成绩不及格的同学:
我是小白呀 55
dtype: int64
Series array operations
import pandas as pd
# 创建关于学生成绩的 Series 数组
student_name = ["张三", "李四", "我是小白呀"] # 创建学生名字列表, 用于索引学生成绩
student_grade = [88, 90, 55] # 创建学生成绩列表
series1 = pd.Series(student_grade, index=student_name) # 创建 Series 数组
print("加分前: \n{}".format(series1)) # 调试输出
# Series 数组运算
series1 = series1 + 5 # 鉴于小白同学不及格, 老师觉得给大家都加 5 分
print("加分后: \n{}".format(series1)) # 调试输出
Output result:
加分前:
张三 88
李四 90
我是小白呀 55
dtype: int64
加分后:
张三 93
李四 95
我是小白呀 60
dtype: int64
Summarize
Series in Pandas provides a flexible and powerful way to process data. Whether it is data analysis, data cleaning or data manipulation, Series is a very useful tool.
DataFrame array
What is a DataFrame?
DataFrame is a two-dimensional labeled data structure, similar to an Excel table. The values in DataFrame are all Series of the same length. DataFrame is the most commonly used and powerful data structure in Pandas.
DataFrame creation
By using thepd.DataFrame
function we can create a DataFrame array. The DataFrame can be punctuated by a variety of data, such asdictionary, list, or external file.
Format:
pd.DataFrame(data=None, index=None, columns=None, dtype=None, copy=None)
parameter:
- data: list-like data
- index: index, default is None
- columns: column name, default is None
- detype: The type of the returned Series array, default is None
- copy: copy the input data, default is None
example:
import pandas as pd
# 创建 DataFrame 数组
data = {"名字":["张三", "李四", "我是小白呀"], "年龄":[25, 32, 18]} # 创建字典
df = pd.DataFrame(data) # 由字典创建 DataFrame 数组
print(df) # 调试输出
Output result:
名字 年龄
0 张三 25
1 李四 32
2 我是小白呀 18
Data operations
In Pandas, index is a very powerful tool that can help us access, query and operate data more efficiently. After understanding the data structure, we need to understand how to use indexes to perform array operations.
Access column data
Through the column names, we can retrieve the data in the DataFrame.
data:
名字 年龄
第一行 张三 25
第二行 李四 32
第三行 我是小白呀 18
example:
import pandas as pd
# 创建 DataFrame 数组
data = {"名字":["张三", "李四", "我是小白呀"], "年龄":[25, 32, 18]} # 创建字典
df = pd.DataFrame(data, index=["第一行", "第二行", "第三行"]) # 由字典创建 DataFrame 数组
print(df) # 调试输出
# 检索名字列
name = df["名字"] # 提取名字列
print("提取名字列: \n{}".format(name)) # 调试输出
# 通过 iloc 实现切片
name = df.iloc[:,0] # 提取名字列 (所有行, 第一列)
print("提取名字列: \n{}".format(name)) # 调试输出
# 通过 loc 实现切片
name = df.loc[:,"名字"] # 提取名字列
print("提取名字列: \n{}".format(name)) # 调试输出
Output result:
名字 年龄
第一行 张三 25
第二行 李四 32
第三行 我是小白呀 18
提取名字列:
第一行 张三
第二行 李四
第三行 我是小白呀
Name: 名字, dtype: object
提取名字列:
第一行 张三
第二行 李四
第三行 我是小白呀
Name: 名字, dtype: object
提取名字列:
第一行 张三
第二行 李四
第三行 我是小白呀
Name: 名字, dtype: object
Access row data
example:
import pandas as pd
# 创建 DataFrame 数组
data = {"名字":["张三", "李四", "我是小白呀"], "年龄":[25, 32, 18]} # 创建字典
df = pd.DataFrame(data, index=["第一行", "第二行", "第三行"]) # 由字典创建 DataFrame 数组
print(df) # 调试输出
# 检索第一行
row0 = df.iloc[0] # 提取第一行
print("提第一行: \n{}".format(row0)) # 调试输出
# 检索第一行
row0 = df.loc["第一行"] # 提取第一行
print("提第一行: \n{}".format(row0)) # 调试输出
Output result:
名字 年龄
第一行 张三 25
第二行 李四 32
第三行 我是小白呀 18
提第一行:
名字 张三
年龄 25
Name: 第一行, dtype: object
提第一行:
名字 张三
年龄 25
Name: 第一行, dtype: object
place vs illoc vs ix
In Pandas, loc
, iloc
, ix
are all methods used to select data.
The difference between the three:
- loc["row","column"]: select data by label
- Select rows: df.loc["row label name"]
- Select columns: df.loc[:,"column label name"]
- iloc[row index, column index]: select data by index
- Select rows: df.iloc[row index]
- Select columns: df.iloc[column index]
- ix: It can be used either through labels or indexes, which is approximately equal to the function of loc + iloc (just understand it)
- For code readability, it is recommended to use loc or iloc. ix is now deprecated and is not recommended.
DataFrame operations
Filter data
example:
import pandas as pd
# 创建 DataFrame 数组
data = {"名字":["张三", "李四", "我是小白呀"], "年龄":[25, 32, 18]} # 创建字典
df = pd.DataFrame(data) # 由字典创建 DataFrame 数组
print(df) # 调试输出
# 检索名字列
name = df["名字"] # 提取名字列
print("提取名字列: \n{}".format(name)) # 调试输出
Output result:
名字 年龄
0 张三 25
1 李四 32
2 我是小白呀 18
提取名字列:
0 张三
1 李四
2 我是小白呀
Name: 名字, dtype: object
sort
Format:
pd.DataFrame.sort_values(by, *, axis=0, ascending=True, inplace=False, kind='quicksort', na_position='last', ignore_index=False, key=None)
parameter:
- axis: axis, default is vertical sorting
- ascending: from low to high
- inplace: replace the original DataFrame, default is False
example:
import pandas as pd
# 创建 DataFrame 数组
data = {"名字":["张三", "李四", "我是小白呀"], "年龄":[25, 32, 18]} # 创建字典
df = pd.DataFrame(data) # 由字典创建 DataFrame 数组
print(df) # 调试输出
# DataFrame 排序
df = df.sort_values(by="年龄") # 通过布尔条件筛选特定数据
df.reset_index(inplace=True) # 重新索引
print("排序: \n{}".format(df)) # 调试输出
Note: Bydf.reset_index(inplace=True)
, the DataFrame array will be re-indexed.
polymerization
We can perform various aggregation operations on DataFrame numbers.
example:
import pandas as pd
# 创建 DataFrame 数组
data = {"名字":["张三", "李四", "我是小白呀"], "年龄":[25, 32, 18]} # 创建字典
df = pd.DataFrame(data) # 由字典创建 DataFrame 数组
print(df) # 调试输出
# DataFrame 聚合
mean = df["年龄"].mean() # 通过布尔条件筛选特定数据
print("平均年龄:", mean) # 调试输出
Output result:
名字 年龄
0 张三 25
1 李四 32
2 我是小白呀 18
平均年龄: 25.0
Add and delete
import pandas as pd
# 创建 DataFrame 数组
data = {"名字":["张三", "李四", "我是小白呀"], "年龄":[25, 32, 18]} # 创建字典
df = pd.DataFrame(data) # 由字典创建 DataFrame 数组
print(df) # 调试输出
# 添加列
data["成绩"] = [78, 82, 60] # 添加一个新的列, 成绩
print(df) # 调试输出
# 删除列
del data["年龄"] # 删除年龄列
print(df)
Output result:
名字 年龄
0 张三 25
1 李四 32
2 我是小白呀 18
名字 年龄
0 张三 25
1 李四 32
2 我是小白呀 18
名字 年龄
0 张三 25
1 李四 32
2 我是小白呀 18
Data loading
We often encounter situations where we need to add data from csv files. Pandas provides theread_csv
method, which makes it very simple for us to load data from csv files. a>
CSV file loading
Format:
pandas.read_csv(filepath_or_buffer, *, sep=_NoDefault.no_default, delimiter=None, header='infer', names=_NoDefault.no_default, index_col=None, usecols=None, dtype=None, engine=None, converters=None, true_values=None, false_values=None, skipinitialspace=False, skiprows=None, skipfooter=0, nrows=None, na_values=None, keep_default_na=True, na_filter=True, verbose=False, skip_blank_lines=True, parse_dates=None, infer_datetime_format=_NoDefault.no_default, keep_date_col=False, date_parser=_NoDefault.no_default, date_format=None, dayfirst=False, cache_dates=True, iterator=False, chunksize=None, compression='infer', thousands=None, decimal='.', lineterminator=None, quotechar='"', quoting=0, doublequote=True, escapechar=None, comment=None, encoding=None, encoding_errors='strict', dialect=None, on_bad_lines='error', delim_whitespace=False, low_memory=True, memory_map=False, float_precision=None, storage_options=None, dtype_backend=_NoDefault.no_default)
parameter:
- filepath_or_buffer: file path
- header: header
- names: column names
example:
import pandas as pd
# 读取 txt/csv
data = pd.read_csv("test.txt", header=None, names=["链接"])
print(data) # 调试输出
Output result:
链接
0 http://melanz.phorum.pl/viewtopic.php?f=7&t=64041
1 http://www.reo14.moe.go.th/phpBB3/viewtopic.ph...
2 https://www.xroxy.com/xorum/viewtopic.php?p=30...
3 http://armasow.forumbb.ru/viewtopic.php?id=840...
4 http://telecom.liveforums.ru/viewtopic.php?id=...
5 http://www.crpsc.org.br/forum/viewtopic.php?f=...
6 http://community.getvideostream.com/topic/4803...
7 http://www.shop.minecraftcommand.science/forum...
8 https://www.moddingway.com/forums/thread-31914...
9 https://webhitlist.com/forum/topics/main-featu...
Excel file loading
example:
df = pd.read_excel('path_to_file.xlsx')
Data exploration
In DataFrame, we can use some functions to view the basic structure and content of DataFrame data.
Commonly used functions:
df.info()
: Return basic information of data, including data type, non-empty, etc.df.head()
: show first 5 linesdf.tail()
: show last 5 linesdf.describe()
: Display basic statistical information, including: such as mean, standard deviation, minimum value, 25th, 50th (median) and 75th percentile, maximum value, etc.
example:
import pandas as pd
# 读取数据
data = pd.read_csv("students.txt", header=None)
print(data.info()) # 显示总览, 包括每列的数据类型和非空值的数量
print(data.head()) # 显示前 5 行
print(data.tail()) # 显示后 5 行
print(data.describe()) # 显示基本统计信息
Debug output:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 131 entries, 0 to 130
Data columns (total 2 columns):
0 131 non-null object
1 131 non-null object
dtypes: object(2)
memory usage: 2.1+ KB
None
0 1
0 c1235666 Fink-Nottle Augustus James
1 c3456765 O'Mahoney Geoffrey
2 c8732719 De Leo Victoria Margaret
3 c9676814 Thompson Sabrina
4 c4418710 Heck Kevin
0 1
126 c6060052 Long Marilyn
127 c2390980 Martz Perry Tony William
128 c5456142 Wilson Christine Mabel
129 c1036678 Bunch Richard Frank
130 c8306065 Hartley Marcel Jonathan Philip
0 1
count 131 131
unique 131 127
top c3827371 Bush Thomas
freq 1 2
Pandas missing value filling
In most cases, the data may not always be complete, so we have to deal with missing data. Pandas provides a variety of ways to deal with missing data.
Identify missing values
In Pandas, missing values are usually represented as "NaN" (Not a Number). We can use the isnull()
function or the isna()
function to list Missing values in the data.
The following data is missing several values in the Fare column:
PassengerId Survived Pclass ... Fare Cabin Embarked
0 1 0 3 ... 7.2500 NaN S
1 2 1 1 ... 71.2833 C85 C
2 3 1 3 ... 7.9250 NaN S
3 4 1 1 ... 53.1000 C123 S
example:
import pandas as pd
# 读取数据
data = pd.read_csv("train.csv")
print(data)
# 调试输出每列的缺失值
print(data.isnull().sum())
Output result:
PassengerId Survived Pclass ... Fare Cabin Embarked
0 1 0 3 ... 7.2500 NaN S
1 2 1 1 ... 71.2833 C85 C
2 3 1 3 ... 7.9250 NaN S
3 4 1 1 ... 53.1000 C123 S
4 5 0 3 ... 8.0500 NaN S
5 6 0 3 ... 8.4583 NaN Q
6 7 0 1 ... 51.8625 E46 S
7 8 0 3 ... 21.0750 NaN S
8 9 1 3 ... 11.1333 NaN S
9 10 1 2 ... 30.0708 NaN C
10 11 1 3 ... 16.7000 G6 S
11 12 1 1 ... 26.5500 C103 S
12 13 0 3 ... 8.0500 NaN S
13 14 0 3 ... 31.2750 NaN S
14 15 0 3 ... 7.8542 NaN S
15 16 1 2 ... 16.0000 NaN S
16 17 0 3 ... 29.1250 NaN Q
17 18 1 2 ... 13.0000 NaN S
18 19 0 3 ... 18.0000 NaN S
19 20 1 3 ... 7.2250 NaN C
20 21 0 2 ... 26.0000 NaN S
21 22 1 2 ... 13.0000 D56 S
22 23 1 3 ... 8.0292 NaN Q
23 24 1 1 ... 35.5000 A6 S
24 25 0 3 ... 21.0750 NaN S
25 26 1 3 ... 31.3875 NaN S
26 27 0 3 ... 7.2250 NaN C
27 28 0 1 ... 263.0000 C23 C25 C27 S
28 29 1 3 ... 7.8792 NaN Q
29 30 0 3 ... 7.8958 NaN S
.. ... ... ... ... ... ... ...
861 862 0 2 ... 11.5000 NaN S
862 863 1 1 ... 25.9292 D17 S
863 864 0 3 ... 69.5500 NaN S
864 865 0 2 ... 13.0000 NaN S
865 866 1 2 ... 13.0000 NaN S
866 867 1 2 ... 13.8583 NaN C
867 868 0 1 ... 50.4958 A24 S
868 869 0 3 ... 9.5000 NaN S
869 870 1 3 ... 11.1333 NaN S
870 871 0 3 ... 7.8958 NaN S
871 872 1 1 ... 52.5542 D35 S
872 873 0 1 ... 5.0000 B51 B53 B55 S
873 874 0 3 ... 9.0000 NaN S
874 875 1 2 ... 24.0000 NaN C
875 876 1 3 ... 7.2250 NaN C
876 877 0 3 ... 9.8458 NaN S
877 878 0 3 ... 7.8958 NaN S
878 879 0 3 ... 7.8958 NaN S
879 880 1 1 ... 83.1583 C50 C
880 881 1 2 ... 26.0000 NaN S
881 882 0 3 ... 7.8958 NaN S
882 883 0 3 ... 10.5167 NaN S
883 884 0 2 ... 10.5000 NaN S
884 885 0 3 ... 7.0500 NaN S
885 886 0 3 ... 29.1250 NaN Q
886 887 0 2 ... 13.0000 NaN S
887 888 1 1 ... 30.0000 B42 S
888 889 0 3 ... 23.4500 NaN S
889 890 1 1 ... 30.0000 C148 C
890 891 0 3 ... 7.7500 NaN Q
[891 rows x 12 columns]
PassengerId 0
Survived 0
Pclass 0
Name 0
Sex 0
Age 177
SibSp 0
Parch 0
Ticket 0
Fare 0
Cabin 687
Embarked 2
dtype: int64
Usedropna()
Discard missing values
df.dropna()
Can help us delete all missing rows in the data.
example:
import pandas as pd
import numpy as np
# 创建一个模拟数据集
data = {
'Product': ['Apple', 'Banana', 'Cherry', 'Date', 'Fig', 'Grape', 'Mango', 'Watermelon'],
'Price': [1, 0.5, np.nan, 0.75, np.nan, 2.5, 1.2, np.nan],
'Date_sold': [np.nan, '2023-01-15', '2023-01-16', '2023-01-17', '2023-01-18', '2023-01-19', np.nan, '2023-01-21']
}
df = pd.DataFrame(data)
print("原始数据:")
print(df)
# 删除任何含有 NaN 的行
df_dropped = df.dropna()
print("\n删除含有 NaN 后的数据:")
print(df_dropped)
Output result:
原始数据:
Product Price Date_sold
0 Apple 1.00 NaN
1 Banana 0.50 2023-01-15
2 Cherry NaN 2023-01-16
3 Date 0.75 2023-01-17
4 Fig NaN 2023-01-18
5 Grape 2.50 2023-01-19
6 Mango 1.20 NaN
7 Watermelon NaN 2023-01-21
删除含有 NaN 后的数据:
Product Price Date_sold
1 Banana 0.50 2023-01-15
3 Date 0.75 2023-01-17
5 Grape 2.50 2023-01-19
Fill missing values usingfillna()
df.fillna()
Helps us fill missing data with specified values.
example:
import pandas as pd
import numpy as np
# 创建一个模拟数据集
data = {
'Product': ['Apple', 'Banana', 'Cherry', 'Date', 'Fig', 'Grape', 'Mango', 'Watermelon'],
'Price': [1, 0.5, np.nan, 0.75, np.nan, 2.5, 1.2, np.nan],
'Date_sold': [np.nan, '2023-01-15', '2023-01-16', '2023-01-17', '2023-01-18', '2023-01-19', np.nan, '2023-01-21']
}
df = pd.DataFrame(data)
print("原始数据:")
print(df)
# 使用固定值填充
df1 = df.copy()
df1['Price'].fillna("1", inplace=True)
print("\n使用固定值填充后的数据:")
print(df1)
# 使用前一个值填充
df2 = df.copy()
df2['Price'].fillna(method='ffill', inplace=True)
print("\n使用前一个值填充后的数据:")
print(df2)
# 使用平均值填充
df3 = df.copy()
df3['Price'].fillna(df3['Price'].mean(), inplace=True)
print("\n使用平均值填充后的数据:")
print(df3)
Output result:
原始数据:
Product Price Date_sold
0 Apple 1.00 NaN
1 Banana 0.50 2023-01-15
2 Cherry NaN 2023-01-16
3 Date 0.75 2023-01-17
4 Fig NaN 2023-01-18
5 Grape 2.50 2023-01-19
6 Mango 1.20 NaN
7 Watermelon NaN 2023-01-21
使用固定值填充后的数据:
Product Price Date_sold
0 Apple 1 NaN
1 Banana 0.5 2023-01-15
2 Cherry 1 2023-01-16
3 Date 0.75 2023-01-17
4 Fig 1 2023-01-18
5 Grape 2.5 2023-01-19
6 Mango 1.2 NaN
7 Watermelon 1 2023-01-21
使用前一个值填充后的数据:
Product Price Date_sold
0 Apple 1.00 NaN
1 Banana 0.50 2023-01-15
2 Cherry 0.50 2023-01-16
3 Date 0.75 2023-01-17
4 Fig 0.75 2023-01-18
5 Grape 2.50 2023-01-19
6 Mango 1.20 NaN
7 Watermelon 1.20 2023-01-21
使用平均值填充后的数据:
Product Price Date_sold
0 Apple 1.00 NaN
1 Banana 0.50 2023-01-15
2 Cherry 1.19 2023-01-16
3 Date 0.75 2023-01-17
4 Fig 1.19 2023-01-18
5 Grape 2.50 2023-01-19
6 Mango 1.20 NaN
7 Watermelon 1.19 2023-01-21
Remove duplicates
Remove duplicates:
df.drop.duplicates()
Data conversion:
df['column_name'] = df['column_name'].astype('new_type') # 转换数据类型
df['new_column'] = df['column1'] + df['column2']
inplace parameter
The inplace parameter in Pandas is found in many functions. Its function is: whether to modify the original object.
The meaning of True and False in inplace:
- inplace = True: Do not create a new object, directly modify the original object
- inplace = False: Modify the data, create and return a new object to carry the modification results
example:
# 以下两行代码意思相同
df.dropna(inplace=True) # 直接对 df 对象进行修改
df_dropped = df.dropna(inplace=True) # 对 df 进行修改并返回对象
The inplace parameter defaults to False in the function, that is, a new object is created for modification, and the original object remains unchanged, which is similar to deep copy and shallow copy.
Data merging, connection and management
merge
contact()
Functions can help us connect two or more data.
Format:
df.concat(objs, axis=0, join='outer', join_axes=None, ignore_index=False,
keys=None, levels=None, names=None, verify_integrity=False,
sort=None, copy=True)
parameter:
- objs: Numbers to be spliced
- join: splicing mode
- join_axes: splicing axes
example:
import pandas as pd
# 初始化 DataFrame 数组
df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2'],
'B': ['B0', 'B1', 'B2']})
df2 = pd.DataFrame({'A': ['A3', 'A4', 'A5'],
'B': ['B3', 'B4', 'B5']})
# 调试输出
print(df1)
print(df2)
# 进行 concat 连接
result = pd.concat([df1, df2])
print(result)
Debug output:
A B
0 A0 B0
1 A1 B1
2 A2 B2
A B
0 A3 B3
1 A4 B4
2 A5 B5
A B
0 A0 B0
1 A1 B1
2 A2 B2
0 A3 B3
1 A4 B4
2 A5 B5
The ``merge()`` function can also help us perform merge operations, similar to SQL's JOIN.
example:
import pandas as pd
# 初始化 DataFrame 数组
left = pd.DataFrame({'key': ['K0', 'K1', 'K2'],
'A': ['A0', 'A1', 'A2'],
'B': ['B0', 'B1', 'B2']})
right = pd.DataFrame({'key': ['K0', 'K1', 'K2'],
'C': ['C0', 'C1', 'C2'],
'D': ['D0', 'D1', 'D2']})
# 调试输出
print(left)
print(right)
# 使用 merge 拼接
result = pd.merge(left, right, on='key')
print(result)
Output result:
key A B
0 K0 A0 B0
1 K1 A1 B1
2 K2 A2 B2
key C D
0 K0 C0 D0
1 K1 C1 D1
2 K2 C2 D2
key A B C D
0 K0 A0 B0 C0 D0
1 K1 A1 B1 C1 D1
2 K2 A2 B2 C2 D2
Grouping and aggregation
Usegroupby()
to group an array box by column value and then apply an aggregate function to each group.
Common aggregate functions:
sum()
:Summean()
: averagemedian()
: Find the medianmin()
: Find the minimum valuemax()
: Find the maximum value
Example 1:
import pandas as pd
# 初始化 DataFrame 数组
df = pd.DataFrame({'A': ['foo', 'bar', 'foo', 'bar'],
'B': [1, 2, 3, 4],
'C': [2.0, 3.0, 4.0, 5.0]})
print(df)
# 聚合求和
grouped = df.groupby('A').sum()
print(grouped)
Debug output:
A B C
0 foo 1 2.0
1 bar 2 3.0
2 foo 3 4.0
3 bar 4 5.0
B C
A
bar 6 8.0
foo 4 6.0
Example 2, a data set on global employee salaries:
import pandas as pd
# 数据集
data = {
'Country': ['USA', 'India', 'UK', 'USA', 'India', 'UK', 'USA', 'India'],
'Employee': ['Sam', 'Amit', 'John', 'Alice', 'Alok', 'Bob', 'Charlie', 'Deepak'],
'Salary': [70000, 45000, 60000, 80000, 50000, 55000, 85000, 65000]
}
# 初始化 DataFrame
df = pd.DataFrame(data)
print(df)
# 使用 goupby() 根据国家计算平均薪水
salary_by_country = df.groupby("Country")["Salary"].mean()
print("平均薪水:", salary_by_country)
# 使用 groupby() 计算每个国家的员工数
employee_by_country = df.groupby("Country")["Employee"].count()
print("员工数量: \n{}",format(employee_by_country))
# 多重聚合
result = df.groupby("Country")['Salary'].agg(['mean', 'median', 'sum', 'max', 'min'])
print(result)
Output result:
Country Employee Salary
0 USA Sam 70000
1 India Amit 45000
2 UK John 60000
3 USA Alice 80000
4 India Alok 50000
5 UK Bob 55000
6 USA Charlie 85000
7 India Deepak 65000
平均薪水: Country
India 53333.333333
UK 57500.000000
USA 78333.333333
Name: Salary, dtype: float64
员工数量:
{} Country
India 3
UK 2
USA 3
Name: Employee, dtype: int64
mean median sum max min
Country
India 53333.333333 50000 160000 65000 45000
UK 57500.000000 57500 115000 60000 55000
USA 78333.333333 80000 235000 85000 70000
time series analysis
Pandas provides "datatime" and "timedelta" types for processing time data.
example:
import pandas as pd
# 时间序列分析
data = pd.to_datetime(['2023-01-01', '2023-02-01'])
print(data) # 调试输出
Output result:
DatetimeIndex(['2023-01-01', '2023-02-01'], dtype='datetime64[ns]', freq=None)
data visualization
We can do basic data visualization with Pandas.
In addition to data processing and analysis functions, Pandas also provides simple and intuitive drawing methods:
- ````plot()```: Draw line graphs, usually used for time series data
hist()
: Draw a histogram to help observe data distributionboxpolt()
: Draw box plots, which can be used to observe statistical indicators such as medians and quantiles of data.
Useplot
to draw line graphs:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# 生成示例数据
date_rng = pd.date_range(start='2020-01-01', end='2020-12-31', freq='D')
df = pd.DataFrame(date_rng, columns=['date'])
df['data'] = np.random.randint(0, 100, size=(len(date_rng)))
df.set_index('date', inplace=True)
df.plot(figsize=(10, 6))
plt.title('Time Series Data Visualization')
plt.ylabel('Random Data')
plt.show()
Output result:
Usehist()
to draw a histogram:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# 生成示例数据
date_rng = pd.date_range(start='2020-01-01', end='2020-12-31', freq='D')
df = pd.DataFrame(date_rng, columns=['date'])
df['data'] = np.random.randint(0, 100, size=(len(date_rng)))
# 绘制直方图
df['data'].hist(bins=30, figsize=(10, 6))
plt.title('Histogram Data Visualization')
plt.xlabel('Random Data')
plt.ylabel('Frequency')
plt.show()
Output result:
Useboxplot()
to draw a box plot:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# 生成示例数据
date_rng = pd.date_range(start='2020-01-01', end='2020-12-31', freq='D')
df = pd.DataFrame(date_rng, columns=['date'])
df['data'] = np.random.randint(0, 100, size=(len(date_rng)))
# 绘制箱线图
df.boxplot(column='data', figsize=(6, 10))
plt.title('Boxplot Data Visualization')
plt.ylabel('Random Data')
plt.show()
Output result:
Summarize
The importance of Pandas in data analysis. Pandas is an indispensable library in Python data analysis. Through Pandas, data scientists and data analysts can easily read, process, analyze and visualize data.
First, we introduced the core data structures of Pandas - Series and DataFrame, which provide rich operations and functions for one-dimensional and two-dimensional data respectively. Next, we discuss in detail how to index, select, and modify data to make data processing simple and efficient.
In the data cleaning section, we learned various data cleaning techniques such as handling missing values, duplicate data, and string operations. These are key steps in the data preprocessing process and are essential for subsequent analysis and model building.
After that, we dived into aggregating, transforming, and filtering data. By grouping and summarizing data, we can gain interesting insights and statistics about the data.
Finally, we explored advanced features of Pandas such as data merging, reshaping, pivoting, and how to handle big data and performance optimization. These tips can help us better organize and optimize our code to make it more readable and efficient.
Overall, Pandas is a powerful, flexible, and efficient tool that you should learn and master in depth whether you are a data novice or an experienced analyst. I hope this blog can provide valuable reference and guidance for your data analysis journey. Continue to explore, learn and practice, and let the charm of data help you go further!
practise
Dataset introduction
Titanic dataset. This dataset contains information about the passengers on the Titanic and whether they survived the sinking.
Data description:
- PassengerId: passenger number
- Survived: Whether he survived (0 = No, 1 = Yes)
- Pclass: ticket category (1 = 1st, 2 = 2nd, 3 = 3rd)
- Name: Passenger’s name
- Sex: gender
- Age: age
- SibSp: Number of siblings and spouses on board
- Parch: Number of parents and children on board
- Ticket: ticket number
- Fare: fare
- Cabin: cabin number
- Embarked: port of embarkation (C = Cherbourg, Q = Queenstown, S = Southampton)
Data record
Download the Titanic dataset: Click here to download
Use Pandas to read the downloaded CSV file.
Initial exploration of data
Initial exploration of data:
- Describe the size of the data set and the data type of the columns.
- See if there are any missing values in the data set and decide what to do with them.
Data filtering and manipulation
Data filtering and manipulation:
- Select the Pclass, Sex, and Survived columns for further analysis.
- Find all female passengers in 1st class.
Create a new column IsChild to mark passengers younger than 18 years old.
Data statistics and aggregation
Data statistics and aggregation:
- Calculate the average survival rate for female passengers.
- Use groupby() on Pclass and calculate the average survival rate for each class.
- Create a pivot table showing survival rates by cabin class and gender.
Data cleaning
Data cleaning:
- To handle missing values in the Age column, consider filling in the mean age or using other methods.
- Find possible duplicate passengers based on name.
- Convert the Sex column to numeric type, such as: 0 is male, 1 is female.
data visualization
data visualization:
- Use Pandas to draw a bar chart of survival rates for different Pclasses.
- Draw a histogram of the Age column to observe the passenger age distribution.
- Draw a boxplot of the Fare column to observe the distribution of fares.
Reference answer
import pandas as pd
from matplotlib import pyplot as plt
# 读取数据
data = pd.read_csv("train.csv")
print(data)
# 显示基本信息
print(data.info())
print(data.describe())
print(data.isnull().sum()) # 调试输出缺失值
# 填充缺失
data["Age"].fillna(data["Age"].mean(), inplace=True) # 填充平均值
data['Cabin'].fillna('Unknown', inplace=True)
print(data.isnull().sum()) # 调试输出缺失值
# 数据筛选与操作
selected_data = data[['Pclass', 'Sex', 'Survived']]
first_class_females = data[(data['Pclass'] == 1) & (data['Sex'] == 'female')]
# 标记18岁以下的乘客
data['IsChild'] = data['Age'].apply(lambda x: 1 if x < 18 else 0)
# 数据统计与聚合
female_survival_rate = data[data['Sex'] == 'female']['Survived'].mean()
print(f"Female Survival Rate: {female_survival_rate}")
class_survival_rates = data.groupby('Pclass')['Survived'].mean()
print(class_survival_rates)
pivot_table = data.pivot_table('Survived', index='Sex', columns='Pclass')
print(pivot_table)
# 数据清洗
data['Sex'] = data['Sex'].map({'male': 0, 'female': 1}) # 将 Sex 列转为数值型
# 数据可视化
class_survival_rates.plot(kind='bar', title='Survival Rates by Pclass')
plt.ylabel('Survival Rate')
plt.show()
data['Age'].hist(bins=30, edgecolor='black')
plt.title('Age Distribution')
plt.xlabel('Age')
plt.ylabel('Number of Passengers')
plt.show()
data['Fare'].plot(kind='box')
plt.title('Fare Distribution')
plt.ylabel('Fare')
plt.show()
Output result:
PassengerId Survived Pclass ... Fare Cabin Embarked
0 1 0 3 ... 7.2500 NaN S
1 2 1 1 ... 71.2833 C85 C
2 3 1 3 ... 7.9250 NaN S
3 4 1 1 ... 53.1000 C123 S
4 5 0 3 ... 8.0500 NaN S
5 6 0 3 ... 8.4583 NaN Q
6 7 0 1 ... 51.8625 E46 S
7 8 0 3 ... 21.0750 NaN S
8 9 1 3 ... 11.1333 NaN S
9 10 1 2 ... 30.0708 NaN C
10 11 1 3 ... 16.7000 G6 S
11 12 1 1 ... 26.5500 C103 S
12 13 0 3 ... 8.0500 NaN S
13 14 0 3 ... 31.2750 NaN S
14 15 0 3 ... 7.8542 NaN S
15 16 1 2 ... 16.0000 NaN S
16 17 0 3 ... 29.1250 NaN Q
17 18 1 2 ... 13.0000 NaN S
18 19 0 3 ... 18.0000 NaN S
19 20 1 3 ... 7.2250 NaN C
20 21 0 2 ... 26.0000 NaN S
21 22 1 2 ... 13.0000 D56 S
22 23 1 3 ... 8.0292 NaN Q
23 24 1 1 ... 35.5000 A6 S
24 25 0 3 ... 21.0750 NaN S
25 26 1 3 ... 31.3875 NaN S
26 27 0 3 ... 7.2250 NaN C
27 28 0 1 ... 263.0000 C23 C25 C27 S
28 29 1 3 ... 7.8792 NaN Q
29 30 0 3 ... 7.8958 NaN S
.. ... ... ... ... ... ... ...
861 862 0 2 ... 11.5000 NaN S
862 863 1 1 ... 25.9292 D17 S
863 864 0 3 ... 69.5500 NaN S
864 865 0 2 ... 13.0000 NaN S
865 866 1 2 ... 13.0000 NaN S
866 867 1 2 ... 13.8583 NaN C
867 868 0 1 ... 50.4958 A24 S
868 869 0 3 ... 9.5000 NaN S
869 870 1 3 ... 11.1333 NaN S
870 871 0 3 ... 7.8958 NaN S
871 872 1 1 ... 52.5542 D35 S
872 873 0 1 ... 5.0000 B51 B53 B55 S
873 874 0 3 ... 9.0000 NaN S
874 875 1 2 ... 24.0000 NaN C
875 876 1 3 ... 7.2250 NaN C
876 877 0 3 ... 9.8458 NaN S
877 878 0 3 ... 7.8958 NaN S
878 879 0 3 ... 7.8958 NaN S
879 880 1 1 ... 83.1583 C50 C
880 881 1 2 ... 26.0000 NaN S
881 882 0 3 ... 7.8958 NaN S
882 883 0 3 ... 10.5167 NaN S
883 884 0 2 ... 10.5000 NaN S
884 885 0 3 ... 7.0500 NaN S
885 886 0 3 ... 29.1250 NaN Q
886 887 0 2 ... 13.0000 NaN S
887 888 1 1 ... 30.0000 B42 S
888 889 0 3 ... 23.4500 NaN S
889 890 1 1 ... 30.0000 C148 C
890 891 0 3 ... 7.7500 NaN Q
[891 rows x 12 columns]
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId 891 non-null int64
Survived 891 non-null int64
Pclass 891 non-null int64
Name 891 non-null object
Sex 891 non-null object
Age 714 non-null float64
SibSp 891 non-null int64
Parch 891 non-null int64
Ticket 891 non-null object
Fare 891 non-null float64
Cabin 204 non-null object
Embarked 889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB
None
PassengerId Survived Pclass ... SibSp Parch Fare
count 891.000000 891.000000 891.000000 ... 891.000000 891.000000 891.000000
mean 446.000000 0.383838 2.308642 ... 0.523008 0.381594 32.204208
std 257.353842 0.486592 0.836071 ... 1.102743 0.806057 49.693429
min 1.000000 0.000000 1.000000 ... 0.000000 0.000000 0.000000
25% 223.500000 0.000000 2.000000 ... 0.000000 0.000000 7.910400
50% 446.000000 0.000000 3.000000 ... 0.000000 0.000000 14.454200
75% 668.500000 1.000000 3.000000 ... 1.000000 0.000000 31.000000
max 891.000000 1.000000 3.000000 ... 8.000000 6.000000 512.329200
[8 rows x 7 columns]
PassengerId 0
Survived 0
Pclass 0
Name 0
Sex 0
Age 177
SibSp 0
Parch 0
Ticket 0
Fare 0
Cabin 687
Embarked 2
dtype: int64
PassengerId 0
Survived 0
Pclass 0
Name 0
Sex 0
Age 0
SibSp 0
Parch 0
Ticket 0
Fare 0
Cabin 0
Embarked 2
dtype: int64
Female Survival Rate: 0.7420382165605095
Pclass
1 0.629630
2 0.472826
3 0.242363
Name: Survived, dtype: float64
Pclass 1 2 3
Sex
female 0.968085 0.921053 0.500000
male 0.368852 0.157407 0.135447