Introduction
If hive in big data is a dragon knife, then pandas is a sword that helps us in data mining, data analysis, and data cleaning .
This article introduces some basic grammar and usage skills of Pandas, it is recommended to collect~
table of Contents
- data preparation
-
Dataframe basic operations
2.1 View
2.2 Modification
2.3 Filter
2.4 Sort
2.5 Deduplication
2.6 Aggregation
2.7 Association
2.8 Custom functions
2.9 Index operation
2.10 Null value processing
2.11 to_csv write csv file
1. Data Preparation
Run the following script on the python IDE platform:
import pandas as pd
import numpy as np
df=pd.DataFrame([['A10','Sone',2,'20200801'],
['A10','welsh',3,'20200801'],['A10','Sone',16,'20200801'],['A10','Albert',20,'20200802'],
['A10','GG',32,' 20200801'],['A20','Albert',42,' 20200801'],['A20','welsh',10,'20200801'],['A20','welsh',15,'20200802'],['A10','Albert',20,'20200801'],['A20','Sone',np.NaN,'20200802'],['A20','welsh',15,'20200802'],['A20','Albert',10,'20200802'],['A10','Jojo',16,'20200802'],
['A20','welsh',35,'20200803'],['A10','welsh',33,'20200803'],['A20','Sone',66,'20200803'],
['A20','Jojo',15,'20200802'],['A10','Albert',53,'20200803'],['A10','Jojo',12,'20200803'],
['A20','GG',35,'20200803'],['A20','J.K',30,'20200803']
],index=[[x for x in range(21)]], columns=['site_id','user_name','pv','dt'])
site=pd.DataFrame([['A02','北京东直门'],['A10','北京朝阳门店'],['A06','北京六里桥店'],['A20','北京西黄村店']],
index=[[x for x in range(4)]],columns=['site_id','site_name'])
Data preview:
site_id | user_name | pv | dt |
---|---|---|---|
A10 | Zone | 2 | 20200801 |
A10 | welsh | 3 | 20200801 |
A10 | Zone | 16 | 20200801 |
A10 | Albert | 20 | 20200802 |
A10 | GG | 32 | 20200801 |
A20 | Albert | 42 | 20200801 |
A20 | welsh | 10 | 20200801 |
A20 | welsh | 15 | 20200802 |
A10 | Albert | 20 | 20200801 |
A20 | Zone | NaN | 20200802 |
A20 | welsh | 15 | 20200802 |
A20 | Albert | 10 | 20200802 |
A10 | Jojo | 16 | 20200802 |
A20 | welsh | 35 | 20200803 |
A10 | welsh | 33 | 20200803 |
A20 | Zone | 66 | 20200803 |
A20 | Jojo | 15 | 20200802 |
A10 | Albert | 53 | 20200803 |
A10 | Jojo | 12 | 20200803 |
A20 | GG | 35 | 20200803 |
A20 | J.K | 30 | 20200803 |
Store preview: |
site_id | user_name |
---|---|---|
A02 | Beijing Dongzhimen | |
A10 | Beijing Chaoyangmen Store | |
A06 | Beijing Liuliqiao Store | |
A20 | Beijing Xihuangcun Store |
2. Dataframe basic operations
Pandas's Dataframe structure is actually a two-dimensional array consisting of columns, rows, and indexes, similar to the mysql structure.
It mainly introduces some basic grammars of table viewing, modification, filtering, sorting, aggregation, association, and null value processing .
2.1 View
- columns get column name
df.columns
# 输出:
Index(['site_id', 'user_name', 'pv', 'dt'], dtype='object')
-
index Get index
df.index # 输出: MultiIndex([( 0,), ( 1,), ( 2,), ( 3,), ... (19,), (20,)], )
-
values Get data
df.values # 输出: array([['A10', 'Sone', 2, '20200801'], ['A10', 'welsh', 3, '20200801'], ['A10', 'Sone', 16, '20200801'], ['A10', 'Albert', 20, '20200802'], ... ['A10', 'Jojo', 12, '20200803'], ['A20', 'GG', 35, '20200803'], ['A20', 'J.K', 30, '20200803']], dtype=object)
-
dtypes view type
df.dtypes # 输出: site_id object user_name object pv object dt object dtype: object
Remarks: When associating between 2 tables, it is often necessary to confirm whether the types of the associated 2 fields are the same. When inconsistencies, astype conversion is required, for example: df["dt"] = df["dt"]. astype("int64" )
-
head get
df.head(2) # 展示头2行 # 输出: site_id user_name pv dt 0 A10 Sone Sone 20200801 1 A10 welsh welsh 20200801
- df.xx/loc column view
df.name # 单列展示 # 输出: 0 Sone 1 welsh 2 Sone ... 18 Jojo 19 GG 20 J.K Name: user_name, dtype: object
df.loc[:,['name','pv']] # Multi-column display
Output:
user_name pv
0 Sone 2
1 welsh 3
2 Sone 16
...
19 GG 35
20 J.K 30
7. iloc 行查看
```python
df.iloc[[0,1,8],] # 展示index为0、1、8的行
# 输出:
site_id user_name pv dt
0 A10 Sone 2 20200801
1 A10 welsh 3 20200801
8 A10 Albert 20 20200801
-
Shape column and row overall statistics
df.shape # 输出21列,4行 # 输出: (21, 4)
- count a column of statistics
df.pv.count() # 输出: 20
Explanation: The total counted by count() does not include NaN
2.2 Modification
-
rename a column modification
df.rename(columns={'pv': 'page_view'}) # 输出: site_id user_name page_view dt 0 A10 Sone 2.0 20200801 1 A10 welsh 3.0 20200801 2 A10 Sone 16.0 20200801 ... 19 A20 GG 35.0 20200803 20 A20 J.K 30.0 20200803
Note: You need to re-assign to the original table , the original table value will take effect, df = df.rename(columns={'pv':'page_view'})
-
remove the drop column
df.drop(['dt'], axis=1) # 输出: site_id user_name pv 0 A10 Sone 2.0 1 A10 welsh 3.0 2 A10 Sone 16.0 3 A10 Albert 20.0 ... 19 A20 GG 35.0 20 A20 J.K 30.0
Note: You need to re-assign to the original table , the original table value will take effect, df = df.drop(['dt'], axis=1)
- df['xx'] Add a row
df['copy_dt']=df['dt'] # 新增df['copy_dt']列,复制['dt']这列而来 df # 输出: site_id user_name pv dt copy_dt 0 A10 Sone 2.0 20200801 20200801 1 A10 welsh 3.0 20200801 20200801 2 A10 Sone 16.0 20200801 20200801 ... 19 A20 GG 35.0 20200803 20200803 20 A20 J.K 30.0 20200803 20200803
2.3 Filter
-
df[xx>x] Single condition filter
df[df.pv>30] # pv值大于30的数据 # 输出: site_id user_name pv dt 4 A10 GG 32.0 20200801 5 A20 Albert 42.0 20200801 13 A20 welsh 35.0 20200803 14 A10 welsh 33.0 20200803 15 A20 Sone 66.0 20200803 17 A10 Albert 53.0 20200803 19 A20 GG 35.0 20200803
- df[(xx>x)&(yy==y)] multi-condition filtering
df["dt"] = df["dt"].astype("int64") # 先将dt转换成int64类型 df[(df.pv>30) & (df.dt==20200801)] # 过滤出pv>30 且 dt是0801这天的 # 输出: site_id user_name pv dt 4 A10 GG 32.0 20200801 5 A20 Albert 42.0 20200801
2.4 Sort
- sort_values sort based on values
df.sort_values(by=["pv"],ascending=False) # pv 倒叙
Output:
site_id user_name pv dt
15 A20 Sone 66.0 20200803
17 A10 Albert 53.0 20200803
5 A20 Albert 42.0 20200801
19 A20 GG 35.0 20200803
...
1 A10 welsh 3.0 20200801
0 A10 Sone 2.0 20200801
9 A20 Sone NaN 20200802
df.sort_values(by=["pv"],ascending=True) # pv 正序
Output:
site_id user_name pv dt
0 A10 Sone 2.0 20200801
1 A10 welsh 3.0 20200801
11 A20 Albert 10.0 20200802
6 A20 welsh 10.0 20200801
...
17 A10 Albert 53.0 20200803
15 A20 Sone 66.0 20200803
9 A20 Sone NaN 20200802
说明:pv是null的数据,无论是正序还是倒叙均排在最后,**进行排序时需要先进行null值处理**
2. sort_index 基于index排序
```python
df=df.sort_index(axis=0)
# 输出:
site_id user_name pv dt
0 A10 Sone 2.0 20200801
1 A10 welsh 3.0 20200801
2 A10 Sone 16.0 20200801
...
19 A20 GG 35.0 20200803
20 A20 J.K 30.0 20200803
Note: When we aggregate, the index will be out of order , so we need to use index-based sorting for these
2.5 Deduplication statistics
- nunique de-duplication based on a certain column
df.groupby('site_id').agg({'user_name': pd.Series.nunique}) # A10下5个用户,A20下6个用户
Output:
user_name
site_id
A10 5
A20 6
### 2.6 聚合
1. groupby('xx') 基于单列聚合
```python
df.groupby('site_id').count()
# 输出:
user_name pv dt
site_id
A10 10 10 10
A20 11 10 11
df.groupby('site_id').min()
# 输出:
user_name pv dt
site_id
A10 Albert 2.0 20200801
A20 Albert 10.0 20200801
df.groupby('site_id').max()
# 输出:
user_name pv dt
site_id
A10 welsh 53.0 20200803
A20 welsh 66.0 20200803
Description: Aggregate function support: count()| min()| max()| avg()| meav()| std() | var(), to calculate non-NaN data
- groupby(['xx','yy']).agg based on multi-column aggregation
df.groupby(['site_id','user_name']).agg({'pv': 'sum','dt':'count'})
Output:
pv dt
site_id user_name
A10 Albert 93.0 3
GG 32.0 1
Jojo 28.0 2
Sone 18.0 2
welsh 36.0 2
A20 Albert 52.0 2
GG 35.0 1
J.K 30.0 1
Jojo 15.0 1
Sone 66.0 2
welsh 75.0 4
### 2.7 关联
1. merge 基于字段关联
```python
df= pd.merge(df,site,how='inner',on='site_id')
# 输出:
site_id user_name pv dt site_name
0 A10 Sone 2.0 20200801 北京朝阳门店
1 A10 welsh 3.0 20200801 北京朝阳门店
...
19 A20 GG 35.0 20200803 北京西黄村店
20 A20 J.K 30.0 20200803 北京西黄村店
- left_index is based on index association
df = df.groupby("site_id").count() df= pd.merge(df,site,how='inner',left_index=True,right_on="site_id")
Output:
user_name pv dt site_id site_name
1 10 10 10 A10 Beijing Chaoyangmen Store
3 11 10 11 A20 Beijing Xihuangcun Store
说明: 表A基于site_id字段进行聚合后,然后site_id字段变成表A的index,然后表A的index与表B的字段site_id在进行聚合,最终带出site_name
### 2.8 自定义函数
1. 例如我们想将 pv 与 dt字段进行拼接后生成,可以用apply 之 lambda 函数实现
```python
df['pv']=df['pv'].astype("str") # pv字段转成str
df['dt']=df['dt'].astype("str") # dt字段转成str
df['pv_dt'] = df.apply(lambda r:(r['pv'] +"_"+ r['dt']),axis=1) # 将pv与dt进行拼接
# 输出:
site_id user_name pv dt pv_dt
0 A10 Sone 2.0 20200801 2.0_20200801
1 A10 welsh 3.0 20200801 3.0_20200801
2 A10 Sone 16.0 20200801 16.0_20200801
...
18 A10 Jojo 12.0 20200803 12.0_20200803
19 A20 GG 35.0 20200803 35.0_20200803
20 A20 J.K 30.0 20200803 30.0_20200803
- Method two, custom function
def str_split(sub_pdf:pd.DataFrame): sub_pdf['pv_dt'] = sub_pdf['pv']+"_"+sub_pdf['dt'] return sub_pdf
df['ab_pro'] = df.apply(str_split, axis=1)
Output:
site_id user_name pv dt pv_dt
0 A10 Sone 2.0 20200801 2.0_20200801
1 A10 welsh 3.0 20200801 3.0_20200801
2 A10 Sone 16.0 20200801 16.0_20200801
...
18 A10 Jojo 12.0 20200803 12.0_20200803
19 A20 GG 35.0 20200803 35.0_20200803
20 A20 JK 30.0_20
### 2.9 索引操作
1. reset_index 重排序索引,一般是针对聚合后的数据,对其索引进行重排
```python
df = df.groupby("user_name").count() # 此时索引是user_name
# 输出:
site_id pv dt
user_name
Albert 5 5 5
GG 2 2 2
J.K 1 1 1
Jojo 3 3 3
Sone 4 3 4
welsh 6 6 6
df.reset_index('user_name')
# 输出:
user_name site_id pv dt # 重排后的索引
0 Albert 5 5 5
1 GG 2 2 2
2 J.K 1 1 1
3 Jojo 3 3 3
4 Sone 4 3 4
5 welsh 6 6 6
- set_index a column is designated as an index
df.set_index("site_id")
Output:
user_name pv dt
site_id
A10 Sone 2.0 20200801
A10 welsh 3.0 20200801
A10 Sone 16.0 20200801
...
A20 Jojo 15.0 20200802
A10 Albert 53.0 20200803
A10 Jojo 12.0 20200803
### 2.10 空值处理
1. isnull() 空值统计,True表示该列含有空值,false表示该列不含空值,通常与any()看哪些列是空值,sum()看各列空值的数量
```python
df.isnull().any() # 统计
# 输出:
site_id False
user_name False
pv True
dt False
dtype: bool
df.isnull().sum()
# 输出:
site_id 0
user_name 0
pv 1
dt 0
dtype: int64
- notnull() non-null statistics, True indicates that the column contains non-null values, false indicates that the column is all null values,
df.notnull().any()
Output:
site_id True
user_name True
pv True
dt True
dtype: bool
3. 空值填充, Sone的pv值被填充为0
```python
df['pv'] = df.pv.fillna(0)
df
# 输出:
site_id user_name pv dt
0 A10 Sone 2.0 20200801
1 A10 welsh 3.0 20200801
..
9 A20 Sone 0.0 20200802
...
20 A20 J.K 30.0 20200803
2.11 to_csv write csv file
df.to_csv("pv.csv")
3. Series basic operations
The Pandas Series structure is actually a one-dimensional array composed of columns and indexes, similar to a single-column mysql table structure, from viewing, statistics, filtering, and aggregation.
3.1 View
- head view
user_name = df['user_name'] user_name.head(2)
Output:
0 Sone
1 welsh
Name: user_name, dtype: object
### 3.2 统计
1. shape 行统计
```python
user_name = df['user_name']
user_name.shape
# 输出:
(21,)
3.3 Filter
- df[xx=='x']
user_name = df['user_name'] user_name[user_name=='Sone']
Output:
0 Sone
2 Sone
9 Sone
15 Sone
Name: user_name, dtype: object
### 3.4 排序
1. sort_values
```python
user_name = df['user_name']
user_name.sort_values()
# 输出:
17 Albert
3 Albert
5 Albert
8 Albert
...
13 welsh
14 welsh
7 welsh
6 welsh
1 welsh
10 welsh
Name: user_name, dtype: object
3.5 aggregation
user_name = df['user_name']
user_name.count()
# 输出:
21
3.6 Null value processing
- isnull() null value statistics
pv = df['pv'] pv.isnull().sum()
Output:
1
2. fillna(0)空值统计
```python
pv = df['pv']
pv = pv.fillna(0)
# 输出:
0 2.0
...
9 0.0
...
20 30.0
Name: pv, dtype: float64
Follow my WeChat public account [Data Ape Wen Da]
Get the Chinese version of pandas official documentation