Introduction
background
Pandas is a tool library for Python for data analysis. Developed by AQR Capital Management in April 2008 and open sourced in 2009, it was originally developed as a financial data analysis tool. Pandas gets its name from panel data
(Panel Data) and Python data analysis
(Python Data Analysis). It is suitable for data analysis fields such as finance and statistics.
Features: two data structures
Series and DataFrame
(1) Series: one-dimensional data (column + index)
pandas.Series(['东汉', '马腾', '?', 212], index=['国家', '姓名', '出生年份', '逝世年份'])
(2) DataFrame: two-dimensional data (table: multiple columns + row/column index)
pandas.DataFrame([
['东汉', 300],
['魏国', 800],
['蜀国', 400],
['吴国', 600],
['西晋', 1000]
], columns=['国家', '国力'])
Install
If you are using the Python distribution for data science: Anaconda, you can conda
install it using
conda install pandas
If it is a normal Python environment, you can pip
install it with
pip install pandas
combat
Let's first look at what the data looks like, the data exists sanguo.csv
in the document
$ head sanguo.csv
(1) import module
import pandas as pd
(2) Read csv data
# 当前目录下的 sanguo.csv 文件,na_values 指定哪些值为空
df = pd.read_csv('./sanguo.csv', na_values=['na', '-', 'N/A', '?'])
1) View data
# 查看前 5 条
df.head(5)
# NaN 为空值
# 查看后 5 条
df.tail(5)
2) View data overview
df.dtypes
# 查看数据类型
df.info()
# 有 25 行,5 列
# 各列的名称(kindom、name、birth、die、character)、非空数目、数据类型
df.describe()
# 查看数值型列统计值:总数、平均值、标准差、最小值、25%/50%/75% 分位数、最大值
3) Data operation
setting column name
df.columns = ['国家', '姓名', '出生年份', '逝世年份', '角色']
df.head()
add new column
# 计算年龄
df['年龄'] = df['逝世年份'] - df['出生年份']
df.head(10)
Calculate column mean, median, mode, minimum/minimum value
Function | function | |
---|---|---|
average value | df['年龄'].mean() |
50.57142857142857 |
median | df['年龄'].median() |
53.0 |
mode | df['年龄'].mode() |
72.0 |
maximum value | df['年龄'].max() |
72.0 |
minimum value | df['年龄'].min() |
12.0 |
column filter
# 筛选年轮小于 50 的数据
df[df['年龄'] < 50]
# 筛选曹姓的数据
df[df['姓名'].str.startswith('曹')]
group
df.groupby('国家')['姓名'].count()
# 类似于 SQL: SELECT 国家, COUNT(姓名) FROM x GROUP BY 国家
apply function
df['状态'] = df['年龄'].apply(lambda x: '长寿' if isinstance(x, (int, float)) and x > 50 else '一般')
df.head()
Fetch data : loc, iloc
df.loc[4] |
Take the 5th row of data (index starts from 0) | |
df.loc[4:5] |
Take the 5th to 6th row data | |
df.loc[4, '姓名'] or df.iloc[4, 1] |
Take the name column of row 5 or column 2 of row 5 |
|
df.loc[4, ['姓名', '年龄']] or df.iloc[4, [1, 5]] |
Take row 5 name, age column or row 5 column 2, column 6 |
|
df.loc[4:5, ['姓名', '年龄']] or df.iloc[[4, 5], [1, 5]] or df.iloc[4:6, [1, 5]] |
Take the name and age columns of the 5th to 6th rows or take the 2nd and 6th columns of the 5th to 6th rows |
|
df.iloc[4:9, 1:4] |
Take columns 2~5 of columns 5~10 |
Append and merge data
concat
# 创建列
newpeople = pd.Series(['东汉', '马腾', '?', 212, '?'], index=['国家', '姓名', '出生年份', '逝世年份', '年龄'])
# 将 Series 转为 DataFrame,并对 DataFrame 转置(列转行)
newpeople = newpeople.to_frame().T
# 追加行(axis=0),重置索引(ignore_index=True)
df2 = pd.concat([df, newpeople], axis=0, ignore_index=True)
df2.tail()
merge
# 创建表格
kindom_power = pd.DataFrame([
['东汉', 300],
['魏国', 800],
['蜀国', 400],
['吴国', 600],
['西晋', 1000]
], columns=['国家', '国力'])
# 按国家列进行两个表格(左 df,右 kindom_power)合并
df3 = pd.merge(left=df, right=kindom_power, on='国家')
df3.head(10)
4) Export data
# 写入 sanguo_result.csv 中,不输出索引值
df.to_csv('sanguo_result.csv', index=False)
reference
- https://pandas.pydata.org/
- https://www.runoob.com/pandas/pandas-tutorial.html
- https://github.com/xchenhao/code-notes/blob/master/data/sanguo.csv
sanguo.csv 数据