Pandas actual combat: analyzing the characters of the Three Kingdoms

Introduction

background

Pandas is a tool library for Python for data analysis. Developed by AQR Capital Management in April 2008 and open sourced in 2009, it was originally developed as a financial data analysis tool. Pandas gets its name from panel data(Panel Data) and Python data analysis(Python Data Analysis). It is suitable for data analysis fields such as finance and statistics.

Features: two data structures

Series and DataFrame
(1) Series: one-dimensional data (column + index)

pandas.Series(['东汉', '马腾', '?', 212], index=['国家', '姓名', '出生年份', '逝世年份'])

series

(2) DataFrame: two-dimensional data (table: multiple columns + row/column index)
Series 和 DataFrame

pandas.DataFrame([
    ['东汉', 300],
    ['魏国', 800],
    ['蜀国', 400],
    ['吴国', 600],
    ['西晋', 1000]
], columns=['国家', '国力'])

dataframe

Install

If you are using the Python distribution for data science: Anaconda, you can condainstall it using

conda install pandas

If it is a normal Python environment, you can pipinstall it with

pip install pandas

combat

Let's first look at what the data looks like, the data exists sanguo.csvin the document

$ head sanguo.csv

head

(1) import module

import pandas as pd

(2) Read csv data

# 当前目录下的 sanguo.csv 文件,na_values 指定哪些值为空
df = pd.read_csv('./sanguo.csv', na_values=['na', '-', 'N/A', '?'])

1) View data

# 查看前 5 条
df.head(5)
# NaN 为空值

df.head()

# 查看后 5 条
df.tail(5)

df.tail()

2) View data overview

df.dtypes
# 查看数据类型

dtypes

df.info()
# 有 25 行,5 列
# 各列的名称(kindom、name、birth、die、character)、非空数目、数据类型

df.info()

df.describe()
# 查看数值型列统计值:总数、平均值、标准差、最小值、25%/50%/75% 分位数、最大值

df.describe

3) Data operation
setting column name

df.columns = ['国家', '姓名', '出生年份', '逝世年份', '角色']
df.head()

set column name

add new column

# 计算年龄
df['年龄'] = df['逝世年份'] - df['出生年份']
df.head(10)

add new column
Calculate column mean, median, mode, minimum/minimum value

Function function
average value df['年龄'].mean() 50.57142857142857
median df['年龄'].median() 53.0
mode df['年龄'].mode() 72.0
maximum value df['年龄'].max() 72.0
minimum value df['年龄'].min() 12.0

column filter

# 筛选年轮小于 50 的数据
df[df['年龄'] < 50]

filter data

# 筛选曹姓的数据
df[df['姓名'].str.startswith('曹')]

filter data

group

df.groupby('国家')['姓名'].count()
# 类似于 SQL: SELECT 国家, COUNT(姓名) FROM x GROUP BY 国家

group
apply function

df['状态'] = df['年龄'].apply(lambda x: '长寿' if isinstance(x, (int, float)) and x > 50 else '一般')
df.head()

apply
Fetch data : loc, iloc

df.loc[4] Take the 5th row of data (index starts from 0) loc
df.loc[4:5] Take the 5th to 6th row data loc
df.loc[4, '姓名']
ordf.iloc[4, 1]
Take the name column of row 5
or column 2 of row 5
loc
df.loc[4, ['姓名', '年龄']]
ordf.iloc[4, [1, 5]]
Take row 5 name, age column
or row 5 column 2, column 6
loc
df.loc[4:5, ['姓名', '年龄']]
or df.iloc[[4, 5], [1, 5]]
ordf.iloc[4:6, [1, 5]]
Take the name and age columns of the 5th to 6th rows
or take the 2nd and 6th columns of the 5th to 6th rows
loc
df.iloc[4:9, 1:4] Take columns 2~5 of columns 5~10 iloc

Append and merge data
concat

# 创建列
newpeople = pd.Series(['东汉', '马腾', '?', 212, '?'], index=['国家', '姓名', '出生年份', '逝世年份', '年龄'])

# 将 Series 转为 DataFrame,并对 DataFrame 转置(列转行)
newpeople = newpeople.to_frame().T

# 追加行(axis=0),重置索引(ignore_index=True)
df2 = pd.concat([df, newpeople], axis=0, ignore_index=True)
df2.tail()

append data
merge

# 创建表格
kindom_power = pd.DataFrame([
    ['东汉', 300],
    ['魏国', 800],
    ['蜀国', 400],
    ['吴国', 600],
    ['西晋', 1000]
], columns=['国家', '国力'])

# 按国家列进行两个表格(左 df,右 kindom_power)合并
df3 = pd.merge(left=df, right=kindom_power, on='国家')
df3.head(10)

merge
4) Export data

# 写入 sanguo_result.csv 中,不输出索引值
df.to_csv('sanguo_result.csv', index=False)

csv

reference

  • https://pandas.pydata.org/
  • https://www.runoob.com/pandas/pandas-tutorial.html
  • https://github.com/xchenhao/code-notes/blob/master/data/sanguo.csv sanguo.csv 数据

Guess you like

Origin blog.csdn.net/xchenhao/article/details/128667563