python Study Notes 37: pandas

pandas

series: one-dimensional data structure
dataframe: two-dimensional data structure

import pandas as pd

Series mode

Parameter Description:

  • data
  • index index
  • name name
  • Are copy Copy
  • dtype data types
ser_obj = pd.Series([1,2,3])
ser_obj
0    1
1    2
2    3
dtype: int64
# 指定索引
ser_obj = pd.Series([1,2,3],index=['a','b','c'])
ser_obj
a    1
b    2
c    3
dtype: int64
# 从字典创建
dit = {2001:100,2002:200,2003:150}
ser_obj1 = pd.Series(dit)
ser_obj1
2001    100
2002    200
2003    150
dtype: int64

Dataframe way

Parameter Description:

  • data
  • index index
  • columns row index
  • Are copy Copy
  • dtype data types
import numpy as np
data = np.arange(6).reshape(2,3)
df_obj = pd.DataFrame(data)
df_obj
0 1 2
0 0 1 2
1 3 4 5
data = np.arange(6).reshape(2,3)
df_obj = pd.DataFrame(data,columns=['a','b','c'])
df_obj
a b c
0 0 1 2
1 3 4 5
# 增加一列
df_obj['d'] = [1,2]
df_obj
a b c d
0 0 1 2 1
1 3 4 5 2
# 删除一列
del df_obj['a']
df_obj
b c d
0 1 2 1
1 4 5 2
# 重置索引
ser_obj1 = pd.Series([1,2,3,4],index=['c','b','a','d'])
ser_obj1
c    1
b    2
a    3
d    4
dtype: int64
# fill_value会让所有缺失值都用同一个值填充
ser_obj2 = ser_obj1.reindex(['a','b','c','d','e','f'],fill_value=5)
ser_obj2
a    3
b    2
c    1
d    4
e    5
f    5
dtype: int64
# ffill\pad 前向填充值
# bfill\backfill 后向填充值
# nearest 从最近的索引值填充
ser_obj3 = pd.Series([1,3,5,7], index=[0,2,4,6])
ser_obj3
0    1
2    3
4    5
6    7
dtype: int64
ser_obj3.reindex([1,2,3,4,5,6],method='bfill')
ser_obj3
0    1
2    3
4    5
6    7
dtype: int64

index

arr = np.arange(12).reshape(3,4)
df_obj = pd.DataFrame(arr, columns=['a','b','c','d'])
df_obj
a b c d
0 0 1 2 3
1 4 5 6 7
2 8 9 10 11
df_obj['a']
0    0
1    4
2    8
Name: a, dtype: int64
df_obj[0:1]
a b c d
0 0 1 2 3
# 多列,以列表方式传入
df_obj[['a','c']]
a c
0 0 2
1 4 6
2 8 10
# loc和iloc
df_obj.loc[:,['c','a']]
c a
0 2 0
1 6 4
2 10 8
df_obj.iloc[:,[2,0]]
c a
0 2 0
1 6 4
2 10 8

Arithmetic and data alignment

# 先对齐在运算
obj_one = pd.Series(range(10,13),index=range(3))
obj_one
0    10
1    11
2    12
dtype: int64
obj_two = pd.Series(range(10,16),index=range(6))
obj_two
0    10
1    11
2    12
3    13
4    14
5    15
dtype: int64
# 没有用NaN补充,也可以设置fill_value
obj_one + obj_two
0    20.0
1    22.0
2    24.0
3     NaN
4     NaN
5     NaN
dtype: float64
obj_one.add(obj_two,fill_value=0)
0    20.0
1    22.0
2    24.0
3    13.0
4    14.0
5    15.0
dtype: float64

Sorting data

# 按索引
ser_obj = pd.Series(range(10,13),index=range(3))
ser_obj
0    10
1    11
2    12
dtype: int64
ser_obj.sort_index(ascending=False)
2    12
1    11
0    10
dtype: int64

Parameter Description:

  • a row axis 0
  • level sort specified index level
  • ascending default liter
  • inplace default False, not create new instances
  • kind sorting algorithms, such as quicksort
# 按值
ser_obj = pd.DataFrame(np.arange(12).reshape(3,4))
ser_obj
0 1 2 3
0 0 1 2 3
1 4 5 6 7
2 8 9 10 11
ser_obj[4] = [12,3,2]
ser_obj
0 1 2 3 4
0 0 1 2 3 12
1 4 5 6 7 3
2 8 9 10 11 2
ser_obj.sort_values(by=4)
0 1 2 3 4
2 8 9 10 11 2
1 4 5 6 7 3
0 0 1 2 3 12

Common statistical calculations

  • sum and
  • mean mean
  • medium value
  • idxmax maximum index
  • idxmin
  • The count value of the number of non-NaN
  • Sample variance var
  • std standard deviation
  • cumsum cumulative seeking unity
  • cumprod cumulative quadrature
  • describe calculation of summary statistics column

Hierarchical Index

df_obj = pd.DataFrame({'学生数':[1,2,3,4]},
                     index=[['学校1','学校1','学校2','学校2'],['班级1','班级2','班级1','班级2']])
df_obj
Number of students
1 school Class 1 1
Class 2 2
2 schools Class 1 3
Class 2 4
from pandas import MultiIndex
# 三种转化为层次索引的方法
# MultiIndex.from_arrays
# MultiIndex.from_product
# MultiIndex.from_tuples
# 方式1
list_tuple = [('学校1','班级1'),('学校1','班级2'),('学校2','班级1'),('学校2','班级2')]
m_index = MultiIndex.from_tuples(tuples=list_tuple)
m_index
MultiIndex(levels=[['学校1', '学校2'], ['班级1', '班级2']],
           codes=[[0, 0, 1, 1], [0, 1, 0, 1]])
df_obj = pd.DataFrame({'学生数':[1,2,3,4]},
                     index = m_index)
df_obj
Number of students
1 school Class 1 1
Class 2 2
2 schools Class 1 3
Class 2 4
# 方式2
schools = ['学校1', '学校2']
classes = ['班级1', '班级2']
m_index = MultiIndex.from_product(iterables=[schools,classes])#,names=['school','class'])
df_obj = pd.DataFrame({'学生数':[1,2,3,4]},
                     index = m_index)
df_obj
Number of students
1 school Class 1 1
Class 2 2
2 schools Class 1 3
Class 2 4

Level index operations

# df_obj['学校1']

Read and write

  • pd.read_csv () default "," do division symbol
  • pd.to_csv()
  • pd.read_table () defaults to "\ t" to do division symbol
  • pd.read_excel()
# 由于表格有多个列标题,所以用header=[0,1]表示前两行都是列标签
df_obj = pd.read_excel('scores.xlsx',header=[0,1],index_col=0)
df_obj
years A score Two scores
liberal arts science liberal arts science
2018 576 532 488 432
2017 555 537 468 439
2016 583 548 532 494
2015 579 548 527 495
2014 565 543 507 495
2013 549 550 494 505
2012 495 477 446 433
2011 524 484 481 435
2010 524 494 474 441
2009 532 501 489 459
2008 515 502 472 455
2007 528 531 486 478
2006 516 528 476 476
# 获取历年文理科最高和最低分数线及极差
df_obj.max()
年份       
一本分数线  文科    583
       理科    550
二本分数线  文科    532
       理科    505
dtype: int64
df_obj.min()
年份       
一本分数线  文科    495
       理科    477
二本分数线  文科    446
       理科    432
dtype: int64
df_obj["一本分数线","文科"].ptp()
/Users/zxx/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:1: FutureWarning: Method .ptp is deprecated and will be removed in a future version. Use numpy.ptp instead.
  """Entry point for launching an IPython kernel.





88
df_obj.describe()
years A score Two scores
liberal arts science liberal arts science
count 13.000000 13.000000 13.000000 13.000000
mean 541.615385 521.153846 487.692308 464.384615
std 28.150010 25.986683 23.570407 27.274953
me 495.000000 477.000000 446.000000 432.000000
25% 524.000000 501.000000 474.000000 439.000000
50% 532.000000 531.000000 486.000000 459.000000
75% 565.000000 543.000000 494.000000 494.000000
max 583.000000 550.000000 532.000000 505.000000

Guess you like

Origin www.cnblogs.com/zheng1076/p/11453525.html