[Turn] Python - DataFrame Basic Operation Python - DataFrame Basic Operation

DataFrame understand

DataFrame can be seen as a number of Series objects orderly arrangement, where the "arrangement" means these Series have a common index.

First, read the file

dt = pd.read_csv(path)
dt = pd.read_excel(path)
dt = pd.read_table(path, sep=',')

 

Second, the index

The first index is iloc attribute, that value, and the slice are explicit, dt.iloc [1: 3] # Notes: closing the left and right from the beginning of the open interval 0

The second category is the index loc attribute is the index is implicit, such as dt.loc [: 'Illinois',: 'pop']

The third index ix may implement one mixing effect, such as dt.ix [: 3,: 'pop']

 

Third, merge with the connector

1、pd.concat()

 

pd.concat(objs, axis=0, join='outer', join_axes=None, ignore_index=False,
                keys=None, levels=None)

Wherein, axis = 0 are combined in rows, axis = 1 is based on a combined basis (also written axis = 'col'). axis = 1 when combined in accordance with the index.

ser1 = pd.Series(['A','B','C'])
ser2 = pd.Series(['D','E','F'])
ser3 = pd.Series(['G','H','I'])

a = pd.concat([ser1,ser2])

as a result, if incorporated directly by column Ser3, it will be given.

Copy the code
0    A
1    B
2    C
0    D
1    E
2    F
dtype: object

pd.concat([a, ser3], axis=1)

InvalidIndexError: Reindexing only valid with uniquely valued Index objects
Copy the code

Because the index is a retention of the original two Index Series, and by setting ignore_index = True, reset the index,

2、 pd.append()

ser1.append(ser2)

Pandas's append () does not update the value of the original object, but creates a new object for the combined data.

3 pd.merge ()

pd.merge(left, right, how='inner', on=None, left_on=None, right_on=None, left_index=False, right_index=False,
     sort=False, suffixes=('_x', '_y'), copy=True, indicator=False)

 

See help (pd.merge)

 

Fourth, the packets (Gruopby)

Groupby method requires the use of packet statistics, which works by dividing (Split), applications (Apply) and combinations (Combine) result.

Groupby data is divided in accordance with the specified column, returns a DataFrameGroupBy object. DataFrameGroupBy objects hidden inside a plurality of sets of data, but does not function before calculating the cumulative application.

Copy the code
import numpy as np
rng = np.random.RandomState(0)
df = pd.DataFrame({'key':['A','B','C','A','B','C'],
                  'data1':range(6),
                  "Dt2 ': Rng. Rndint ( 0, 10, 6)},
                 columns=['key','data1','data2'])

print(df)
df.groupby('key')
 
  key  data1  data2
0   A      0      5
1   B      1      0
2   C      2      3
3   A      3      3
4   B      4      7
5   C      5      9
Out[15]:
<pandas.core.groupby.DataFrameGroupBy object at 0x000001D9BC42A860>
Copy the code

(1) Polymerization aggregate

After applying function will DataFrameGroupBy object is expanded computing.

df.groupby('key').aggregate(['min',np.median,max])
data1 data2
  me median max me median max
key            
A 0 1.5 3 3 4.0 5
B 1 2.5 4 0 3.5 7
C 2 3.5 5 3 6.0 9

 You can also specify different functions required by the column dictionary

df.groupby('key').aggregate({'data1':'min',
                            'data2':'max'})
data1 data2
key    
A 0 5
B 1 7
C 2 9

 (2) filtering filter

def filter_func(x):
    return x['data2'].std() > 4

print(df.groupby('key').std())
print(df.groupby('key').filter(filter_func))

 

  data1 data2
key    
A 2.12132 1.414214
B 2.12132 4.949747
C 2.12132 4.242641

 

  key data1 data2
1 B 1 0
2 C 2 3
4 B 4 7
5 C 5 9

(3) Conversion transform

df.groupby('key').transform(lambda x:x-x.mean())
data1 data2
0 -1.5 1.0
1 -1.5 -3.5
2 -1.5 -3.0
3 1.5 -1.0
4 1.5 3.5
5 1.5 3.0

 

V. PivotTable

See help (pd.pivot_table)

titanic.pivot_table('survived', index='sex', columns='class', aggfunc={'survived':sum, 'fare':'mean'})

 

Six other operations (sort, de-duplication, and is calculated according to the ranks of the application function, etc.)

1, Sort

Sort by index

df.sort_index(ascending=False)
  key data1 data2
5 C 5 9
4 B 4 7
3 A 3 3
2 C 2 3
1 B 1 0
0 A 0 5

 Ordered by values

df.sort_values(by=['key','data2'])
key data1 data2
3 A 3 3
0 A 0 5
1 B 1 0
4 B 4 7
2 C 2 3
5 C 5 9

2, de-duplication

df.drop_duplicates('data2', keep='first')
key data1 data2
0 A 0 5
1 B 1 0
2 C 2 3
4 B 4 7
5 C 5 9

想要知道某一列有多少个不重复的元素可以用df['data1'].nunique()

3、删除Drop

按照行删除

先选出需要删除的行的index,再删除指定index

df.drop(df.loc[df['key']=='A'].index, axis=0)
key data1 data2
1 B 1 0
2 C 2 3
4 B 4 7
5 C 5 9

按照列删除

df.drop(['key','data2'], axis=1) 

DataFrame理解

DataFrame可以看做是有序排列的若干Series对象,这里的“排列”是指这些Series都有共同的索引。

一、读取文件

dt = pd.read_csv(path)
dt = pd.read_excel(path)
dt = pd.read_table(path, sep=',')

 

二、索引

第一类索引是iloc属性,表示取值和切片都是显式的,dt.iloc[1:3] #注:从0开始的左闭右开区间

第二类索引是loc属性,表示索引是隐式的,如dt.loc[:'Illinois', :'pop']

第三种索引ix可实现一种混合效果,如dt.ix[:3, :'pop']

 

三、合并与连接

1、pd.concat()

 

pd.concat(objs, axis=0, join='outer', join_axes=None, ignore_index=False,
                keys=None, levels=None)

其中,axis=0是按照行合并,axis=1是按列合并(也可写成axis='col')。axis=1时是按照索引合并的。

ser1 = pd.Series(['A','B','C'])
ser2 = pd.Series(['D','E','F'])
ser3 = pd.Series(['G','H','I'])

a = pd.concat([ser1,ser2])

a的结果如下,如果直接按列合并ser3,就会报错。

Copy the code
0    A
1    B
2    C
0    D
1    E
2    F
dtype: object

pd.concat([a, ser3], axis=1)

InvalidIndexError: Reindexing only valid with uniquely valued Index objects
Copy the code

因为此时a的索引是保留原本的两个Series的Index的,可通过设置ignore_index=True,重置索引,

2、 pd.append()

ser1.append(ser2)

Pandas的append()不直接更新原有对象的值,而是为合并后的数据创建一个新对象。

3、pd.merge()

pd.merge(left, right, how='inner', on=None, left_on=None, right_on=None, left_index=False, right_index=False,
     sort=False, suffixes=('_x', '_y'), copy=True, indicator=False)

 

详见help(pd.merge)

 

四、分组(Gruopby)

分组统计时需要使用到groupby方法,其原理是通过分割(split)、应用(apply)和组合(combine)得到结果。

Groupby是对数据按照指定列进行分割,返回一个DataFrameGroupBy对象。DataFrameGroupBy对象里面隐藏着若干组数据,但是没有应用累计函数之前不会计算。

Copy the code
import numpy as np
rng = np.random.RandomState(0)
df = pd.DataFrame({'key':['A','B','C','A','B','C'],
                  'data1':range(6),
                  'data2':rng.randint(0,10,6)},
                 columns=['key','data1','data2'])

print(df)
df.groupby('key')
 
  key  data1  data2
0   A      0      5
1   B      1      0
2   C      2      3
3   A      3      3
4   B      4      7
5   C      5      9
Out[15]:
<pandas.core.groupby.DataFrameGroupBy object at 0x000001D9BC42A860>
Copy the code

(1)聚合aggregate

应用函数后会对DataFrameGroupBy对象展开计算。

df.groupby('key').aggregate(['min',np.median,max])
data1 data2
  min median max min median max
key            
A 0 1.5 3 3 4.0 5
B 1 2.5 4 0 3.5 7
C 2 3.5 5 3 6.0 9

 还可以通过字典指定不同列需要的函数

df.groupby('key').aggregate({'data1':'min',
                            'data2':'max'})
data1 data2
key    
A 0 5
B 1 7
C 2 9

 (2)过滤filter

def filter_func(x):
    return x['data2'].std() > 4

print(df.groupby('key').std())
print(df.groupby('key').filter(filter_func))

 

  data1 data2
key    
A 2.12132 1.414214
B 2.12132 4.949747
C 2.12132 4.242641

 

  key data1 data2
1 B 1 0
2 C 2 3
4 B 4 7
5 C 5 9

(3)转换transform

df.groupby('key').transform(lambda x:x-x.mean())
data1 data2
0 -1.5 1.0
1 -1.5 -3.5
2 -1.5 -3.0
3 1.5 -1.0
4 1.5 3.5
5 1.5 3.0

 

五、数据透视表

详见help(pd.pivot_table)

titanic.pivot_table('survived', index='sex', columns='class', aggfunc={'survived':sum, 'fare':'mean'})

 

六、其他操作(排序、去重、计算及按行列应用函数等)

1、排序

按照索引排序

df.sort_index(ascending=False)
  key data1 data2
5 C 5 9
4 B 4 7
3 A 3 3
2 C 2 3
1 B 1 0
0 A 0 5

 按照值排序

df.sort_values(by=['key','data2'])
key data1 data2
3 A 3 3
0 A 0 5
1 B 1 0
4 B 4 7
2 C 2 3
5 C 5 9

2、去重

df.drop_duplicates('data2', keep='first')
key data1 data2
0 A 0 5
1 B 1 0
2 C 2 3
4 B 4 7
5 C 5 9

Want to know a column how many elements can not be repeated with df [ 'data1']. Nunique ()

3, Delete Drop

Delete in rows

First elected to be deleted index rows, and then delete the specified index

df.drop(df.loc[df['key']=='A'].index, axis=0)
key data1 data2
1 B 1 0
2 C 2 3
4 B 4 7
5 C 5 9

Delete according to the column

df.drop(['key','data2'], axis=1) 

Guess you like

Origin www.cnblogs.com/chanshion/p/12388873.html