Common data processing functions in Pandas (reindexing, drop, selection, sorting, mapping/apply.. etc.)

Pandas has two major data structures, Series and DataFrame. Series processes unique data, and each data has a corresponding pointer index. DataFrame deals with two-dimensional data, each data has a corresponding index and a corresponding column.

For more details about Series and DataFrame, please refer to:

Series:  Introduction to the Series data structure in the Python data processing library pandas\

DataFrame:  Introduction to the DataFrame data structure in the Python data processing library pandas 

This article mainly introduces some basic functions commonly used by Pandas when processing data. For more details, please refer to the online documentation of pandas .

Reindexing 

Series

Reindex the pointer, if the newly defined pointer has no corresponding data, it will be automatically filled with NaN. 

In [2]: obj = pd.Series([5.2, 2.1, -4.5, 2.65], index=['c', 'a', 'd', 'b'])

In [3]: obj
Out[3]: 
c    5.20
a    2.10
d   -4.50
b    2.65
dtype: float64


In [6]: obj2 = obj.reindex(['a', 'b', 'c', 'd', 'f'])

In [7]: obj2
Out[7]: 
a    2.10
b    2.65
c    5.20
d   -4.50
f     NaN
dtype: float64

DataFrame

Can be reindexed by row or column

In [27]: frame = pd.DataFrame(np.random.randint(10, size=(3, 3)),
    ...:                      index=['d', 'b', 'a'],
    ...:                      columns=['three', 'four', 'one'])
    ...: 

In [28]: frame
Out[28]: 
   three  four  one
d      7     9    7
b      6     8    5
a      4     1    8

#default是按 行 reindex
In [29]: frame2 = frame.reindex(['a', 'b', 'c', 'd'])

In [30]: frame2
Out[30]: 
   three  four  one
a    4.0   1.0  8.0
b    6.0   8.0  5.0
c    NaN   NaN  NaN
d    7.0   9.0  7.0

#指定按列 reindex
In [31]: frame3 = frame.reindex(columns=['one', 'two', 'three', 'four'])

In [32]: frame3
Out[32]: 
   one  two  three  four
d    7  NaN      7     9
b    5  NaN      6     8
a    8  NaN      4     1

You can also use the loc method to reindex, but using the loc method will not create NaN data. 

In [33]: frame.loc[['a', 'b', 'd'], ['one', 'three', 'four']]
Out[33]: 
   one  three  four
a    8      4     1
b    5      6     8
d    7      7     9

Dropping 

Eliminate data in a certain row and column, this method will return a new data set by default. 

Series

In [34]: obj
Out[34]: 
c    5.20
a    2.10
d   -4.50
b    2.65
dtype: float64

In [35]: obj.drop('a')
Out[35]: 
c    5.20
d   -4.50
b    2.65
dtype: float64

In [36]: obj.drop(['a', 'c'])
Out[36]: 
d   -4.50
b    2.65
dtype: float64

DataFrame

In [40]: frame
Out[40]: 
   three  four  one
d      7     9    7
b      6     8    5
a      4     1    8

In [41]: frame.drop(['a', 'd'])
Out[41]: 
   three  four  one
b      6     8    5

#通过指定axis =1 或者axis='columns'来drop列
In [42]: frame.drop('one', axis=1)
Out[42]: 
   three  four
d      7     9
b      6     8
a      4     1

In [43]: frame.drop('one', axis='columns')
Out[43]: 
   three  four
d      7     9
b      6     8
a      4     1

Indexing 

Series

First method: through indexing like Numpy array:

In [44]: obj
Out[44]: 
c    5.20
a    2.10
d   -4.50
b    2.65
dtype: float64

In [45]: obj['a']
Out[45]: 2.1

In [46]: obj[1]
Out[46]: 2.1

In [47]: obj[2:4]
Out[47]: 
d   -4.50
b    2.65
dtype: float64

In [48]: obj[['a', 'b']]
Out[48]: 
a    2.10
b    2.65
dtype: float64

In [49]: obj[[1, 3]]
Out[49]: 
a    2.10
b    2.65
dtype: float64

In [50]: obj[obj<0]
Out[50]: 
d   -4.5
dtype: float64

The second method: index through the loc/iloc method. loc is indexed by the name of the index, and iloc is indexed by the location of the index.

In [51]: obj.loc[['a', 'b']]
Out[51]: 
a    2.10
b    2.65
dtype: float64

If the index is an integer, use the loc index or the first method [] to index, and this integer will be regarded as the index, as follows.

In [55]: obj1 = pd.Series([1, 2, 3], index=[2, 0, 1])

In [59]: obj1[[0, 1, 2]]
Out[59]: 
0    2
1    3
2    1
dtype: int64

In [60]: obj1.loc[[0, 1, 2]]
Out[60]: 
0    2
1    3
2    1
dtype: int64

In this case, if you want to index by position, you need to use the iloc method.

In [61]: obj1.iloc[[0, 1, 2]]
Out[61]: 
2    1
0    2
1    3
dtype: int64

Note: When using the first method or the loc method, if the index is an integer, this number will be used as an index index. At this time, the index cannot be indexed by position, such as:

In [67]: ser = pd.Series(np.arange(4))

In [68]: ser[-1]
KeyError: -1

 The index index here is based on the index 0, 1, 2, 3 of ser. -1 is not in this index, so there will be a keyerror. If the index is not an integer, there is no problem, because it will treat the incoming number as the position Guidelines, not index guidelines:

In [69]: ser1 = pd.Series(np.arange(4), index=['a', 'b', 'c', 'd'])

In [70]: ser1[-1]
Out[70]: 3

DataFrame

the first method:

In [62]: frame
Out[62]: 
   three  four  one
d      7     9    7
b      6     8    5
a      4     1    8

In [63]: frame['three']
Out[63]: 
d    7
b    6
a    4
Name: three, dtype: int64

In [64]: frame[['three', 'one']]
Out[64]: 
   three  one
d      7    7
b      6    5
a      4    8

The second method: loc method

In [65]: frame.loc['a', ['one', 'three']]
Out[65]: 
one      8
three    4
Name: a, dtype: int64

In [66]: frame.iloc[2, [2, 0]]
Out[66]: 
one      8
three    4
Name: a, dtype: int64

equation apply

In [82]: frame
Out[82]: 
   three  four  one
d      7     9    7
b      6     8    5
a      4     1    8

In [83]: f = lambda x: x.max() + x.min()

按行
In [84]: frame.apply(f)
Out[84]: 
three    11
four     10
one      13
dtype: int64

按列
In [85]: frame.apply(f, axis='columns')
Out[85]: 
d    16
b    13
a     9
dtype: int64

If it is an element-wise operation, you need to use applymap instead of apply 

In [82]: frame
Out[82]: 
   three  four  one
d      7     9    7
b      6     8    5
a      4     1    8

In [89]: f1 = lambda x: x+ 10

In [90]: frame.applymap(f1)
Out[90]: 
   three  four  one
d     17    19   17
b     16    18   15
a     14    11   18

Sorting 

Sort according to a certain index, such as index sorting, or sorting by content.

Series

In [91]: obj
Out[91]: 
c    5.20
a    2.10
d   -4.50
b    2.65
dtype: float64

按指针排序
In [92]: obj.sort_index()
Out[92]: 
a    2.10
b    2.65
c    5.20
d   -4.50
dtype: float64

按内容排序
In [93]: obj.sort_values()
Out[93]: 
d   -4.50
a    2.10
b    2.65
c    5.20
dtype: float64

DataFrame

In [94]: frame
Out[94]: 
   three  four  one
d      7     9    7
b      6     8    5
a      4     1    8

按行index排序
In [95]: frame.sort_index()
Out[95]: 
   three  four  one
a      4     1    8
b      6     8    5
d      7     9    7

按列index排序
In [96]: frame.sort_index(axis=1)
Out[96]: 
   four  one  three
d     9    7      7
b     8    5      6
a     1    8      4

按降序排序
In [97]: frame.sort_index(axis=1, ascending=False)
Out[97]: 
   three  one  four
d      7    7     9
b      6    5     8
a      4    8     1

按指定列的内容排序
In [99]: frame.sort_values(by='one')
Out[99]: 
   three  four  one
b      6     8    5
d      7     9    7
a      4     1    8

Reference: Python for Data Analysis, 2nd Edition by Wes McKinney

Guess you like

Origin blog.csdn.net/bo17244504/article/details/124732529