Pandas has two major data structures, Series and DataFrame. Series processes unique data, and each data has a corresponding pointer index. DataFrame deals with two-dimensional data, each data has a corresponding index and a corresponding column.
For more details about Series and DataFrame, please refer to:
Series: Introduction to the Series data structure in the Python data processing library pandas\
DataFrame: Introduction to the DataFrame data structure in the Python data processing library pandas
This article mainly introduces some basic functions commonly used by Pandas when processing data. For more details, please refer to the online documentation of pandas .
Reindexing
Series
Reindex the pointer, if the newly defined pointer has no corresponding data, it will be automatically filled with NaN.
In [2]: obj = pd.Series([5.2, 2.1, -4.5, 2.65], index=['c', 'a', 'd', 'b'])
In [3]: obj
Out[3]:
c 5.20
a 2.10
d -4.50
b 2.65
dtype: float64
In [6]: obj2 = obj.reindex(['a', 'b', 'c', 'd', 'f'])
In [7]: obj2
Out[7]:
a 2.10
b 2.65
c 5.20
d -4.50
f NaN
dtype: float64
DataFrame
Can be reindexed by row or column
In [27]: frame = pd.DataFrame(np.random.randint(10, size=(3, 3)),
...: index=['d', 'b', 'a'],
...: columns=['three', 'four', 'one'])
...:
In [28]: frame
Out[28]:
three four one
d 7 9 7
b 6 8 5
a 4 1 8
#default是按 行 reindex
In [29]: frame2 = frame.reindex(['a', 'b', 'c', 'd'])
In [30]: frame2
Out[30]:
three four one
a 4.0 1.0 8.0
b 6.0 8.0 5.0
c NaN NaN NaN
d 7.0 9.0 7.0
#指定按列 reindex
In [31]: frame3 = frame.reindex(columns=['one', 'two', 'three', 'four'])
In [32]: frame3
Out[32]:
one two three four
d 7 NaN 7 9
b 5 NaN 6 8
a 8 NaN 4 1
You can also use the loc method to reindex, but using the loc method will not create NaN data.
In [33]: frame.loc[['a', 'b', 'd'], ['one', 'three', 'four']]
Out[33]:
one three four
a 8 4 1
b 5 6 8
d 7 7 9
Dropping
Eliminate data in a certain row and column, this method will return a new data set by default.
Series
In [34]: obj
Out[34]:
c 5.20
a 2.10
d -4.50
b 2.65
dtype: float64
In [35]: obj.drop('a')
Out[35]:
c 5.20
d -4.50
b 2.65
dtype: float64
In [36]: obj.drop(['a', 'c'])
Out[36]:
d -4.50
b 2.65
dtype: float64
DataFrame
In [40]: frame
Out[40]:
three four one
d 7 9 7
b 6 8 5
a 4 1 8
In [41]: frame.drop(['a', 'd'])
Out[41]:
three four one
b 6 8 5
#通过指定axis =1 或者axis='columns'来drop列
In [42]: frame.drop('one', axis=1)
Out[42]:
three four
d 7 9
b 6 8
a 4 1
In [43]: frame.drop('one', axis='columns')
Out[43]:
three four
d 7 9
b 6 8
a 4 1
Indexing
Series
First method: through indexing like Numpy array:
In [44]: obj
Out[44]:
c 5.20
a 2.10
d -4.50
b 2.65
dtype: float64
In [45]: obj['a']
Out[45]: 2.1
In [46]: obj[1]
Out[46]: 2.1
In [47]: obj[2:4]
Out[47]:
d -4.50
b 2.65
dtype: float64
In [48]: obj[['a', 'b']]
Out[48]:
a 2.10
b 2.65
dtype: float64
In [49]: obj[[1, 3]]
Out[49]:
a 2.10
b 2.65
dtype: float64
In [50]: obj[obj<0]
Out[50]:
d -4.5
dtype: float64
The second method: index through the loc/iloc method. loc is indexed by the name of the index, and iloc is indexed by the location of the index.
In [51]: obj.loc[['a', 'b']]
Out[51]:
a 2.10
b 2.65
dtype: float64
If the index is an integer, use the loc index or the first method [] to index, and this integer will be regarded as the index, as follows.
In [55]: obj1 = pd.Series([1, 2, 3], index=[2, 0, 1])
In [59]: obj1[[0, 1, 2]]
Out[59]:
0 2
1 3
2 1
dtype: int64
In [60]: obj1.loc[[0, 1, 2]]
Out[60]:
0 2
1 3
2 1
dtype: int64
In this case, if you want to index by position, you need to use the iloc method.
In [61]: obj1.iloc[[0, 1, 2]]
Out[61]:
2 1
0 2
1 3
dtype: int64
Note: When using the first method or the loc method, if the index is an integer, this number will be used as an index index. At this time, the index cannot be indexed by position, such as:
In [67]: ser = pd.Series(np.arange(4))
In [68]: ser[-1]
KeyError: -1
The index index here is based on the index 0, 1, 2, 3 of ser. -1 is not in this index, so there will be a keyerror. If the index is not an integer, there is no problem, because it will treat the incoming number as the position Guidelines, not index guidelines:
In [69]: ser1 = pd.Series(np.arange(4), index=['a', 'b', 'c', 'd'])
In [70]: ser1[-1]
Out[70]: 3
DataFrame
the first method:
In [62]: frame
Out[62]:
three four one
d 7 9 7
b 6 8 5
a 4 1 8
In [63]: frame['three']
Out[63]:
d 7
b 6
a 4
Name: three, dtype: int64
In [64]: frame[['three', 'one']]
Out[64]:
three one
d 7 7
b 6 5
a 4 8
The second method: loc method
In [65]: frame.loc['a', ['one', 'three']]
Out[65]:
one 8
three 4
Name: a, dtype: int64
In [66]: frame.iloc[2, [2, 0]]
Out[66]:
one 8
three 4
Name: a, dtype: int64
equation apply
In [82]: frame
Out[82]:
three four one
d 7 9 7
b 6 8 5
a 4 1 8
In [83]: f = lambda x: x.max() + x.min()
按行
In [84]: frame.apply(f)
Out[84]:
three 11
four 10
one 13
dtype: int64
按列
In [85]: frame.apply(f, axis='columns')
Out[85]:
d 16
b 13
a 9
dtype: int64
If it is an element-wise operation, you need to use applymap instead of apply
In [82]: frame
Out[82]:
three four one
d 7 9 7
b 6 8 5
a 4 1 8
In [89]: f1 = lambda x: x+ 10
In [90]: frame.applymap(f1)
Out[90]:
three four one
d 17 19 17
b 16 18 15
a 14 11 18
Sorting
Sort according to a certain index, such as index sorting, or sorting by content.
Series
In [91]: obj
Out[91]:
c 5.20
a 2.10
d -4.50
b 2.65
dtype: float64
按指针排序
In [92]: obj.sort_index()
Out[92]:
a 2.10
b 2.65
c 5.20
d -4.50
dtype: float64
按内容排序
In [93]: obj.sort_values()
Out[93]:
d -4.50
a 2.10
b 2.65
c 5.20
dtype: float64
DataFrame
In [94]: frame
Out[94]:
three four one
d 7 9 7
b 6 8 5
a 4 1 8
按行index排序
In [95]: frame.sort_index()
Out[95]:
three four one
a 4 1 8
b 6 8 5
d 7 9 7
按列index排序
In [96]: frame.sort_index(axis=1)
Out[96]:
four one three
d 9 7 7
b 8 5 6
a 1 8 4
按降序排序
In [97]: frame.sort_index(axis=1, ascending=False)
Out[97]:
three one four
d 7 7 9
b 6 5 8
a 4 8 1
按指定列的内容排序
In [99]: frame.sort_values(by='one')
Out[99]:
three four one
b 6 8 5
d 7 9 7
a 4 1 8
Reference: Python for Data Analysis, 2nd Edition by Wes McKinney