Introduction
pandas is a very important library for python to process data. The name of pandas comes from panel data, a concept of multi-dimensional structured database in econometrics (not the meaning of pandas T-T). So the focus of pandas is on structured data . The difference between him and numpy (processing array data) is that each data has an additional label (index), which is specially used to process tabular data. pandas has two major data structures Series and DataFrame . This article is mainly about the basic usage of Series. .
In [1]: import pandas as pd
Series: one-dimensional array data, each data unit has a label index
In [2]: from pandas import Series
In [3]: obj = pd.Series([3, 2, -4, 9, 3])
In [4]: obj
Out[4]:
0 3
1 2
2 -4
3 9
4 3
dtype: int64
The index is displayed on the left, and the array is displayed on the right.
The numpy array and index label can be obtained through .array and .index.
In [5]: obj.array
Out[5]:
<PandasArray>
[3, 2, -4, 9, 3]
Length: 5, dtype: int64
In [6]: obj.index
Out[6]: RangeIndex(start=0, stop=5, step=1)
It is also possible to create sequences of custom labels :
In [7]: obj2 = pd.Series([3, 2, -4, 9, 3], index=['a', 'b', 'e', 'ff', 'g'])
In [8]: obj2
Out[8]:
a 3
b 2
e -4
ff 9
g 3
dtype: int64
Data can be obtained by labels , which is not possible with numpy
In [9]: obj2['a']
Out[9]: 3
In [11]: obj2[['b', 'e']] #注意 ['b', 'e'] 是个存有标签的列表
Out[11]:
b 2
e -4
dtype: int64
Operation labels on pandas structured data will not change
In [12]: obj2[obj2>0]
Out[12]:
a 3
b 2
ff 9
g 3
dtype: int64
In [13]: obj2 / 2
Out[13]:
a 1.5
b 1.0
e -2.0
ff 4.5
g 1.5
dtype: float64
Series can also be seen as a mapping from label index to data unit, so you can convert a dictionary into a Series
In [14]: data = {'class1': 100, 'class2': 300, 'class3': 200, 'class4': 50}
In [15]: obj3 = pd.Series(data)
In [16]: obj3
Out[16]:
class1 100
class2 300
class3 200
class4 50
dtype: int64
Series can be formed according to the key of the specified dictionary
In [17]: keys = ['class5', 'class4', 'class3', 'class2']
In [18]: obj4 = pd.Series(data, index=keys)
In [19]: obj4
Out[19]:
class5 NaN
class4 50.0
class3 200.0
class2 300.0
dtype: float64
class5 is not defined in the above dictionary, so NaN (not a number) will be used instead. In reality, there will be a lot of NaN data in the data, so data preprocessing is very important~
You can use .isna to detect if there is missing data:
In [20]: pd.isna(obj4)
Out[20]:
class5 True
class4 False
class3 False
class2 False
dtype: bool
A very important feature of Series is: It can perform corresponding operations (data alignment) according to the index
In [21]: obj3
Out[21]:
class1 100
class2 300
class3 200
class4 50
dtype: int64
In [22]: obj4
Out[22]:
class5 NaN
class4 50.0
class3 200.0
class2 300.0
dtype: float64
In [23]: obj3 + obj4
Out[23]:
class1 NaN
class2 600.0
class3 400.0
class4 100.0
class5 NaN
dtype: float64
You can give the Series class and its tags a name
In [24]: obj4.name = 'dataset'
In [25]: obj4.index.name= 'class'
In [26]: obj4
Out[26]:
class
class5 NaN
class4 50.0
class3 200.0
class2 300.0
Name: dataset, dtype: float64
Reference: Python for Data Analysis, 2nd Edition by Wes McKinney