Introduction to the Series data structure in the Python data processing library pandas

Introduction

pandas is a very important library for python to process data. The name of pandas comes from panel data, a concept of multi-dimensional structured database in econometrics (not the meaning of pandas T-T). So the focus of pandas is on structured data . The difference between him and numpy (processing array data) is that each data has an additional label (index), which is specially used to process tabular data. pandas has two major data structures Series and DataFrame . This article is mainly about the basic usage of Series. .

In [1]: import pandas as pd

Series: one-dimensional array data, each data unit has a label index

In [2]: from pandas import Series

In [3]: obj = pd.Series([3, 2, -4, 9, 3])

In [4]: obj
Out[4]: 
0    3
1    2
2   -4
3    9
4    3
dtype: int64

The index is displayed on the left, and the array is displayed on the right.

The numpy array and index label can be obtained through .array and .index.

In [5]: obj.array
Out[5]: 
<PandasArray>
[3, 2, -4, 9, 3]
Length: 5, dtype: int64

In [6]: obj.index
Out[6]: RangeIndex(start=0, stop=5, step=1)

It is also possible to create sequences of custom labels :

In [7]: obj2 = pd.Series([3, 2, -4, 9, 3], index=['a', 'b', 'e', 'ff', 'g'])

In [8]: obj2
Out[8]: 
a     3
b     2
e    -4
ff    9
g     3
dtype: int64

 Data can be obtained by labels , which is not possible with numpy

In [9]: obj2['a']
Out[9]: 3

In [11]: obj2[['b', 'e']] #注意 ['b', 'e'] 是个存有标签的列表
Out[11]: 
b    2
e   -4
dtype: int64

Operation labels on pandas structured data will not change

In [12]: obj2[obj2>0]
Out[12]: 
a     3
b     2
ff    9
g     3
dtype: int64

In [13]: obj2 / 2
Out[13]: 
a     1.5
b     1.0
e    -2.0
ff    4.5
g     1.5
dtype: float64

Series can also be seen as a mapping from label index to data unit, so you can convert a dictionary into a Series

In [14]: data = {'class1': 100, 'class2': 300, 'class3': 200, 'class4': 50}

In [15]: obj3 = pd.Series(data)

In [16]: obj3
Out[16]: 
class1    100
class2    300
class3    200
class4     50
dtype: int64

Series can be formed according to the key of the specified dictionary 

In [17]: keys = ['class5', 'class4', 'class3', 'class2']

In [18]: obj4 = pd.Series(data, index=keys)

In [19]: obj4
Out[19]: 
class5      NaN
class4     50.0
class3    200.0
class2    300.0
dtype: float64

class5 is not defined in the above dictionary, so NaN (not a number) will be used instead. In reality, there will be a lot of NaN data in the data, so data preprocessing is very important~

You can use .isna to detect if there is missing data:

In [20]: pd.isna(obj4)
Out[20]: 
class5     True
class4    False
class3    False
class2    False
dtype: bool

A very important feature of Series is: It can perform corresponding operations (data alignment) according to the index

In [21]: obj3
Out[21]: 
class1    100
class2    300
class3    200
class4     50
dtype: int64

In [22]: obj4
Out[22]: 
class5      NaN
class4     50.0
class3    200.0
class2    300.0
dtype: float64

In [23]: obj3 + obj4
Out[23]: 
class1      NaN
class2    600.0
class3    400.0
class4    100.0
class5      NaN
dtype: float64

 You can give the Series class and its tags a name

In [24]: obj4.name = 'dataset'

In [25]: obj4.index.name= 'class'

In [26]: obj4
Out[26]: 
class
class5      NaN
class4     50.0
class3    200.0
class2    300.0
Name: dataset, dtype: float64

Reference: Python for Data Analysis, 2nd Edition by Wes McKinney

Guess you like

Origin blog.csdn.net/bo17244504/article/details/124686751