three data structures
- Series: series (array)
- DataFrame: data frame (table)
Panel: Panel (table container)
description and comparison
data structure | dimension | describe |
---|---|---|
Series | 1 | data uniform array |
DataFrame | 2 | Variable size table structure data |
Panel | 3 | Variable size, can be understood as a table container |
Note: A higher-dimensional data structure is a container for its lower-dimensional data structure. For example, a DataFrame is actually a container for a Series. Among the three data structures, it DataFrame
is widely used and is one of the most important data structures in Pandas.
Series object creation and manipulation
Series objects store data in the form of one-dimensional arrays, so a one-bit array is usually required as the data source
- Create a Series object
# Series对象
s = pd.Series([2, 4, np.nan, 'lolita'])
print(s)
'''
0 2
1 4
2 NaN
3 lolita
dtype: object'''
s = pd.Series(list('abcde'))
print(s)
'''
0 a
1 b
2 c
3 d
4 e
dtype: object'''
# 如果是多维度的数据则会发生错误
s = pd.Series(np.arange(8).reshape(2,4))
print(s)
'''raise Exception('Data must be 1-dimensional')
Exception: Data must be 1-dimensional
'''
- View and modify arrays
Viewing Series data is similar to ordinary one-dimensional array slices, which can be achieved by subscripting
s = pd.Series(list('abcde'))
retrieve data
value = s[2]
print(value)
'''c'''
value = s[1:3]
print(value)
'''
1 b
2 c
dtype: object
'''
Modify data value
s[3:] = np.arange(0, len(s)-3)
print(s)
'''
0 a
1 b
2 c
3 0
4 1
dtype: object'''
Modify array size
s[len(s)] = 3
print(s)
'''
0 a
1 b
2 c
3 0
4 1
5 3
dtype: object'''
DataFrame object creation and manipulation
The DataFrame object is the container of the Series object, so it is a two-dimensional object. Unlike the two-dimensional array in numpy, the DataFrame object has a row index and a column index. Intuitively, a DataFrame is similar to a table, so you can use It is treated as a table-structured data.
- Create a DataFrame object
First, we use numpy to create a 2D array as the data source, and do not set the row and column indices
df = pd.DataFrame(np.arange(15).reshape(3,5))
print(df)
'''
0 1 2 3 4
0 0 1 2 3 4
1 5 6 7 8 9
2 10 11 12 13 14'''
We can see that without setting the row index and column index, the DataFrame will be represented by numbers by default.
Next, we specify the row and column indices for the DataFame
# 行索引
dates = pd.date_range('20180506', periods=7)
print(dates)
'''
DatetimeIndex(['2018-05-06', '2018-05-07', '2018-05-08', '2018-05-09',
'2018-05-10', '2018-05-11', '2018-05-12'],
dtype='datetime64[ns]', freq='D')
'''
# 列索引
columns = list('ABCD')
df = pd.DataFrame(np.random.randn(7,4), index=dates, columns=columns)
print(df)
'''
A B C D
2018-05-06 -0.274367 0.402984 -0.381829 0.123850
2018-05-07 0.422842 0.548137 -0.183929 0.800568
2018-05-08 -0.485918 -2.088587 1.407923 -0.249723
2018-05-09 1.929589 0.579739 1.395986 -0.602761
2018-05-10 0.016730 0.278051 0.100124 -0.208399
2018-05-11 1.050533 0.147563 0.480859 -0.608219
2018-05-12 0.253236 -1.476788 0.115376 -0.488298
'''
Alternatively, we can also create a DataFrame from a dictionary that can be converted to a series-like object.
Such as:
# 另外一种创建DataFrame方式
df2 = pd.DataFrame({
'A':pd.Series(1, index=list(range(4))),
'B':26,
'C':pd.Timestamp('20180506'),
'D':np.array(np.arange(4))
})
print(df2)
'''
A B C D
0 1 26 2018-05-06 0
1 1 26 2018-05-06 1
2 1 26 2018-05-06 2
3 1 26 2018-05-06 3'''
- Select area to view data
sample data
dates = pd.date_range('20180506', periods=7)
columns = list('ABCD')
df = pd.DataFrame(np.random.randn(7,4), index=dates, columns=columns)
1. Directly manipulate the DataFrame array to obtain the data in it
a. Select a column by column label
data = df['A'] # 等同于 df.A
print(data)
'''
2018-05-06 0.151124
2018-05-07 0.771603
2018-05-08 -1.457813
2018-05-09 -0.102214
2018-05-10 -1.266102
2018-05-11 0.780995
2018-05-12 0.957048
Freq: D, Name: A, dtype: float64'''
b. Get some rows by column label
data = df['20180506':'20180510']
print(data)
'''
A B C D
2018-05-06 -0.361994 -2.217695 -0.789552 -0.966792
2018-05-07 0.479107 0.723792 0.976540 1.522255
2018-05-08 1.463182 -0.106349 -1.781704 -1.399375
2018-05-09 0.728835 0.768247 0.206403 -0.759910
2018-05-10 0.862932 0.025200 -0.246032 0.626610'''
c. Select some rows by slicing
data = df[1:5:2]
print(data)
'''
A B C D
2018-05-07 0.479107 0.723792 0.976540 1.522255
2018-05-09 0.728835 0.768247 0.206403 -0.759910'''
2. Function loc + label to get data
a. Via row labels
data = df.loc['20180508']
print(data)
'''
A -0.084551
B 0.666451
C 2.010399
D 0.121208
Name: 2018-05-08 00:00:00, dtype: float64'''
b. By column labels
data = df.loc[:, 'A':'C']
print(data)
'''
A B C
2018-05-06 0.268044 -2.011492 0.934763
2018-05-07 1.054199 -0.147792 -0.464180
2018-05-08 -0.132866 0.494600 -0.275043
2018-05-09 0.713573 1.727417 -0.121440
2018-05-10 0.479439 1.195971 2.710756
2018-05-11 0.374772 -0.800548 0.239096
2018-05-12 -1.009797 0.224585 -0.577983'''
c. Row Label + Column Label
# 行标签 + 列标签
data = df.loc[dates[1:4], ['C','B']]
print(data)
'''
C B
2018-05-07 -0.981423 1.515804
2018-05-08 2.010399 0.666451
2018-05-09 -0.541498 -0.760648'''
Description: loc function, the first parameter is the row label, the second is the column label; the parameter can be an array of labels
3. Function iloc + index to get data
a. By row index
data = df.iloc[2]
print(data)
'''
A -0.118724
B 0.328970
C 0.748574
D 0.654055
Name: 2018-05-08 00:00:00, dtype: float64'''
b. By column index
data = df.iloc[:, :3:2]
print(data)
'''
A C
2018-05-06 0.548679 -0.420244
2018-05-07 0.028167 -0.878201
2018-05-08 -0.118724 0.748574
2018-05-09 -1.015111 -1.348442
2018-05-10 -0.226723 0.486991
2018-05-11 -0.553960 -0.485923
2018-05-12 1.044797 -0.158911'''
c. By row index + column index
data = df.iloc[2:4, :2]
print(data)
'''
A B
2018-05-08 -0.118724 0.328970
2018-05-09 -1.015111 0.239186'''
Description: The parameter can be an indexed array, such asdf.iloc[[1,3,4],[2,3]]
4. Boolean index: pass filter conditions
a. A column as a filter condition
data = df[df.A>0]
print(data)
'''
A B C D
2018-05-08 0.516235 3.104029 -1.670880 -0.437195
2018-05-11 0.974800 -2.326192 -0.281022 -0.476735'''
b. All data filtering
data = df[df>0]
print(data)
'''
A B C D
2018-05-06 NaN 0.805111 NaN 0.257760
2018-05-07 NaN 1.852752 0.154370 NaN
2018-05-08 NaN 1.023995 NaN 1.861297
2018-05-09 0.606395 NaN 0.650511 0.287029
2018-05-10 0.257984 0.632350 0.743557 0.548811
2018-05-11 NaN 0.986047 NaN 0.393672
2018-05-12 1.622094 NaN NaN NaN'''
c. isin() filter condition
If you want to filter by the existence of certain objects, you can use the isin() function to build a condition
First we add a piece of data for the test data
df['E'] = ['jim', 'baobe', 'sam', 'tom', 'tonny', 'shilly', 'tom']
Build filters to filter data
data = df[df.E.isin(['sam','tom'])]
print(data)
'''
A B C D E
2018-05-08 1.821732 1.077904 -0.833458 -0.478336 sam
2018-05-09 0.106113 1.337638 -0.080521 -0.057544 tom
2018-05-12 0.252590 2.860070 1.128609 0.726241 tom'''
5. The head() and tail() functions
We can view the data from the head or from the tail through the head() and tail() functions
a. View the first 2 rows of data
data = df.head(2)
print(data)
'''
A B C D E
2018-05-06 -1.287527 0.847162 -0.267924 -0.983046 jim
2018-05-07 1.449559 0.075858 1.514339 -0.796074 baobe'''
b. View the last 2 rows of data
data = df.tail(2)
print(data)
'''
A B C D E
2018-05-11 -0.082329 -1.738341 0.442104 0.798396 shilly
2018-05-12 -1.095280 0.091995 -1.044384 -0.090420 tom'''
- sort
a. Index sorting, sort_index()
# 索引排序,axis:1:横向,0:竖向,ascending:True:升序,Flase:降序
df2 = df.sort_index(axis=0, ascending=False)
print(df2)
'''
A B C D E
2018-05-12 -1.172311 0.249176 0.619619 1.201507 tom
2018-05-11 -0.903955 0.479631 -0.963870 0.630743 shilly
2018-05-10 0.338365 2.306580 -0.283463 0.058125 tonny
2018-05-09 0.383475 0.676785 0.526411 -0.818025 tom
2018-05-08 -0.905796 -0.069078 0.073586 1.430755 sam
2018-05-07 0.294005 0.271840 0.428224 1.292046 baobe
2018-05-06 0.260828 0.618682 -1.491793 0.122746 jim
'''
b. Sort by value
df2 = df.sort_values(by='A', ascending=False)
print(df2)
'''
A B C D E
2018-05-10 1.575477 0.344951 0.733876 0.907723 tonny
2018-05-07 0.932946 0.780123 -0.414745 0.658090 baobe
2018-05-11 0.843189 -0.913189 -0.315761 0.660586 shilly
2018-05-08 0.160137 -1.695199 1.934444 0.084714 sam
2018-05-06 -0.293306 -1.162267 0.872631 1.044711 jim
2018-05-12 -0.645678 -0.482878 0.636781 0.217972 tom
2018-05-09 -1.554232 -0.304241 1.669206 1.139795 tom
'''
- View the attribute information of the DataFrame
index、columns、values、describe()
# 索引
print(df.index)
'''
DatetimeIndex(['2018-05-06', '2018-05-07', '2018-05-08', '2018-05-09',
'2018-05-10', '2018-05-11', '2018-05-12'],
dtype='datetime64[ns]', freq='D')
'''
# 行
print(df.columns)
'''Index(['A', 'B', 'C', 'D', 'E'], dtype='object')'''
# 值
print(df.values)
'''
[[-0.0730245774443805 0.33905255249941385 2.5015746151296807
-0.33123450303072904 'jim']
[0.3790907350980047 -0.18159752603921003 0.9152898508787708
-1.310437299063603 'baobe']
[-1.0601500232526542 1.3286400269975474 1.0971376542803182
1.5576173636367971 'sam']
[1.0412081612316724 -0.4888964096447794 0.780175389471046
-0.47686838058306735 'tom']
[-0.10804989737868699 -0.8324112840144998 1.4111047235564484
-0.48340663193714745 'tonny']
[0.37660491844670085 0.45699643985409844 0.2527901594952756
1.289500089112413 'shilly']
[-0.8105633809548823 0.18592373788802305 0.5400829628974213
-0.5655611250881947 'tom']]
'''
# 一些统计信息
print(df.describe())
'''
A B C D
count 7.000000 7.000000 7.000000 7.000000
mean -0.036412 0.115387 1.071165 -0.045770
std 0.725524 0.706560 0.732924 1.055322
min -1.060150 -0.832411 0.252790 -1.310437
25% -0.459307 -0.335247 0.660129 -0.524484
50% -0.073025 0.185924 0.915290 -0.476868
75% 0.377848 0.398024 1.254121 0.479133
max 1.041208 1.328640 2.501575 1.557617
'''