Data Analysis Study Notes (9) -- Pandas: Data Structure and Basic Use

three data structures

  • Series: series (array)
  • DataFrame: data frame (table)
  • Panel: Panel (table container)

    description and comparison

data structure dimension describe
Series 1 data uniform array
DataFrame 2 Variable size table structure data
Panel 3 Variable size, can be understood as a table container

Note: A higher-dimensional data structure is a container for its lower-dimensional data structure. For example, a DataFrame is actually a container for a Series. Among the three data structures, it DataFrameis widely used and is one of the most important data structures in Pandas.

Series object creation and manipulation

Series objects store data in the form of one-dimensional arrays, so a one-bit array is usually required as the data source

  • Create a Series object
# Series对象
s = pd.Series([2, 4, np.nan, 'lolita'])
print(s)
'''
0         2
1         4
2       NaN
3    lolita
dtype: object'''
s = pd.Series(list('abcde'))
print(s)
'''
0    a
1    b
2    c
3    d
4    e
dtype: object'''
# 如果是多维度的数据则会发生错误
s = pd.Series(np.arange(8).reshape(2,4))
print(s)
'''raise Exception('Data must be 1-dimensional')
Exception: Data must be 1-dimensional
'''
  • View and modify arrays

Viewing Series data is similar to ordinary one-dimensional array slices, which can be achieved by subscripting

s = pd.Series(list('abcde'))

retrieve data

value = s[2]
print(value)
'''c'''
value = s[1:3]
print(value)
'''
1    b
2    c
dtype: object
'''

Modify data value

s[3:] = np.arange(0, len(s)-3)
print(s)
'''
0    a
1    b
2    c
3    0
4    1
dtype: object'''

Modify array size

s[len(s)] = 3
print(s)
'''
0    a
1    b
2    c
3    0
4    1
5    3
dtype: object'''

DataFrame object creation and manipulation

The DataFrame object is the container of the Series object, so it is a two-dimensional object. Unlike the two-dimensional array in numpy, the DataFrame object has a row index and a column index. Intuitively, a DataFrame is similar to a table, so you can use It is treated as a table-structured data.

  • Create a DataFrame object

First, we use numpy to create a 2D array as the data source, and do not set the row and column indices

df = pd.DataFrame(np.arange(15).reshape(3,5))
print(df)
'''
    0   1   2   3   4
0   0   1   2   3   4
1   5   6   7   8   9
2  10  11  12  13  14'''

We can see that without setting the row index and column index, the DataFrame will be represented by numbers by default.

Next, we specify the row and column indices for the DataFame

# 行索引
dates = pd.date_range('20180506', periods=7)
print(dates)
'''
DatetimeIndex(['2018-05-06', '2018-05-07', '2018-05-08', '2018-05-09',
               '2018-05-10', '2018-05-11', '2018-05-12'],
              dtype='datetime64[ns]', freq='D')
              '''
# 列索引
columns = list('ABCD')
df = pd.DataFrame(np.random.randn(7,4), index=dates, columns=columns)
print(df)
'''
                   A         B         C         D
2018-05-06 -0.274367  0.402984 -0.381829  0.123850
2018-05-07  0.422842  0.548137 -0.183929  0.800568
2018-05-08 -0.485918 -2.088587  1.407923 -0.249723
2018-05-09  1.929589  0.579739  1.395986 -0.602761
2018-05-10  0.016730  0.278051  0.100124 -0.208399
2018-05-11  1.050533  0.147563  0.480859 -0.608219
2018-05-12  0.253236 -1.476788  0.115376 -0.488298
'''

Alternatively, we can also create a DataFrame from a dictionary that can be converted to a series-like object.

Such as:

# 另外一种创建DataFrame方式
df2 = pd.DataFrame({
    'A':pd.Series(1, index=list(range(4))),
    'B':26,
    'C':pd.Timestamp('20180506'),
    'D':np.array(np.arange(4))
})
print(df2)
'''
   A   B          C  D
0  1  26 2018-05-06  0
1  1  26 2018-05-06  1
2  1  26 2018-05-06  2
3  1  26 2018-05-06  3'''
  • Select area to view data

sample data

dates = pd.date_range('20180506', periods=7)
columns = list('ABCD')
df = pd.DataFrame(np.random.randn(7,4), index=dates, columns=columns)

1. Directly manipulate the DataFrame array to obtain the data in it

a. Select a column by column label

data = df['A']     # 等同于 df.A
print(data)
'''
2018-05-06    0.151124
2018-05-07    0.771603
2018-05-08   -1.457813
2018-05-09   -0.102214
2018-05-10   -1.266102
2018-05-11    0.780995
2018-05-12    0.957048
Freq: D, Name: A, dtype: float64'''

b. Get some rows by column label

data = df['20180506':'20180510']
print(data)
'''
                   A         B         C         D
2018-05-06 -0.361994 -2.217695 -0.789552 -0.966792
2018-05-07  0.479107  0.723792  0.976540  1.522255
2018-05-08  1.463182 -0.106349 -1.781704 -1.399375
2018-05-09  0.728835  0.768247  0.206403 -0.759910
2018-05-10  0.862932  0.025200 -0.246032  0.626610'''

c. Select some rows by slicing

data = df[1:5:2]
print(data)
'''
                   A         B         C         D
2018-05-07  0.479107  0.723792  0.976540  1.522255
2018-05-09  0.728835  0.768247  0.206403 -0.759910'''

2. Function loc + label to get data

a. Via row labels

data = df.loc['20180508']
print(data)
'''
A   -0.084551
B    0.666451
C    2.010399
D    0.121208
Name: 2018-05-08 00:00:00, dtype: float64'''

b. By column labels

data = df.loc[:, 'A':'C']
print(data)
'''
                   A         B         C
2018-05-06  0.268044 -2.011492  0.934763
2018-05-07  1.054199 -0.147792 -0.464180
2018-05-08 -0.132866  0.494600 -0.275043
2018-05-09  0.713573  1.727417 -0.121440
2018-05-10  0.479439  1.195971  2.710756
2018-05-11  0.374772 -0.800548  0.239096
2018-05-12 -1.009797  0.224585 -0.577983'''

c. Row Label + Column Label

# 行标签 + 列标签
data = df.loc[dates[1:4], ['C','B']]
print(data)
'''
                   C         B
2018-05-07 -0.981423  1.515804
2018-05-08  2.010399  0.666451
2018-05-09 -0.541498 -0.760648'''

Description: loc function, the first parameter is the row label, the second is the column label; the parameter can be an array of labels

3. Function iloc + index to get data

a. By row index

data = df.iloc[2]
print(data)
'''
A   -0.118724
B    0.328970
C    0.748574
D    0.654055
Name: 2018-05-08 00:00:00, dtype: float64'''

b. By column index

data = df.iloc[:, :3:2]
print(data)
'''
                   A         C
2018-05-06  0.548679 -0.420244
2018-05-07  0.028167 -0.878201
2018-05-08 -0.118724  0.748574
2018-05-09 -1.015111 -1.348442
2018-05-10 -0.226723  0.486991
2018-05-11 -0.553960 -0.485923
2018-05-12  1.044797 -0.158911'''

c. By row index + column index

data = df.iloc[2:4, :2]
print(data)
'''
                   A         B
2018-05-08 -0.118724  0.328970
2018-05-09 -1.015111  0.239186'''

Description: The parameter can be an indexed array, such asdf.iloc[[1,3,4],[2,3]]

4. Boolean index: pass filter conditions

a. A column as a filter condition

data = df[df.A>0]
print(data)
'''
                   A         B         C         D
2018-05-08  0.516235  3.104029 -1.670880 -0.437195
2018-05-11  0.974800 -2.326192 -0.281022 -0.476735'''

b. All data filtering

data = df[df>0]
print(data)
'''
                   A         B         C         D
2018-05-06       NaN  0.805111       NaN  0.257760
2018-05-07       NaN  1.852752  0.154370       NaN
2018-05-08       NaN  1.023995       NaN  1.861297
2018-05-09  0.606395       NaN  0.650511  0.287029
2018-05-10  0.257984  0.632350  0.743557  0.548811
2018-05-11       NaN  0.986047       NaN  0.393672
2018-05-12  1.622094       NaN       NaN       NaN'''

c. isin() filter condition

If you want to filter by the existence of certain objects, you can use the isin() function to build a condition

First we add a piece of data for the test data

df['E'] = ['jim', 'baobe', 'sam', 'tom', 'tonny', 'shilly', 'tom']

Build filters to filter data

data = df[df.E.isin(['sam','tom'])]
print(data)
'''
                   A         B         C         D    E
2018-05-08  1.821732  1.077904 -0.833458 -0.478336  sam
2018-05-09  0.106113  1.337638 -0.080521 -0.057544  tom
2018-05-12  0.252590  2.860070  1.128609  0.726241  tom'''

5. The head() and tail() functions

We can view the data from the head or from the tail through the head() and tail() functions

a. View the first 2 rows of data

data = df.head(2)
print(data)
'''
                   A         B         C         D      E
2018-05-06 -1.287527  0.847162 -0.267924 -0.983046    jim
2018-05-07  1.449559  0.075858  1.514339 -0.796074  baobe'''

b. View the last 2 rows of data

data = df.tail(2)
print(data)
'''
                   A         B         C         D       E
2018-05-11 -0.082329 -1.738341  0.442104  0.798396  shilly
2018-05-12 -1.095280  0.091995 -1.044384 -0.090420     tom'''
  • sort

a. Index sorting, sort_index()

# 索引排序,axis:1:横向,0:竖向,ascending:True:升序,Flase:降序
df2 = df.sort_index(axis=0, ascending=False)
print(df2)
'''
                   A         B         C         D       E
2018-05-12 -1.172311  0.249176  0.619619  1.201507     tom
2018-05-11 -0.903955  0.479631 -0.963870  0.630743  shilly
2018-05-10  0.338365  2.306580 -0.283463  0.058125   tonny
2018-05-09  0.383475  0.676785  0.526411 -0.818025     tom
2018-05-08 -0.905796 -0.069078  0.073586  1.430755     sam
2018-05-07  0.294005  0.271840  0.428224  1.292046   baobe
2018-05-06  0.260828  0.618682 -1.491793  0.122746     jim
'''

b. Sort by value

df2 = df.sort_values(by='A', ascending=False)
print(df2)
'''
                   A         B         C         D       E
2018-05-10  1.575477  0.344951  0.733876  0.907723   tonny
2018-05-07  0.932946  0.780123 -0.414745  0.658090   baobe
2018-05-11  0.843189 -0.913189 -0.315761  0.660586  shilly
2018-05-08  0.160137 -1.695199  1.934444  0.084714     sam
2018-05-06 -0.293306 -1.162267  0.872631  1.044711     jim
2018-05-12 -0.645678 -0.482878  0.636781  0.217972     tom
2018-05-09 -1.554232 -0.304241  1.669206  1.139795     tom
'''
  • View the attribute information of the DataFrame

index、columns、values、describe()

# 索引
print(df.index)
'''
DatetimeIndex(['2018-05-06', '2018-05-07', '2018-05-08', '2018-05-09',
               '2018-05-10', '2018-05-11', '2018-05-12'],
              dtype='datetime64[ns]', freq='D')
              '''

# 行
print(df.columns)
'''Index(['A', 'B', 'C', 'D', 'E'], dtype='object')'''

# 值
print(df.values)
'''
[[-0.0730245774443805 0.33905255249941385 2.5015746151296807
  -0.33123450303072904 'jim']
 [0.3790907350980047 -0.18159752603921003 0.9152898508787708
  -1.310437299063603 'baobe']
 [-1.0601500232526542 1.3286400269975474 1.0971376542803182
  1.5576173636367971 'sam']
 [1.0412081612316724 -0.4888964096447794 0.780175389471046
  -0.47686838058306735 'tom']
 [-0.10804989737868699 -0.8324112840144998 1.4111047235564484
  -0.48340663193714745 'tonny']
 [0.37660491844670085 0.45699643985409844 0.2527901594952756
  1.289500089112413 'shilly']
 [-0.8105633809548823 0.18592373788802305 0.5400829628974213
  -0.5655611250881947 'tom']]
  '''

# 一些统计信息
print(df.describe())
'''
              A         B         C         D
count  7.000000  7.000000  7.000000  7.000000
mean  -0.036412  0.115387  1.071165 -0.045770
std    0.725524  0.706560  0.732924  1.055322
min   -1.060150 -0.832411  0.252790 -1.310437
25%   -0.459307 -0.335247  0.660129 -0.524484
50%   -0.073025  0.185924  0.915290 -0.476868
75%    0.377848  0.398024  1.254121  0.479133
max    1.041208  1.328640  2.501575  1.557617
'''

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325697724&siteId=291194637