Life is short, I used Python
The foregoing Portal:
White school Python Data Analysis (1): Based on the analysis of data
White Python Data Analysis School (2): Pandas (a) Overview
White Python Data Analysis School (3): Pandas (ii) Data structure Series
White Python Data Analysis School (4): Pandas (iii) data structure DataFrame
White Python Data Analysis School (5): Pandas (D) Basic Operation (1) Data View
introduction
Our previous article describes how to operate some basic Pandas view the data, but we recommend the use of more official .at
, .iat
, .loc
and .iloc
these through Pandas optimized data access methods to access the data.
First we create a DataFrame used as a presentation, small lazy, then put a copy of DataFrame over, as follows:
import numpy as np
import pandas as pd
dates = pd.date_range('20200101', periods=6)
df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))
print(df)
DataFrame is composed of many columns, in fact, can be seen as consisting of a plurality Series, we can get a single direct access to a Series, as follows:
# 获取单列,获得 Series
print(df['A'])
# 输出结果
2020-01-01 -0.065477
2020-01-02 -1.089716
2020-01-03 0.049215
2020-01-04 -0.017615
2020-01-05 -0.910402
2020-01-06 -0.008887
Freq: D, Name: A, dtype: float64
Next we can []
slice a DataFrame, examples are as follows:
# 行切片
print(df[0:3])
print(df['20200101' : '20200103'])
# 输出结果
A B C D
2020-01-01 -0.065477 1.603827 1.152969 0.742842
2020-01-02 -1.089716 -0.540936 0.456917 0.295272
2020-01-03 0.049215 -1.182454 -0.294177 -0.698877
A B C D
2020-01-01 -0.065477 1.603827 1.152969 0.742842
2020-01-02 -1.089716 -0.540936 0.456917 0.295272
2020-01-03 0.049215 -1.182454 -0.294177 -0.698877
We can see, we had sliced through the line integer or columns will DataFrame.
place
We can carry on column names and index names located using loc.
For example, we extract the row of data by column, as follows:
# 用标签提取一行数据
print(df.loc[dates[0]])
# 输出结果
A -0.065477
B 1.603827
C 1.152969
D 0.742842
Name: 2020-01-01 00:00:00, dtype: float64
Note, dates here in front of us is an array generated, the wording here can also be replaced df.loc['20200101']
.
Similarly, we can get a few lines of data specified by slice way, as follows:
# 用标签提取多列数据
print(df.loc[:, ['A', 'B']])
# 输出结果
A B
2020-01-01 -0.065477 1.603827
2020-01-02 -1.089716 -0.540936
2020-01-03 0.049215 -1.182454
2020-01-04 -0.017615 -0.777637
2020-01-05 -0.910402 -0.173959
2020-01-06 -0.008887 0.525035
# 用标签进行切片操作,同时制定行与列的结束点
print(df.loc['20200101':'20200103', ['A', 'B']])
# 输出结果
A B
2020-01-01 -0.065477 1.603827
2020-01-02 -1.089716 -0.540936
2020-01-03 0.049215 -1.182454
# 返回一行中的两列
print(df.loc['20200101', ['A', 'B']])
# 输出结果
A -0.065477
B 1.603827
Name: 2020-01-01 00:00:00, dtype: float64
So if I want to get how to do a specific location data? When we DataFrame imagine becoming a coordinate system of the time, of course, is to specify the horizontal and vertical coordinates can identify a unique point ah, as follows:
# 获取某个标量值
print(df.loc[dates[0], 'A'])
# 输出结果
-0.06547653622759132
Icelandic
iloc 和上面的 loc 很像, loc 主要是通过行进行索引定位,而 iloc 是通过 index 也就是列进行索引定位,所以参数是整型, iloc 的英文全称为 index locate 。
先看一个简单的示例,我们先用整数选择出其中的一列:
# 用整数位置选择
print(df.iloc[3])
# 输出结果
A -0.017615
B -0.777637
C 0.824364
D 0.210244
Name: 2020-01-04 00:00:00, dtype: float64
这里我们还可以加上切片进行选择:
# 使用整数按行和列进行切片操作
print(df.iloc[3:5, 0:2])
# 输出结果
A B
2020-01-04 -0.017615 -0.777637
2020-01-05 -0.910402 -0.173959
# 用整数列表按位置切片
print(df.iloc[[1, 2, 4], [0, 2]])
# 输出结果
A C
2020-01-02 -1.089716 0.456917
2020-01-03 0.049215 -0.294177
2020-01-05 -0.910402 -1.140222
# 整行切片
print(df.iloc[1:3, :])
# 输出结果
A B C D
2020-01-02 -1.089716 -0.540936 0.456917 0.295272
2020-01-03 0.049215 -1.182454 -0.294177 -0.698877
# 整列切片
print(df.iloc[:, 1:3])
# 输出结果
B C
2020-01-01 1.603827 1.152969
2020-01-02 -0.540936 0.456917
2020-01-03 -1.182454 -0.294177
2020-01-04 -0.777637 0.824364
2020-01-05 -0.173959 -1.140222
2020-01-06 0.525035 -1.076101
同样,我们通过 iloc 也可以直接选择一个标量值:
# 获取某个标量值 同上
print(df.iloc[1, 1])
# 结果如下
-0.540936460611594
at 和 iat
at 和 iat 都是用来访问单个元素的,而且他们的访问速度要快于上面的 loc 和 iloc 。
at 使用方法与 loc 类似,示例如下:
print(df.at[dates[0], 'A'])
# 输出结果
-0.06547653622759132
iat 对于 iloc 的关系就像 at 对于 loc 的关系,示例如下:
print(df.iat[1, 1])
# 输出结果
-0.540936460611594
其他
我们还可以使用一些判断条件来选择数据,如用单列的值选择数据,示例如下:
print(df[df.A > 0])
# 输出结果
A B C D
2020-01-03 0.049215 -1.182454 -0.294177 -0.698877
上面这个示例是输出的所有 A 列大于 0 的数据。
还有直接使用整个 df 做判断的,示例如下:
print(df[df < 0])
# 输出结果
A B C D
2020-01-01 -0.065477 NaN NaN NaN
2020-01-02 -1.089716 -0.540936 NaN NaN
2020-01-03 NaN -1.182454 -0.294177 -0.698877
2020-01-04 -0.017615 -0.777637 NaN NaN
2020-01-05 -0.910402 -0.173959 -1.140222 -0.662615
2020-01-06 -0.008887 NaN -1.076101 -0.862407
示例代码
老规矩,所有的示例代码都会上传至代码管理仓库 Github 和 Gitee 上,方便大家取用。