When using pandas, statistical calculations are often performed on a certain row, a certain column, and data that meets the conditions.
The following summarizes common methods of pandas data selection, including the use of loc, iloc and other methods.
First read the data:
df = pd.read_excel('zpxx.xlsx')
1. Obtain elements, indexes, and column names
You can use the basic attributes values, index, and columns of DataFrame to obtain the elements, indexes, and column names respectively.
print('获取元素:\n', df.values) # 返回二维列表
print('获取索引:\n', df.index) # 返回行的索引,可使用list转换为列表格式
print('获取列名:\n', df.columns) # 返回字段名,可使用list转换为列表格式
2. Row selection
(1) head() and tail() methods
The head() and tail() methods provided by DataFrame can achieve the acquisition of multiple rows of data, and obtain continuous data from the beginning or the end. The default is the first or last 5 rows of data; you can enter the number of accessed rows in the method to achieve the target row. Number of views.
by default:
print('前5行(默认)数据:\n', df.head())
print('后5行(默认)数据:\n', df.tail())
Specify the number of rows to view:
print('指定查看前3行数据:\n', df.head(3))
Specify the target number of rows to view for a field:
print('指定查看【关键词】字段的前3行数据:\n', df['关键词'].head(3))
(2) Slicing method
Format: df[m:n], m and n represent the specified number of rows, left closed and right open
print('查看第2-第6行数据:\n', df[1:6])
3. Column selection
(1) Use a dictionary to access the value of a certain key.
Select a column: df['column name']
. Select multiple columns: df[['column name 1', 'column name 2', 'column name3']]
Select a column:
print('选取【采集时间】列:\n', df['采集时间'])
Select multiple columns:
print('选取多列:\n', df[['关键词', '采集时间']])
(2) Method of accessing attributes. Usage
: df. column name.
It is best not to use it. It is easy to cause confusion between field names and internal fixed method names.
print('选取【采集时间】列:\n', df.采集时间)
4. Loc and iloc row and column selection
(1) loc usage
Syntax: df.loc [row index name or condition, column index name]
loc is a slicing method for the DataFrame index name. The index name must be passed in, otherwise it cannot be executed; And the row index cannot be empty, otherwise it will lose its meaning.
In the first usage, both row and column indexes are available:
print('选取【采集时间】整列数据:\n', df.loc[:, '采集时间']) # loc用法
print('选取前5行的【采集时间】:\n', df.loc[:4, '采集时间']) # loc用法
Note: If the row index is an interval, both the front and back are closed intervals. The ":4" above represents the row index [0:4], which are all closed intervals.
print('选取第3行的【采集时间】:\n', df.loc[2, '采集时间']) # loc用法
The second type only has row labels:
Note: If the row index is an interval, both the front and rear are closed intervals.
print('选取第一行', df.loc[0])
print('选取第2行,第4行:\n', df.loc[[0, 3]])
print('选取前3行:\n', df.loc[0:2])
The third type is to pass in conditions:
print('选取【学历】是本科的数据:\n', df.loc[df['学历'] == '本科', ['学历', '所在地']])
(2) iloc usage
Syntax: df.iloc [row index position, column index position]
The difference between iloc and loc is that iloc selects data based on position. Only integer data is accepted, such as df.iloc[1], df.iloc[1,2], df[:4,3], df[1,[1,2,5]]
print('选取【关键词】字段的前4行数据:\n', df.iloc[:4, 0]) # iloc用法
Note: ":4" here means the row position [0,4), starting from 0, left closed and right open; "0" means the [keyword] field is in the first position.
Overall, loc is more flexible to use and the code is more readable.
5. ix data selection
The ix method can receive both the index name and the index position when used.
Syntax: df.ix [row index name or position or condition, column index name or position]
Note: When the index name and position partially overlap, ix identifies the name first by default.
The ix method has been removed after pandas 1.0.0 and replaced with the loc and iloc methods.
The above are common uses of pandas data selection.
[Search [digit code] on WeChat to follow me]
-end-