1 Introduction
Pandas is a well-known toolkit for Python data analysis. It provides a variety of data selection methods, which are convenient and practical. This article mainly introduces several methods of data selection in Pandas.
In Pandas, data is mainly saved as Dataframe and Series are data structures. The data selection methods of these two data structures are basically the same. This article mainly uses Dataframe as an example to introduce.
Selecting data in Dataframe roughly includes 3 situations:
1) Row (column) selection (single-dimensional selection): df[]. In this case, only rows or columns can be selected at a time, that is, in one selection, only filter conditions can be set for rows or columns (filter conditions can only be set for one dimension).
2) Area selection (multi-dimensional selection): df.loc[], df.iloc[]. This way you can set filter conditions for multiple dimensions at the same time.
3) Cell selection (point selection): df.at[], df.iat[]. Position exactly one cell.
Next, we take the following data as an example to introduce these three situations through examples.
import pandas as pd
import numpy as np
data = {'name': ['Joe', 'Mike', 'Jack', 'Rose', 'David', 'Marry', 'Wansi', 'Sidy', 'Jason', 'Even'],
'age': [25, 32, 18, np.nan, 15, 20, 41, np.nan, 37, 32],
'gender': [1, 0, 1, 1, 0, 1, 0, 0, 1, 0],
'isMarried': ['yes', 'yes', 'no', 'yes', 'no', 'no', 'no', 'yes', 'no', 'no']}
labels = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']
df = pd.DataFrame(data, index=labels)
df
name age gender isMarried
a Joe 25.0 1 yes
b Mike 32.0 0 yes
c Jack 18.0 1 no
d Rose NaN 1 yes
e David 15.0 0 no
f Marry 20.0 1 no
g Wansi 41.0 0 no
h Sidy NaN 0 yes
i Jason 37.0 1 no
j Even 32.0 0 no
2. Row (column) selection: df[]
Row (column) selection is to select data in a single dimension, that is, to select in units of rows or in units of columns. The row of the Dataframe object has an index (index), which is an integer sequence of [0,1,2,…] by default, and you can also customize and add another index, such as the labels above, (to distinguish between the default index and custom , the default index is referred to as an integer index in this article, and the custom index is referred to as a label index). Each column of the Dataframe object has a column name, and the column selection can be realized through the column name.
1) Select row
There are three ways to select rows: integer-indexed slices, label-indexed slices, and Boolean arrays.
a) Integer index slicing: front closed and back opened
Select the first row:
df[0:1]
name age gender isMarried
a Joe 25.0 1 yes
Select the first two rows:
df[0:2]
name age gender isMarried
a Joe 25.0 1 yes
b Mike 32.0 0 yes
b) Label index slicing: front close and back close
Select the first row:
df[:'a']
name age gender isMarried
a Joe 25.0 1 yes
Select the first two rows:
df['a':'b']
name age gender isMarried
a Joe 25.0 1 yes
b Mike 32.0 0 yes
Note : Integer index slices are front-closed and back-opened, and label index slices are front-closed and back-closed. This is especially important.
c) boolean array
Select the first three rows
df[[True,True,True,False,False,False,False,False,False,False]]
name age gender isMarried
a Joe 25.0 1 yes
b Mike 32.0 0 yes
c Jack 18.0 1 no
Select all rows with age greater than 30
df[[each>30 for each in df['age']]]
name age gender isMarried
b Mike 32.0 0 yes
g Wansi 41.0 0 no
i Jason 37.0 1 no
j Even 32.0 0 no
Through the method of Boolean array, the following selection methods can be derived:
Select all rows with age greater than 30
df[df['age']>30]
name age gender isMarried
b Mike 32.0 0 yes
g Wansi 41.0 0 no
i Jason 37.0 1 no
j Even 32.0 0 no
Select all rows whose age is greater than 30 and isMarried is no
df[(df['age']>30) & (df['isMarried']=='no')]
name age gender isMarried
g Wansi 41.0 0 no
i Jason 37.0 1 no
j Even 32.0 0 no
Select all rows with age 20 or 32
df[(df['age']==20) | (df['age']==32)]
name age gender isMarried
b Mike 32.0 0 yes
f Marry 20.0 1 no
j Even 32.0 0 no
Note: In the case of judging by multiple Boolean conditions like the above, it is best (must) enclose multiple conditions in parentheses, otherwise it is very error-prone.
2) Column selection
There are also three column selection methods: label index, label list, Callable object
a) Tab index: select a single column
Select all data in the name column
df['name']
a Joe
b Mike
c Jack
d Rose
e David
f Marry
g Wansi
h Sidy
i Jason
j Even
Name: name, dtype: object
b) Label list: select multiple columns
Select the two columns of data name and age
df[['name','age']]
name age
a Joe 25.0
b Mike 32.0
c Jack 18.0
d Rose NaN
e David 15.0
f Marry 20.0
g Wansi 41.0
h Sidy NaN
i Jason 37.0
j Even 32.0
c) callable object
select first column
df[lambda df: df.columns[0]]
a Joe
b Mike
c Jack
d Rose
e David
f Marry
g Wansi
h Sidy
i Jason
j Even
Name: name, dtype: object
3. Region selection
Area selection can filter data from multiple dimensions (rows and columns), which can be realized by three methods: df.loc[], df.iloc[], and df.ix[]. When using df.loc[], df.iloc[], and df.ix[] to select data, there must be two parameters in square brackets. The first parameter is the filter condition for rows, and the second parameter is The parameter is the filtering condition for the column, and the two parameters are separated by commas. The differences between df.loc[], df.iloc[], and df.ix[] are as follows:
df.loc[] can only use label index, not integer index. When filtering through note index trimming, it will be closed before and after.
df.iloc[] can only use integer indexes, not label indexes. When filtering through integer index trimming, the front is closed and the back is opened. ;
df.ix[] can use both label indexing and integer indexing.
These three methods are demonstrated by examples below.
3.1 df.loc[]
1) Select the row
Select the row with index 'a':
df.loc['a', :]
name Joe
age 25.0
gender 1
isMarried yes
Name: a, dtype: object
Select row with index 'a' or 'b' or 'c'
df.loc[['a','b','c'], :]
name age gender isMarried
a Joe 25.0 1 yes
b Mike 32.0 0 yes
c Jack 18.0 1 no
Select all rows from 'a' to 'd' (including row 'd')
df.loc['a':'d', :]
name age gender isMarried
a Joe 25.0 1 yes
b Mike 32.0 0 yes
c Jack 18.0 1 no
d Rose NaN 1 yes
Select all rows with age greater than 30
df.loc[df['age']>30,:]
name age gender isMarried
b Mike 32.0 0 yes
g Wansi 41.0 0 no
i Jason 37.0 1 no
j Even 32.0 0 no
You can also use the following two methods:
method one:
df.loc[df.loc[:,'age']>30, :]
name age gender isMarried
b Mike 32.0 0 yes
g Wansi 41.0 0 no
i Jason 37.0 1 no
j Even 32.0 0 no
Method Two:
df.loc[df.iloc[:,1]>30, :]
name age gender isMarried
b Mike 32.0 0 yes
g Wansi 41.0 0 no
i Jason 37.0 1 no
j Even 32.0 0 no
Use callable object to select all rows with age greater than 30
df.loc[lambda df:df['age'] > 30, :]
name age gender isMarried
b Mike 32.0 0 yes
g Wansi 41.0 0 no
i Jason 37.0 1 no
j Even 32.0 0 no
2) Select the column
Output the name of the owner (select the name column)
df.loc[:, 'name']
a Joe
b Mike
c Jack
d Rose
e David
f Marry
g Wansi
h Sidy
i Jason
j Even
Name: name, dtype: object
Output the name and age of everyone (select the name and age columns)
df.loc[:, 'name':'age']
name age
a Joe 25.0
b Mike 32.0
c Jack 18.0
d Rose NaN
e David 15.0
f Marry 20.0
g Wansi 41.0
h Sidy NaN
i Jason 37.0
j Even 32.0
Output the name, age, and marriage of the owner (select the name, age, and isMarried columns)
df.loc[:, ['name','age','isMarried']]
name age isMarried
a Joe 25.0 yes
b Mike 32.0 yes
c Jack 18.0 no
d Rose NaN yes
e David 15.0 no
f Marry 20.0 no
g Wansi 41.0 no
h Sidy NaN yes
i Jason 37.0 no
j Even 32.0 no
Select the first 3 columns as a boolean array
df.loc[:, [True,True,True,False]]
name age gender
a Joe 25.0 1
b Mike 32.0 0
c Jack 18.0 1
d Rose NaN 1
e David 15.0 0
f Marry 20.0 1
g Wansi 41.0 0
h Sidy NaN 0
i Jason 37.0 1
j Even 32.0 0
3) Filter rows and columns at the same time
Output the names and ages of people whose age is greater than 30
df.loc[df['age']>30,['name','age']]
name age
b Mike 32.0
g Wansi 41.0
i Jason 37.0
j Even 32.0
output row name and age of 'Mike' or 'Marry'
df.loc[(df['name']=='Mike') |(df['name']=='Marry'),['name','age']]
name age
b Mike 32.0
f Marry 20.0
3.2 df.iloc[]
1) row selection
Select row 2
df.iloc[1, :]
name Mike
age 32.0
gender 0
isMarried yes
Name: b, dtype: object
Select the first 3 rows
df.iloc[:3, :]
name age gender isMarried
a Joe 25.0 1 yes
b Mike 32.0 0 yes
c Jack 18.0 1 no
Select row 2, row 4, row 6
df.iloc[[1,3,5],:]
name age gender isMarried
b Mike 32.0 0 yes
d Rose NaN 1 yes
f Marry 20.0 1 no
2) Column selection
Select column 2
df.iloc[:, 1]
a 25.0
b 32.0
c 18.0
d NaN
e 15.0
f 20.0
g 41.0
h NaN
i 37.0
j 32.0
Name: age, dtype: float64
Select first 3 columns
df.iloc[:, 0:3]
name age gender
a Joe 25.0 1
b Mike 32.0 0
c Jack 18.0 1
d Rose NaN 1
e David 15.0 0
f Marry 20.0 1
g Wansi 41.0 0
h Sidy NaN 0
i Jason 37.0 1
j Even 32.0 0
Select columns 1, 3 and 4
df.iloc[:, [0,2,3]]
name gender isMarried
a Joe 1 yes
b Mike 0 yes
c Jack 1 no
d Rose 1 yes
e David 0 no
f Marry 1 no
g Wansi 0 no
h Sidy 0 yes
i Jason 1 no
j Even 0 no
Select first 3 columns by boolean array
df.iloc[:,[True,True,True,False]]
name age gender
a Joe 25.0 1
b Mike 32.0 0
c Jack 18.0 1
d Rose NaN 1
e David 15.0 0
f Marry 20.0 1
g Wansi 41.0 0
h Sidy NaN 0
i Jason 37.0 1
j Even 32.0 0
3) Select rows and columns at the same time
Select column 1, column 3, column 4 of row 2
df.iloc[1, [0,2,3]]
name Mike
gender 0
isMarried yes
Name: b, dtype: object
Select the first 3 columns of the first 3 rows
df.iloc[:3, :3]
name age gender
a Joe 25.0 1
b Mike 32.0 0
c Jack 18.0 1
4 cell selection
Cell selection includes two methods, df.at[] and df.iat[]. When using df.at[] and df.iat[], two parameters must be input, namely row index and column index. Among them, df.at[] can only use label index, and df.iat[] can only use integer index. Both df.at[] and df.iat[] select a single cell (single row and single column), so the return values are all basic data types.
4.1 df.at[]
Select the name column of row b
df.at['b','name']
Mike
4.2 df.iat[]
Select row 2, column 1
df.iat[1,0]
Mike
5 Expansion and summary
1) When selecting an entire row (multiple entire rows) or a certain entire column (multiple entire columns) of data, you can use df[], df.loc[], df.iloc[], at this time the method of df[] Writing is easier.
2) When performing region selection, if only label indexes can be used, use df.loc[], and if only integer indexes can be used, use df.iloc[]. df .loc[] can not only query, but also overwrite and write, it is strongly recommended to use!
3) If you select a cell, df.at[], df.iat[], df.loc[], df.iloc[] are all available, but pay attention to the parameters.
4) When selecting data, the return value has the following conditions:
If the return value includes a single row with multiple columns or multiple rows with a single column, the return value is a Series object;
If the return value includes multiple rows and columns, the return value is a DataFrame object;
If the return value is only one cell (single row and single column), the return value is a basic data type, such as str, int, etc.
5) The df[] method can only select row and column data, and cannot be accurate to the cell, so the return value of df[] must be a DataFrame or Series object.
6) When using DataFrame's default index (integer index), the integer index is the label index. For example, instantiate a DataFrame object using the data above:
df2 = pd.DataFrame(data)
df2.loc[1,'name']
Mike
df2.iloc[1,0]
Mike