The difference and usage of df[], df.loc[], df.iloc[], df.at[], df.iat[] in Pandas data selection

1 Introduction

  Pandas is a well-known toolkit for Python data analysis. It provides a variety of data selection methods, which are convenient and practical. This article mainly introduces several methods of data selection in Pandas.

  In Pandas, data is mainly saved as Dataframe and Series are data structures. The data selection methods of these two data structures are basically the same. This article mainly uses Dataframe as an example to introduce.

  Selecting data in Dataframe roughly includes 3 situations:

  1) Row (column) selection (single-dimensional selection): df[]. In this case, only rows or columns can be selected at a time, that is, in one selection, only filter conditions can be set for rows or columns (filter conditions can only be set for one dimension).

  2) Area selection (multi-dimensional selection): df.loc[], df.iloc[]. This way you can set filter conditions for multiple dimensions at the same time.

  3) Cell selection (point selection): df.at[], df.iat[]. Position exactly one cell.

  Next, we take the following data as an example to introduce these three situations through examples.

import pandas as pd
import numpy as np

data = {'name': ['Joe', 'Mike', 'Jack', 'Rose', 'David', 'Marry', 'Wansi', 'Sidy', 'Jason', 'Even'],

    'age': [25, 32, 18, np.nan, 15, 20, 41, np.nan, 37, 32],

    'gender': [1, 0, 1, 1, 0, 1, 0, 0, 1, 0],

    'isMarried': ['yes', 'yes', 'no', 'yes', 'no', 'no', 'no', 'yes', 'no', 'no']}

labels = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']

df = pd.DataFrame(data, index=labels)

df
    name   age  gender isMarried
a    Joe  25.0       1       yes
b   Mike  32.0       0       yes
c   Jack  18.0       1        no
d   Rose   NaN       1       yes
e  David  15.0       0        no
f  Marry  20.0       1        no
g  Wansi  41.0       0        no
h   Sidy   NaN       0       yes
i  Jason  37.0       1        no
j   Even  32.0       0        no

2. Row (column) selection: df[]

  Row (column) selection is to select data in a single dimension, that is, to select in units of rows or in units of columns. The row of the Dataframe object has an index (index), which is an integer sequence of [0,1,2,…] by default, and you can also customize and add another index, such as the labels above, (to distinguish between the default index and custom , the default index is referred to as an integer index in this article, and the custom index is referred to as a label index). Each column of the Dataframe object has a column name, and the column selection can be realized through the column name.

  1) Select row

There are three ways to select rows: integer-indexed slices, label-indexed slices, and Boolean arrays.

  a) Integer index slicing: front closed and back opened

  • Select the first row:

df[0:1]
  name   age  gender isMarried
a  Joe  25.0       1       yes
  • Select the first two rows:

df[0:2]
   name   age  gender isMarried
a   Joe  25.0       1       yes
b  Mike  32.0       0       yes

  b) Label index slicing: front close and back close

  • Select the first row:

df[:'a']
  name   age  gender isMarried
a  Joe  25.0       1       yes
  • Select the first two rows:

df['a':'b']
   name   age  gender isMarried
a   Joe  25.0       1       yes
b  Mike  32.0       0       yes

  Note : Integer index slices are front-closed and back-opened, and label index slices are front-closed and back-closed. This is especially important.

  c) boolean array

  • Select the first three rows

df[[True,True,True,False,False,False,False,False,False,False]]
   name   age  gender isMarried
a   Joe  25.0       1       yes
b  Mike  32.0       0       yes
c  Jack  18.0       1        no
  • Select all rows with age greater than 30

df[[each>30 for each in df['age']]]
    name   age  gender isMarried
b   Mike  32.0       0       yes
g  Wansi  41.0       0        no
i  Jason  37.0       1        no
j   Even  32.0       0        no

  Through the method of Boolean array, the following selection methods can be derived:

  • Select all rows with age greater than 30

df[df['age']>30]
    name   age  gender isMarried
b   Mike  32.0       0       yes
g  Wansi  41.0       0        no
i  Jason  37.0       1        no
j   Even  32.0       0        no
  • Select all rows whose age is greater than 30 and isMarried is no

df[(df['age']>30) & (df['isMarried']=='no')]
    name   age  gender isMarried
g  Wansi  41.0       0        no
i  Jason  37.0       1        no
j   Even  32.0       0        no
  • Select all rows with age 20 or 32

df[(df['age']==20) | (df['age']==32)]
    name   age  gender isMarried
b   Mike  32.0       0       yes
f  Marry  20.0       1        no
j   Even  32.0       0        no

  Note: In the case of judging by multiple Boolean conditions like the above, it is best (must) enclose multiple conditions in parentheses, otherwise it is very error-prone.

  2) Column selection

  There are also three column selection methods: label index, label list, Callable object

  a) Tab index: select a single column

  • Select all data in the name column

df['name']
a      Joe
b     Mike
c     Jack
d     Rose
e    David
f    Marry
g    Wansi
h     Sidy
i    Jason
j     Even
Name: name, dtype: object

  b) Label list: select multiple columns

  • Select the two columns of data name and age

df[['name','age']]
    name   age
a    Joe  25.0
b   Mike  32.0
c   Jack  18.0
d   Rose   NaN
e  David  15.0
f  Marry  20.0
g  Wansi  41.0
h   Sidy   NaN
i  Jason  37.0
j   Even  32.0

  c) callable object

  • select first column

df[lambda df: df.columns[0]]
a      Joe
b     Mike
c     Jack
d     Rose
e    David
f    Marry
g    Wansi
h     Sidy
i    Jason
j     Even
Name: name, dtype: object

3. Region selection

  Area selection can filter data from multiple dimensions (rows and columns), which can be realized by three methods: df.loc[], df.iloc[], and df.ix[]. When using df.loc[], df.iloc[], and df.ix[] to select data, there must be two parameters in square brackets. The first parameter is the filter condition for rows, and the second parameter is The parameter is the filtering condition for the column, and the two parameters are separated by commas. The differences between df.loc[], df.iloc[], and df.ix[] are as follows:

  df.loc[] can only use label index, not integer index. When filtering through note index trimming, it will be closed before and after.

  df.iloc[] can only use integer indexes, not label indexes. When filtering through integer index trimming, the front is closed and the back is opened. ;

  df.ix[] can use both label indexing and integer indexing.

  These three methods are demonstrated by examples below.

3.1 df.loc[]

  1) Select the row

  • Select the row with index 'a':

df.loc['a', :]
name          Joe
age          25.0
gender          1
isMarried     yes
Name: a, dtype: object
  • Select row with index 'a' or 'b' or 'c'

df.loc[['a','b','c'], :]
   name   age  gender isMarried
a   Joe  25.0       1       yes
b  Mike  32.0       0       yes
c  Jack  18.0       1        no
  • Select all rows from 'a' to 'd' (including row 'd')

df.loc['a':'d', :]
   name   age  gender isMarried
a   Joe  25.0       1       yes
b  Mike  32.0       0       yes
c  Jack  18.0       1        no
d  Rose   NaN       1       yes
  • Select all rows with age greater than 30

df.loc[df['age']>30,:]
    name   age  gender isMarried
b   Mike  32.0       0       yes
g  Wansi  41.0       0        no
i  Jason  37.0       1        no
j   Even  32.0       0        no

You can also use the following two methods:

method one:

df.loc[df.loc[:,'age']>30, :]
   name   age  gender isMarried
b   Mike  32.0       0       yes
g  Wansi  41.0       0        no
i  Jason  37.0       1        no
j   Even  32.0       0        no

Method Two:

df.loc[df.iloc[:,1]>30, :]
    name   age  gender isMarried
b   Mike  32.0       0       yes
g  Wansi  41.0       0        no
i  Jason  37.0       1        no
j   Even  32.0       0        no
  • Use callable object to select all rows with age greater than 30

df.loc[lambda df:df['age'] > 30, :]
    name   age  gender isMarried
b   Mike  32.0       0       yes
g  Wansi  41.0       0        no
i  Jason  37.0       1        no
j   Even  32.0       0        no

  2) Select the column

  • Output the name of the owner (select the name column)

df.loc[:, 'name']
a      Joe
b     Mike
c     Jack
d     Rose
e    David
f    Marry
g    Wansi
h     Sidy
i    Jason
j     Even
Name: name, dtype: object
  • Output the name and age of everyone (select the name and age columns)

df.loc[:, 'name':'age']
    name   age
a    Joe  25.0
b   Mike  32.0
c   Jack  18.0
d   Rose   NaN
e  David  15.0
f  Marry  20.0
g  Wansi  41.0
h   Sidy   NaN
i  Jason  37.0
j   Even  32.0
  • Output the name, age, and marriage of the owner (select the name, age, and isMarried columns)

df.loc[:, ['name','age','isMarried']]
    name   age isMarried
a    Joe  25.0       yes
b   Mike  32.0       yes
c   Jack  18.0        no
d   Rose   NaN       yes
e  David  15.0        no
f  Marry  20.0        no
g  Wansi  41.0        no
h   Sidy   NaN       yes
i  Jason  37.0        no
j   Even  32.0        no
  • Select the first 3 columns as a boolean array

df.loc[:, [True,True,True,False]]
    name   age  gender
a    Joe  25.0       1
b   Mike  32.0       0
c   Jack  18.0       1
d   Rose   NaN       1
e  David  15.0       0
f  Marry  20.0       1
g  Wansi  41.0       0
h   Sidy   NaN       0
i  Jason  37.0       1
j   Even  32.0       0

  3) Filter rows and columns at the same time

  • Output the names and ages of people whose age is greater than 30

df.loc[df['age']>30,['name','age']]
    name   age
b   Mike  32.0
g  Wansi  41.0
i  Jason  37.0
j   Even  32.0
  • output row name and age of 'Mike' or 'Marry'

df.loc[(df['name']=='Mike') |(df['name']=='Marry'),['name','age']]
    name   age
b   Mike  32.0
f  Marry  20.0

3.2 df.iloc[]

  1) row selection

  • Select row 2

df.iloc[1, :]
name         Mike
age          32.0
gender          0
isMarried     yes
Name: b, dtype: object
  • Select the first 3 rows

df.iloc[:3, :]
   name   age  gender isMarried
a   Joe  25.0       1       yes
b  Mike  32.0       0       yes
c  Jack  18.0       1        no
  • Select row 2, row 4, row 6

df.iloc[[1,3,5],:]
    name   age  gender isMarried
b   Mike  32.0       0       yes
d   Rose   NaN       1       yes
f  Marry  20.0       1        no

  2) Column selection

  • Select column 2

df.iloc[:, 1]
a    25.0
b    32.0
c    18.0
d     NaN
e    15.0
f    20.0
g    41.0
h     NaN
i    37.0
j    32.0
Name: age, dtype: float64
  • Select first 3 columns

df.iloc[:, 0:3]
    name   age  gender
a    Joe  25.0       1
b   Mike  32.0       0
c   Jack  18.0       1
d   Rose   NaN       1
e  David  15.0       0
f  Marry  20.0       1
g  Wansi  41.0       0
h   Sidy   NaN       0
i  Jason  37.0       1
j   Even  32.0       0
  • Select columns 1, 3 and 4

df.iloc[:, [0,2,3]]
    name  gender isMarried
a    Joe       1       yes
b   Mike       0       yes
c   Jack       1        no
d   Rose       1       yes
e  David       0        no
f  Marry       1        no
g  Wansi       0        no
h   Sidy       0       yes
i  Jason       1        no
j   Even       0        no
  • Select first 3 columns by boolean array

df.iloc[:,[True,True,True,False]]
    name   age  gender
a    Joe  25.0       1
b   Mike  32.0       0
c   Jack  18.0       1
d   Rose   NaN       1
e  David  15.0       0
f  Marry  20.0       1
g  Wansi  41.0       0
h   Sidy   NaN       0
i  Jason  37.0       1
j   Even  32.0       0

  3) Select rows and columns at the same time

  • Select column 1, column 3, column 4 of row 2

df.iloc[1, [0,2,3]]
name         Mike
gender          0
isMarried     yes
Name: b, dtype: object
  • Select the first 3 columns of the first 3 rows

df.iloc[:3, :3]
   name   age  gender
a   Joe  25.0       1
b  Mike  32.0       0
c  Jack  18.0       1

4 cell selection

  Cell selection includes two methods, df.at[] and df.iat[]. When using df.at[] and df.iat[], two parameters must be input, namely row index and column index. Among them, df.at[] can only use label index, and df.iat[] can only use integer index. Both df.at[] and df.iat[] select a single cell (single row and single column), so the return values ​​are all basic data types.

4.1 df.at[]

  • Select the name column of row b

df.at['b','name']
Mike

4.2 df.iat[]

  • Select row 2, column 1

df.iat[1,0]
Mike

5 Expansion and summary

  1) When selecting an entire row (multiple entire rows) or a certain entire column (multiple entire columns) of data, you can use df[], df.loc[], df.iloc[], at this time the method of df[] Writing is easier.

  2) When performing region selection, if only label indexes can be used, use df.loc[], and if only integer indexes can be used, use df.iloc[]. df .loc[] can not only query, but also overwrite and write, it is strongly recommended to use!

  3) If you select a cell, df.at[], df.iat[], df.loc[], df.iloc[] are all available, but pay attention to the parameters.  

  4) When selecting data, the return value has the following conditions:

  • If the return value includes a single row with multiple columns or multiple rows with a single column, the return value is a Series object;

  • If the return value includes multiple rows and columns, the return value is a DataFrame object;

  • If the return value is only one cell (single row and single column), the return value is a basic data type, such as str, int, etc.

  5) The df[] method can only select row and column data, and cannot be accurate to the cell, so the return value of df[] must be a DataFrame or Series object.

  6) When using DataFrame's default index (integer index), the integer index is the label index. For example, instantiate a DataFrame object using the data above:

df2 = pd.DataFrame(data)
df2.loc[1,'name']
Mike
df2.iloc[1,0]
Mike

Guess you like

Origin blog.csdn.net/qq_39312146/article/details/129769974