DataFrame data value traversal access method

DataFrame data value traversal access method

To traverse, query or modify the element values ​​of a dataframe, the loc and iloc functions are usually used to locate elements. Compare the usage and differences of loc and iloc in traversal applications.

1. Use of loc and iloc

loc function: Get row data through the specific value in the row index "Index" (access by column name, or access by condition)
iloc function: Get row data by row number (access by row number and column number, columns cannot be used name visit)

Note: loc means location, i in iloc means integer, and only accepts integers as parameters.

loc official website description, mainly used for column label access

.loc is primarily label based, but may also be used with a boolean
array.

  • A single label, e.g. 5 or ‘a’ (Note that 5 is interpreted as a label of the index. This use is not an integer position along the index.).
  • A list or array of labels [‘a’, ‘b’, ‘c’].
  • A slice object with labels ‘a’:‘f’ (Note that contrary to usual Python slices, both the start and the stop are included, when present in the index! See Slicing with labels and Endpoints are inclusive.)
  • A boolean array (any NA values will be treated as False).
  • A callable function with one argument (the calling Series or DataFrame) and that returns valid output for indexing (one of the above).

The iloc official website states that it is mainly based on integer position access.

.iloc is primarily integer position based (from 0 to length-1 of the
axis), but may also be used with a boolean array. .iloc will raise
IndexError if a requested indexer is out-of-bounds, except slice
indexers which allow out-of-bounds indexing. (this conforms with
Python/NumPy slice semantics). Allowed inputs are:

  • An integer e.g. 5.
  • A list or array of integers [4, 3, 0].
  • A slice object with ints 1:7.
  • A boolean array (any NA values will be treated as False).
  • A callable function with one argument (the calling Series or DataFrame) and that returns valid output for indexing (one of the above).

2.loc data value traversal

(1) Based on date index

#日期做索引
df = pd.DataFrame(abs(np.random.randn(10, 4)), index=pd.date_range('1/1/2023', periods=10), 
                  columns=list('ABCD')) 
df.index.name='date'
i = 1
#按日期索引遍历
for d in df.index :
	#偶数行A列赋值0
    if i%2 ==0 :
        df.loc[df.index==d,'A'] = 0
        
    i += 1
#索引列值最小值A列赋值100
df.loc[df.index==df.index.min()  ,['A']]=100

print(df)
pre=100
#A列如果是0 ,向下赋值
for d in df.index :
    if df.loc[d,'A']== 0.0 :
        df.loc[d,'A']=pre
    pre=df.loc[d,'A']

print(df)

The effect is as follows:

                     A         B         C         D
date                                                
2023-01-01  100.000000  1.532303  1.667700  0.799870
2023-01-02    0.000000  0.183089  1.239692  1.321370
2023-01-03    0.741102  0.046467  1.132106  0.019921
2023-01-04    0.000000  1.051709  0.236322  1.521744
2023-01-05    0.385593  0.533345  0.762100  0.654683
2023-01-06    0.000000  1.244160  0.433445  1.050108
2023-01-07    0.243023  2.255967  0.165955  0.287973
2023-01-08    0.000000  0.454799  1.382565  0.732341
2023-01-09    0.497839  0.371737  0.366683  0.524772
2023-01-10    0.000000  0.677077  0.542580  1.384272
                     A         B         C         D
date                                                
2023-01-01  100.000000  1.532303  1.667700  0.799870
2023-01-02  100.000000  0.183089  1.239692  1.321370
2023-01-03    0.741102  0.046467  1.132106  0.019921
2023-01-04    0.741102  1.051709  0.236322  1.521744
2023-01-05    0.385593  0.533345  0.762100  0.654683
2023-01-06    0.385593  1.244160  0.433445  1.050108
2023-01-07    0.243023  2.255967  0.165955  0.287973
2023-01-08    0.243023  0.454799  1.382565  0.732341
2023-01-09    0.497839  0.371737  0.366683  0.524772
2023-01-10    0.497839  0.677077  0.542580  1.384272

(2)Default index

df = pd.DataFrame(abs(np.random.randn(10, 4)), columns=list('ABCD')) 
print(df)
df.index.name='no'
#默认数字序列索引
for i in df.index :
    df.loc[df.index==i,['A','C']] = i

#D列赋值空值
df['D']= np.NaN
#D列偶数行赋值
for i in df.index :
    if i%2  :
        df.loc[df.index==i,['D']] = i

print(df)
#用fillna函数向前填充      
df1 = df['D'].fillna(method='bfill')
print(df1)

The effect is as follows:

         A         B         C         D
0  1.355061  1.784947  0.530280  0.343836
1  0.591961  1.587958  0.700280  0.096845
2  0.945876  1.036163  0.903821  0.161356
3  1.144042  1.162818  0.148023  1.971303
4  0.424846  0.960678  0.891586  1.687668
5  0.441317  2.275049  0.168477  0.297483
6  0.791475  0.894168  1.309116  1.826531
7  0.349400  0.878078  1.748874  2.238486
8  0.501033  0.608020  0.346233  2.553355
9  0.795990  1.267664  0.565392  1.510390
      A         B    C    D
no                         
0   0.0  1.784947  0.0  NaN
1   1.0  1.587958  1.0  1.0
2   2.0  1.036163  2.0  NaN
3   3.0  1.162818  3.0  3.0
4   4.0  0.960678  4.0  NaN
5   5.0  2.275049  5.0  5.0
6   6.0  0.894168  6.0  NaN
7   7.0  0.878078  7.0  7.0
8   8.0  0.608020  8.0  NaN
9   9.0  1.267664  9.0  9.0
no
0    1.0
1    1.0
2    3.0
3    3.0
4    5.0
5    5.0
6    7.0
7    7.0
8    9.0
9    9.0
Name: D, dtype: float64

3.iloc data value traversal

(1) Based on date and default index

Based on date and default index, since only positional parameters can be used, there is no difference in indexing.

df = pd.DataFrame(abs(np.random.randn(10, 4)), index=pd.date_range('1/1/2023', periods=10), 
                  columns=list('ABCD')) 
df.index.name='date'
#按行数循环
for i in range(df.shape[0]) :
    #偶数行,2,3列是C、D列
    if i%2 == 0 :
        df.iloc[[i],[2,3]] = 0

#第一行的CD列赋值100
df.iloc[[0],[2,3]] = 100        
#
#print(df.iloc[[0],[3]].values[0][0])
print(df)

for i in range(df.shape[0]) :
	# 偶数行赋值C列向下填充  
    if df.iloc[[i],[2]].values[0][0] == 0.0 :
        df.iloc[[i],[2]]=pre_c
       
    pre_c=df.iloc[[i],[2]]

  	# 偶数行赋值D列向下填充  
    if df.iloc[[i],[3]].values[0][0] == 0.0 :
        df.iloc[[i],[3]]=pre_d
       
    pre_d=df.iloc[[i],[3]]
    
print(df)           

The effect is as follows:

                   A         B           C           D
date                                                  
2023-01-01  1.040645  0.369780  100.000000  100.000000
2023-01-02  1.850851  1.422875    0.066909    1.137934
2023-01-03  0.321779  0.376273    0.000000    0.000000
2023-01-04  0.316248  1.198039    1.707555    0.539617
2023-01-05  0.350327  0.144577    0.000000    0.000000
2023-01-06  0.396593  1.054268    0.791154    0.898749
2023-01-07  0.685409  1.286553    0.000000    0.000000
2023-01-08  0.366570  0.997236    1.534733    0.689972
2023-01-09  0.417907  0.823729    0.000000    0.000000
2023-01-10  1.316604  0.867192    0.514058    0.945503
                   A         B           C           D
date                                                  
2023-01-01  1.040645  0.369780  100.000000  100.000000
2023-01-02  1.850851  1.422875    0.066909    1.137934
2023-01-03  0.321779  0.376273    0.066909    1.137934
2023-01-04  0.316248  1.198039    1.707555    0.539617
2023-01-05  0.350327  0.144577    1.707555    0.539617
2023-01-06  0.396593  1.054268    0.791154    0.898749
2023-01-07  0.685409  1.286553    0.791154    0.898749
2023-01-08  0.366570  0.997236    1.534733    0.689972
2023-01-09  0.417907  0.823729    1.534733    0.689972
2023-01-10  1.316604  0.867192    0.514058    0.945503

Note:
The data type of df.iloc[[0],[3]]

print(type(df.iloc[[0],[3]])) 
print(type(df.iloc[[0],[3]].values[0])) 
print(type(df.iloc[[0],[3]].values[0][0])) 

They are dataframe, array and floating point respectively.

<class 'pandas.core.frame.DataFrame'>
<class 'numpy.ndarray'>
<class 'numpy.float64'>

4. Summary and comparison

(1)loc

  • Rows can be filtered based on row labels and index conditions, and columns can be filtered based on column labels.
  • If you select all rows or all columns, you can use: instead
  • When selecting row labels, both ends are included. The default index [0:3] refers to 0, 1, 2, and 3, a total of 4 columns.
  • Use column labels to specify columns, making the program more readable

(2) iloc

  • iloc is based on positional index, and the rows and columns start from 0.
  • iloc cannot filter with index conditions. If the index is of date type, the date range cannot be selected.
  • Row label selection, left included, right excluded
  • If there are many columns in the dataframe, it is inconvenient to select corresponding column values, and the readability of the program is poor.

Comparison description:

print(df.iloc[0:2,1:3])
print(df.loc[0:2,['C','D']])

iloc, row selection 0:2, excluding the second row, column 1:3, 1-3 corresponding to the BCD column, excluding column D after filtering loc, row
selection 0:2, including the second row.
result:

           B    C
no               
0   1.141945  0.0
1   1.010452  1.0
      C    D
no          
0   0.0  NaN
1   1.0  1.0
2   2.0  NaN

Guess you like

Origin blog.csdn.net/qq_39065491/article/details/132822000