I am so confused with different indexing methods using iloc
in pandas.
Let say I am trying to convert a 1-d Dataframe to a 2-d Dataframe. First I have the following 1-d Dataframe
a_array = [1,2,3,4,5,6,7,8]
a_df = pd.DataFrame(a_array).T
And I am going to convert that into a 2-d Dataframe with the size of 2x4
. I start by preseting the 2-d Dataframe as follow:
b_df = pd.DataFrame(columns=range(4),index=range(2))
Then I use for-loop to help me converting a_df
(1-d) to b_df
(2-d) with the following code
for i in range(2):
b_df.iloc[i,:] = a_df.iloc[0,i*4:(i+1)*4]
It only gives me the following results
0 1 2 3
0 1 2 3 4
1 NaN NaN NaN NaN
But when I changed b_df.iloc[i,:]
to b_df.iloc[i][:]
. The result is correct like the following, which is what I want
0 1 2 3
0 1 2 3 4
1 5 6 7 8
Could anyone explain to me what the difference between .iloc[i,:]
and .iloc[i][:]
is, and why .iloc[i][:]
worked in my example above but not .iloc[i,:]
There is a very, very big difference between series.iloc[:]
and series[:]
, when assigning back. (i)loc
always checks to make sure whatever you're assigning from matches the index of the assignee. Meanwhile, the [:]
syntax assigns to the underlying NumPy array, bypassing index alignment.
s = pd.Series(index=[0, 1, 2, 3], dtype='float')
s
0 NaN
1 NaN
2 NaN
3 NaN
dtype: float64
# Let's get a reference to the underlying array with `copy=False`
arr = s.to_numpy(copy=False)
arr
# array([nan, nan, nan, nan])
# Reassign using slicing syntax
s[:] = pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])
s
0 1
1 2
2 3
3 4
dtype: int64
arr
# array([1., 2., 3., 4.]) # underlying array has changed
# Now, reassign again with `iloc`
s.iloc[:] = pd.Series([5, 6, 7, 8], index=[3, 4, 5, 6])
s
0 NaN
1 NaN
2 NaN
3 5.0
dtype: float64
arr
# array([1., 2., 3., 4.]) # `iloc` created a new array for the series
# during reassignment leaving this unchanged
s.to_numpy(copy=False) # the new underlying array, for reference
# array([nan, nan, nan, 5.])
Now that you understand the difference, let's look at what happens in your code. Just print out the RHS of your loops to see what you are assigning:
for i in range(2):
print(a_df.iloc[0, i*4:(i+1)*4])
# output - first row
0 1
1 2
2 3
3 4
Name: 0, dtype: int64
# second row. Notice the index is different
4 5
5 6
6 7
7 8
Name: 0, dtype: int64
When assigning to b_df.iloc[i, :]
in the second iteration, the indexes are different so nothing is assigned and you only see NaNs. However, changing b_df.iloc[i, :]
to b_df.iloc[i][:]
will mean you assign to the underlying NumPy array, so indexing alignment is bypassed. This operation is better expressed as
for i in range(2):
b_df.iloc[i, :] = a_df.iloc[0, i*4:(i+1)*4].to_numpy()
b_df
0 1 2 3
0 1 2 3 4
1 5 6 7 8
It's also worth mentioning this is a form of chained assignment, which is not a good thing, and also makes your code harder to read and understand.