I encountered this unexpected behavior using pandas that I don't quite know how to explain, and have not found any related questions here in SO.
When creating a dataframe from a dictionary of lists, as expected we get each element from the iterable into a new row in the columns specified by the given key
:
pd.DataFrame({'a':[1,2,3]})
a
0 1
1 2
2 3
However, trying to do the same with a set
, produced:
pd.DataFrame({'a':{1,2,3}})
a
0 {1, 2, 3}
1 {1, 2, 3}
2 {1, 2, 3}
So it seems that the set was replicated up to the amount of elements it actually contains, i.e 3.
I know it does not really make sense to use a set for this, since sets are by definition unordered collections. However I could not find any references or explanations behind this behaviour. Is this specified somewhere in the docs? Is there an obvious reason behind this that I'm missing?
pd.__version__
# '1.0.0'
The issue is in extract_index
, and also somewhat sanitize_array
. To give the full walk-through:
import pandas as pd
from pandas.core.internals.construction import init_dict
#pd.DataFrame({'a':{1,2,3}})
data = {'a': {1,2,3}}
index = None
columns = None
dtype = None
Construction from a dict will go through this block
elif isinstance(data, dict):
mgr = init_dict(data, index, columns, dtype=dtype)
And you can see the index is incorrect:
BlockManager
Items: Index(['a'], dtype='object')
Axis 1: RangeIndex(start=0, stop=4, step=1)
ObjectBlock: slice(0, 1, 1), 1 x 4, dtype: object
This happens because init_dict
does this, which passes arrays=[{1, 2, 3}]
to extract_index
and pandas considers a set to be list_like
. This means it takes the length of this set as your Index length.
from pandas.core.dtypes.common import is_list_like
is_list_like({1,2,3})
#True
The other issue is due to the difference in ndim
of an array that stores lists or a set, so the underlying np.array
is created differently. This is pretty buried here
np.array({1,2,3}).ndim
#0
np.array([1,2,3]).ndim
#1
And so, the set is treated as a "scalar" which gets broadcast to the entire RangeIndex specified above to become array([{1, 2, 3}, {1, 2, 3}, {1, 2, 3}], dtype=object)
, while the list remains as array([1, 2, 3])
Because it has an issue extracting the index, the simple work-around is to specify the index so it doesn't go through any of those.
pd.DataFrame({'a': {1,2,3}}, index=[0])
# a
#0 {1, 2, 3}