Creating a column with a set replicates the set n times

yatu :

I encountered this unexpected behavior using pandas that I don't quite know how to explain, and have not found any related questions here in SO.

When creating a dataframe from a dictionary of lists, as expected we get each element from the iterable into a new row in the columns specified by the given key:

pd.DataFrame({'a':[1,2,3]})

   a
0  1
1  2
2  3

However, trying to do the same with a set, produced:

pd.DataFrame({'a':{1,2,3}})

       a
0  {1, 2, 3}
1  {1, 2, 3}
2  {1, 2, 3}

So it seems that the set was replicated up to the amount of elements it actually contains, i.e 3.

I know it does not really make sense to use a set for this, since sets are by definition unordered collections. However I could not find any references or explanations behind this behaviour. Is this specified somewhere in the docs? Is there an obvious reason behind this that I'm missing?

pd.__version__
# '1.0.0'
ALollz :

The issue is in extract_index, and also somewhat sanitize_array. To give the full walk-through:

import pandas as pd
from pandas.core.internals.construction import init_dict

#pd.DataFrame({'a':{1,2,3}})
data = {'a': {1,2,3}}
index = None
columns = None
dtype = None

Construction from a dict will go through this block

elif isinstance(data, dict):
    mgr = init_dict(data, index, columns, dtype=dtype)

And you can see the index is incorrect:

BlockManager
Items: Index(['a'], dtype='object')
Axis 1: RangeIndex(start=0, stop=4, step=1)
ObjectBlock: slice(0, 1, 1), 1 x 4, dtype: object

This happens because init_dict does this, which passes arrays=[{1, 2, 3}] to extract_index and pandas considers a set to be list_like. This means it takes the length of this set as your Index length.

from pandas.core.dtypes.common import is_list_like

is_list_like({1,2,3})
#True

The other issue is due to the difference in ndim of an array that stores lists or a set, so the underlying np.array is created differently. This is pretty buried here

np.array({1,2,3}).ndim
#0

np.array([1,2,3]).ndim
#1

And so, the set is treated as a "scalar" which gets broadcast to the entire RangeIndex specified above to become array([{1, 2, 3}, {1, 2, 3}, {1, 2, 3}], dtype=object), while the list remains as array([1, 2, 3])


Because it has an issue extracting the index, the simple work-around is to specify the index so it doesn't go through any of those.

pd.DataFrame({'a': {1,2,3}}, index=[0])
#           a
#0  {1, 2, 3}

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=193379&siteId=1
set
set