Which Pandas dataframe is better: super long dataframe VS badly structured one with lists

user8491363 :

I've been trying to NLP by tokenizing texts with n-gram. I have to count how many occurrences of each n-gram there is, by label A and B respectively.

However, I have to choose between putting a long list into a column VS getting a very long dataframe and I'm not sure which structure is superior to the other.

AFAIK, it is a bad structure to have list inside a column of a dataframe, since you can hardly get any useful information using pandas operations, like getting the frequency(occurence) of an item that's inside several lists. Also, it would require more calculations to do any tasks even if it's possible.

However, I know that a dataframe too long will eat up a lot of RAM, and even possibly kill other processes if the data gets too big to fit in the RAM. That's kind of the situation I certainly don't want to be in.

So now I have to make a choice. What I want to do is counting each ngram item's occurrence by its label.

For example, (The dataframes are shown below)

{

{ngram: hey, occurence_A: 2, occurence_B: 0},

{ngram: python, occurence_A: 2, occurence_B: 1},

...

}

I think it'll be relevant to state my computer's spec.

CPU: i3-6100

RAM: 16GB

GPU: n/a

DataFrame 1:

+------------+-------------------------------------------+-------+
|    DATE    |                   NGRAM                   | LABEL |
+------------+-------------------------------------------+-------+
| 2019-02-01 | [hey, hey, reddit, reddit, learn, python] | A     |
| 2019-02-02 | [python, reddit, pandas, dataframe]       | B     |
| 2019-02-03 | [python, reddit, ask, learn]              | A     |
+------------+-------------------------------------------+-------+

DataFrame 2:

+------------+-----------+-------+
|    DATE    |   NGRAM   | LABEL |
+------------+-----------+-------+
| 2019-02-01 | hey       | A     |
| 2019-02-01 | hey       | A     |
| 2019-02-01 | reddit    | A     |
| 2019-02-01 | reddit    | A     |
| 2019-02-01 | learn     | A     |
| 2019-02-01 | python    | A     |
| 2019-02-02 | python    | B     |
| 2019-02-02 | reddit    | B     |
| 2019-02-02 | pandas    | B     |
| 2019-02-02 | dataframe | B     |
| 2019-02-03 | python    | A     |
| 2019-02-03 | reddit    | A     |
| 2019-02-03 | ask       | A     |
| 2019-02-03 | learn     | A     |
+------------+-----------+-------+
Toukenize :

like you mention, having a list inside a column of a dataframe is a bad structure, and long format dataframe is preferred. Let me attempt to answer the question from several aspects:

  1. Added complexity for data manipulation & lack of native support functions for list-like column

With a list-like column, you are not able to use Pandas functions readily.

For example, you mentioned you are interested in the NGRAM by LABEL. With dataframe1 (df1), you can obtain what you need readily by a simple groupby and count function, while for dataframe2 (df2) you need to explode the list-list column before you can work on them:

df1.groupby(['LABEL','NGRAM']).count().unstack(-1).fillna(0)

df2.explode(column='NGRAM').groupby(['LABEL','NGRAM']).count().unstack(-1).fillna(0)

Both give you the same thing:

enter image description here

In addition, many native Pandas functions (e.g. my favourite value_counts) can't work on list directly, so explode is almost always needed.

  1. Lower computation time for long data than list-like data (generally speaking, since we don't need to explode the column first)

Imagine you decided to capitalize your NGRAM, you would do the following respectively, and you can see that df2 takes much longer time to execute:

df1['NGRAM'] = df1['NGRAM'].str.capitalize()
# 1000 loops, best of 5: 1.49 ms per loop

df2['NGRAM'] = df2['NGRAM'].explode().str.capitalize().groupby(level=0).apply(list)
# 1000 loops, best of 5: 246 µs per loop

If memory is an issue for you, you might want to consider working with NGRAM counts per label directly (data structure in the image above, rather than storing them as either df1 or df2) or use Numpy arrays (which reduces the overhead of Pandas slightly) while keeping a NGRAM dictionary file separately.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=3899&siteId=1