[Yugong Series] MultiIndex of Pandas Data Analysis in July 2023


foreword

Multi-level index (MultiIndex) in Pandas refers to using multiple index levels to organize data in a DataFrame or Series. Multilevel indexes can be used to store high-dimensional data, such as time series data or data with multiple categorical variables.

In Pandas, MultiIndex can be created by:

  1. Create with list of tuples: Create a MultiIndex by passing a list of tuples of lists of unique values ​​at each level. For example:
import pandas as pd

index = pd.MultiIndex.from_tuples([('A', 'X'), ('A', 'Y'), ('B', 'X'), ('B', 'Y')])
  1. Create with multiple index array: Create a MultiIndex by passing a list of arrays. Array lists must have the same length and shape. For example:
import pandas as pd
import numpy as np

index1 = np.array(['A', 'A', 'B', 'B'])
index2 = np.array(['X', 'Y', 'X', 'Y'])
index = pd.MultiIndex.from_arrays([index1, index2])
  1. Create with Cross Index: Create a MultiIndex by passing a list of unique values ​​on each level. For example:
import pandas as pd

index1 = ['A', 'A', 'B', 'B']
index2 = ['X', 'Y', 'X', 'Y']
index = pd.MultiIndex.from_product([index1, index2])

After creating a MultiIndex, you can use MultiIndex.get_level_values()methods to get the value of each level and loc()methods to select data at a specific level. For example:

import pandas as pd

index = pd.MultiIndex.from_product([['A', 'B'], ['X', 'Y']])
data = pd.DataFrame({'value': [1, 2, 3, 4]}, index=index)
print(data.loc['A'])

This will output:

   value
X      1
Y      2

1. MultiIndex

1. The concept of MultiIndex

Functions and methods that allow us to manipulate data quickly and easily.

insert image description here
For those who have never heard of Pandas, the most straightforward use of a MultiIndex is to use a second index column that complements the first to uniquely identify each row. For example, to disambiguate cities from different states, the name of the state is usually appended to the name of the city. For example, there are about 40 springfields in the US (in relational databases, it's called a composite primary key).

You can specify the columns to include in the index after parsing the DataFrame from CSV, or immediately as arguments to read_csv.

insert image description here
You can also add existing levels to a multi-index using append=True, as shown in the image below:
insert image description here
Another more typical use case is to represent multi-dimensional. When you have a set of objects with specific properties or objects that evolve over time. For example:

  • Results of Sociological Survey

  • Titanicdata set

  • Historical Weather Observations

  • Chronology of Championship Rankings.

This is also known as "panel data", after which Pandas is named.

Let's add such a dimension:

insert image description here
We now have a 4D space that looks like this:

  • Years form a (nearly continuous) dimension

  • City names are arranged along the second line

  • third state name

  • Specific urban attributes ("population", "density", "area", etc.) act as "tick marks" on the fourth dimension.

The following diagram illustrates the concept:

insert image description here
To make room for the dimension names of the corresponding columns, Pandas moves the entire header up:
insert image description here

2. Grouping

The first thing to note about a multiindex is that it doesn't group anything by the way it might appear. Internally, it's just a flattened sequence of labels, like this:
insert image description here
You can get the same groupby effect by sorting the row labels:
insert image description here
you can even disable visual grouping entirely by setting the corresponding Pandas option
: pd.options.display .multi_sparse=False.

3. Type conversion

Pandas (and Python itself) distinguish between numbers and strings, so it's usually best to convert numbers to strings when the data type cannot be detected automatically:

pdi.set_level(df.columns, 0, pdi.get_level(df.columns, 0).astype('int'))

If you're feeling adventurous, you can do the same with standard tools:

df.columns = df.columns.set_levels(df.columns.levels[0].astype(int), level=0)

But in order to use them properly, you need to understand what sums are levels, codesand pdi allows you to use multi-indexing just like you would with normal lists or NumPy arrays.

If you really want to know, levelsand codesis what a regular label list of a certain level is broken down into to speed up operations like pivots, joins, etc.:

  • pdi.get_level(df, 0) == Int64Index([2010, 2010, 2020, 2020])

  • df.columns.levels[0] == Int64Index([2010, 2020])

  • df.columns.codes[0] == Int64Index([0, 1, 0, 1])

4. Build a Dataframe with multiple indexes

Besides reading from CSV files and building from existing columns, there are ways to create multiple indexes. They are less commonly used - mostly for testing and debugging.

The most intuitive approach of using Panda's own multi-index representation doesn't work for historical reasons.

insert image description here
Here Levelsand codes(now) are considered implementation details that should not be exposed to end users, but we've got what we have.

Probably the easiest way to build a multiple index is as follows:
insert image description here
this has the disadvantage of having to specify the name of the level on a separate line. There are several optional constructors that bundle names and labels together.
insert image description here
When levels form a regular structure, you can specify key elements and let Pandas interweave them automatically, as follows:
insert image description here
All the methods listed above also work with columns. For example:
insert image description here

5. Indexing with Multiple Indexes

The benefit of accessing a DataFrame via a multiple index is that you can easily refer to all levels at once (possibly omitting inner levels) using familiar syntax.

Columns - via plain square brackets

insert image description here
Rows and Cells - Using .loc[]
insert image description here
Now, what if you want to select all the cities in Oregon, or just leave the column with the population? The Python syntax has two limitations here.

  1. There's no way to differentiate between df['a', 'b'] and df[('a', 'b')] - it's handled the same way, so you can't just write df[:,'Oregon'] . Otherwise, Pandas will never know whether you are referring to the column Oregon or the second level row Oregon

  2. Python only allows colons inside square brackets, not inside parentheses, so you can't write df.loc[(:, 'Oregon'), :]

Technically, this is not difficult to arrange. I monkey-patched DataFrame to add such functionality, which you can see here: The
insert image description here
only downside to this syntax is that it returns a copy when you use two indexers, so you can't write df.mi [:, 'Oregon']. Co['population'] = 10. There are many alternative indexers, some of which allow such assignments, but they all have their own characteristics:

  1. You can swap inner with outer, and use parentheses.
    insert image description here
    Thus, df[:, 'population'] can be implemented with df.swaplevel(axis=1)['population'] .

This feels hacky and inconvenient for more than two layers.

  1. You can use the xs method: df.xs('population', level=1, axis=1).

It doesn't feel very pythonic, especially when multiple levels are selected. This method cannot filter rows and columns at the same time, so the reason behind the name xs (which stands for "cross-section") is not entirely clear. It cannot be used to set a value.

3. You can create an alias for pd. idx = pd.IndexSlice; df.loc[:, idx[:, 'population']]

This is more pythonic, but to access elements, aliases have to be used, which is a bit cumbersome (the code without aliases is too long). You can select both rows and columns. writable.

  1. You can learn how to use slice instead of colon. If you know that a[3:10:2] == a[slice(3,10,2)], then you may also understand the following code: df.loc[:, (slice(None), 'population' )], but it's barely readable. You can select both rows and columns. writable.

As a bottom line, Pandas has multiple ways of accessing elements of a DataFrame using multiple indexes using parentheses, but none of them are convenient enough, so they had to resort to another indexing syntax:

  1. A mini-language for the .query method: df.query(' state=="Oregon" or city=="Portland"') .

It's convenient and fast, but lacks IDE support (no autocompletion, no syntax highlighting, etc.), and it only filters lines, not columns. This means you can't implement df :, 'population' with it without transposing the DataFrame . Non-writable.

6. Overlay and Split

Pandas does not have a set_index for columns. A common way to add hierarchies to a column is to "unstack" existing hierarchies from the index:
insert image description here
Pandas' stacks are very different from NumPy's stacks. Let's see what the documentation says about the naming convention:

"The function is named like a reorganized collection of books from horizontally positioned side by side (column of dataframe) to vertically stacked (in index of dataframe)."

The "on top" part doesn't sound convincing to me, but at least the explanation helps remember who moved things in which direction. By the way, Series has unstack, but not stack, because it's already "stacked". Since it is one-dimensional, Series can be used as a row vector or a column vector in different situations, but is generally considered a column vector (such as a dataframe column).

For example:
insert image description here
You can also specify the level to stack/unstack by name or position index. In this example, df.stack(), df.stack(1) and df.stack(' year ') produce the same the result of. The destination is always "after the last layer" and is not configurable. If you need to place the levels elsewhere, you can use df.swaplevel().sort_index() or pdi. swap_level (df = True)

Columns must not contain duplicate values ​​to be stackable (and so are indexes when unstacking):

insert image description here

7. How to prevent stacking/decomposition sorting

Both stack and unstack have a bad habit of unpredictably sorting the resulting index lexicographically. This can be irritating at times, but it's the only way to give predictable results when there are a lot of missing values.

Consider the following example. In what order do you want the days of the week to appear in the table on the right?

insert image description here
You can speculate that if John's Monday is to the left of John's Friday, then 'Mon' < 'Fri', similarly, Silvia's 'Fri' < 'Sun', so the result should be 'Mon' < 'Fri' < 'Sun'. This is legal, but what if the remaining columns are in a different order, like 'Mon' < 'frii' and 'Tue' < 'frii'? Or 'Mon' < 'friday' and 'Wed' < 'Sat' ?

Well, there aren't that many days in a week, and Pandas can infer the order based on prior knowledge. However, mankind has not yet come to a decisive conclusion whether Sunday should be the end or the beginning of the week. Which order should Pandas use by default? Read locales? What about less trivial orders, like the order of states in the US?

In this case, all Pandas does is simply sort alphabetically, as shown below:
insert image description here
While this is a reasonable default, it still feels wrong. There should be a solution! There is one. It's called a CategoricalIndex. It remembers the order even if some tags are missing. It has recently been smoothly integrated into the Pandas toolchain. The only thing it lacks is infrastructure. It's hard to set up; it's brittle (falls back to objects in some operations), but it's perfectly usable, and the pdi library has some helpers that steepen the learning curve.

For example, to tell Pandas to lock the order of a simple index that stores products (which will inevitably happen if you decide to unstack the days of the week back into the column), you need to write horrible code like df . index = pd.CategoricalIndex(df.index) df index. index sort=True). It is more suitable for multi-index.

The pdi library has a helper function locked (and an alias lock that defaults to inplace=True ) that locks the ordering of a certain multiindex level by promoting it to a CategoricalIndex: a check mark next to a level name indicates that the level is locked
insert image description here
. It can be visualized manually using pdi.vis(df) or automatically visualized with monkey patches on the DataFrame HTML output using pdi.vis_patch(). After applying the patch, simply writing in the Jupyter cell dfwill show checkmarks for all levels of the locking order.

Lock and locked work automatically in simple cases (like client names), but require user prompts in more complex cases (like day of the week with missing dates).

insert image description here
After the level is switched to CategoricalIndex, it will keep the original order in sort_index, stack, unstack, pivot, pivot_table and other operations.

However, it is fragile. Even something as simple as df['new_col'] = 1 will break it. Use pdi.insert(df.columns, 0, 'new_col', 1) to correctly handle levels with CategoricalIndex.

8. Operating level

In addition to the previously mentioned methods, there are some other methods:

  • pdi.get_level(obj, level_id) returns a specific level referenced by number or name, available for DataFrames,
    Series and MultiIndex
  • pdi.set_level(obj, level_id, labels) replaces the labels of the level with the given array (list, NumPy array, Series,
    Index, etc.)

insert image description here

  • pdi.insert_level (obj, pos, labels, name) add a level with the given value (broadcast appropriately if necessary)
  • pdi.drop_level(obj, level_id) removes the specified level from the multi-index

insert image description here
pdi.swap_levels (obj, src=-2, dst=-1) swap two levels (the default is the two innermost levels)

pdi.move_level (obj, src, dst) moves a specific level src to the specified position dst

insert image description here
In addition to the above parameters, all functions in this section have the following parameters:

  • axis=None where None means "column" for DataFrame and "index" for Series

  • sort=False, optional sort the corresponding multi-index after the operation

  • inplace=False, optionally perform the operation in-place (cannot be used for a single index as it is immutable).

All the operations above understand the word "level" in the traditional sense (the number of labels for a level is the same as the number of columns in the data frame), hiding the mechanism of indexing. labels and indexes. Code from end users.

In rare cases, when moving and swapping individual levels is not enough, you can use the pure Pandas call :df to reorder all the levels at once. columns = df.columns.reorder_levels([' M ', ' L ', ' K ']) where [' M ', ' L ', ' K '] is the desired order of the levels.

Usually it's enough to use get_level and set_level to fix the labels as necessary, but if you want to apply transformations to all levels of a multiindex at once, Pandas has a (ambiguously named) function rename that accepts a dict or a function:

insert image description here
As for renaming levels, their names are stored in the .names field. The field does not support direct assignment (why not?): df.index.names[1] = ' x ' # TypeError, but can be replaced as a whole:

insert image description here
When you only need to rename a specific level, the syntax is as follows:
insert image description here

9. Convert multi-index to flat index and restore it

As we saw above, the convenience query method only addresses the complexity of dealing with multiple indexes on a row. Despite having so many helper functions, there is a shocking effect for beginners when certain Pandas functions return multi-indexes on columns. So the pdi library has the following:

join_levels(obj, sep='_', name=None) join all multi-index levels to one index

split_level(obj, sep='_', names=None) splits the index back into multi-index

insert image description here
They both have optional axis and inplace parameters.

10. Sort MultiIndex

Since multi-indexes consist of multiple levels, sorting is more artificial than for single-indexes. This can still be done using the sort_index method, but can be further fine-tuned with the following parameters.
insert image description here
To sort on column level, specify axis=1.

11. Read and write multi-index dataframe to disk

Pandas can write a DataFrame with multiple indexes to a CSV file in a fully automated fashion: df.to_csv('df.csv') . But when reading such a file, Pandas cannot automatically resolve multiple indexes and needs some hints from the user. For example, to read a DataFrame with columns three levels high and an index four levels wide, you would specify pd.read_csv('df.csv', header=[0,1,2], index_col=[0,1,2, 3]).

This means that the first three rows contain information about the column, and the first four fields of each subsequent row contain the index level (if the column has more than one level, you can no longer refer to the row level by name, only by number).

Manually deciphering the number of layers in a multi-index is inconvenient, so a better idea is to stack() all column header layers before saving the DataFrame to CSV, and unstack() them after reading.

If you need a "forget it" solution, you might want to look into a binary format, such as Python's pickle format:

Direct call: df.to_pickle('df.pkl'), pd.read_pickle('df.pkl')

Use storemagic in Jupyter %store df then %store -r df (stored in $
HOME/.ipython/profile_default/db/autorestore)

Python's pickle is small and fast, but only accessible within Python. If you need to interoperate with other ecosystems, look at more standard formats like the Excel format (requires the same hint as read_csv when reading MultiIndex). code show as below:

!pip install openpyxl  
df.to_excel('df3.xlsx')  
df.to_pd.read_excel('df3.xlsx', header=[0,1,2], index_col=[0,1,2,3])

Or look at other options (see docs).

12. MultiIndex Arithmetic

When using a multi-index data frame, the same rules apply as for normal data frames (see above). But processing a subset of cells has some properties of its own.

Users can update some columns through the external multi-index level, as follows:
insert image description here
If you want to keep the original data unchanged, you can use df1 = df.assign(population=df.population*10).

You can also easily get the population density with density=df.population/df.area.

insert image description here
But unfortunately, you cannot assign the result to the original dataframe with df.assign.

One approach is to stack all unrelated levels of the column index into the row index, perform the necessary calculations, and then unstack them back (using pdi). lock to preserve the original order of the columns).

insert image description here
Alternatively, you can also use pdi.assign:
insert image description here
pdi.assign is lock order aware, so if you give it a dataframe with one (multiple) lock levels, it won't unlock them or subsequent stack/unstack/etc. Operations will maintain the original column and row order.

Guess you like

Origin blog.csdn.net/aa2528877987/article/details/131545528