Advanced pandas (data analysis)

Table of contents

Chapter 12 Advanced Pandas

12.1 Categorical data

12.1.1 Background and objectives

12.1.2 Categorical types in pandas

12.1.3 Computing with Categorical objects

12.1.4 Classification methods

12.2 High-level GroupBy application

12.2.1 Grouping transformations and "unfolding" GroupBy

12.2.2 Temporal Resampling of Groups

12.3 Method Chaining Technology

12.3.1 The pipe method

reference books


Chapter 12 Advanced Pandas

12.1 Categorical data

This section introduces the Categorical type of pandas.

12.1.1 Background and objectives

A column often contains duplicate values, which are small collections of distinct values. We've already seen functions like unique and value_counts that allow us to extract distinct values ​​from an array and count the frequencies of those distinct values ​​separately:

Many data systems (for data warehousing, statistical computing, or other purposes) have developed specialized ways to represent data with repeated values ​​for more efficient storage and computation. In data storage operations, it is a best practice to use so-called dimension tables, which contain distinct values, and store key observations as integer keys referencing the dimension tables:

You can use the take method to restore the original String Series:

This representation in terms of integers is called a categorical or dictionary-encoded representation.

Arrays of distinct values ​​can be referred to as categories, dictionaries, or hierarchies of data.

12.1.2 Categorical types in pandas

pandas has a special Categorical type for carrying data represented or encoded based on integer categories.

Let's consider the previous example Series:

df['fruit'] is an array of Python string objects. We can convert it to a Categorical object by calling the function:

The value of fruit_cat is not a NumPy array, but an instance of pandas.Categorical:

Categorical objects have categories and codes properties:

A column of a DataFrame can be converted to a Categorical object by assigning the transformed result:

It is also possible to generate pandas.Categorical directly from other Python sequence types:

If you already have categorical codes from another data source, you can use the from_codes constructor:

Classification transformations do not specify the order of the classes unless explicitly specified. So the categories array may not be in the same order as the input data. When using from_codes or any other constructor, you can specify a meaningful order for the categories:

The output [foo<bar<baz] indicates that 'foo' comes before 'bar', and so on. An unordered categorical instance can be ordered using as_ordered:

Finally, note that categorical data need not be strings. A categorical array can contain any immutable value type.

12.1.3 Computing with Categorical objects

Using Categorical in pandas is overall consistent compared to the non-encoded version (e.g. string arrays). Certain parts of pandas, such as the groupby function, work better with Categorical objects. There are also functions that take advantage of the ordered flag.

Let's consider some random numeric data and use the pandas.qcut binning function.

Compute the quartile bins of the above data and extract some statistics:

While sample quartiles are useful, quartiles are less useful than quartile names when generating a report. We can do this by using the labels parameter in the qcut function:

The labeled bins classification data does not contain relevant information about the box boundaries in the data, so we can use groupby to extract some summary statistics:

The 'quartile' column in the result preserves the original categorical information in the bins, including order:

12.1.3.1 Using Classification for Higher Performance

If you're doing a lot of analysis on a particular dataset, converting the data to categorical can yield substantial performance gains. The categorical version of a column in a DateFrame also usually uses significantly less memory.

Consider some Series with 10 million elements and a small number of different classes:

Convert labels to Categorical objects:

Note that labels use significantly more memory than categories:

Classification conversion is of course not free, but it is a one-time overhead:

GroupBy operations using categorical objects are significantly faster because the underlying algorithm uses arrays based on integer codes rather than string arrays.

12.1.4 Classification methods

Series contains categorical data with special methods similar to the special string methods of Series.str.

These methods provide quick access to categories and codes. Consider the following Series:

The special attribute cat provides access to the classification methods:

Suppose you know that the actual set of categories for this data exceeds the four values ​​observed in the data. The categories can be changed using the set_categories method:

Although it looks like the data hasn't changed, the new categories will be reflected in operations that use them. For example, value_counts will follow new categories (if present):

In large datasets, categorical data is often used as a handy tool for saving memory and higher performance.

After filtering a large DataFrame or Series, many categories will not appear in the data.

To help with this, the remove_unused_categories method can be used to remove unobserved categories:

The classification method of Series in pandas:

12.1.4.1 Creating dummy variables for modeling (one-hot encoding)

When you work with statistics or machine learning tools, it is common to convert categorical data into dummy variables, also known as one-hot encoding. This produces a DataFrame with each distinct category as a column of it. These columns contain the number of occurrences of a particular category, and 0 otherwise.

Example:

The pandas.get_dummies function converts one-dimensional categorical data into a DataFrame containing dummy variables:

12.2 High-level GroupBy application

12.2.1 Grouping transformations and "unfolding" GroupBy

In Chapter 10, we learned that the apply method is used to perform transformation operations in grouping operations.

There is another built-in method transform that is similar to the apply method but places more restrictions on the kinds of functions you can use:

        Transform can generate a scalar value and broadcast it to the size data of each group

        transform can produce an object with the same size as the input group

        A transform cannot change its input

Example:

Mean grouped by 'key':

Suppose you want to generate a Series with the same size as df['value'] but with the values ​​replaced by the mean grouped by 'key'. An anonymous function lambda x: x.mean() can be passed to transfrom:

For built-in aggregate functions, a string alias can be passed like GroupBy's agg method:

Like apply, transform can be used with functions returning a Series, but the result must have the same size as the input. For example, you can use a lambda function to multiply each group by 2:

As a more complex example, ranks could be computed in descending order for each group:

Consider a grouping transformation function consisting of simple aggregations:

Equivalent results can be obtained using transform or apply:

Built-in aggregation functions like 'mean' or 'sum' are usually faster than the apply function.

These functions also have a "fast pass" when used with transform. This allows to perform a so-called unwind grouping operation:

However, an unroll group operation may contain multiple group aggregations, and the overall advantages of vectorizing operations often outweigh this.

12.2.2 Temporal Resampling of Groups

For time series data, the resample method is semantically a grouping operation based on time segments. Here is a small example table:

It is possible to index by 'time' and then resample:

Assuming a DataFrame containing multiple time series labeled by an additional grouping key column:

To do the same resampling for each 'key' value, we can use a pandas.TimeGrouper object:

time_key = pd.TimeGrouper('5min')

You can set the time index, group by 'key' and time_key, and then aggregate:

One limitation of using TimeGrouper is that time must be an index into a Series or DataFrame.

12.3 Method Chaining Technology

When applying a series of transformations to a dataset, you may find yourself creating many temporary variables that are never used in the analysis. For example, consider the following example:

Although we didn't use real data here, this example demonstrates some new approaches. First, the DataFrame.assign method is a functional replacement for the assignment of df[k] = v. It returns a new DataFrame modified as specified, rather than making modifications on the original object. Therefore, the following expressions are equivalent:

In-place assignment may be faster than using assign, but assign allows for more convenient method chaining:

When doing method chaining keep in mind that you may need to refer to temporary objects. In the previous example, we couldn't refer to the result of load_data unless it was assigned to the temporary variable df. To handle this situation, assign and many other pandas functions accept function arguments, also known as callable arguments.

To illustrate the callable object in action, consider the following snippet from the previous section:

The above code can be rewritten as:

Here, the result of load_data is not copied to a variable, so the function passed into [] will be bound to the object at that stage of the method chain.

Afterwards, we can proceed to write the entire sequence as a single-chain expression:

12.3.1 The pipe method

You can get a lot done using the built-in pandas functions and the way we just saw method chaining with callable arguments. However, sometimes you need to use custom functions or functions from third-party libraries. This is where the pipe method comes in.

Consider the following sequence of function calls:

When using functions that accept and return Series or DataFrame objects, you can rewrite the code to call the pipe method:

The expressions f(df) and df.pipe(f) are equivalent, but pipe makes chaining more convenient.

Generalizing sequences of operations into reusable functions is one potential use of the pipe method. As an example, let's consider subtracting the group mean from a column:

Suppose you want to mean remove multiple columns and easily change the grouping key. Also, you may want to perform conversions in method chains. Here is an example implementation:

Then you can write the following code:

reference books

        -- "Using Python to Realize Data Analysis"

Guess you like

Origin blog.csdn.net/qq_42433311/article/details/123821091