New improvements and features in Pandas 2.1

Hello everyone, Pandas 2.1 was released on August 30, 2023. Follow this article to see what new content is introduced in this version and how it can help users improve Pandas workloads. It contains a series of improvements and a new set of deprecations. Function.

Pandas 2.1 brings a number of improvements over the PyArrow integration introduced in Pandas 2.0. This article focuses on support for new features that are expected to become default features in Pandas 3.0, with the most important improvements detailed below.

Avoid using NumPy object types in string columns

pandasA major problem in Pandas is inefficient string representation, and the Pandas team spent quite some time working on this issue. The first PyArrow-based strings are available dtypein pandas 1.3, which has the potential to reduce memory usage by about 70% and improve performance.

The Pandas team decided to introduce a new configuration option to store all string columns in PyArrow arrays. You no longer need to worry about converting string columns, it will work automatically.

This option can be turned on via:

pd.options.future.infer_string = True

This behavior will pandas 3.0become the default in , which means that string columns will always be supported by PyArrow, PyArrow must be installed to use this option.

PyArrow dtypehas different behavior than NumPy objects, which can be difficult to understand in detail. Pandas implements strings for this option dtypeto be compatible with NumPy's semantics. It behaves exactly like a NumPy object column.

Improved PyArrow support

One of the main goals of the designers over the past few months has been to improve the integration pandas 2.0within PyArrow-based DataFrame, which was introduced in . pandasTheir goal was to make switching from NumPy-based DataFrame as easy as possible, with a focus on fixing performance bottlenecks that had caused unexpected slowdowns.

Next look at an example:

import pandas as pd
import numpy as np

df = pd.DataFrame(
    {
        "foo": np.random.randint(1, 10, (1_000_000, )),
        "bar": np.random.randint(1, 100, (1_000_000,)),
    }, dtype="int64[pyarrow]"
)
grouped = df.groupby("foo")

The DataFrame of this article has 1 million rows and 10 groups. Now let’s compare the performance of pandas 2.0.3sum pandas 2.1:

# pandas 2.0.3
10.6 ms ± 72.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

# pandas 2.1.0
1.91 ms ± 3.16 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

This particular example is 5x faster on the new version and mergeis another commonly used function that will now be faster.

Copy-on-Write

Copy-on-Write (Copy-on-Write) was originally pandas 1.5.0introduced in , and is expected to become pandas 3.0the default behavior of , Copy-on-Write already pandas 2.0.xprovides a good experience in . The Pandas team, primarily focused on fixing known bugs and making it run faster, recommends using this mode in production environments and has already seen copy-on-write improve real-world workflow performance by more than 50%.

Deprecate setitemsilent type conversions in class operations

Previously, setting an incompatible value into pandasa column pandassilently changed the column's data type. Next look at an example:

ser = pd.Series([1, 2, 3])

0    1
1    2
2    3
dtype: int64

This example has a series containing integers and the result will be of integer data type. Now "a"set the letters into the second row:

ser.iloc[1] = "a"

0    1
1    a
2    3
dtype: object

This changes the Series' data type to be the only one that can accommodate integers and strings, which is a big problem for many users. Columns take up a lot of memory, causing calculations to fail, performance to degrade, and many other problems. In order to solve these problems, it also added a lot of special processing internally. In the past, silent data type changes in DataFrame caused a lot of trouble. This behavior is now deprecated and will throw :objectObjectObjectFutureWarning

FutureWarning: Setting an item of incompatible dtype is deprecated and will raise in a future 
error of pandas. Value 'a' has dtype incompatible with int64, please explicitly cast to a 
compatible dtype first.
  ser.iloc[1] = "a"

Operations like the example in this article will pandas 3.0raise an error in and the DataFrame's data type will remain consistent between operations. When you want to change the data type, you have to specify it explicitly, which adds some code but is easier to understand for subsequent developers. This change affects all data types, for example setting a floating point value into an integer column will also throw an exception.

Upgrade to new version

New pandasversions can be installed using the following command:

pip install -U pandas

or:

mamba install -c conda-forge pandas=2.1

This will install the new version in the user's environment.

Guess you like

Origin blog.csdn.net/csdn1561168266/article/details/133420745