Hello everyone, Pandas 2.1 was released on August 30, 2023. Follow this article to see what new content is introduced in this version and how it can help users improve Pandas workloads. It contains a series of improvements and a new set of deprecations. Function.
Pandas 2.1 brings a number of improvements over the PyArrow integration introduced in Pandas 2.0. This article focuses on support for new features that are expected to become default features in Pandas 3.0, with the most important improvements detailed below.
Avoid using NumPy object types in string columns
pandas
A major problem in Pandas is inefficient string representation, and the Pandas team spent quite some time working on this issue. The first PyArrow-based strings are available dtype
in pandas 1.3
, which has the potential to reduce memory usage by about 70% and improve performance.
The Pandas team decided to introduce a new configuration option to store all string columns in PyArrow arrays. You no longer need to worry about converting string columns, it will work automatically.
This option can be turned on via:
pd.options.future.infer_string = True
This behavior will pandas 3.0
become the default in , which means that string columns will always be supported by PyArrow, PyArrow must be installed to use this option.
PyArrow dtype
has different behavior than NumPy objects, which can be difficult to understand in detail. Pandas implements strings for this option dtype
to be compatible with NumPy's semantics. It behaves exactly like a NumPy object column.
Improved PyArrow support
One of the main goals of the designers over the past few months has been to improve the integration pandas 2.0
within PyArrow-based DataFrame, which was introduced in . pandas
Their goal was to make switching from NumPy-based DataFrame as easy as possible, with a focus on fixing performance bottlenecks that had caused unexpected slowdowns.
Next look at an example:
import pandas as pd
import numpy as np
df = pd.DataFrame(
{
"foo": np.random.randint(1, 10, (1_000_000, )),
"bar": np.random.randint(1, 100, (1_000_000,)),
}, dtype="int64[pyarrow]"
)
grouped = df.groupby("foo")
The DataFrame of this article has 1 million rows and 10 groups. Now let’s compare the performance of pandas 2.0.3
sum pandas 2.1
:
# pandas 2.0.3
10.6 ms ± 72.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# pandas 2.1.0
1.91 ms ± 3.16 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
This particular example is 5x faster on the new version and merge
is another commonly used function that will now be faster.
Copy-on-Write
Copy-on-Write (Copy-on-Write) was originally pandas 1.5.0
introduced in , and is expected to become pandas 3.0
the default behavior of , Copy-on-Write already pandas 2.0.x
provides a good experience in . The Pandas team, primarily focused on fixing known bugs and making it run faster, recommends using this mode in production environments and has already seen copy-on-write improve real-world workflow performance by more than 50%.
Deprecate setitem
silent type conversions in class operations
Previously, setting an incompatible value into pandas
a column pandas
silently changed the column's data type. Next look at an example:
ser = pd.Series([1, 2, 3])
0 1
1 2
2 3
dtype: int64
This example has a series containing integers and the result will be of integer data type. Now "a"
set the letters into the second row:
ser.iloc[1] = "a"
0 1
1 a
2 3
dtype: object
This changes the Series' data type to be the only one that can accommodate integers and strings, which is a big problem for many users. Columns take up a lot of memory, causing calculations to fail, performance to degrade, and many other problems. In order to solve these problems, it also added a lot of special processing internally. In the past, silent data type changes in DataFrame caused a lot of trouble. This behavior is now deprecated and will throw :object,
Object
Object
FutureWarning
FutureWarning: Setting an item of incompatible dtype is deprecated and will raise in a future
error of pandas. Value 'a' has dtype incompatible with int64, please explicitly cast to a
compatible dtype first.
ser.iloc[1] = "a"
Operations like the example in this article will pandas 3.0
raise an error in and the DataFrame's data type will remain consistent between operations. When you want to change the data type, you have to specify it explicitly, which adds some code but is easier to understand for subsequent developers. This change affects all data types, for example setting a floating point value into an integer column will also throw an exception.
Upgrade to new version
New pandas
versions can be installed using the following command:
pip install -U pandas
or:
mamba install -c conda-forge pandas=2.1
This will install the new version in the user's environment.