Exclusive | Pandas 2.0 Game Changer for Data Scientists (with link)

6f66134a13cb3be15310b4a3834ff4c9.png

作者:Miriam Santos
翻译:陈超

校对:Zrx


本文约4800字,建议阅读12分钟
本文介绍了新版本pandas 2.0中引入的主要优势以及代码实现。

Top five traits for efficient data manipulation.

  78be2d119ee3f70e4d7c7c4d653043e5.png

Image via Yancy Min on Unsplash

In April, the official release of pandas 2.0.0 caused an uproar in the data science community.

Due to its wide functionality and versatility, data manipulation is almost impossible without import pandas as pd, right?

Now, hear me out: with all the hype around big language models over the past few months, I somehow missed the fact that pandas just went through a major release! Yes, pandas 2.0 has the guns polished Here we come (What's new in 2.0.0 (April 3, 2023) — pandas 2.1.0.dev0+1237.gc92b3701ef documentation (pydata.org))!

While I wasn't aware of all the hype, the AI ​​community in the data center quickly reached out:

ca0e90291251450d21eee1b524bfabff.png

Screenshot by author The 2.0 release seems to have had a considerable impact in the data science community, with many users praising the improvements in the new version.

Fun Fact: Did you realize that this distro took an amazing 3 years to make? That's what I call a "commitment to the community"!

So what does pandas 2.0 bring? Let's take a deep dive right away!

1. Performance, speed and memory efficiency

As we know, pandas was built using numpy and was not intentionally designed as a backend for dataframe libraries. For this reason, one of the main limitations of pandas is the in-memory handling of larger datasets.

The big change in this release comes from the introduction of the Apache Arrow backend for pandas data.

Essentially, Arrow is a standardized in-memory columnar data format with available libraries for several programming languages ​​(C, C++, R, Python, etc.). For Python, there's PyArrow, which is based on the C++ implementation of Arrow, so it's fast!

So, to make a long story short, PyArrow takes into account the memory constraints of our previous 1.0-point versions, allowing us to perform faster, more memory-efficient data operations, especially for large datasets.

Here is a comparison between reading the data without the pyarrow backend and reading it with the pyarrow backend using the Hacker News dataset (about 650 MB) (license CC BY-NC-SA 4.0):

%timeit df = pd.read_csv("data/hn.csv")
# 12 s ± 304 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit df_arrow = pd.read_csv("data/hn.csv", engine='pyarrow', dtype_backend='pyarrow')
# 329 ms ± 65 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Compare read_csv(): more than 35 times faster using pyarrow background. Author snippet.

As you can see, reading data is almost 35 times faster with the new backend. Other points worth pointing out:

  • Without the pyarrow backend, each column/feature is stored as its own unique data type: numeric features are stored as int64 or float64, while string values ​​are stored as objects;

  • With pyarrow, all functions use Arrow dtypes: Note the [pyarrow] comment and the different types of data: int64, float64, string, timestamp and double:

df = pd.read_csv("data/hn.csv")
df.info()




#
# RangeIndex: 3885799 entries, 0 to 3885798
# Data columns (total 8 columns):
# #   Column              Dtype 
# ---  ------              ----- 
# 0   Object ID           int64 
# 1   Title               object
# 2   Post Type           object
# 3   Author              object
# 4   Created At          object
# 5   URL                 object
# 6   Points              int64 
# 7   Number of Comments  float64
# dtypes: float64(1), int64(2), object(5)
# memory usage: 237.2+ MB




df_arrow = pd.read_csv("data/hn.csv", dtype_backend='pyarrow', engine='pyarrow')
df_arrow.info()




#
# RangeIndex: 3885799 entries, 0 to 3885798
# Data columns (total 8 columns):
# #   Column              Dtype               
# ---  ------              -----               
# 0   Object ID           int64[pyarrow]      
# 1   Title               string[pyarrow]     
# 2   Post Type           string[pyarrow]     
# 3   Author              string[pyarrow]     
# 4   Created At          timestamp[s][pyarrow]
# 5   URL                 string[pyarrow]     
# 6   Points              int64[pyarrow]      
# 7   Number of Comments  double[pyarrow]     
# dtypes: double[pyarrow](1), int64[pyarrow](2), string[pyarrow](4), timestamp[s][pyarrow](1)
# memory usage: 660.2 MB

df.info(): Investigate the dtype of each dataframe. Author screenshot.

2. Arrow data type and Numpy index

Besides reading data (which is the simplest case), you can expect other improvements for a range of other operations, especially those involving string manipulation, since pyarrow has a very efficient implementation for the string datatype:

%timeit df["Author"].str.startswith('phy')
# 851 ms ± 7.89 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
          


%timeit df_arrow["Author"].str.startswith('phy')
# 27.9 ms ± 538 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Comparing String Operations: Demonstrating the Efficiency of Arrow's Implementation. Author snippet.

In fact, Arrow has more (and better supported) data types than numpy that are required outside the scientific (numeric) range: dates and times, durations, binary, decimals, lists, and maps. It might actually be a good exercise to browse the equivalence between the data types supported by pyarrow and the numpy data types so that you learn how to take advantage of them.

It is now also possible to store more numpy numeric types in the index. Traditional int64, uint64 and float64 open up space for all numpy numeric dtypes Index values, so we can specify their 32-bit versions:

pd.Index([1, 2, 3])
# Index([1, 2, 3], dtype='int64')
          


pd.Index([1, 2, 3], dtype=np.int32)
# Index([1, 2, 3], dtype='int32')

Take advantage of 32-bit numpy indexing to make the code more memory efficient. Author snippet.

3. Easier to handle missing values

Being built on top of numpy makes it difficult for pandas to handle missing values ​​in an easy, flexible way, since numpy doesn't support null values ​​for some data types.

For example, integers are automatically converted to floats, which is not ideal:

df = pd.read_csv("data/hn.csv")
          


points = df["Points"]
points.isna().sum()
# 0
          


points[0:5]
# 0    61
# 1    16
# 2     7
# 3     5
# 4     7
# Name: Points, dtype: int64
          


# Setting first position to None
points.iloc[0] = None
          


points[0:5]
# 0     NaN
# 1    16.0
# 2     7.0
# 3     5.0
# 4     7.0
# Name: Points, dtype: float64

Missing values: converted to floats. author snippet

Note how the point automatically changed from int64 to float64 after the introduction of the singleNone value.

There is nothing worse for dataflow than bad typography, especially in a data-centric AI paradigm.

Typographical errors directly affect data preparation decisions, leading to incompatibilities between different chunks of data, and even when passed silently, they can impair certain operations that output nonsensical results.

For example, in the Data-Centric AI Community (DCAI Community (discord.com)), we are working around synthetic data for data privacy (GitHub - Data-Centric-AI-Community/nist-crc-2023: NIST Collaborative Research Cycle on Synthetic Data. Learn about Synthetic Data week by week!) to start a project. One of the features NOC (number of children, number of children) has missing values, so it is automatically converted to float when loading the data. When passing data into a generative model as floats, we may get output values ​​that are decimals, such as 2.5 - unless you're a mathematician with 2 kids, a newborn and a weird sense of humor, 2.5 A child is not enough.

In pandas 2.0, we can take advantage of dtype = 'numpy_nullable', where missing values ​​are considered without any dtype change, so we can keep the original data type (int64 in this case):

df_null = pd.read_csv("data/hn.csv", dtype_backend='numpy_nullable')
          


points_null = df_null["Points"]
points_null.isna().sum()
# 0
          


points_null[0:5]
# 0    61
# 1    16
# 2     7
# 3     5
# 4     7
# Name: Points, dtype: Int64
          


points_null.iloc[0] = None
          


points_null[0:5]
# 0              
# 1      16
# 2       7
# 3       5
# 4       7
# Name: Points, dtype: Int64

With "numpy_nullable", pandas 2.0 can handle missing values ​​without changing the original data type. Author snippet.

This may seem like a subtle change, but it means that now pandas natively handles missing values ​​using Arrow. This makes operations more efficient, since pandas doesn't have to implement its own version to handle null values ​​for each data type.

4. Copy-on-write optimization

Pandas 2.0 also adds a new lazy copy mechanism that delays copying DataFrame and Series objects until they are modified.

This means that when copy-on-write is enabled, certain methods will return views rather than copies, which improves memory efficiency by minimizing unnecessary duplication of data.

This also means that extra care needs to be taken when using chained allocations.

Chained allocations will not work if copy-on-write mode is enabled, since they point to a temporary object that is the result of an index operation (behaves like a copy under copy-on-write).

With copy_on_write disabled, operations such as slicing may change the original df if you change the new dataframe:

pd.options.mode.copy_on_write = False # disable copy-on-write (default in pandas 2.0)
          


df = pd.read_csv("data/hn.csv")
df.head()
          


# Throws a 'SettingWithCopy' warning
# SettingWithCopyWarning:
# A value is trying to be set on a copy of a slice from a DataFrame
df["Points"][0] = 2000
          


df.head() # <---- df changes

Disable copy-on-write: change the original dataframe in the link assignment. Author snippet.

With copy_on_write enabled, copies are created on assignment (python - What rules does Pandas use to generate a view vs a copy? - Stack Overflow), so the original dataframe is never altered. Pandas 2.0 raises a ChainedAssignmentError in these cases to avoid silent errors:

pd.options.mode.copy_on_write = True
          


df = pd.read_csv("data/hn.csv")
df.head()
          


# Throws a ChainedAssignmentError
df["Points"][0] = 2000
          


# ChainedAssignmentError: A value is trying to be set on a copy of a DataFrame
# or Series through chained assignment. When using the Copy-on-Write mode,
# such chained assignment never works to update the original DataFrame
# or Series, because the intermediate object on which we are setting
# values always behaves as a copy.
# Try using '.loc[row_indexer, col_indexer] = value' instead,
# to perform the assignment in a single step.
          


df.head() # <---- df does not change

Copy-on-write enabled: the original dataframe is not altered in chained assignments. Author snippet.

5. Dependable options

When using pip, version 2.0 gives us the flexibility to install optional dependencies, which is a plus in terms of customization and optimization of resources.

We can tailor the installation to our specific requirements without spending disk space on things we don't really need.

Additionally, it saves a lot of "dependency headaches", reducing the possibility of compatibility issues or conflicts with other packages that may exist in the development environment:

pip install "pandas[postgresql, aws, spss]>=2.0.0"

Install dependent options. Author code snippet.

Let's try it out!

However, the question lingers: Is this enthusiasm really justified? I'm curious if pandas 2.0 will work for my daily use

Some of the packages provided significant improvements: ydata-profiling, matplotlib, seaborn, scikit-learn.

From these, I decided to give ydata-profiling a try - it just added support for pandas 2.0, which seems to be a must for the community! In the new version, users can rest to make sure their pipelines don't break if they use pandas 2.0, which is a major advantage! But what else?

To be honest, ydata-profiling has always been one of my favorite tools for exploratory data analysis, and it's also a good quick benchmark - only 1 line of code on my side, but below that, it's full of data as Scientist I need to solve calculations - descriptive statistics, histogram plotting, analyzing correlations, etc.

So, what better way than to test the impact of the pyarrow engine on all engines at the same time with minimal effort?

import pandas as pd
from ydata_profiling import ProfileReport
          


# Using pandas 1.5.3 and ydata-profiling 4.2.0
%timeit df = pd.read_csv("data/hn.csv")
# 10.1 s ± 215 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
          


%timeit profile = ProfileReport(df, title="Pandas Profiling Report")
# 4.85 ms ± 77.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
          


%timeit profile.to_file("report.html")
# 18.5 ms ± 2.02 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
          


# Using pandas 2.0.2 and ydata-profiling 4.3.1
%timeit df_arrow = pd.read_csv("data/hn.csv", engine='pyarrow')
# 3.27 s ± 38.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
          


%timeit profile_arrow = ProfileReport(df_arrow, title="Pandas Profiling Report")
# 5.24 ms ± 448 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
          


%timeit profile_arrow.to_file("report.html")
# 19 ms ± 1.87 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Benchmarking with ydata-profiling. Author snippet.

Also, reading data using the pyarrow engine is definitely better, although creating data profiles doesn't change significantly in terms of speed.

However, the difference may depend on memory efficiency, for which we have to do a different analysis. Additionally, we can further investigate the type of analysis performed on the data: for some operations, the differences between versions 1.5.2 and 2.0 appear to be negligible.

But the main thing I've noticed that might have an impact in this regard is that ydata-profiling doesn't yet take advantage of the pyarrow datatype. This update could have a big impact on speed and memory, which is what I expect for future developments!

Conclusion: Performance, Flexibility, Interoperability!

This new pandas 2.0 release brings a lot of flexibility and performance optimizations, with subtle but critical modifications "under the hood".

They may not be "flashy" to newbies in the world of data manipulation, but to seasoned data scientists who have jumped through hoops to overcome the limitations of previous versions, they are like water in the desert.

To summarize, these are the main advantages introduced in the new version:

  • Performance optimization: with the introduction of Apache Arrow backend, more numpy dtype indexes and copy-on-write mode;

  • Adds flexibility and customization: Allows users to control optional dependencies and take advantage of Apache Arrow data types (including nullability from the start!);

  • Interoperability: Perhaps a less "appreciated" advantage of the new version, but a huge one. Since Arrow is language independent, in-memory data can be transferred not only between programs built on top of Python, but also between R, Spark, and other programs using the Apache Arrow backend!

Guys, you have it! I hope this summary calms some of your questions about pandas 2.0 and its suitability for our data manipulation tasks.

I'm still curious, with the introduction of pandas 2.0, if you've also noticed significant differences in everyday coding! If you like, please come find me on the Data Center AI Community (Discord) and let me know what you think! shall we meet there?

about me

Ph.D., machine learning researcher, educator, data advocate, and overall "jack of all trades." On Medium, I write about data-centric AI and data quality, educating the data science and machine learning community how to move from imperfect data to smart data.

Original title:

Pandas 2.0: A Game-Changer for Data Scientists?

Original link:

https://medium.com/towards-data-science/pandas-2-0-a-game-changer-for-data-scientists-3cd281fcc4b4?source=topic_portal_recommended_stories---------2-85----------machine_learning----------30a1af14_d40c_416a_bc92_b752b8fd806c-------

Editor: Wang Jing

Translator profile

02fa143146deeeaaf2c576a7ff712548.jpeg

Chen Chao , Master of Applied Psychology, Peking University, data analysis enthusiast. I once studied computer science as an undergraduate, and then pursued relentlessly on the path of psychology. In the process of learning, I have increasingly discovered that data analysis has a wide range of applications. I hope to output some meaningful work through what I have learned. I am very happy to join the big family of Data School, keep humble, and keep eager.

Translation Team Recruitment Information

Job content: It needs a meticulous heart to translate the selected foreign language articles into fluent Chinese. If you are an international student of data science/statistics/computer, or are engaged in related work overseas, or friends who are confident in your foreign language proficiency, welcome to join the translation team.

You can get: regular translation training to improve the translation level of volunteers, improve the awareness of the frontier of data science, overseas friends can keep in touch with the development of domestic technology applications, and the background of THU's data-based industry-university-research research brings great benefits to volunteers Development Opportunities.

Other benefits: Data science workers from famous companies, students from Peking University, Tsinghua University and overseas famous schools will all become your partners in the translation team.

Click "Read the original text" at the end of the article to join the Datapai team~

Reprint Notice

If you need to reprint, please indicate the author and source in a prominent position at the beginning of the article (from: Datapi ID: DatapiTHU), and place an eye-catching QR code at the end of the article. If you have an original logo article, please send [article name - official account name and ID to be authorized] to the contact email, apply for whitelist authorization and edit as required.

After publishing, please send the link back to the contact email (see below). Unauthorized reprinting and adaptation, we will pursue their legal responsibilities according to law.

7149840d9d20e394e9169c70846a21c0.png

Click "Read the original text" to embrace the organization

Guess you like

Origin blog.csdn.net/tMb8Z9Vdm66wH68VX1/article/details/131989706