The new king of data processing? Differences between Polars and Pandas


If you have followed the progress of Python DataFrames over the past year, you must have heard of  Polars , a powerful DataFrame library designed for working with large data sets .


Unlike other libraries such as  Spark , Dask  , and  Ray  that handle large data sets, Polars is used on a single machine, which has attracted many   comparisons with pandas . In fact, Polars differs from pandas in many important ways, including how it handles data and its best applications. The following will explore the technical differences between the two DataFrame libraries and analyze their respective advantages and limitations.


If you want to hear  it from Ritchie Vink , the creator of Polars  , you can watch our interview with him!



Why use Polars 

Instead of pandas?


Two words: performance . Polars has been extremely fast from the start, performing common operations 5 to 10 times faster than pandas. In addition, the memory requirements of Polars operations are significantly smaller than pandas: pandas requires about 5 to 10 times the RAM of the data set size to perform operations, while Polars requires 2 to 4 times.


You can learn how Polars performance compares to other DataFrame libraries here . For common operations, Polars is 10 to 100 times faster than pandas and is one of the fastest DataFrame libraries. Additionally, it can handle larger data sets than pandas before out of memory errors.



Why are Polars so fast?


These results are so impressive that you might be wondering: How does Polars achieve this kind of performance when running on a single machine? Because the library was designed with performance in mind from the beginning and achieved it in a variety of ways.


Written in Rust


One of the most famous facts about Polars is that it  is written in  Rust , a low-level language that is almost as fast as C and C++. And pandas is built on top of Python libraries, one of which is  NumPy . Although the core of NumPy is written in C, it still suffers from problems inherent in Python's handling of certain types in memory (such as strings for categorical data), resulting in poor performance when handling these types (see this article by  Wes  McKinney A great blog post for more details).


Another advantage of using Rust is that it allows for safe concurrency , making parallelism as predictable as possible. This means that Polars can safely use all machine cores to execute complex queries involving multiple columns, leading Ritchie Vink to describe Polar's performance as "overly parallel." The performance of Polars is therefore much higher than that of pandas, which uses only one core to perform operations. Watch Nico Kreiling's great talk at this year's PyCon DE detailing how Polars achieves this goal.


Based on Arrow


Another factor in Polars' amazing performance is Apache Arrow , a language-independent memory format. Arrow was actually co-created by Wes McKinney to solve a problem he saw with pandas as the amount of data exploded. It is also the backend for pandas 2.0, a higher-performance version of pandas released in March this year. However, the library's Arrow backend is slightly different: while pandas 2.0 is built on PyArrow, the Polars team built its own Arrow implementation.


One of the main advantages of building a database on Arrow is interoperability . Arrow was designed to standardize the in-memory data format used across libraries and is already used in many important libraries and databases, as figure below .


This interoperability improves performance because it avoids the need to convert data into different formats to pass between different steps of the data pipeline (in other words, it avoids the need to serialize and deserialize the data need). It is also more memory efficient because two processes can share the same data without creating copies. It is estimated that serialization/deserialization  accounts for 80-90% of the computational overhead in data workflows , and Arrow's universal data format has brought significant performance improvements to Polars.


Arrow also has built-in support for a wider range of data types than pandas. Since pandas is based on NumPy, it is great at handling integer and floating point columns, but struggles with other data types. In contrast , Arrow provides sophisticated support for datetime, boolean, binary, and even complex column types such as those containing lists. Additionally, Arrow can handle missing data natively, which requires extra steps in NumPy.


Finally, Arrow uses columnar data storage , where all columns are stored in contiguous blocks of memory regardless of data type. Not only does this make parallelization easier, it also makes data retrieval faster.


Query optimization


Another core part of Polars performance is how you evaluate your code. Pandas uses Eager execution by default, and operations are performed in the order written. In contrast, Polars performs both eager and lazy execution , where the query optimizer evaluates all required operations and formulates the most efficient way to execute the code. This may include rewriting the order in which operations are performed or removing redundant calculations. For example, use the following expression to get the mean of the columns for each category "A" and "B" Categoryin .Number1

(
df
.groupby(by = "Category").agg(pl.col("Number1").mean())
.filter(pl.col("Category").is_in(["A""B"]))
)


If the expression Eager is executed, the entire DataFrame is unnecessarily evaluated groupbyand then Categoryfiltered. With lazy execution, the DataFrame is filtered and executed only on the required data groupby.


Expressive API


最后,Polars 拥有一个极具表达性的 API,基本上您想执行的任何运算都可以用 Polars 方法表达。相比之下,pandas 中更复杂的运算通常需要作为 lambda 表达式传递给apply方法。apply方法的问题是它循环遍历 DataFrame 的行,对每一行按顺序执行运算。内置方法能够让您在列级别上工作并利用另一种称为 SIMD 的并行形式。



什么时候应该继续使用 pandas?


这一切看起来都棒极了,您可能现在就想抛弃 pandas。先别急!虽然 Polars 非常适合进行极其高效的数据转换,但它目前并不是数据探索或机器学习管道的最佳选择。 这些都是 pandas 可以继续崭露头角的领域。


其中一个原因是,虽然 Polars 与其他使用 Arrow 的软件包具有良好的互操作性,但它尚不兼容大多数 Python 数据可视化软件包或机器学习库,例如 scikit-learn PyTorch 唯一的例外是 Plotly,它允许您直接从 Polars DataFrame 创建图表。


当前讨论较多的解决方案是在这些软件包中使用 Python DataFrame 交换协议,以允许其支持一系列 DataFrame 库,让数据科学和机器学习工作流不再遭遇 pandas 的瓶颈。不过,这个想法相对较新,项目距离实现还有一段时间。



Polars 和 pandas 的工具


想亲自尝试 Polars 了吗?DataSpell 和 PyCharm Professional 2023.2 都提供了出色的工具,可在 Jupyter Notebook 中处理 pandas 和 Polars。特别是,pandas 和 Polars DataFrame 以交互功能显示,使数据探索更快、更舒适。


我最喜欢的功能包括 DataFrame 所有行和列的不截断滚动浏览、DataFrame 值聚合的一键获取以及 DataFrame 的多种格式导出(包括 Markdown!)。


如果您还没有使用 DataSpell,可以立即开始 30 天试用。



本博文英文原作者:Jodie Burchell

DataSpell 相关阅读

关于 DataSpell

DataSpell 是一款专门用于探索性数据分析和机器学习模型原型设计的 IDE。它在一个人性化环境中将 Jupyter Notebook 的交互性与 PyCharm 的智能 Python 和 R 编码辅助相结合。


DataSpell 具有智能代码辅助、版本控制和其他特定于 IDE 的功能,以及诸如表、图和微件等交互式输出功能,这些交互式输出功能可以帮助您可视化数据并从中获取洞见。

进一步了解 DataSpell

⏬ 戳「阅读原文」了解更多

本文分享自微信公众号 - JetBrains(JetBrainsChina)。
如有侵权,请联系 [email protected] 删除。
本文参与“OSC源创计划”,欢迎正在阅读的你也加入,一起分享。

阿里云严重故障,全线产品受影响(已恢复) 俄罗斯操作系统 Aurora OS 5.0 全新 UI 亮相 汤不热 (Tumblr) 凉了 多家互联网公司急招鸿蒙程序员 .NET 8 正式 GA,最新 LTS 版本 UNIX 时间即将进入 17 亿纪元(已进入) 小米官宣 Xiaomi Vela 全面开源,底层内核为 NuttX Linux 上的 .NET 8 独立体积减少 50% FFmpeg 6.1 "Heaviside" 发布 微软推出全新“Windows App”
{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/5494143/blog/10114360