Extremely fast data processing third-party modules for Polars

d6c6a1cf173d905236be5754192ee87c.png

Polars is an extremely fast DataFrames library.

It has the following properties:

1. Multithreading

2. Powerful expression API

3. Query optimization

Here is a brief introduction to how to use this module.

1. Prepare

Before starting, you need to make sure that Python and pip have been successfully installed on your computer. If not, you can visit this article: Super detailed Python installation guide  for installation.

(Optional 1)  If you use Python for data analysis, you can install Anaconda directly: Anaconda, a good helper for Python data analysis and mining , has built-in Python and pip.

(Optional 2)  In addition, it is recommended that you use the VSCode editor, which has many advantages: The best partner for Python programming—VSCode detailed guide .

Please choose one of the following ways to enter the command to install dependencies :
1. Open Cmd (Start-Run-CMD) in the Windows environment.
2. Open Terminal in the MacOS environment (command+space to enter Terminal).
3. If you are using VSCode editor or Pycharm, you can directly use the Terminal at the bottom of the interface.

pip install polars

2. Polars introduction

When initializing variables, the way Polars is used is not much different from Pandas. Below we define an initial variable, which will be used in all subsequent examples:

import polars as pl
df = pl.DataFrame(
    {
        "A": [1, 2, 3, 4, 5],
        "fruits": ["banana", "banana", "apple", "apple", "banana"],
        "B": [5, 4, 3, 2, 1],
        "cars": ["beetle", "audi", "beetle", "beetle", "beetle"],
    }
)

Select the data to display:

(df.select([
    pl.col("A"),
    "B", # the col part is inferred
    pl.lit("B"), # we must tell polars we mean the literal "B"
    pl.col("fruits"),
]))

The effect is as follows:

c08abaaeed00b3a5c3e0818068a08737.png

He can also use regular expressions to filter values ​​and perform operations such as summing:

# 正则表达式
(df.select([
    pl.col("^A|B$").sum()
]))
# 或者多选
(df.select([
    pl.col(["A", "B"]).sum()
]))
cd88e24251b5f45978a0d50a54f2c4c1.png

Polars supports the following complex and efficient query and display:

>>> df.sort("fruits").select(
...     [
...         "fruits",
...         "cars",
...         pl.lit("fruits").alias("literal_string_fruits"),
...         pl.col("B").filter(pl.col("cars") == "beetle").sum(),
...         pl.col("A").filter(pl.col("B") > 2).sum().over("cars").alias("sum_A_by_cars"),
...         pl.col("A").sum().over("fruits").alias("sum_A_by_fruits"),
...         pl.col("A").reverse().over("fruits").alias("rev_A_by_fruits"),
...         pl.col("A").sort_by("B").over("fruits").alias("sort_A_by_B_by_fruits"),
...     ]
... )
shape: (5, 8)
┌──────────┬──────────┬──────────────┬─────┬─────────────┬─────────────┬─────────────┬─────────────┐
│ fruits ┆ cars ┆ literal_stri ┆ B ┆ sum_A_by_ca ┆ sum_A_by_fr ┆ rev_A_by_fr ┆ sort_A_by_B │
│ --- ┆ --- ┆ ng_fruits ┆ --- ┆ rs ┆ uits ┆ uits ┆ _by_fruits │
│ str ┆ str ┆ --- ┆ i64 ┆ --- ┆ --- ┆ --- ┆ --- │
│ ┆ ┆ str ┆ ┆ i64 ┆ i64 ┆ i64 ┆ i64 │
╞══════════╪══════════╪══════════════╪═════╪═════════════╪═════════════╪═════════════╪═════════════╡
│ "apple"  ┆ "beetle" ┆ "fruits"     ┆ 11  ┆ 4           ┆ 7           ┆ 4           ┆ 4           │
│ "apple"  ┆ "beetle" ┆ "fruits"     ┆ 11  ┆ 4           ┆ 7           ┆ 3           ┆ 3           │
│ "banana" ┆ "beetle" ┆ "fruits"     ┆ 11  ┆ 4           ┆ 8           ┆ 5           ┆ 5           │
│ "banana" ┆ "audi"   ┆ "fruits"     ┆ 11  ┆ 2           ┆ 8           ┆ 2           ┆ 2           │
│ "banana" ┆ "beetle" ┆ "fruits"     ┆ 11  ┆ 4           ┆ 8           ┆ 1           ┆ 1           │
└──────────┴──────────┴──────────────┴─────┴─────────────┴─────────────┴─────────────┴─────────────┘

3. Advanced use of Polars

Reverse operation, reverse the value and put it back into the variable, named xxx_reverse:

(df.select([
    pl.all(),
    pl.all().reverse().suffix("_reverse")
]))

0d29cef4d54ac5e72fbd0fe01cac4282.png

Sum all the columns and put them back into a variable named xxx_sum:

(df.select([
    pl.all(),
    pl.all().sum().suffix("_sum")
]))
bb3cc8b897c1ae0fd4588fec392c10c7.png

Regex can also be used for filtering:

predicate = pl.col("fruits").str.contains("^b.*")

(df.select([
    predicate
]))
b0be4747057c51b5913f87bb2e5b68b3.png

When setting a new column, it is even possible to set values ​​for different rows based on conditions:

(df.select([
    "fruits",
    "B",
    pl.when(pl.col("fruits") == "banana").then(pl.col("B")).otherwise(-1).alias("b")
]))
4ff18cfd499f4f947d9ad5cfde8d05db.png

The fold function is powerful, and it performs operations on columns for the fastest, aka vectorized execution:

df = pl.DataFrame({
        "a": [1, 2, 3],
        "b": [10, 20, 30],
    }
)

out = df.select(
    pl.fold(acc=pl.lit(0), f=lambda acc, x: acc + x, exprs=pl.col("*")).alias("sum"),
)
print(out)
# shape: (3, 1)
# ┌─────┐
# │ sum │
# │ --- │
# │ i64 │
# ╞═════╡
# │ 11 │
# ├╌╌╌╌╌┤
# │ 22 │
# ├╌╌╌╌╌┤
# │ 33 │
# └─────┘

Polars also has many other useful features, if you are interested, you can visit their user manual to read and learn:

https://pola-rs.github.io/polars-book/user-guide

This is the end of our article. If you like today's Python practical tutorial, please continue to pay attention to Python Practical Collection.

If you have any questions, you can reply in the background of the official account: join the group , answer the corresponding red letter verification information , and enter the mutual assistance group to ask.

Originality is not easy, I hope you can give me a thumbs up below and watch to support me to continue creating, thank you!

Click below to read the original text for a better reading experience

Python Practical Collection (pythondict.com)
is not just a collection.
Welcome to pay attention to the official account: Python Practical Collection

9d0ad61b14f712d4b9338ff9183281c7.jpeg

Guess you like

Origin blog.csdn.net/u010751000/article/details/128556763