Another data analysis artifact: Polars is really powerful

I believe that for many data analysis practitioners, these two tools are used more. PandasThey SQLcan Pandasnot only clean and analyze the data set, but also draw all kinds of cool charts, but When the data set is very large, it Pandasis obviously not enough to use it to process it.

Today I will introduce another data processing and analysis tool, called Polars, it is faster in data processing, of course, it also includes two APIs, one is Eager API, the other is Lazy API, the use of Eager APIand Pandasis similar, the syntax Similar is not too much, immediate execution can produce results. If you like this article, remember to bookmark, follow, and like.

Note: The complete code, data, and technical exchange are obtained at the end of the article

picture

And very similarly, there will be parallelism Lazy APIand Sparkoperations optimized for query logic.

Module installation and import

Let's install the module first, using the pipcommand

pip install polars

After the installation is successful, we use Pandasand Polarsto read the data respectively, look at the difference in their performance, we import the modules that will be used

import pandas as pd
import polars as pl
import matplotlib.pyplot as plt
%matplotlib inline

Pandasread file with

The data set used this time is the user name data of a registered user of a website, with a total size of 360MB. We first use the Pandasmodule to read the csvfile

%%time 
df = pd.read_csv("users.csv")
df.head()

output

picture

It can be seen that it took a total of 12 seconds to Pandasread CSVthe file. The data set has a total of two columns, one column is the user name, and the number of times the user name is repeated "n", let's sort the data set and call the sort_values()method, code show as below

%%time 
df.sort_values("n", ascending=False).head()

output

picture

Used toPolars read operation files

Next, we use the Polarsmodule to read and operate the file to see how long it takes. The code is as follows

%%time 
data = pl.read_csv("users.csv")
data.head()

output

picture

It can be seen that polarsit only takes 730 milliseconds to read the data with the module, which can be said to be a lot faster. We sort the dataset according to the column "n". The code is as follows

%%time
data.sort(by="n", reverse=True).head()

output

picture

It takes 1.39 seconds to sort the data set. Next, we use the polars module to conduct a preliminary exploratory analysis of the data set. What are the columns and column names in the data set? We are still familiar with "Titan" Nick" dataset as an example

df_titanic = pd.read_csv("titanic.csv")
df_titanic.columns

output

['PassengerId',
 'Survived',
 'Pclass',
 'Name',
 'Sex',
 'Age',
 ......]

The Pandassame method is called to output the column name columns, and then let's see how many rows and columns the data set has in total,

df_titanic.shape

output

(891, 12)

Take a look at the data type of each column in the dataset

df_titanic.dtypes

output

[polars.datatypes.Int64,
 polars.datatypes.Int64,
 polars.datatypes.Int64,
 polars.datatypes.Utf8,
 polars.datatypes.Utf8,
 polars.datatypes.Float64,
......]

Statistical Analysis of Filling Nulls and Data

Let's take a look at the distribution of null values ​​in the dataset and call the null_count()method

df_titanic.null_count()

output

picture

We can see that there are empty values ​​in the "Age" and "Cabin" columns, we can try to fill them with the average value, the code is as follows

df_titanic["Age"] = df_titanic["Age"].fill_nan(df_titanic["Age"].mean())

To calculate the average value of a column, you only need to call the mean()method, then the calculation of the median and the maximum/minimum value is the same, the code is as follows

print(f'Median Age: {
      
      df_titanic["Age"].median()}')
print(f'Average Age: {
      
      df_titanic["Age"].mean()}')
print(f'Maximum Age: {
      
      df_titanic["Age"].max()}')
print(f'Minimum Age: {
      
      df_titanic["Age"].min()}')

output

Median Age: 29.69911764705882
Average Age: 29.699117647058817
Maximum Age: 80.0
Minimum Age: 0.42

Data filtering and visualization

We filter out the passengers who are over 40 years old, the codes are as follows

df_titanic[df_titanic["Age"] > 40]

output

picture

Finally, we simply draw a chart, the code is as follows

fig, ax = plt.subplots(figsize=(10, 5))
ax.boxplot(df_titanic["Age"])
plt.xticks(rotation=90)
plt.xlabel('Age Column')
plt.ylabel('Age')
plt.show()

output

picture

In general, there are many similarities with modules polarsin data analysis and processing, and Pandassome APIs are different. Interested children's shoes can refer to their official website: https://www.pola.rs/

recommended article

Technology Exchange

Welcome to reprint, collect, like and support!

insert image description here

At present, a technical exchange group has been opened, and the group has more than 2,000 members . The best way to remark when adding is: source + interest direction, which is convenient to find like-minded friends

  • Method 1. Send the following picture to WeChat, long press to identify, and reply in the background: add group;
  • Method ②, add micro-signal: dkl88191 , note: from CSDN
  • Method ③, WeChat search public account: Python learning and data mining , background reply: add group

long press follow

Guess you like

Origin blog.csdn.net/qq_34160248/article/details/124359835