I believe that for many data analysis practitioners, these two tools are used more. Pandas
They SQL
can Pandas
not only clean and analyze the data set, but also draw all kinds of cool charts, but When the data set is very large, it Pandas
is obviously not enough to use it to process it.
Today I will introduce another data processing and analysis tool, called Polars
, it is faster in data processing, of course, it also includes two APIs, one is Eager API
, the other is Lazy API
, the use of Eager API
and Pandas
is similar, the syntax Similar is not too much, immediate execution can produce results. If you like this article, remember to bookmark, follow, and like.
Note: The complete code, data, and technical exchange are obtained at the end of the article
And very similarly, there will be parallelism Lazy API
and Spark
operations optimized for query logic.
Module installation and import
Let's install the module first, using the pip
command
pip install polars
After the installation is successful, we use Pandas
and Polars
to read the data respectively, look at the difference in their performance, we import the modules that will be used
import pandas as pd
import polars as pl
import matplotlib.pyplot as plt
%matplotlib inline
Pandas
read file with
The data set used this time is the user name data of a registered user of a website, with a total size of 360MB. We first use the Pandas
module to read the csv
file
%%time
df = pd.read_csv("users.csv")
df.head()
output
It can be seen that it took a total of 12 seconds to Pandas
read CSV
the file. The data set has a total of two columns, one column is the user name, and the number of times the user name is repeated "n", let's sort the data set and call the sort_values()
method, code show as below
%%time
df.sort_values("n", ascending=False).head()
output
Used toPolars
read operation files
Next, we use the Polars
module to read and operate the file to see how long it takes. The code is as follows
%%time
data = pl.read_csv("users.csv")
data.head()
output
It can be seen that polars
it only takes 730 milliseconds to read the data with the module, which can be said to be a lot faster. We sort the dataset according to the column "n". The code is as follows
%%time
data.sort(by="n", reverse=True).head()
output
It takes 1.39 seconds to sort the data set. Next, we use the polars module to conduct a preliminary exploratory analysis of the data set. What are the columns and column names in the data set? We are still familiar with "Titan" Nick" dataset as an example
df_titanic = pd.read_csv("titanic.csv")
df_titanic.columns
output
['PassengerId',
'Survived',
'Pclass',
'Name',
'Sex',
'Age',
......]
The Pandas
same method is called to output the column name columns
, and then let's see how many rows and columns the data set has in total,
df_titanic.shape
output
(891, 12)
Take a look at the data type of each column in the dataset
df_titanic.dtypes
output
[polars.datatypes.Int64,
polars.datatypes.Int64,
polars.datatypes.Int64,
polars.datatypes.Utf8,
polars.datatypes.Utf8,
polars.datatypes.Float64,
......]
Statistical Analysis of Filling Nulls and Data
Let's take a look at the distribution of null values in the dataset and call the null_count()
method
df_titanic.null_count()
output
We can see that there are empty values in the "Age" and "Cabin" columns, we can try to fill them with the average value, the code is as follows
df_titanic["Age"] = df_titanic["Age"].fill_nan(df_titanic["Age"].mean())
To calculate the average value of a column, you only need to call the mean()
method, then the calculation of the median and the maximum/minimum value is the same, the code is as follows
print(f'Median Age: {
df_titanic["Age"].median()}')
print(f'Average Age: {
df_titanic["Age"].mean()}')
print(f'Maximum Age: {
df_titanic["Age"].max()}')
print(f'Minimum Age: {
df_titanic["Age"].min()}')
output
Median Age: 29.69911764705882
Average Age: 29.699117647058817
Maximum Age: 80.0
Minimum Age: 0.42
Data filtering and visualization
We filter out the passengers who are over 40 years old, the codes are as follows
df_titanic[df_titanic["Age"] > 40]
output
Finally, we simply draw a chart, the code is as follows
fig, ax = plt.subplots(figsize=(10, 5))
ax.boxplot(df_titanic["Age"])
plt.xticks(rotation=90)
plt.xlabel('Age Column')
plt.ylabel('Age')
plt.show()
output
In general, there are many similarities with modules polars
in data analysis and processing, and Pandas
some APIs are different. Interested children's shoes can refer to their official website: https://www.pola.rs/
recommended article
-
Li Hongyi's "Machine Learning" Mandarin Course (2022) is here
-
Someone made a Chinese version of Mr. Wu Enda's machine learning and deep learning
-
I'm addicted, and recently I gave the company a big visual screen (with source code)
-
So elegant, 4 Python automatic data analysis artifacts are really fragrant
-
It's very fragrant, and 20 visual large-screen templates have been organized
Technology Exchange
Welcome to reprint, collect, like and support!
At present, a technical exchange group has been opened, and the group has more than 2,000 members . The best way to remark when adding is: source + interest direction, which is convenient to find like-minded friends
- Method 1. Send the following picture to WeChat, long press to identify, and reply in the background: add group;
- Method ②, add micro-signal: dkl88191 , note: from CSDN
- Method ③, WeChat search public account: Python learning and data mining , background reply: add group