Detailed explanation of Polars library in python

Polars library in python

what are polars

Polars is a Python library for data manipulation that provides an API similar to Pandas, but much faster and memory efficient.

Polars is able to handle very large datasets and operate on them quickly at runtime. It has a Pandas-like API for common operations such as filtering, aggregation, and transformation. In addition, Polars also provides a more intuitive and easy-to-use API, which allows you to easily use some complex data processing techniques.

Common functions

1.read_csv()
The read_csv() function is used to read data from a CSV file and returns a DataFrame object. This function can accept various parameters, such as file path, column delimiter, row delimiter, etc.

Sample code:

import polars as pl

df = pl.read_csv('data.csv')

2.head()
The head() function is used to return the first n rows of data in the DataFrame, and the default is 5 rows.

Sample code:

import polars as pl

df = pl.read_csv('data.csv')
print(df.head())

3.filter()
The filter() function is used to filter row data in DataFrame based on specified criteria.

Sample code:

import polars as pl

df = pl.read_csv('data.csv')
filtered_df = df.filter(pl.col('age') > 18)#找age大于18的数据
print(filtered_df)

4.select()
The select() function is used to select column data in DataFrame.

Sample code:

import polars as pl

df = pl.read_csv('data.csv')
selected_df = df.select(['name', 'age'])#返回列名为name和age的列数据
print(selected_df)

5.groupby()
The groupby() function is used to group the data in the DataFrame and aggregate the grouped data.

Sample code:

import polars as pl

df = pl.read_csv('data.csv')
grouped_df = df.groupby('gender').agg({
    
    'age': ['min', 'max', 'mean'], 'salary': 'sum'})
print(grouped_df)

.agg() is a function in the Polars library for performing aggregate operations on DataFrames. It can accept a dictionary argument specifying the columns to be aggregated and the aggregation function to use.

  1. join()
    The join() function is used to join the data in two DataFrames according to the specified column.

Sample code:

import polars as pl

df1 = pl.read_csv('data1.csv')
df2 = pl.read_csv('data2.csv')
joined_df = df1.join(df2, on='id')
print(joined_df)

7.sort()
The sort() function is used to sort the data in the DataFrame according to the specified column.

Sample code:

import polars as pl

df = pl.read_csv('data.csv')
sorted_df = df.sort(by='age')
print(sorted_df)

8.fill_null()
The fill_null() function is used to fill the null values ​​in the DataFrame with the specified value.

Sample code:

import polars as pl

df = pl.read_csv('data.csv')
filled_df = df.fill_null(0)#用0填充空值
print(filled_df)

9.describe()
The describe() function is used to generate descriptive statistics for numeric columns in DataFrame, including count, mean, standard deviation, minimum, maximum, etc.

Sample code:

import polars as pl

df = pl.read_csv('data.csv')
description = df.describe()
print(description)

10.pl.DataFrame
pl.DataFrame is the class in the Polars library used to create DataFrame objects. DataFrame is a two-dimensional tabular data structure in which each column can be a different data type, similar to an Excel table or a data table in SQL.

Sample code:

import polars as pl

data = {
    
    
    'name': ['Alice', 'Bob', 'Charlie'],
    'age': [20, 30, 25],
    'gender': ['F', 'M', 'M']
}

df = pl.DataFrame(data)
print(df)

In the sample code above, we first define a dictionary data, which contains three key-value pairs, representing the three column data of name, age and gender respectively. Next, we use the pl.DataFrame class to create a DataFrame object df, and pass in data as a parameter of the constructor. Finally, we print out the value of the df object.
11..col
.col is a method in the Polars library for selecting a column of data in a DataFrame, which returns a Series object. In Polars, a DataFrame object consists of multiple Series objects, each Series object representing a column of data.
Sample code:

import polars as pl

df = pl.DataFrame({
    
    
    'name': ['Alice', 'Bob', 'Charlie'],
    'age': [20, 30, 25],
    'gender': ['F', 'M', 'M']
})

age_col = df.col('age')
print(age_col)

In the above sample code, we first created a DataFrame object, then used the col() method to select the age column data, and assigned it to the age_col variable. Finally, we print out the value of the age_col variable, which is a Series object representing the age column data.

The col() method is a convenient way to select a column of data in a DataFrame and operate on it, such as calculating the average, maximum, etc. of that column. At the same time, you can also use the select() method to select multiple columns for multiple columns of data, and then operate on multiple columns.

Guess you like

Origin blog.csdn.net/m0_68678046/article/details/130301656