Goodbye Pandas, another data processing artifact!

cuDF (Pandas GPU replacement) for loading, joining, aggregating, filtering and other data operations.

Introduction to cuDF

cuDF is a Python GPU DataFrame library based on the Apache Arrow column memory format for loading, joining, aggregation, filtering and other data operations. cuDF also provides an API similar to pandas.

picture

GitHub:https://github.com/rapidsai/cudf

Documentation:https://docs.rapids.ai/api/cudf/stable

Technical exchange and source code acquisition

Technology needs to be communicated and shared, and it is not recommended to work behind closed doors. One person can go very fast, and a group of people can go further.

Good articles are inseparable from the sharing and recommendations of fans. Information, information sharing, data, and technical exchanges and improvements can all be obtained by joining the communication group. The group has more than 2,000 members. The best way to comment when adding is: source + interest Directions to find like-minded friends.

The methods for technical exchange, code, and data acquisition are as follows:

Method ①, add WeChat ID: dkl88194, note: from CSDN + consumer data
Method ②, WeChat search public account: Python learning and data mining, background reply: consumer Data

Fee 1
Insert image description here
Fee 2

We created "100 Super Powerful Algorithm Models". Features: Easy to learn from 0 to 1. Principles, codes, and cases are all available. All algorithm models are expressed according to this rhythm, so it is a complete set of cases. Library.

Many beginners have such a pain point, which is the case. The completeness of the case directly affects the interest of the students. Therefore, I have compiled 100 of the most common algorithm models to give you a boost on your learning journey!

Insert image description here

Introduction to related frameworks

cuDF: cuDF is a Python GPU DataFrame library based on Apache Arrow's columnar memory format for loading, joining, aggregating, filtering, and dataframes in a pandas-like DataFrame style API to manipulate tabular data. It allows data engineers and data scientists to easily accelerate their workflows through a pandas-like API without having to delve into the details of CUDA programming. cuDF is designed to process large-scale data sets on GPUs, providing high-performance support for data processing tasks.

Dask: Dask is a flexible Python parallel computing library that makes scaling smooth and simple in your workflow. On the CPU, Dask uses Pandas to perform operations on DataFrame partitions in parallel. It allows users to process data at a larger scale and make full use of computing resources without requiring large-scale changes to the code.

Dask-cuDF: Dask-cuDF extends Dask where necessary to allow its DataFrame partitions to be processed using a cuDF GPU DataFrame instead of a Pandas DataFrame. For example, when dask_cudf.read_csv(...) is called, the cluster's GPU performs the work of parsing the CSV file by calling cudf.read_csv(). This enables leveraging cuDF's high-performance data processing capabilities on the GPU, thereby accelerating large-scale data processing tasks.

cuDF and Pandas comparison

cuDF is a DataFrame library that closely matches the Pandas API, but is not a complete replacement for Pandas when used directly. There are some differences between cuDF and Pandas in terms of API and behavior. Here is a comparison of the similarities and differences between cuDF and Pandas:

Supported operations:

cuDF supports many of the same data structures and operations as Pandas, including Series, DataFrame, Index, etc., as well as their unary and binary operations, indexing, filtering, joins, grouping and window operations, etc.

type of data:

cuDF supports commonly used data types in Pandas, including numeric, datetime, timestamp, string and categorical data types. Additionally, cuDF supports special data types for decimal, list, and "struct" values.

Missing values:

Unlike Pandas, all data types in cuDF are nullable, meaning they can contain missing values ​​(represented by cudf.NA).

Iteration:

In cuDF, iteration over Series, DataFrame or Index is not supported. Because iterating over data on the GPU can result in extremely poor performance, GPU optimization is used for highly parallel operations rather than sequential operations.

Sorting of results:

By default, join (or merge) and groupby operations in cuDF do not guarantee output ordering. In contrast to Pandas, you need to explicitly pass the sort=True or enable the mode.pandas_compatible option when trying to match Pandas behavior.

Floating point operations:

cuDF utilizes the GPU to perform operations in parallel, so the order of operations is not always determined. This affects the determinism of floating point operations, since floating point operations are non-associative. When comparing floating point results, it is recommended to use the functions provided by the cudf.testing module, which allow you to compare values ​​based on your desired precision.

Column name:

Unlike Pandas, cuDF does not support duplicate column names. It's best to use unique strings as column names.

There is no real "object" data type:

Unlike Pandas and NumPy, cuDF does not support the "object" data type, which is used to store collections of arbitrary Python objects.

.apply() function limitations:

cuDF supports the .apply() function, but it relies on Numba to JIT compile user-defined functions (UDFs) and execute them on the GPU. This can be very fast, but imposes some restrictions on the operations allowed in the UDF.

When to use cuDF and Dask-cuDF

cuDF:
  • You'll want to use cuDF when your workflow is fast enough on a single GPU, or your data fits easily in a single GPU's memory.

  • When the amount of data is not large and can be processed in a single GPU memory, cuDF provides support for high-performance data operations on a single GPU.

Dask-cuDF:
  • You'll want to use Dask-cuDF when you want to distribute your workflow across multiple GPUs, or when your data volume exceeds the capacity of a single GPU's memory, or when you want to analyze data distributed across many files simultaneously.

  • Dask-cuDF allows you to perform high-performance data processing in distributed GPU environments, especially when the data set is too large to fit in a single GPU memory.

cuDF code case

import os
import pandas as pd
import cudf

# Creating a cudf.Series
s = cudf.Series([1, 2, 3, None, 4])

# Creating a cudf.DataFrame
df = cudf.DataFrame(
    {
    
    
        "a": list(range(20)),
        "b": list(reversed(range(20))),
        "c": list(range(20)),
    }
)

# read data directly into a dask_cudf.DataFrame with read_csv
pdf = pd.DataFrame({
    
    "a": [0, 1, 2, 3], "b": [0.1, 0.2, None, 0.3]})
gdf = cudf.DataFrame.from_pandas(pdf)
gdf

# Viewing the top rows of a GPU dataframe.
ddf.head(2)

# Sorting by values.
df.sort_values(by="b")

# Selecting a single column
df["a"]

# Selecting rows from index 2 to index 5 from columns ‘a’ and ‘b’.
df.loc[2:5, ["a", "b"]]

# Selecting via integers and integer slices, like numpy/pandas.
df.iloc[0:3, 0:2]

# Selecting rows in a DataFrame or Series by direct Boolean indexing.
df[df.b > 15]

# Grouping and then applying the sum function to the grouped data.
df.groupby("agg_col1").agg({
    
    "a": "max", "b": "mean", "c": "sum"})

Guess you like

Origin blog.csdn.net/qq_34160248/article/details/134905341