Running Pandas and sklearn on gpu

Pandas can handle data efficiently when large amounts of data are involved. But it uses the CPU for computational operations. The process can be sped up by parallel processing, but it is still not efficient to process large amounts of data.

In the past, GPUs were primarily used for rendering video and playing games. But now with the advancement of technology most large projects rely on GPU support because of its potential to improve deep learning algorithms.

Rapids, Nvidia's open source library, allows us to perform data science calculations entirely on the GPU. In this paper we compare the performance of DF on a Rapids-optimized GPU with normal Pandas.

We will test it in Google Colab. Since we only need very little disk space but a large memory GPU (15GB), Colab can provide exactly what we need. We will start with the installation, please follow the steps to complete the process.

Enable GPU

Select "Change runtime type" in the "Runtime" option of Colab in the menu bar. Then select GPU as hardware accelerator.

NV's graphics card is the only graphics card that supports CUDA. Rapids only supports GPUs based on P4, P100, T4 or V100 in Google Colab. After assigning to GPU, we execute the following command to confirm:

 !nvidia-smi

It can be seen that a T4 is allocated with 15G of memory. If it is assigned to another GPU (such as p4), you can reapply it in the "Runtime" menu and select "Factory Reset Runtimes".

Install Rapids

 !git clone https://github.com/rapidsai/rapidsai-csp-utils.git
 !python rapidsai-csp-utils/colab/env-check.py

Running the following command will update the existing colab file and restart the kernel. After running this command, the current session will automatically restart.

 ! bash rapidsai-csp-utils/colab/update_gcc.sh
 import os
 os._exit(00)

Install CondaColab

 import condacolab
 condacolab.install()

This command will cause the kernel to reboot again. After rebooting, run the following command to determine if the installation was successful:

 import condacolab
 condacolab.check()

The following is to install Rapids on the colab instance

 !python rapidsai-csp-utils/colab/install_rapids.py stable

Once that's done, it's time to test the GPU's performance!

Simple comparison test

Creating a large DF allows testing the full potential of the gpu. We will create a cuDF (cuda dataframe) with a size of 10000000 rows x 2 columns (10M x 2), first import the required libraries:

 import cudf 
 import pandas as pd
 import numpy as np

Create DF

 gpuDF = cudf.DataFrame({'col_1': np.random.randint(0, 10000000, size=10000000),
                           'col_2': np.random.randint(0, 10000000, size=10000000)})
 pandasDF = pd.DataFrame({'col_1': np.random.randint(0, 10000000, size=10000000),
                           'col_2': np.random.randint(0, 10000000, size=10000000)})

cuDF is a DataFrame on top of GPU. Almost all functions of Pandas can be run on it because it is built as a mirror of Pandas. Same as Pandas function operations, but all operations are performed in GPU memory.

Let's take a look at the time comparison at the time of creation:

Now let's see if the GPU improves performance by performing some operations on these dataframes!

Logarithmic operations

To get the best possible average, we'll apply the np.log function to one of the columns in the two dfs and run 10 loops:

The GPU result is 32.8ms, while the CPU (regular pandas) is 2.55s! GPU-based processing is much faster.

Data type conversion from " Int " to " String "

Test it further by converting "col_1" (containing integer values ​​from 0 to 10M) to a string value (object).

As you can see, the speed difference is bigger

Linear regression model testing

The training of a model can take a long time. Model training in GPU memory may vary by type. We will use GPU-based cuML to test simple modeling and compare its performance with Sklearn.

 import cudf
 from cuml import make_regression, train_test_split
 from cuml.linear_model import LinearRegression as cuLinearRegression
 from cuml.metrics.regression import r2_score
 from sklearn.linear_model import LinearRegression as skLinearRegression

Create dummy data and split it (train and test)

 n_samples = 2**20 
 n_features = 399
 random_state = 23
 X, y = make_regression(n_samples=n_samples, n_features=n_features, random_state=random_state)
 X = cudf.DataFrame(X)
 y = cudf.DataFrame(y)[0]
 X_cudf, X_cudf_test, y_cudf, y_cudf_test = train_test_split(X, y, test_size = 0.2, random_state=random_state)
 X_train = X_cudf.to_pandas()
 X_test = X_cudf_test.to_pandas()
 y_train = y_cudf.to_pandas()
 y_test = y_cudf_test.to_pandas()

The make_regression function and train_test_split of the CuML library are the same as sklearn's function of the same name. Use the .to_pandas() function to convert gpu data into ordinary pandas df.

Train a skearn-based model:

There is not much difference between training a GPU-based model and training a CPU-based model.

Here it takes 16.2 seconds to train the sklearn model, but only 342 milliseconds to train the gpu-based cuML model!

Summarize

Pandas and sklearn are our most commonly used basic libraries. Rapids fully translates the functions of Pandas and sklearn to the GPU, which is very helpful to us. If you are interested in these two libraries, you can Try it with reference to his official documentation:

https://avoid.overfit.cn/post/7ee1826e416a40b7965bca9ab4ee28f1

Author: Onepagecode

Guess you like

Origin blog.csdn.net/m0_46510245/article/details/127180543