Data scientists or data practitioners who use Python for data processing are no strangers to the data science package pandas, and there are also heavy pandas users like Yun Duojun. Most of the first lines of code written at the beginning of the project are import pandas as pd.

Pandas can be said to be yyds for data processing! And its shortcomings are also very obvious. Pandas can only be processed on a single machine , and it cannot scale linearly with the amount of data. For example, if pandas attempts to read a data set larger than a machine's available memory, it will fail due to insufficient memory.

In addition , pandas is very slow in processing large data . Although there are other libraries like Dask or Vaex to optimize and improve data processing speed, it is a piece of cake in front of Spark, the god framework for big data processing.

Fortunately, in the new Spark 3.2 version, a new Pandas API has appeared, which integrates most of the pandas functions into PySpark. Using the pandas interface, you can use Spark , because the Pandas API on Spark is used in the background. Spark, in this way, can achieve the effect of strong cooperation, which can be said to be very powerful and very convenient .

It all started at the Spark + AI Summit 2019. Koalas is an open source project that uses Pandas on top of Spark. In the beginning, it only covered a small part of Pandas' functionality, but it has gradually grown in size. Now, in the new Spark 3.2 version, Koalas has been merged into PySpark.

Spark now integrates the Pandas API, so you can run Pandas on Spark . Only one line of code needs to be changed:

import pyspark.pandas as ps

From this we can gain many advantages:

If we are familiar with using Python and Pandas, but not familiar with Spark, we can skip the complicated learning process and use PySpark immediately.
One codebase can be used for everything: small and big data, single and distributed machines.
You can run Pandas code faster on the Spark distributed framework.

This last point is especially noteworthy.

On the one hand, distributed computing can be applied to code in Pandas. And with the Spark engine, the code will be faster even on a single machine! The figure below shows the performance comparison between running Spark on a machine with 96 vCPUs and 384 GiBs of memory and calling pandas alone to analyze a 130GB CSV data set.

Insert image description here

Multithreading and Spark SQL Catalyst Optimizer both help optimize performance. For example, the Join count operation is 4 times faster during the entire phase of code generation: 5.9 seconds without code generation and 1.6 seconds with code generation.

Spark has particularly significant advantages in chaining operations . The Catalyst query optimizer recognizes filters to filter data intelligently and can apply disk-based joins, whereas Pandas prefers to load all data into memory at each step.

Are you eager to try out how to write some code using the Pandas API on Spark? Let's get started now!

Switch between Pandas/Pandas-on-Spark/Spark

The first thing to know is what exactly we are using. When using Pandas, use the class pandas.core.frame.DataFrame. When using the pandas API in Spark, use pyspark.pandas.frame.DataFrame. Although the two are similar, they are not the same. The main difference is that the former is in a single machine, while the latter is distributed.

You can use Pandas-on-Spark to create a Dataframe and convert it to Pandas and vice versa:

# import Pandas-on-Spark 
import pyspark.pandas as ps

# 使用 Pandas-on-Spark 创建一个 DataFrame 
ps_df = ps.DataFrame(range(10))

# 将 Pandas-on-Spark Dataframe 转换为 Pandas Dataframe 
pd_df = ps_df.to_pandas()

# 将 Pandas Dataframe 转换为 Pandas-on-Spark Dataframe 
ps_df = ps.from_pandas(pd_df)

Note that if multiple machines are used, data will be transferred from multiple machines to one machine and vice versa when converting a Pandas-on-Spark Dataframe to a Pandas Dataframe (see the PySpark Guide [1]).

You can also convert a Pandas-on-Spark Dataframe to a Spark DataFrame and vice versa:

# 使用 Pandas-on-Spark 创建一个 DataFrame 
ps_df = ps.DataFrame(range(10))

# 将 Pandas-on-Spark Dataframe 转换为 Spark Dataframe 
spark_df = ps_df.to_spark()

# 将 Spark Dataframe 转换为 Pandas-on-Spark Dataframe 
ps_df_new = spark_df.to_pandas_on_spark()

How do data types change?

The data types are basically the same when using Pandas-on-Spark and Pandas. When converting a Pandas-on-Spark DataFrame to a Spark DataFrame, the data types are automatically converted to the appropriate type (see PySpark Guide [2])

The following example shows how to convert the data type from PySpark DataFrame to pandas-on-Spark DataFrame when converting.

>>> sdf = spark.createDataFrame([
...     (1, Decimal(1.0), 1., 1., 1, 1, 1, datetime(2020, 10, 27), "1", True, datetime(2020, 10, 27)),
... ], 'tinyint tinyint, decimal decimal, float float, double double, integer integer, long long, short short, timestamp timestamp, string string, boolean boolean, date date')
>>> sdf
DataFrame[tinyint: tinyint, decimal: decimal(10,0),
float: float, double: double, integer: int,
long: bigint, short: smallint, timestamp: timestamp, 
string: string, boolean: boolean, date: date]
psdf = sdf.pandas_api()
psdf.dtypes
tinyint                int8
decimal              object
float               float32
double              float64
integer               int32
long                  int64
short                 int16
timestamp    datetime64[ns]
string               object
boolean                bool
date                 object
dtype: object

Pandas-on-Spark vs Spark functions

DataFrame in Spark and its most commonly used functions in Pandas-on-Spark. Note that the only syntax difference between Pandas-on-Spark and Pandas is the import pyspark.pandas as ps line.

After you read the following content, you will find that even if you are not familiar with Spark, you can easily use it through the Pandas API.

Import library

# 运行Spark
from pyspark.sql import SparkSession
spark = SparkSession.builder \
          .appName("Spark") \
          .getOrCreate()

Running Pandas on Spark

import pyspark.pandas as ps

Read data

Take the old dog iris data set as an example.

# SPARK 
sdf = spark.read.options(inferSchema='True', 
              header='True').csv('iris.csv')
# PANDAS-ON-SPARK 
pdf = ps.read_csv('iris.csv')

choose

# SPARK 
sdf.select("sepal_length","sepal_width").show()
# PANDAS-ON-SPARK 
pdf[["sepal_length","sepal_width"]].head()

Delete column

# SPARK 
sdf.drop('sepal_length').show()# PANDAS-ON-SPARK 
pdf.drop('sepal_length').head()

Remove duplicates

# SPARK 
sdf.dropDuplicates(["sepal_length","sepal_width"]).show()
# PANDAS-ON-SPARK 
pdf[["sepal_length", "sepal_width"]].drop_duplicates()

filter

# SPARK 
sdf.filter( (sdf.flower_type == "Iris-setosa") & (sdf.petal_length > 1.5) ).show()
# PANDAS-ON-SPARK 
pdf.loc[ (pdf.flower_type == "Iris-setosa") & (pdf.petal_length > 1.5) ].head()

count

# SPARK 
sdf.filter(sdf.flower_type == "Iris-virginica").count()
# PANDAS-ON-SPARK 
pdf.loc[pdf.flower_type == "Iris-virginica"].count()

unique value

# SPARK 
sdf.select("flower_type").distinct().show()
# PANDAS-ON-SPARK 
pdf["flower_type"].unique()

sort

# SPARK 
sdf.sort("sepal_length", "sepal_width").show()
# PANDAS-ON-SPARK 
pdf.sort_values(["sepal_length", "sepal_width"]).head()

Group

# SPARK 
sdf.groupBy("flower_type").count().show()
# PANDAS-ON-SPARK 
pdf.groupby("flower_type").count()

replace

# SPARK 
sdf.replace("Iris-setosa", "setosa").show()
# PANDAS-ON-SPARK 
pdf.replace("Iris-setosa", "setosa").head()

connect

#SPARK 
sdf.union(sdf)
# PANDAS-ON-SPARK 
pdf.append(pdf)

transform and apply function application

There are many APIs that allow users to apply functions against pandas-on-Spark DataFrame, for example:

DataFrame.transform() 
DataFrame.apply()
DataFrame.pandas_on_spark.transform_batch()  
DataFrame.pandas_on_spark.apply_batch()  
Series.pandas_on_spark.transform_batch()

Each API serves a different purpose and works differently internally.

transform 和 apply

The main difference between DataFrame.transform() and DataFrame.apply() is that the former needs to return the same length of input, while the latter does not.

# transform
psdf = ps.DataFrame({
    
    'a': [1,2,3], 'b':[4,5,6]})
def pandas_plus(pser):
    return pser + 1  # 应该总是返回与输入相同的长度。

psdf.transform(pandas_plus)

# apply
psdf = ps.DataFrame({
    
    'a': [1,2,3], 'b':[5,6,7]})
def pandas_plus(pser):
    return pser[pser % 2 == 1]  # 允许任意长度

psdf.apply(pandas_plus)

In this case, each function takes a pandas Series and the pandas API on Spark computes the functions in a distributed manner as shown below.

Insert image description here

In the case of the "column" axis, this function treats each row as a pandas series.

psdf = ps.DataFrame({
    
    'a': [1,2,3], 'b':[4,5,6]})
def pandas_plus(pser):
    return sum(pser)  # 允许任意长度
psdf.apply(pandas_plus, axis='columns')

The above example calculates the sum of each row as pands Series

Insert image description here

pandas_on_spark.transform_batch和pandas_on_spark.apply_batch

The batch suffix represents each chunk in a pandas-on-Spark DataFrame or Series. The API slices a pandas-on-Spark DataFrame or Series and then applies the given function with the pandas DataFrame or Series as input and output. See the following examples:

psdf = ps.DataFrame({
    
    'a': [1,2,3], 'b':[4,5,6]})
def pandas_plus(pdf):
    return pdf + 1  # 应该总是返回与输入相同的长度。

psdf.pandas_on_spark.transform_batch(pandas_plus)

psdf = ps.DataFrame({
    
    'a': [1,2,3], 'b':[4,5,6]})
def pandas_plus(pdf):
    return pdf[pdf.a > 1]  # 允许任意长度

psdf.pandas_on_spark.apply_batch(pandas_plus)

The functions in both examples take a pandas DataFrame as a chunk of a pandas-on-Spark DataFrame and output a pandas DataFrame. The Pandas on Spark API combines pandas dataframes into pandas-on-Spark dataframes.

Insert image description here

Things to note when using pandas API on Spark

Avoid shuffle

Certain operations such as sort_values are more difficult to complete in a parallel or distributed environment than in memory on a single machine because it requires sending data to other nodes and exchanging data between multiple nodes over the network.

Avoid computing on a single partition

Another common situation is to perform calculations on a single partition . Currently, some APIs such as DataFrame.rank use PySpark's Window without specifying partition specifications. This moves all data to a single partition in a single machine and can cause severe performance degradation. For very large data sets, such APIs should be avoided.

Don't use duplicate column names

Duplicate column names are not allowed as Spark SQL generally does not allow this. The Pandas API on Spark inherits this behavior. For example, see below:

import pyspark.pandas as ps
psdf = ps.DataFrame({
    
    'a': [1, 2], 'b':[3, 4]})
psdf.columns = ["a", "a"]
Reference 'a' is ambiguous, could be: a, a.;

Additionally, it is strongly recommended not to use case-sensitive column names. The Pandas API on Spark doesn't allow it by default.

import pyspark.pandas as ps
psdf = ps.DataFrame({
    
    'a': [1, 2], 'A':[3, 4]})
Reference 'a' is ambiguous, could be: a, a.;

But you can turn on spark.sql.caseSensitive in Spark configuration to enable it, but do so at your own risk.

from pyspark.sql import SparkSession
builder = SparkSession.builder.appName("pandas-on-spark")
builder = builder.config("spark.sql.caseSensitive", "true")
builder.getOrCreate()

import pyspark.pandas as ps
psdf = ps.DataFrame({
    
    'a': [1, 2], 'A':[3, 4]})
psdf
   a  A
0  1  3
1  2  4

Use default index

A common problem faced by pandas-on-Spark users is performance degradation caused by default indexes . When the index is unknown, the Pandas-on-Spark API appends a default index, such that a Spark DataFrame is converted directly to a pandas-on-Spark DataFrame.

If you plan to process big data in production , make sure the default index is distributed by configuring it as distributed or distributed-sequence .

Using pandas API on Spark

Although the pandas API on Spark has most of the equivalent APIs to pandas, there are still some APIs that are not yet implemented or explicitly unsupported. Therefore use the pandas API directly on Spark whenever possible.

For example, the pandas API on Spark does not implement __iter__(), preventing users from collecting all data from the entire cluster to the client (driver) side . Unfortunately, many external APIs, such as Python's built-in functions such as min, max, sum, etc., require that the given parameters are iterable. For pandas, it works out of the box like this:

>>> import pandas as pd
>>> max(pd.Series([1, 2, 3]))
3
>>> min(pd.Series([1, 2, 3]))
1
>>> sum(pd.Series([1, 2, 3]))
6

Pandas datasets exist on a single machine and can naturally be iterated locally within the same machine . However, pandas-on-Spark datasets exist on multiple machines, and they are computed in a distributed fashion. It's difficult to iterate locally and it's likely that the user is collecting the entire data to the client without knowing it. Therefore, it's better to stick to the pandas-on-Spark API. The above example can be transformed as follows:

>>> import pyspark.pandas as ps
>>> ps.Series([1, 2, 3]).max()
3
>>> ps.Series([1, 2, 3]).min()
1
>>> ps.Series([1, 2, 3]).sum()
6

Another common pattern among pandas users might be relying on list comprehensions or generator expressions . However, it also assumes that the dataset is natively iterable under the hood. Therefore, it works seamlessly in pandas as follows:

import pandas as pd
data = []
countries = ['London', 'New York', 'Helsinki']
pser = pd.Series([20., 21., 12.], index=countries)
for temperature in pser:
     assert temperature > 0
     if temperature > 1000:
         temperature = None
     data.append(temperature ** 2)

pd.Series(data, index=countries)
London      400.0
New York    441.0
Helsinki    144.0
dtype: float64

However, for the pandas API on Spark, it works the same as above. The above example can also be changed to use the pandas-on-Spark API directly as follows:

import pyspark.pandas as ps
import numpy as np
countries = ['London', 'New York', 'Helsinki']
psser = ps.Series([20., 21., 12.], index=countries)
def square(temperature) -> np.float64:
     assert temperature > 0
     if temperature > 1000:
         temperature = None
     return temperature ** 2

psser.apply(square)
London      400.0
New York    441.0
Helsinki    144.0

Reduce operations on different DataFrames

The Pandas API on Spark does not allow operations on different DataFrames (or Series) by default to prevent expensive operations. Whenever possible, this operation should be avoided.

write at the end

As of now, we will be able to use Pandas on Spark. This will lead to significant improvements in the speed of Pandas, a reduction in the learning curve when migrating to Spark, and the merging of single-machine and distributed computing in the same code base.

[Following the trend of the times, I have compiled a lot of Python learning materials here and uploaded them to the CSDN official. Friends in need can scan the QR code below to obtain them]

1. Study Outline

Insert image description here

2. Development tools

Insert image description here

3. Python basic materials

Insert image description here

4. Practical data

Insert image description here

Pandas and PySpark join forces to achieve both functionality and speed!

Switch between Pandas/Pandas-on-Spark/Spark

How do data types change?

Pandas-on-Spark vs Spark functions

Import library

Running Pandas on Spark

Read data

choose

Delete column

Remove duplicates

filter

count

unique value

sort

Group

replace

connect

transform and apply function application

transform 和 apply

Things to note when using pandas API on Spark

Avoid computing on a single partition

Don't use duplicate column names

Use default index

Using pandas API on Spark

Reduce operations on different DataFrames

write at the end

1. Study Outline

2. Development tools

3. Python basic materials

4. Practical data

Guess you like