Spark dataframe in comparison with Pandas in dataframe / conversion

Billion, the statement

Spark numpy and pandas can run a program, as long as you installed

First, why you want to change the dataframe Spark program with the pandas.dataframe

The former can only stand-alone operation, the latter can run clusters

Second, comparison

Jump directly to this post "Spark and Pandas in DataFrame contrast" and write well

Third, the transformation

spark —> pandas pandas —> spark
pandas_df = spark_df.toPandas() spark_df = spark.createDataFrame(pandas_df)

Because pandas are stand-alone version of the way, that toPandas () is a stand-alone version of the way, into a distributed version:

import pandas as pd
def _map_to_pandas(rdds):
    return [pd.DataFrame(list(rdds))]
    
def topas(df, n_partitions=None):
    if n_partitions is not None: df = df.repartition(n_partitions)
    df_pand = df.rdd.mapPartitions(_map_to_pandas).collect()
    df_pand = pd.concat(df_pand)
    df_pand.columns = df.columns
    return df_pand
    
pandas_df = topas(spark_df)

Reference Bowen:
"spark with pandas data conversion"
"pandas and spark of dataframe Huzhuan"

Four, SparkContext in Spark2.x are integrated into SparkSession, the entire Spark podium

Reference Bowen:
"the Spark core articles -SparkContext"
"the Spark 2.0 Series SparkSession explain"

Published 131 original articles · won praise 81 · views 60000 +

Guess you like

Origin blog.csdn.net/weixin_43469047/article/details/104010581