Conversion between DataFrame in Spark and DataFrame in Pandas

DataFrame in Spark has been introduced to you before, and the conversion between RDD, DataSet, and DataFrame, and PySpark can be said to be a combination of Spark and Python. DataFrame is also used in PySpark, and it can also be converted between RDD and DataSet. In fact, there is a Pandas library in python, and there is also a DataFrame, which is a data structure composed of multiple columns, and sometimes they need to be converted to each other to use.

DataFrame conversion between Spark and Pandas

import pandas as pd

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("test").getOrCreate()

#创建pandas DataFrame

df = pd.DataFrame([["zhangsan",25], ["lisi", 24]], columns = ["name", "age"])
#取出值,并转换成list的两种方式

value = df.values.tolist()

column = list(df.columns)

#将pandas.DataFrame 转换成 Spark.DataFrame

spark_df = spark.createDataFrame(value,column)

spark_df.show()

#将Spark.dataFrame 转换成 pandas.Dataframe

pd_df = spark_df.toPandas()

type(spark_df)

type(pd_df)

Conversion between Spark DataFrame and RDD

#生成pandas.DataFrame

import pandas as pd
from pyspark.sql import SparkSession
str = {
    
    

"name": ["zhangsan", "lisi", "wangwu", "qianyi"],

"age": [22, 24, 34, 25]

}

pd_df = pd.DataFrame(str)

#将pd.DataFrame转换成spark.Dataframe

value = pd_df.values.tolist()

column = list(pd_df.columns)

spark_df = spark.createDataFrame(value, column)

spark_df.show()

#将spark.Dataframe转换成rdd

spark_rdd = spark_df.rdd

print(spark_rdd.collect())

#将rdd转换成让spark.Dataframe

sp_df = spark.createDataFrame(spark_rdd)

sp_df.show()

I won’t talk about the conversion between RDD, DataSet, and DataFrame. I’ve already talked about it before, so you can check it out again!

Guess you like

Origin blog.csdn.net/zp17834994071/article/details/108644954