DataFrame in Spark has been introduced to you before, and the conversion between RDD, DataSet, and DataFrame, and PySpark can be said to be a combination of Spark and Python. DataFrame is also used in PySpark, and it can also be converted between RDD and DataSet. In fact, there is a Pandas library in python, and there is also a DataFrame, which is a data structure composed of multiple columns, and sometimes they need to be converted to each other to use.
DataFrame conversion between Spark and Pandas
import pandas as pd
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("test").getOrCreate()
#创建pandas DataFrame
df = pd.DataFrame([["zhangsan",25], ["lisi", 24]], columns = ["name", "age"])
#取出值,并转换成list的两种方式
value = df.values.tolist()
column = list(df.columns)
#将pandas.DataFrame 转换成 Spark.DataFrame
spark_df = spark.createDataFrame(value,column)
spark_df.show()
#将Spark.dataFrame 转换成 pandas.Dataframe
pd_df = spark_df.toPandas()
type(spark_df)
type(pd_df)
Conversion between Spark DataFrame and RDD
#生成pandas.DataFrame
import pandas as pd
from pyspark.sql import SparkSession
str = {
"name": ["zhangsan", "lisi", "wangwu", "qianyi"],
"age": [22, 24, 34, 25]
}
pd_df = pd.DataFrame(str)
#将pd.DataFrame转换成spark.Dataframe
value = pd_df.values.tolist()
column = list(pd_df.columns)
spark_df = spark.createDataFrame(value, column)
spark_df.show()
#将spark.Dataframe转换成rdd
spark_rdd = spark_df.rdd
print(spark_rdd.collect())
#将rdd转换成让spark.Dataframe
sp_df = spark.createDataFrame(spark_rdd)
sp_df.show()
I won’t talk about the conversion between RDD, DataSet, and DataFrame. I’ve already talked about it before, so you can check it out again!