Spark DataFrame in comparison with Pandas

  Pandas Spark
Way of working Single single machine tool, there is no parallelism parallelism
does not support Hadoop, large data bottleneck
Parallel Distributed computing framework, parallelism Parallelism built, all data and operation of the automatic parallel distributed among the cluster nodes. As to in-memory data processing distributed data processing.
Support for Hadoop, can handle large amounts of data
Delay mechanism not lazy-evaluated lazy-evaluated
Memory Cache Stand-alone cache persist () or cache () converts the RDDs stored in memory
DataFrame variability Pandas are variable in DataFrame Spark the RDDs are immutable, and therefore are not mutable DataFrame
create Conversion from spark_df: pandas_df = spark_df.toPandas () Conversion from pandas_df: spark_df = SQLContext.createDataFrame (pandas_df)
Further, createDataFrame supported from list conversion spark_df, where the list element may be a tuple, dict, rdd
list, dict, ndarray conversion Conversion of existing RDDs
Reading CSV data set Structured data file read
HDF5 reading Data set read JSON
EXCEL read Hive table reading
  External database reads
index index Created automatically No index index, if you need to create the need for additional columns
Line structure Series structure, structure belonging Pandas DataFrame Row structure, structure belonging Spark DataFrame
Column structure Series structure, structure belonging Pandas DataFrame Column structure, belonging Spark DataFrame structures, such as: DataFrame [name: string]
Column Name Do not allow duplicate names Allow duplicate names
to modify column using alias names Method
Add Column df[“xx”] = 0 df.withColumn(“xx”, 0).show() 会报错
from pyspark.sql import functions
df.withColumn(“xx”, functions.lit(0)).show()
Modify column That there df [ "xx"] column, df [ "xx"] = 1 That there df [ "xx"] column, df.withColumn ( "xx", 1) .show ()
display   df does not output the specific content, the specific content by the output method show
output form: DataFrame [age: bigint, name : string]
df output specific content df.show () Output details
No tree structure in the form of output In the form of a tree print summary: df.printSchema ()
  df.collect()
Sequence df.sort_index () axis sorted by  
df.sort () are sorted by values ​​in a column df.sort () are sorted by values ​​in a column
Select or slice df.name output specific content DF [] does not output the specific content, the specific content with a show output method
df [ "name"] does not output the specific content, the specific content by the output method show
DF [] output the specific content,
DF [ "name"] Output details
df.select () to select one or more columns
df.select ( "name")
slice df.select (df [ 'name'] , df [ 'age'] + 1)
DF [0]
df.ix [0]
df.first()
df.head(2) df.head (2) or df.take (2)
df.tail(2)  
Slice df.ix [: 3] or df.ix [: "xx"] or df [: "xx"]  
df.loc [] is selected by tab  
df.iloc [] is selected by the position  
filter df[df[‘age’]>21] df.filter(df[‘age’]>21) 或者 df.where(df[‘age’]>21)
整合 df.groupby(“age”)
df.groupby(“A”).avg(“B”)
df.groupBy(“age”)
df.groupBy(“A”).avg(“B”).show() 应用单个函数
from pyspark.sql import functions
df.groupBy(“A”).agg(functions.avg(“B”), functions.min(“B”), functions.max(“B”)).show() 应用多个函数
统计 df.count() 输出每一列的非空行数 df.count() 输出总行数
df.describe() 描述某些列的count, mean, std, min, 25%, 50%, 75%, max df.describe() 描述某些列的count, mean, stddev, min, max
合并 Pandas下有concat方法,支持轴向合并  
Pandas下有merge方法,支持多列合并
同名列自动添加后缀,对应键仅保留一份副本
Spark下有join方法即df.join()
同名列不自动添加后缀,只有键值完全匹配才保留一份副本
df.join() 支持多列合并  
df.append() 支持多行合并  
缺失数据处理 对缺失数据自动添加NaNs 不自动添加NaNs,且不抛出错误
fillna函数:df.fillna() fillna函数:df.na.fill()
dropna函数:df.dropna() dropna函数:df.na.drop()
SQL语句 import sqlite3
pd.read_sql(“SELECT name, age FROM people WHERE age >= 13 AND age <= 19”)
表格注册:把DataFrame结构注册成SQL语句使用类型
df.registerTempTable(“people”) 或者 sqlContext.registerDataFrameAsTable(df, “people”)
sqlContext.sql(“SELECT name, age FROM people WHERE age >= 13 AND age <= 19”)
功能注册:把函数注册成SQL语句使用类型
sqlContext.registerFunction(“stringLengthString”, lambda x: len(x))
sqlContext.sql(“SELECT stringLengthString(‘test’)”)
两者互相转换 pandas_df = spark_df.toPandas() spark_df = sqlContext.createDataFrame(pandas_df)
函数应用 df.apply(f)将df的每一列应用函数f df.foreach(f) 或者 df.rdd.foreach(f) 将df的每一列应用函数f
df.foreachPartition(f) 或者 df.rdd.foreachPartition(f) 将df的每一块应用函数f
map-reduce操作 map(func, list),reduce(func, list) 返回类型seq df.map(func),df.reduce(func) 返回类型seqRDDs
diff操作 有diff操作,处理时间序列数据(Pandas会对比当前行与上一行) 没有diff操作(Spark的上下行是相互独立,分布式存储的)
发布了131 篇原创文章 · 获赞 7 · 访问量 3万+

Guess you like

Origin blog.csdn.net/weixin_43064185/article/details/103909256