Pandas | Spark | |
Way of working | Single single machine tool, there is no parallelism parallelism does not support Hadoop, large data bottleneck |
Parallel Distributed computing framework, parallelism Parallelism built, all data and operation of the automatic parallel distributed among the cluster nodes. As to in-memory data processing distributed data processing. Support for Hadoop, can handle large amounts of data |
Delay mechanism | not lazy-evaluated | lazy-evaluated |
Memory Cache | Stand-alone cache | persist () or cache () converts the RDDs stored in memory |
DataFrame variability | Pandas are variable in DataFrame | Spark the RDDs are immutable, and therefore are not mutable DataFrame |
create | Conversion from spark_df: pandas_df = spark_df.toPandas () | Conversion from pandas_df: spark_df = SQLContext.createDataFrame (pandas_df) Further, createDataFrame supported from list conversion spark_df, where the list element may be a tuple, dict, rdd |
list, dict, ndarray conversion | Conversion of existing RDDs | |
Reading CSV data set | Structured data file read | |
HDF5 reading | Data set read JSON | |
EXCEL read | Hive table reading | |
External database reads | ||
index index | Created automatically | No index index, if you need to create the need for additional columns |
Line structure | Series structure, structure belonging Pandas DataFrame | Row structure, structure belonging Spark DataFrame |
Column structure | Series structure, structure belonging Pandas DataFrame | Column structure, belonging Spark DataFrame structures, such as: DataFrame [name: string] |
Column Name | Do not allow duplicate names | Allow duplicate names to modify column using alias names Method |
Add Column | df[“xx”] = 0 | df.withColumn(“xx”, 0).show() 会报错 from pyspark.sql import functions df.withColumn(“xx”, functions.lit(0)).show() |
Modify column | That there df [ "xx"] column, df [ "xx"] = 1 | That there df [ "xx"] column, df.withColumn ( "xx", 1) .show () |
display | df does not output the specific content, the specific content by the output method show output form: DataFrame [age: bigint, name : string] |
|
df output specific content | df.show () Output details | |
No tree structure in the form of output | In the form of a tree print summary: df.printSchema () | |
df.collect() | ||
Sequence | df.sort_index () axis sorted by | |
df.sort () are sorted by values in a column | df.sort () are sorted by values in a column | |
Select or slice | df.name output specific content | DF [] does not output the specific content, the specific content with a show output method df [ "name"] does not output the specific content, the specific content by the output method show |
DF [] output the specific content, DF [ "name"] Output details |
df.select () to select one or more columns df.select ( "name") slice df.select (df [ 'name'] , df [ 'age'] + 1) |
|
DF [0] df.ix [0] |
df.first() | |
df.head(2) | df.head (2) or df.take (2) | |
df.tail(2) | ||
Slice df.ix [: 3] or df.ix [: "xx"] or df [: "xx"] | ||
df.loc [] is selected by tab | ||
df.iloc [] is selected by the position | ||
filter | df[df[‘age’]>21] | df.filter(df[‘age’]>21) 或者 df.where(df[‘age’]>21) |
整合 | df.groupby(“age”) df.groupby(“A”).avg(“B”) |
df.groupBy(“age”) df.groupBy(“A”).avg(“B”).show() 应用单个函数 from pyspark.sql import functions df.groupBy(“A”).agg(functions.avg(“B”), functions.min(“B”), functions.max(“B”)).show() 应用多个函数 |
统计 | df.count() 输出每一列的非空行数 | df.count() 输出总行数 |
df.describe() 描述某些列的count, mean, std, min, 25%, 50%, 75%, max | df.describe() 描述某些列的count, mean, stddev, min, max | |
合并 | Pandas下有concat方法,支持轴向合并 | |
Pandas下有merge方法,支持多列合并 同名列自动添加后缀,对应键仅保留一份副本 |
Spark下有join方法即df.join() 同名列不自动添加后缀,只有键值完全匹配才保留一份副本 |
|
df.join() 支持多列合并 | ||
df.append() 支持多行合并 | ||
缺失数据处理 | 对缺失数据自动添加NaNs | 不自动添加NaNs,且不抛出错误 |
fillna函数:df.fillna() | fillna函数:df.na.fill() | |
dropna函数:df.dropna() | dropna函数:df.na.drop() | |
SQL语句 | import sqlite3 pd.read_sql(“SELECT name, age FROM people WHERE age >= 13 AND age <= 19”) |
表格注册:把DataFrame结构注册成SQL语句使用类型 df.registerTempTable(“people”) 或者 sqlContext.registerDataFrameAsTable(df, “people”) sqlContext.sql(“SELECT name, age FROM people WHERE age >= 13 AND age <= 19”) |
功能注册:把函数注册成SQL语句使用类型 sqlContext.registerFunction(“stringLengthString”, lambda x: len(x)) sqlContext.sql(“SELECT stringLengthString(‘test’)”) |
||
两者互相转换 | pandas_df = spark_df.toPandas() | spark_df = sqlContext.createDataFrame(pandas_df) |
函数应用 | df.apply(f)将df的每一列应用函数f | df.foreach(f) 或者 df.rdd.foreach(f) 将df的每一列应用函数f df.foreachPartition(f) 或者 df.rdd.foreachPartition(f) 将df的每一块应用函数f |
map-reduce操作 | map(func, list),reduce(func, list) 返回类型seq | df.map(func),df.reduce(func) 返回类型seqRDDs |
diff操作 | 有diff操作,处理时间序列数据(Pandas会对比当前行与上一行) | 没有diff操作(Spark的上下行是相互独立,分布式存储的) |
Spark DataFrame in comparison with Pandas
Guess you like
Origin blog.csdn.net/weixin_43064185/article/details/103909256
Recommended
Ranking