pyspark dataframe api速览

快速了解dataframe 提供的功能. 避免重复工作

版本 spark 2.2

相关性
cov 皮尔逊相关系数
corr 方差

删除
dropDuplicates 可指定列
dropna 可指定列

选择
select
selectExpr 支持 sql 表达式的select
colRegex 正则表达式选择列
where
filter
exceptAll 在df1不在df2
union 并根据列index
unionByName 根据列名union
subtract 差
intersect 交, 不去重
intersectAll 去重
limit
first
head
take
randomSplit 比例切分
sample 采样
sampleBy 根据列值采样只支持一列

排序
orderBy
sort 支持多种写法见附录
sortWithinPartitions

存储
cache
registerTempTable
coalesce 存储分块对比!!
repartition 存储分块
repartitionByRange 根据给定表达式partition
write 支持多种mode, 自定义partition nums 也可以自己写一个connector例如tfrecord connector
writeStream
toDF
toJSON
toPandas 注意excutor要装过pandas 自己当做package传上去可能会由于大小受限制
checkpoint 版本存储
localCheckpoint
persist
unpersist

修改
withColumn
withColumnRenamed
replace 支持多值替换, 类型必须相同
fillna
na

展示/统计
show truncate每列展示值作截断, vertical 垂直展示
schema
stat
dtypes
describe 对比
summary
printSchema
rollup 根据所有列组合groupby and agg
freqItems
foreach shorthand for df.rdd.foreach()
foreachPartition

聚合
groupby
agg

other
toLocalIterator
approxQuantile
explain
hint
isLocal
isStreaming
withWatermark

附录
sort

扫描二维码关注公众号，回复： 5685019 查看本文章

df.sort(df.age.desc()).collect()
[Row(age=5, name='Bob'), Row(age=2, name='Alice')]
df.sort("age", ascending=False).collect()
[Row(age=5, name='Bob'), Row(age=2, name='Alice')]
df.orderBy(df.age.desc()).collect()
[Row(age=5, name='Bob'), Row(age=2, name='Alice')]
from pyspark.sql.functions import *
df.sort(asc("age")).collect()
[Row(age=2, name='Alice'), Row(age=5, name='Bob')]
df.orderBy(desc("age"), "name").collect()
[Row(age=5, name='Bob'), Row(age=2, name='Alice')]
df.orderBy(["age", "name"], ascending=[0, 1]).collect()
[Row(age=5, name='Bob'), Row(age=2, name='Alice')]

pyspark dataframe api速览

版本 spark 2.2

猜你喜欢