pyspark RDD zip、zipWithUniqueId、zipWithIndex操作详解

一、zip(other)
Zips this RDD with another one, returning key-value pairs with the first element in each RDD second element in each RDD, etc. Assumes that the two RDDs have the same number of partitions and the same number of elements in each partition (e.g. one was made through a map on the other).
两个RDDzip，返回k-v
前提：
两个RDD具有相同个数的分区，并且每个分区内的个数相等
例如：

例子：

x=sc.parallelize(range(5),2)
y=sc.parallelize(range(1000,1005),2)
a=x.zip(y).glom().collect()
print(a)
a=x.zip(y).collect()
print(a)

运行结果：
[[(0, 1000), (1, 1001)], [(2, 1002), (3, 1003), (4, 1004)]]
[(0, 1000), (1, 1001), (2, 1002), (3, 1003), (4, 1004)]

二、zipWithIndex()
Zips this RDD with its element indices.
The ordering is first based on the partition index and then the ordering of items within each partition. So the first item in the first partition gets index 0, and the last item in the last partition receives the largest index.
返回：k-v
与分区没有关系，v的值对应list下标值

例子：
a=sc.parallelize(list('abczyx'),1).zipWithIndex().glom().collect()
print(a)

a=sc.parallelize(list('abczyx'),2).zipWithIndex().glom().collect()
print(a)

a=sc.parallelize(list('abczyx'),6).zipWithIndex().glom().collect()
print(a)

运行结果

运行结果：

[[('a', 0), ('b', 1), ('c', 2), ('z', 3), ('y', 4), ('x', 5)]]
[[('a', 0), ('b', 1), ('c', 2)], [('z', 3), ('y', 4), ('x', 5)]]
[[('a', 0)], [('b', 1)], [('c', 2)], [('z', 3)], [('y', 4)], [('x', 5)]]

三、zipWithUniqueId

Zips this RDD with generated unique Long ids.

Items in the kth partition will get ids k, n+k, 2n+k, …, where n is the number of partitions. So there may exist gaps, but this method won’t trigger a spark job, which is different from zipWithIndex
返回k-v，与分区有关系
k, n+k, 2n+k,
n为第几个分区
k为第n个分区的第k个值
从0开始计数

程序实例：一个分区

a=sc.parallelize(list('abczyx'),1).zipWithUniqueId().glom().collect()
print(a)

结果同zipWithIndex()

结果
[[('a', 0), ('b', 1), ('c', 2), ('z', 3), ('y', 4), ('x', 5)]]

二个分区

代码：
rdd=sc.parallelize(list('abczyx'),2)
print(rdd.glom().collect())
a=rdd.zipWithUniqueId().glom().collect()
print(a)

结果：

[['a', 'b', 'c'], ['z', 'y', 'x']]
[[('a', 0), ('b', 2), ('c', 4)], [('z', 1), ('y', 3), ('x', 5)]]

在这里插入图片描述

4个分区

扫描二维码关注公众号，回复： 5557358 查看本文章

rdd=sc.parallelize(list('45abczyx'),4)
print(rdd.glom().collect())
a=rdd.zipWithUniqueId().glom().collect()
print(a)

[['4', '5'], ['a', 'b'], ['c', 'z'], ['y', 'x']]
[[('4', 0), ('5', 4)], [('a', 1), ('b', 5)], [('c', 2), ('z', 6)], [('y', 3), ('x', 7)]]

分析：

在这里插入图片描述

pyspark RDD zip、zipWithUniqueId、zipWithIndex操作详解

猜你喜欢