4. RDD Programming API
4.1 RDD operator classification
Transformation (conversion): According to create a new data set data set, the calculation returns a new RDD; example: a map rdd for a new operation epigenetic rdd.
Action (operation): Returns a numerical value to the driver, or the result stored in the external storage system (e.g. HDFS) after rdd calculated result;
For example: collect all the elements of the operator sets the data collection is completed is returned to the driver.
4.2 Transformation
All in all RDD conversion lazy loading, that is, they do not calculate the results directly. Instead, they simply remember the conversion operation based on those applied to the data set (e.g., a document). Occurs only when a required operation Driver returns the results to the outside or writes the results to storage, these conversions will really run. This design allows Spark to run more efficiently.
Commonly used Transformation:
Change |
meaning |
map(func) |
Returns a new RDD, the RDD by each input element composition after conversion function func |
filter(func) |
It returns a new RDD, RDD returned by the function calculating after input element func composition is true |
flatMap(func) |
Similar map, but each input element may be mapped to zero or more output elements (func it should return a sequence, rather than a single element) |
mapPartitions (func) |
Similar map, but run independently on each slice of RDD, thus running the RDD type T when, func function type must be Iterator [T] => Iterator [U] |
mapPartitionsWithIndex (func) |
Similar mapPartitions, but func integer with a value of the index parameter indicates the slice, so as on the type of operation when the RDD T, the function func type must be (Int, Interator[T]) => Iterator[U] |
union(otherDataset) |
After the source RDD and RDD parameters and requirements set returns a new RDD |
intersection(otherDataset) |
RDD source parameters and returns a new RDD RDD after the intersection of |
distinct([numTasks])) |
After the source RDD were to re-return to a new RDD |
groupByKey([numTasks]) |
In a (K, V) of the RDD call, it returns a (K, Iterator [V]) of the RDD |
reduceByKey(func, [numTasks]) |
In a (K, V) of the RDD call, returns a (K, V) of RDD, reduce the specified function, the same key value is polymerized together with groupByKey Similarly, reduce the number of tasks may be performed by a second optional parameters to set |
sortByKey([ascending], [numTasks]) |
In a (K, V) of the RDD calls, K must implement the Ordered interface returns a sorted according to a key (K, V) of the RDD |
sortBy(func,[ascending], [numTasks]) |
And sortByKey similar, but more flexible |
join(otherDataset, [numTasks]) |
In the type (K, V), and (K, W) of the RDD call returns all elements of one and the same key corresponding to RDD together (K, (V, W)) of |
cogroup(otherDataset, [numTasks]) |
In the type (K, V), and (K, W) of the RDD call, it returns a (K, (Iterable <V>, Iterable <W>)) type RDD |
coalesce(numPartitions) |
RDD to reduce the number of partitions specified value. |
distribution (numPartitions) |
RDD to re-partition |
repartitionAndSortWithinPartitions (partitions)
|
RDD to re-partition and each partition sorted by key records |
4.3 Action
action |
meaning |
reduce(func) |
reduce将RDD中元素前两个传给输入函数,产生一个新的return值,新产生的return值与RDD中下一个元素(第三个元素)组成两个元素,再被传给输入函数,直到最后只有一个值为止。 |
collect() |
在驱动程序中,以数组的形式返回数据集的所有元素 |
count() |
返回RDD的元素个数 |
first() |
返回RDD的第一个元素(类似于take(1)) |
take(n) |
返回一个由数据集的前n个元素组成的数组 |
takeOrdered(n, [ordering]) |
返回自然顺序或者自定义顺序的前 n 个元素 |
saveAsTextFile(path) |
将数据集的元素以textfile的形式保存到HDFS文件系统或者其他支持的文件系统,对于每个元素,Spark将会调用toString方法,将它装换为文件中的文本 |
saveAsSequenceFile(path) |
将数据集中的元素以Hadoop sequencefile的格式保存到指定的目录下,可以使HDFS或者其他Hadoop支持的文件系统。 |
saveAsObjectFile(path) |
将数据集的元素,以 Java 序列化的方式保存到指定的目录下 |
countByKey() |
针对(K,V)类型的RDD,返回一个(K,Int)的map,表示每一个key对应的元素个数。 |
foreach(func) |
在数据集的每一个元素上,运行函数func |
foreachPartition(func) |
在数据集的每一个分区上,运行函数func |
5. RDD常用的算子操作
Spark Rdd的所有算子操作,请见《sparkRDD函数详解.docx》
启动spark-shell 进行测试:
spark-shell --master spark://node1:7077
练习1:map、filter
//通过并行化生成rdd
val rdd1 = sc.parallelize(List(5, 6, 4, 7, 3, 8, 2, 9, 1, 10))
//对rdd1里的每一个元素乘2然后排序
val rdd2 = rdd1.map(_ * 2).sortBy(x => x, true)
//过滤出大于等于5的元素
val rdd3 = rdd2.filter(_ >= 5)
//将元素以数组的方式在客户端显示
rdd3.collect
练习2:flatMap
val rdd1 = sc.parallelize(Array("a b c", "d e f", "h i j"))
//将rdd1里面的每一个元素先切分在压平
val rdd2 = rdd1.flatMap(_.split(" "))
rdd2.collect
练习3:交集、并集
val rdd1 = sc.parallelize(List(5, 6, 4, 3))
val rdd2 = sc.parallelize(List(1, 2, 3, 4))
//求并集
val rdd3 = rdd1.union(rdd2)
//求交集
val rdd4 = rdd1.intersection(rdd2)
//去重
rdd3.distinct.collect
rdd4.collect
练习4:join、groupByKey
val rdd1 = sc.parallelize(List(("tom", 1), ("jerry", 3), ("kitty", 2)))
val rdd2 = sc.parallelize(List(("jerry", 2), ("tom", 1), ("shuke", 2)))
//求join
val rdd3 = rdd1.join(rdd2)
rdd3.collect
//求并集
val rdd4 = rdd1 union rdd2
rdd4.collect
//按key进行分组
val rdd5=rdd4.groupByKey
rdd5.collect
练习5:cogroup
val rdd1 = sc.parallelize(List(("tom", 1), ("tom", 2), ("jerry", 3), ("kitty", 2)))
val rdd2 = sc.parallelize(List(("jerry", 2), ("tom", 1), ("jim", 2)))
//cogroup
val rdd3 = rdd1.cogroup(rdd2)
//注意cogroup与groupByKey的区别
rdd3.collect
练习6:reduce
val rdd1 = sc.parallelize(List(1, 2, 3, 4, 5))
//reduce聚合
val rdd2 = rdd1.reduce(_ + _)
rdd2.collect
练习7:reduceByKey、sortByKey
val rdd1 = sc.parallelize(List(("tom", 1), ("jerry", 3), ("kitty", 2), ("shuke", 1)))
val rdd2 = sc.parallelize(List(("jerry", 2), ("tom", 3), ("shuke", 2), ("kitty", 5)))
val rdd3 = rdd1.union(rdd2)
//按key进行聚合
val rdd4 = rdd3.reduceByKey(_ + _)
rdd4.collect
//按value的降序排序
val rdd5 = rdd4.map(t => (t._2, t._1)).sortByKey(false).map(t => (t._2, t._1))
rdd5.collect
练习8:repartition、coalesce
val rdd1 = sc.parallelize(1 to 10,3)
//利用repartition改变rdd1分区数
//减少分区
rdd1.repartition(2).partitions.size
//增加分区
rdd1.repartition(4).partitions.size
//利用coalesce改变rdd1分区数
//减少分区
rdd1.coalesce(2).partitions.size
注意:repartition可以增加和减少rdd中的分区数,coalesce只能减少rdd分区数,增加rdd分区数不会生效。