sparkRDD: Section 3 RDD common arithmetic operators

4. RDD Programming API

4.1 RDD operator classification

       Transformation (conversion): According to create a new data set data set, the calculation returns a new RDD; example: a map rdd for a new operation epigenetic rdd.

Action (operation): Returns a numerical value to the driver, or the result stored in the external storage system (e.g. HDFS) after rdd calculated result;

For example: collect all the elements of the operator sets the data collection is completed is returned to the driver.

 

4.2 Transformation

All in all RDD conversion lazy loading, that is, they do not calculate the results directly. Instead, they simply remember the conversion operation based on those applied to the data set (e.g., a document). Occurs only when a required operation Driver returns the results to the outside or writes the results to storage, these conversions will really run. This design allows Spark to run more efficiently.

 

Commonly used Transformation:

Change

meaning

map(func)

Returns a new RDD, the RDD by each input element composition after conversion function func

filter(func)

It returns a new RDD, RDD returned by the function calculating after input element func composition is true

flatMap(func)

Similar map, but each input element may be mapped to zero or more output elements (func it should return a sequence, rather than a single element)

mapPartitions (func)

Similar map, but run independently on each slice of RDD, thus running the RDD type T when, func function type must be Iterator [T] => Iterator [U]

mapPartitionsWithIndex (func)

Similar mapPartitions, but func integer with a value of the index parameter indicates the slice, so as on the type of operation when the RDD T, the function func type must be

(Int, Interator[T]) => Iterator[U]

union(otherDataset)

After the source RDD and RDD parameters and requirements set returns a new RDD

intersection(otherDataset)

RDD source parameters and returns a new RDD RDD after the intersection of

distinct([numTasks]))

After the source RDD were to re-return to a new RDD

groupByKey([numTasks])  

In a (K, V) of the RDD call, it returns a (K, Iterator [V]) of the RDD

reduceByKey(func, [numTasks])

In a (K, V) of the RDD call, returns a (K, V) of RDD, reduce the specified function, the same key value is polymerized together with groupByKey Similarly, reduce the number of tasks may be performed by a second optional parameters to set

sortByKey([ascending], [numTasks])

In a (K, V) of the RDD calls, K must implement the Ordered interface returns a sorted according to a key (K, V) of the RDD

sortBy(func,[ascending], [numTasks])

And sortByKey similar, but more flexible

join(otherDataset, [numTasks])

In the type (K, V), and (K, W) of the RDD call returns all elements of one and the same key corresponding to RDD together (K, (V, W)) of

cogroup(otherDataset, [numTasks])

In the type (K, V), and (K, W) of the RDD call, it returns a (K, (Iterable <V>, Iterable <W>)) type RDD

coalesce(numPartitions) 

RDD to reduce the number of partitions specified value.

distribution (numPartitions)

RDD to re-partition

repartitionAndSortWithinPartitions (partitions)

 

RDD to re-partition and each partition sorted by key records

4.3 Action

action

meaning

reduce(func)

reduce将RDD中元素前两个传给输入函数,产生一个新的return值,新产生的return值与RDD中下一个元素(第三个元素)组成两个元素,再被传给输入函数,直到最后只有一个值为止。

collect()

在驱动程序中,以数组的形式返回数据集的所有元素

count()

返回RDD的元素个数

first()

返回RDD的第一个元素(类似于take(1))

take(n)

返回一个由数据集的前n个元素组成的数组

takeOrdered(n, [ordering])

返回自然顺序或者自定义顺序的前 n 个元素

saveAsTextFile(path)

将数据集的元素以textfile的形式保存到HDFS文件系统或者其他支持的文件系统,对于每个元素,Spark将会调用toString方法,将它装换为文件中的文本

saveAsSequenceFile(path) 

将数据集中的元素以Hadoop sequencefile的格式保存到指定的目录下,可以使HDFS或者其他Hadoop支持的文件系统。

saveAsObjectFile(path) 

将数据集的元素,以 Java 序列化的方式保存到指定的目录下

countByKey()

针对(K,V)类型的RDD,返回一个(K,Int)的map,表示每一个key对应的元素个数。

foreach(func)

在数据集的每一个元素上,运行函数func

foreachPartition(func)

在数据集的每一个分区上,运行函数func

5.      RDD常用的算子操作

Spark Rdd的所有算子操作,请见《sparkRDD函数详解.docx》

启动spark-shell 进行测试:

spark-shell --master spark://node1:7077

 

练习1:map、filter

//通过并行化生成rdd

val rdd1 = sc.parallelize(List(5, 6, 4, 7, 3, 8, 2, 9, 1, 10))

//对rdd1里的每一个元素乘2然后排序

val rdd2 = rdd1.map(_ * 2).sortBy(x => x, true)

//过滤出大于等于5的元素

val rdd3 = rdd2.filter(_ >= 5)

//将元素以数组的方式在客户端显示

rdd3.collect

 

练习2:flatMap

val rdd1 = sc.parallelize(Array("a b c", "d e f", "h i j"))

//将rdd1里面的每一个元素先切分在压平

val rdd2 = rdd1.flatMap(_.split(" "))

rdd2.collect

 

练习3:交集、并集

val rdd1 = sc.parallelize(List(5, 6, 4, 3))

val rdd2 = sc.parallelize(List(1, 2, 3, 4))

//求并集

val rdd3 = rdd1.union(rdd2)

//求交集

val rdd4 = rdd1.intersection(rdd2)

//去重

rdd3.distinct.collect

rdd4.collect

 

练习4:join、groupByKey

val rdd1 = sc.parallelize(List(("tom", 1), ("jerry", 3), ("kitty", 2)))

val rdd2 = sc.parallelize(List(("jerry", 2), ("tom", 1), ("shuke", 2)))

//求join

val rdd3 = rdd1.join(rdd2)

rdd3.collect

//求并集

val rdd4 = rdd1 union rdd2

rdd4.collect

//按key进行分组

val rdd5=rdd4.groupByKey

rdd5.collect

 

练习5:cogroup

val rdd1 = sc.parallelize(List(("tom", 1), ("tom", 2), ("jerry", 3), ("kitty", 2)))

val rdd2 = sc.parallelize(List(("jerry", 2), ("tom", 1), ("jim", 2)))

//cogroup

val rdd3 = rdd1.cogroup(rdd2)

//注意cogroup与groupByKey的区别

rdd3.collect

 

练习6:reduce

val rdd1 = sc.parallelize(List(1, 2, 3, 4, 5))

//reduce聚合

val rdd2 = rdd1.reduce(_ + _)

rdd2.collect

 

练习7:reduceByKey、sortByKey

val rdd1 = sc.parallelize(List(("tom", 1), ("jerry", 3), ("kitty", 2),  ("shuke", 1)))

val rdd2 = sc.parallelize(List(("jerry", 2), ("tom", 3), ("shuke", 2), ("kitty", 5)))

val rdd3 = rdd1.union(rdd2)

//按key进行聚合

val rdd4 = rdd3.reduceByKey(_ + _)

rdd4.collect

//按value的降序排序

val rdd5 = rdd4.map(t => (t._2, t._1)).sortByKey(false).map(t => (t._2, t._1))

rdd5.collect

练习8:repartition、coalesce

val rdd1 = sc.parallelize(1 to 10,3)

//利用repartition改变rdd1分区数

//减少分区

rdd1.repartition(2).partitions.size

//增加分区

rdd1.repartition(4).partitions.size

//利用coalesce改变rdd1分区数

//减少分区

rdd1.coalesce(2).partitions.size

 

注意:repartition可以增加和减少rdd中的分区数,coalesce只能减少rdd分区数,增加rdd分区数不会生效。

 

Guess you like

Origin www.cnblogs.com/mediocreWorld/p/11432268.html
Recommended