RDD operations are divided into transformation operations and action operations.
For RDD, each conversion operation will generate a different RDD for use by one operation.
The RDD we get from each transformation is lazily evaluated
That is to say, the entire conversion process is not really calculated, but only the trajectory of the conversion is recorded.
When an action operation is encountered, the real calculation occurs, starting from the source of the DAG graph and performing "from the beginning to the end" of the calculation.
common operations
Operation type |
Function name |
effect |
conversion action |
map() |
The parameter is a function, the function is applied to each element of the RDD, and the return value is the new RDD |
flatMap() |
The parameter is a function, the function is applied to each element of the RDD, the element data is split into an iterator, and the return value is the new RDD |
|
filter() |
The parameter is a function, the function will filter out the elements that do not meet the conditions, and the return value is a new RDD |
|
distinct() |
Without parameters, the elements in the RDD are deduplicated |
|
union() |
The argument is an RDD, generate a new RDD containing all elements of both RDDs |
|
intersection() |
The parameter is RDD, find the common elements of the two RDDs |
|
subtract() |
The parameter is RDD, remove the same elements in the original RDD as in the parameter RDD |
|
cartesian() |
The parameter is RDD, find the Cartesian product of two RDDs |
|
Action operation |
collect() |
Returns all elements of the RDD |
count() |
The number of elements in the RDD |
|
countByValue() |
The number of times each element appears in the RDD |
|
reduce() |
Consolidate all RDD data in parallel, such as sum operations |
|
fold(0)(func) |
Same as reduce, but fold has an initial value |
|
aggregate(0)(seqOp,combop) |
Same as reduce, but the returned RDD data type is different from the original RDD |
|
foreach(func) |
Use a specific function for each element of the RDD |
In addition, the transformation operations we have used are:
1.groupByKey(): applied to the dataset of (K, V) key-value pairs, returning a new dataset in the form of (K, Iterable)
2.reduceByKey(func): apply to the dataset of (K, V) key-value pairs, and return a new dataset of (K, V) form
其中每个值是将每个Key传入到func中进行聚合。
除此之外我们还用到过的行动操作还有
1.first():返回数据集的第一个元素
2.take(n):以数组形式返回数据集的前n个元素。
示例
转化操作
val rddInt:RDD[Int] = sc.makeRDD(List(1,2,3,4,5,6,2,5,1) val rddStr:RDD[String] = sc.parallelize(Array("a","b","c","d","b","a"), 1) val rddFile:RDD[String] = sc.textFile(path, 1) val rdd01:RDD[Int] = sc.makeRDD(List(1,3,5,3)) val rdd02:RDD[Int] = sc.makeRDD(List(2,4,5,1))
/* map操作 */ println("======map操作======") println(rddInt.map(x => x + 1).collect().mkString(",")) println("======map操作======") /* filter操作 */ println("======filter操作======") println(rddInt.filter(x => x > 4).collect().mkString(",")) println("======filter操作======") /* flatMap操作 */ println("======flatMap操作======") println(rddFile.flatMap { x => x.split(",") }.first()) println("======flatMap操作======") /* distinct去重操作 */ println("======distinct去重======") println(rddInt.distinct().collect().mkString(",")) println(rddStr.distinct().collect().mkString(",")) println("======distinct去重======") /* union操作 */ println("======union操作======") println(rdd01.union(rdd02).collect().mkString(",")) println("======union操作======") /* intersection操作 */ println("======intersection操作======") println(rdd01.intersection(rdd02).collect().mkString(",")) println("======intersection操作======") /* subtract操作 */ println("======subtract操作======") println(rdd01.subtract(rdd02).collect().mkString(",")) println("======subtract操作======") /* cartesian操作 */ println("======cartesian操作======") println(rdd01.cartesian(rdd02).collect().mkString(",")) println("======cartesian操作======")
行动操作
val rddInt:RDD[Int] = sc.makeRDD(List(1,2,3,4,5,6,2,5,1)) val rddStr:RDD[String] = sc.parallelize(Array("a","b","c","d","b","a"), 1)
/* count操作 */ println("======count操作======") println(rddInt.count()) println("======count操作======") /* countByValue操作 */ println("======countByValue操作======") println(rddInt.countByValue()) println("======countByValue操作======") /* reduce操作 */ println("======countByValue操作======") println(rddInt.reduce((x, y) => x + y)) println("======countByValue操作======") /* fold操作 */ println("======fold操作======") println(rddInt.fold(0)((x, y) => x + y)) println("======fold操作======") /* aggregate操作 */ println("======aggregate操作======") val res: (Int, Int) = rddInt.aggregate((0, 0))((x, y) => (x._1 + x._2, y), (x, y) => (x._1 + x._2, y._1 + y._2)) println(res._1 + "," + res._2) println("======aggregate操作======") /* foreach操作 */ println("======foeach操作======") println(rddStr.foreach { x => println(x) }) println("======foeach操作======")