[Spark] RDD operation

RDD operations are divided into transformation operations and action operations.

For RDD, each conversion operation will generate a different RDD for use by one operation.

The RDD we get from each transformation is lazily evaluated

That is to say, the entire conversion process is not really calculated, but only the trajectory of the conversion is recorded.

When an action operation is encountered, the real calculation occurs, starting from the source of the DAG graph and performing "from the beginning to the end" of the calculation.

common operations

Operation type

Function name

effect

conversion action

map()

The parameter is a function, the function is applied to each element of the RDD, and the return value is the new RDD

flatMap()

The parameter is a function, the function is applied to each element of the RDD, the element data is split into an iterator, and the return value is the new RDD

filter()

The parameter is a function, the function will filter out the elements that do not meet the conditions, and the return value is a new RDD

distinct()

Without parameters, the elements in the RDD are deduplicated

union()

The argument is an RDD, generate a new RDD containing all elements of both RDDs

intersection()

The parameter is RDD, find the common elements of the two RDDs

subtract()

The parameter is RDD, remove the same elements in the original RDD as in the parameter RDD

cartesian()

The parameter is RDD, find the Cartesian product of two RDDs

Action operation

collect()

Returns all elements of the RDD

count()

The number of elements in the RDD

countByValue()

The number of times each element appears in the RDD

reduce()

Consolidate all RDD data in parallel, such as sum operations

fold(0)(func)

Same as reduce, but fold has an initial value

aggregate(0)(seqOp,combop)

Same as reduce, but the returned RDD data type is different from the original RDD

foreach(func)

Use a specific function for each element of the RDD

In addition, the transformation operations we have used are:

1.groupByKey(): applied to the dataset of (K, V) key-value pairs, returning a new dataset in the form of (K, Iterable)

2.reduceByKey(func): apply to the dataset of (K, V) key-value pairs, and return a new dataset of (K, V) form

                                     其中每个值是将每个Key传入到func中进行聚合。

除此之外我们还用到过的行动操作还有

1.first():返回数据集的第一个元素

2.take(n):以数组形式返回数据集的前n个元素。

示例

转化操作

val rddInt:RDD[Int] = sc.makeRDD(List(1,2,3,4,5,6,2,5,1)
val rddStr:RDD[String] = sc.parallelize(Array("a","b","c","d","b","a"), 1)
val rddFile:RDD[String] = sc.textFile(path, 1)
val rdd01:RDD[Int] = sc.makeRDD(List(1,3,5,3))
val rdd02:RDD[Int] = sc.makeRDD(List(2,4,5,1))
/* map操作 */
    println("======map操作======")
    println(rddInt.map(x => x + 1).collect().mkString(","))
    println("======map操作======")

/* filter操作 */
    println("======filter操作======")
    println(rddInt.filter(x => x > 4).collect().mkString(","))
    println("======filter操作======")
        
/* flatMap操作 */
    println("======flatMap操作======")
    println(rddFile.flatMap { x => x.split(",") }.first())
    println("======flatMap操作======")

/* distinct去重操作 */
    println("======distinct去重======")
    println(rddInt.distinct().collect().mkString(","))
    println(rddStr.distinct().collect().mkString(","))
    println("======distinct去重======")

/* union操作 */
    println("======union操作======")
    println(rdd01.union(rdd02).collect().mkString(","))
    println("======union操作======")

/* intersection操作 */
    println("======intersection操作======")
    println(rdd01.intersection(rdd02).collect().mkString(","))
    println("======intersection操作======")

/* subtract操作 */
    println("======subtract操作======")
    println(rdd01.subtract(rdd02).collect().mkString(","))
    println("======subtract操作======")

/* cartesian操作 */
    println("======cartesian操作======")
    println(rdd01.cartesian(rdd02).collect().mkString(","))
    println("======cartesian操作======")

行动操作

val rddInt:RDD[Int] = sc.makeRDD(List(1,2,3,4,5,6,2,5,1))
val rddStr:RDD[String] = sc.parallelize(Array("a","b","c","d","b","a"), 1)

  

/* count操作 */
    println("======count操作======")
    println(rddInt.count())
    println("======count操作======")
    
/* countByValue操作 */
    println("======countByValue操作======")
    println(rddInt.countByValue())
    println("======countByValue操作======")
    
/* reduce操作 */
    println("======countByValue操作======")
    println(rddInt.reduce((x, y) => x + y))
    println("======countByValue操作======")

/* fold操作 */
    println("======fold操作======")
    println(rddInt.fold(0)((x, y) => x + y))
    println("======fold操作======")

/* aggregate操作 */
    println("======aggregate操作======")
    val res: (Int, Int) = rddInt.aggregate((0, 0))((x, y) => (x._1 + x._2, y),
                                                               (x, y) => (x._1 + x._2, y._1 + y._2))
    println(res._1 + "," + res._2)
    println("======aggregate操作======")

/* foreach操作 */
    println("======foeach操作======")
    println(rddStr.foreach { x => println(x) })
    println("======foeach操作======")        

 

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325250084&siteId=291194637