spark operator example

RDD conversion operator Transformation (lazy): lazy mode (conversion)

  • A data set into two RDD, two possible merger

map

  • Input transform function is applied to all elements RDD

    val a = sc.parallelize(1 to 8)
    val b = a.map(s=>(s+1))
    b.collect
    

    flatMap * RDD input transform function is applied to all elements, all the objects into an object.

    sc.parallelize(1 to 10).flatMap(it=>it to 10).collect
    

    filter

    val a = sc.parallelize(1 to 9)
    val b = a.filter(s=>(s%2==0))
    b.collect
    

    mapValues

  • On mapvalues ​​a RDD in Key unchanged with the new Values ​​together to form a new RDD

    val a = sc.parallelize(List("aa","bb","cc","dd"))
    val b=a.map(x=>(x.length,x))
    b.mapValues("x"+_+"x").collect
    

    MapPartitions

  • Because each partition is operated solely in the RDD (block), so a type of the run when the T RDD, (fuction) Must Iterator <> T => Iterator type of process

    val date=sc.parallelize(1 to 10,3)
    def function(it:Iterator[Int]):Iterator[Int]={
       | var res = for(e <-it) yield e*2(yield相当缓冲)
       | res
       | }
    val result4=data.mapPartitions(function)
    result4.collect
    

    sample(withReplacement,fraction,seed)

  • withReplacement: whether to back, fraction: sample than the column, seed: seed the random number generator

    val date=sc.parallelize(1 to 10,3)
    val result6=date.sample(false,0.5,1).collect
    

    union()

  • For another set of metadata and the data set request and sets, not weight

    value result7=date.union(result6)
    result7.collect
    

    intersection

  • For another set of metadata and the data set request and sets, deduplication

    value result7=date.intersection(result6)
    result7.collect
    

    disinct

  • It returns a new data set after the source data set to the heavy, i.e. to the weight and partially disordered integrally orderly return

    val date1=sc.parallelize(1 to 10,3)
    val result=date1.disinct
    result.collect
    

    groupByKey

  • The key for the same key value into a set of packet sequence, the order is uncertain if too many values corresponding to a key, it is easy to cause a memory overflow.

    val data=sc.parallelize(1 to 10)
    val pair1=data.map(x=>{(x,1)})
    val pair2=data.map(x=>{(x,2)})
    val pair3=pair1.union(pair2)
    val groupedPair=pair3.groupByKey
    groupedPair.collect
    

    join

  • The same key value is extracted, value values ​​form (x, y)

    val data=sc.parallelize(1 to 10)
    val pair1=data.map(x=>{(x,1)})
    val pair2=data.map(x=>{(x,2)})
    val joinpair=pair1.join(pair2,2).collect
    

    sortByKey(ascending,numTasks)

  • According to the sort key, default true ascending

val data=sc.parallelize(1 to 10)
val pair1=data.map(x=>{(x,1)})
val pair2=data.map(x=>{(x,2)})
val pair3=pair1.union(pair2)
val sortPair=pair3.sortByKey(true,2)
sortPair=pair3.sortByKey(false,2)

RDD Action Operator Action (non-lazy): starving mode (action)

  • The specific values ​​are returned from a plurality of conversion RDD

reduce

  • The RDD twenty-two transfer the elements, while generating a new value
val data=sc.parallelize(1 to 10)
data.reduce((a,b)=>a+b)//方法一
data.reduce(_+_)//方法二

take()

  • Take the first few values
val data = sc.parallelize(1 to 10)
data.take(2)
//结果
res17: Array[Int] = Array(1, 2)

 

Guess you like

Origin www.cnblogs.com/tudousiya/p/11285744.html