RDD编程基础-RDD操作

scala> val rdd1 = sc.textFile("file:///Users/***/spark/test_data/word.txt")
scala> rdd1.filter(x=>x.contains("huahua")) foreach println
huahua hadoop spark
huahua hadoop

也可以预先定义func:
scala> val func:String=>Boolean = {x:String=>x.contains("huahua")}
scala> rdd1.filter(func) foreach println
huahua hadoop spark
huahua hadoop

综合练习 WordCount:

// flatMap,map,reduceByKey
scala> rdd1.flatMap(_.split(" ")).map((_,1)).reduceByKey((x,y)=>x+y).foreach(println) //reduceByKey中的func仅仅作用于 PairRDD的value,reduceByKey((x,y)=>x+y)可写成reduceByKey(_+_)
(mapreduce,2)
(huahua,2)
(spark,5)
(hadoop,5)
(spark2.2,1)
(spark2.4,2)
(kylin,1)
(hbase,4)


// flatMap,map,groupByKey
scala> rdd1.flatMap(_.split(" ")).map((_,1)).groupByKey().map(t=>(t._1,t._2.sum)) foreach println //groupByKey之后得到的是org.apache.spark.rdd.RDD[(String, Iterable[Int])],每个元素为一个tuple,map函数作用于每个tuple
(mapreduce,2)
(huahua,2)
(spark,5)
(hadoop,5)
(spark2.2,1)
(spark2.4,2)
(kylin,1)

猜你喜欢

转载自www.cnblogs.com/wooluwalker/p/12319144.html
rdd
今日推荐