Continued operation RDD

map and flatmap

scala> val nums = sc.parallelize(List(1,2,3,4,5,6))
nums: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[4] at parallelize at <console>:24

scala> val squares = nums.map(x => x * x)
squares: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[5] at map at <console>:25

scala> squares.collect
res5: Array[Int] = Array(1, 4, 9, 16, 25, 36)

scala> nums.flatMap(x =>1 to x).collect
res6: Array[Int] = Array(1, 1, 2, 1, 2, 3, 1, 2, 3, 4, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 6)

scala> nums.map(x =>1 to x).collect
res7: Array[scala.collection.immutable.Range.Inclusive] = Array(Range(1), Range(1, 2), Range(1, 2, 3), Range(1, 2, 3, 4), Range(1, 2, 3, 4, 5), Range(1, 2, 3, 4, 5, 6))


scala> nums.map(x =>1 to x).flatMap(x => x).collect
res9: Array[Int] = Array(1, 1, 2, 1, 2, 3, 1, 2, 3, 4, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 6)

join

scala> a.collect
res10: Array[(String, String)] = Array((a,a1), (b,b1), (c,c1), (d,d1), (f,f1), (f,f2))

scala> b.collect
res11: Array[(String, String)] = Array((a,a2), (c,c2), (c,c3), (e,e1))

//inner join
scala> a.join(b).collect
res14: Array[(String, (String, String))] = Array((a,(a1,a2)), (c,(c1,c2)), (c,(c1,c3)))

//left join
scala> a.leftOuterJoin(b).collect
res15: Array[(String, (String, Option[String]))] = Array((d,(d1,None)), (b,(b1,None)), (f,(f1,None)), (f,(f2,None)), (a,(a1,Some(a2))), (c,(c1,Some(c2))), (c,(c1,Some(c3))))

//right join
scala> a.rightOuterJoin(b).collect
res16: Array[(String, (Option[String], String))] = Array((e,(None,e1)), (a,(Some(a1),a2)), (c,(Some(c1),c2)), (c,(Some(c1),c3)))

//full join
scala> a.fullOuterJoin(b).collect
res17: Array[(String, (Option[String], Option[String]))] = Array((d,(Some(d1),None)), (b,(Some(b1),None)), (f,(Some(f1),None)), (f,(Some(f2),None)), (e,(None,Some(e1))), (a,(Some(a1),Some(a2))), (c,(Some(c1),Some(c2))), (c,(Some(c1),Some(c3))))

Spark Core using word frequency statistics

Source File:

[hadoop@hadoop001 data]$ cat wordcount.txt 
world world hello
China hello
people person
love
scala> val wc = sc.textFile("file:///home/hadoop/data/wordcount.txt")
wc: org.apache.spark.rdd.RDD[String] = file:///home/hadoop/data/wordcount.txt MapPartitionsRDD[28] at textFile at <console>:24

scala> wc.collect
res18: Array[String] = Array(world world hello, China hello, people person, love)
//这个里面每个元素类型是String类型
//就是说world world hello这个是一个元素,China hello是另一个元素...

scala> val splits = wc.flatMap(x => x.split(" "))
splits: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[46] at flatMap at <console>:25
//把RDD里面多个元素压扁,比如上面有三个,最后变成了(world world hello China hello people person love),然后再来做map,按照空格分割

scala> splits.collect
res31: Array[String] = Array(world, world, hello, China, hello, people, person, love)
//RDD里面每个元素为String类型

scala> val wordone = splits.map(x => (x,1))
wordone: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[47] at map at <console>:25
//对RDD里面每个元素赋值1的操作,x变成了(x,1)
//将splits中存放的单词映射为一个tuple元组,元组中有两个元素,第一个元素为单词,第二个元素为当前单词本次的个数,固定为1
scala> wordone.collect
res32: Array[(String, Int)] = Array((world,1), (world,1), (hello,1), (China,1), (hello,1), (people,1), (person,1), (love,1))
//RDD里面元素类型为(String, Int)类型 (key,value)

scala> val result = wordone.reduceByKey(_+_)
result: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[49] at reduceByKey at <console>:25
//存在一个shuffle操作的,相同的key分发到同一个reduce上面去,根据key做聚合,然后根据相同的key,把后面的value加起来
scala> result.collect
res34: Array[(String, Int)] = Array((love,1), (hello,2), (world,2), (people,1), (China,1), (person,1))
//数据结构为(String, Int)

//保存的方式有这么多
scala> result.saveAs
saveAsHadoopDataset   saveAsNewAPIHadoopDataset   saveAsObjectFile     saveAsTextFile   
saveAsHadoopFile      saveAsNewAPIHadoopFile      saveAsSequenceFile  
scala> result.saveAsTextFile("/data/wcresult.txt")
//讲结果保存到本地

map each element is made inside RDD map, flatMap RDD for a first flattening process, and do map.
On the basis of the above, do descending / ascending order of the number of times each word appears

scala> val result = wc.flatMap(x => x.split(" ")).map(x => (x,1)).reduceByKey(_+_)
result: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[52] at reduceByKey at <console>:25

scala> result.collect
res35: Array[(String, Int)] = Array((love,1), (hello,2), (world,2), (people,1), (China,1), (person,1))

//按照单词出现的次数做升序排列:(使用sortByKey)
//利用sortByKey这个方法来实现,但是需要把key和value先调一下,比如(world,2),变成(2,world),然后再排序,最后再调回来即可。
//sortByKey,意思是按照Key的大小进行排序,默认参数是升序
scala> result.map(x => (x._2,x._1)).sortByKey().map(x => (x._2,x._1)).collect
res2: Array[(String, Int)] = Array((love,1), (people,1), (China,1), (person,1), (hello,2), (world,2))
//map(x => (x._2,x._1))表示把里面里面的每个元组tuple,里的第一个元素和第二个元素换位置
//这样就实现了排序
上面可以合成一个:
scala> wc.flatMap(x => x.split(" ")).map(x => (x,1)).reduceByKey(_+_).map(x => (x._2,x._1)).sortByKey().map(x => (x._2,x._1)).collect
res3: Array[(String, Int)] = Array((love,1), (people,1), (China,1), (person,1), (hello,2), (world,2))

//按照单词出现的次数做降序排列:
//sortByKey,意思是按照Key的大小进行排序,默认参数是升序,如果是降序,传入个参数就可以了sortByKey(false)
//其它和升序一样
scala> wc.flatMap(x => x.split(" ")).map(x => (x,1)).reduceByKey(_+_).map(x => (x._2,x._1)).sortByKey(false).map(x => (x._2,x._1)).collect
res4: Array[(String, Int)] = Array((hello,2), (world,2), (love,1), (people,1), (China,1), (person,1))

//还可以使用top
//这是一个柯里化的函数,top命令是查看前多少条数据,如图可见,在查看之时,元素也是排序好的
scala> wc.flatMap(x => x.split(" ")).map(x => (x,1)).reduceByKey(_+_).map(x => (x._2,x._1)).top(6).map(x => (x._2,x._1))
res9: Array[(String, Int)] = Array((world,2), (hello,2), (person,1), (people,1), (love,1), (China,1))





//按照单词出现的次数做升序排列:(使用sortBy)
//上面排序是使用sortByKey进行排序,现在使用另外一个方法sortBy
sortBy: sortBy(x$1, Boolean)   第一个参数是元组里你要写的第几个参数,第二个参数是Boolean类型,true是升序,默认的,可以不写,false是降序,要写的
//按照value升序
//直接按照RDD里面的每个元组里的第二个元素的大小进行排序,默认升序
scala> result.sortBy(_._2).collect
res43: Array[(String, Int)] = Array((love,1), (people,1), (China,1), (person,1), (hello,2), (world,2))
scala> result.sortBy(x => x._2).collect    //和上面一样
res46: Array[(String, Int)] = Array((love,1), (people,1), (China,1), (person,1), (hello,2), (world,2))
//可以合起来写:
scala> wc.flatMap(x => x.split(" ")).map(x => (x,1)).reduceByKey(_+_).sortBy(_._2,false).collect
res15: Array[(String, Int)] = Array((hello,2), (world,2), (love,1), (people,1), (China,1), (person,1))

//按照单词出现的次数做降序排列:(使用sortBy)
scala> result.sortBy(_._2,false).collect
res14: Array[(String, Int)] = Array((hello,2), (world,2), (love,1), (people,1), (China,1), (person,1))


还有这个:
scala> result.sortBy(_._2).foreach(println)
(hello,2)
(world,2)
(love,1)
(people,1)
(China,1)
(person,1)

Praying set subtract subtraction

a.subtract (b):
The inside of the RAA a, RDD b not return element inside out

scala> val a = sc.parallelize(1.to(5))
a: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[62] at parallelize at <console>:24

scala> val b = sc.parallelize(2.to(3))
b: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[63] at parallelize at <console>:24

scala> a.collect
res18: Array[Int] = Array(1, 2, 3, 4, 5)

scala> b.collect
res19: Array[Int] = Array(2, 3)

scala> a.subtract(b).collect
res20: Array[Int] = Array(4, 1, 5)

intersection intersection of

scala> a.collect
res18: Array[Int] = Array(1, 2, 3, 4, 5)

scala> b.collect
res19: Array[Int] = Array(2, 3)

scala> a.intersection(b).collect
res27: Array[Int] = Array(2, 3)

cartesian: Cartesian product

scala> a.collect
res18: Array[Int] = Array(1, 2, 3, 4, 5)

scala> b.collect
res19: Array[Int] = Array(2, 3)

//形成了键值对的结构
scala> a.cartesian(b)
res29: org.apache.spark.rdd.RDD[(Int, Int)] = CartesianRDD[108] at cartesian at <console>:28

scala> a.cartesian(b).collect
res28: Array[(Int, Int)] = Array((1,2), (2,2), (1,3), (2,3), (3,2), (4,2), (5,2), (3,3), (4,3), (5,3))

takeOrdered

takeOrdered (n, [ordering])
def takeOrdered(num: Int)(implicit ord: Ordering[T]): Array[T]
after use implicit Ordering [T] sorted, before returning num elements
in their natural order, or use a custom comparator returns the first n elements RDD

scala> val a = sc.parallelize(List(2,1,3,5,4,8,6,7))
a: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[112] at parallelize at <console>:24

scala> a.collect
res34: Array[Int] = Array(2, 1, 3, 5, 4, 8, 6, 7)

scala> a.takeOrdered(5)
res35: Array[Int] = Array(1, 2, 3, 4, 5)

takeSample

takeSample(withReplacement, num, [seed])

def takeSample(
    withReplacement: Boolean,
    num: Int,
    seed: Long = Utils.random.nextLong): Array[T]

It returns an array, wherein the data set comprises a random sample num elements, or put back sampled without replacement, optionally pre-specified random number generator seed.
Returns false random element in the RDD indicates each taking element is not only taken off, true taken from the whole set represents

scala> val a = sc.parallelize(1 to 10)
a: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[109] at parallelize at <console>:24

scala> a.collect
res51: Array[Int] = Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)

scala> a.takeSample(true,5)
res54: Array[Int] = Array(9, 3, 3, 1, 8)

scala> a.takeSample(true,5)
res55: Array[Int] = Array(5, 6, 5, 5, 9)

//上面是true,可能会有重复的,下面是false,没有重复的

scala> a.takeSample(false,5)
res56: Array[Int] = Array(1, 10, 8, 2, 6)

scala> a.takeSample(false,5)
res57: Array[Int] = Array(2, 9, 6, 4, 5)

sample

def sample(
    withReplacement: Boolean,
    fraction: Double,
    seed: Long = Utils.random.nextLong): RDD[T]

The given random number generator seed samples the data in case of no back or back into the

Sampling Method:

  • Without replacement: the probability of each element is drawn in; fraction must be between [0, 1]
  • With replacement: number of times each element is drawn in; fraction must be> = 0

countByKey

The data in the RDD counts by Key

scala> val data = sc.parallelize(List((1,3),(1,2),(5,4),(1, 4),(2,3),(2,4)),3)
data: org.apache.spark.rdd.RDD[(Int, Int)] = ParallelCollectionRDD[135] at parallelize at <console>:24

scala> data.collect
res58: Array[(Int, Int)] = Array((1,3), (1,2), (5,4), (1,4), (2,3), (2,4))

scala> data.countByKey()
res59: scala.collection.Map[Int,Long] = Map(1 -> 3, 5 -> 1, 2 -> 2)

mapPartitions

def mapPartitions[U: ClassTag](
    f: Iterator[T] => Iterator[U],
    preservesPartitioning: Boolean = false): RDD[U]

Similarly map, but run separately on each RDD partitions, the type func Iterator [T] => Iterator [U]

foreach

def foreach(f: T => Unit): Unit
Traversing all the elements using the f function in RDD

Refer to: https://blog.csdn.net/goldlone/article/details/83868822#t3

Guess you like

Origin blog.csdn.net/liweihope/article/details/90919049