Article Directory
map and flatmap
scala> val nums = sc.parallelize(List(1,2,3,4,5,6))
nums: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[4] at parallelize at <console>:24
scala> val squares = nums.map(x => x * x)
squares: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[5] at map at <console>:25
scala> squares.collect
res5: Array[Int] = Array(1, 4, 9, 16, 25, 36)
scala> nums.flatMap(x =>1 to x).collect
res6: Array[Int] = Array(1, 1, 2, 1, 2, 3, 1, 2, 3, 4, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 6)
scala> nums.map(x =>1 to x).collect
res7: Array[scala.collection.immutable.Range.Inclusive] = Array(Range(1), Range(1, 2), Range(1, 2, 3), Range(1, 2, 3, 4), Range(1, 2, 3, 4, 5), Range(1, 2, 3, 4, 5, 6))
scala> nums.map(x =>1 to x).flatMap(x => x).collect
res9: Array[Int] = Array(1, 1, 2, 1, 2, 3, 1, 2, 3, 4, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 6)
join
scala> a.collect
res10: Array[(String, String)] = Array((a,a1), (b,b1), (c,c1), (d,d1), (f,f1), (f,f2))
scala> b.collect
res11: Array[(String, String)] = Array((a,a2), (c,c2), (c,c3), (e,e1))
//inner join
scala> a.join(b).collect
res14: Array[(String, (String, String))] = Array((a,(a1,a2)), (c,(c1,c2)), (c,(c1,c3)))
//left join
scala> a.leftOuterJoin(b).collect
res15: Array[(String, (String, Option[String]))] = Array((d,(d1,None)), (b,(b1,None)), (f,(f1,None)), (f,(f2,None)), (a,(a1,Some(a2))), (c,(c1,Some(c2))), (c,(c1,Some(c3))))
//right join
scala> a.rightOuterJoin(b).collect
res16: Array[(String, (Option[String], String))] = Array((e,(None,e1)), (a,(Some(a1),a2)), (c,(Some(c1),c2)), (c,(Some(c1),c3)))
//full join
scala> a.fullOuterJoin(b).collect
res17: Array[(String, (Option[String], Option[String]))] = Array((d,(Some(d1),None)), (b,(Some(b1),None)), (f,(Some(f1),None)), (f,(Some(f2),None)), (e,(None,Some(e1))), (a,(Some(a1),Some(a2))), (c,(Some(c1),Some(c2))), (c,(Some(c1),Some(c3))))
Spark Core using word frequency statistics
Source File:
[hadoop@hadoop001 data]$ cat wordcount.txt
world world hello
China hello
people person
love
scala> val wc = sc.textFile("file:///home/hadoop/data/wordcount.txt")
wc: org.apache.spark.rdd.RDD[String] = file:///home/hadoop/data/wordcount.txt MapPartitionsRDD[28] at textFile at <console>:24
scala> wc.collect
res18: Array[String] = Array(world world hello, China hello, people person, love)
//这个里面每个元素类型是String类型
//就是说world world hello这个是一个元素,China hello是另一个元素...
scala> val splits = wc.flatMap(x => x.split(" "))
splits: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[46] at flatMap at <console>:25
//把RDD里面多个元素压扁,比如上面有三个,最后变成了(world world hello China hello people person love),然后再来做map,按照空格分割
scala> splits.collect
res31: Array[String] = Array(world, world, hello, China, hello, people, person, love)
//RDD里面每个元素为String类型
scala> val wordone = splits.map(x => (x,1))
wordone: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[47] at map at <console>:25
//对RDD里面每个元素赋值1的操作,x变成了(x,1)
//将splits中存放的单词映射为一个tuple元组,元组中有两个元素,第一个元素为单词,第二个元素为当前单词本次的个数,固定为1
scala> wordone.collect
res32: Array[(String, Int)] = Array((world,1), (world,1), (hello,1), (China,1), (hello,1), (people,1), (person,1), (love,1))
//RDD里面元素类型为(String, Int)类型 (key,value)
scala> val result = wordone.reduceByKey(_+_)
result: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[49] at reduceByKey at <console>:25
//存在一个shuffle操作的,相同的key分发到同一个reduce上面去,根据key做聚合,然后根据相同的key,把后面的value加起来
scala> result.collect
res34: Array[(String, Int)] = Array((love,1), (hello,2), (world,2), (people,1), (China,1), (person,1))
//数据结构为(String, Int)
//保存的方式有这么多
scala> result.saveAs
saveAsHadoopDataset saveAsNewAPIHadoopDataset saveAsObjectFile saveAsTextFile
saveAsHadoopFile saveAsNewAPIHadoopFile saveAsSequenceFile
scala> result.saveAsTextFile("/data/wcresult.txt")
//讲结果保存到本地
map each element is made inside RDD map, flatMap RDD for a first flattening process, and do map.
On the basis of the above, do descending / ascending order of the number of times each word appears
scala> val result = wc.flatMap(x => x.split(" ")).map(x => (x,1)).reduceByKey(_+_)
result: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[52] at reduceByKey at <console>:25
scala> result.collect
res35: Array[(String, Int)] = Array((love,1), (hello,2), (world,2), (people,1), (China,1), (person,1))
//按照单词出现的次数做升序排列:(使用sortByKey)
//利用sortByKey这个方法来实现,但是需要把key和value先调一下,比如(world,2),变成(2,world),然后再排序,最后再调回来即可。
//sortByKey,意思是按照Key的大小进行排序,默认参数是升序
scala> result.map(x => (x._2,x._1)).sortByKey().map(x => (x._2,x._1)).collect
res2: Array[(String, Int)] = Array((love,1), (people,1), (China,1), (person,1), (hello,2), (world,2))
//map(x => (x._2,x._1))表示把里面里面的每个元组tuple,里的第一个元素和第二个元素换位置
//这样就实现了排序
上面可以合成一个:
scala> wc.flatMap(x => x.split(" ")).map(x => (x,1)).reduceByKey(_+_).map(x => (x._2,x._1)).sortByKey().map(x => (x._2,x._1)).collect
res3: Array[(String, Int)] = Array((love,1), (people,1), (China,1), (person,1), (hello,2), (world,2))
//按照单词出现的次数做降序排列:
//sortByKey,意思是按照Key的大小进行排序,默认参数是升序,如果是降序,传入个参数就可以了sortByKey(false)
//其它和升序一样
scala> wc.flatMap(x => x.split(" ")).map(x => (x,1)).reduceByKey(_+_).map(x => (x._2,x._1)).sortByKey(false).map(x => (x._2,x._1)).collect
res4: Array[(String, Int)] = Array((hello,2), (world,2), (love,1), (people,1), (China,1), (person,1))
//还可以使用top
//这是一个柯里化的函数,top命令是查看前多少条数据,如图可见,在查看之时,元素也是排序好的
scala> wc.flatMap(x => x.split(" ")).map(x => (x,1)).reduceByKey(_+_).map(x => (x._2,x._1)).top(6).map(x => (x._2,x._1))
res9: Array[(String, Int)] = Array((world,2), (hello,2), (person,1), (people,1), (love,1), (China,1))
//按照单词出现的次数做升序排列:(使用sortBy)
//上面排序是使用sortByKey进行排序,现在使用另外一个方法sortBy
sortBy: sortBy(x$1, Boolean) 第一个参数是元组里你要写的第几个参数,第二个参数是Boolean类型,true是升序,默认的,可以不写,false是降序,要写的
//按照value升序
//直接按照RDD里面的每个元组里的第二个元素的大小进行排序,默认升序
scala> result.sortBy(_._2).collect
res43: Array[(String, Int)] = Array((love,1), (people,1), (China,1), (person,1), (hello,2), (world,2))
scala> result.sortBy(x => x._2).collect //和上面一样
res46: Array[(String, Int)] = Array((love,1), (people,1), (China,1), (person,1), (hello,2), (world,2))
//可以合起来写:
scala> wc.flatMap(x => x.split(" ")).map(x => (x,1)).reduceByKey(_+_).sortBy(_._2,false).collect
res15: Array[(String, Int)] = Array((hello,2), (world,2), (love,1), (people,1), (China,1), (person,1))
//按照单词出现的次数做降序排列:(使用sortBy)
scala> result.sortBy(_._2,false).collect
res14: Array[(String, Int)] = Array((hello,2), (world,2), (love,1), (people,1), (China,1), (person,1))
还有这个:
scala> result.sortBy(_._2).foreach(println)
(hello,2)
(world,2)
(love,1)
(people,1)
(China,1)
(person,1)
Praying set subtract subtraction
a.subtract (b):
The inside of the RAA a, RDD b not return element inside out
scala> val a = sc.parallelize(1.to(5))
a: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[62] at parallelize at <console>:24
scala> val b = sc.parallelize(2.to(3))
b: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[63] at parallelize at <console>:24
scala> a.collect
res18: Array[Int] = Array(1, 2, 3, 4, 5)
scala> b.collect
res19: Array[Int] = Array(2, 3)
scala> a.subtract(b).collect
res20: Array[Int] = Array(4, 1, 5)
intersection intersection of
scala> a.collect
res18: Array[Int] = Array(1, 2, 3, 4, 5)
scala> b.collect
res19: Array[Int] = Array(2, 3)
scala> a.intersection(b).collect
res27: Array[Int] = Array(2, 3)
cartesian: Cartesian product
scala> a.collect
res18: Array[Int] = Array(1, 2, 3, 4, 5)
scala> b.collect
res19: Array[Int] = Array(2, 3)
//形成了键值对的结构
scala> a.cartesian(b)
res29: org.apache.spark.rdd.RDD[(Int, Int)] = CartesianRDD[108] at cartesian at <console>:28
scala> a.cartesian(b).collect
res28: Array[(Int, Int)] = Array((1,2), (2,2), (1,3), (2,3), (3,2), (4,2), (5,2), (3,3), (4,3), (5,3))
takeOrdered
takeOrdered (n, [ordering])
def takeOrdered(num: Int)(implicit ord: Ordering[T]): Array[T]
after use implicit Ordering [T] sorted, before returning num elements
in their natural order, or use a custom comparator returns the first n elements RDD
scala> val a = sc.parallelize(List(2,1,3,5,4,8,6,7))
a: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[112] at parallelize at <console>:24
scala> a.collect
res34: Array[Int] = Array(2, 1, 3, 5, 4, 8, 6, 7)
scala> a.takeOrdered(5)
res35: Array[Int] = Array(1, 2, 3, 4, 5)
takeSample
takeSample(withReplacement, num, [seed])
def takeSample(
withReplacement: Boolean,
num: Int,
seed: Long = Utils.random.nextLong): Array[T]
It returns an array, wherein the data set comprises a random sample num elements, or put back sampled without replacement, optionally pre-specified random number generator seed.
Returns false random element in the RDD indicates each taking element is not only taken off, true taken from the whole set represents
scala> val a = sc.parallelize(1 to 10)
a: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[109] at parallelize at <console>:24
scala> a.collect
res51: Array[Int] = Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
scala> a.takeSample(true,5)
res54: Array[Int] = Array(9, 3, 3, 1, 8)
scala> a.takeSample(true,5)
res55: Array[Int] = Array(5, 6, 5, 5, 9)
//上面是true,可能会有重复的,下面是false,没有重复的
scala> a.takeSample(false,5)
res56: Array[Int] = Array(1, 10, 8, 2, 6)
scala> a.takeSample(false,5)
res57: Array[Int] = Array(2, 9, 6, 4, 5)
sample
def sample(
withReplacement: Boolean,
fraction: Double,
seed: Long = Utils.random.nextLong): RDD[T]
The given random number generator seed samples the data in case of no back or back into the
Sampling Method:
- Without replacement: the probability of each element is drawn in; fraction must be between [0, 1]
- With replacement: number of times each element is drawn in; fraction must be> = 0
countByKey
The data in the RDD counts by Key
scala> val data = sc.parallelize(List((1,3),(1,2),(5,4),(1, 4),(2,3),(2,4)),3)
data: org.apache.spark.rdd.RDD[(Int, Int)] = ParallelCollectionRDD[135] at parallelize at <console>:24
scala> data.collect
res58: Array[(Int, Int)] = Array((1,3), (1,2), (5,4), (1,4), (2,3), (2,4))
scala> data.countByKey()
res59: scala.collection.Map[Int,Long] = Map(1 -> 3, 5 -> 1, 2 -> 2)
mapPartitions
def mapPartitions[U: ClassTag](
f: Iterator[T] => Iterator[U],
preservesPartitioning: Boolean = false): RDD[U]
Similarly map, but run separately on each RDD partitions, the type func Iterator [T] => Iterator [U]
foreach
def foreach(f: T => Unit): Unit
Traversing all the elements using the f function in RDD
Refer to: https://blog.csdn.net/goldlone/article/details/83868822#t3