RDD的创建&操作

一、如何创建RDD

 1)There are two ways to create RDDs: parallelizing an existing collection in your driver program, 
 2)referencing a dataset in an external storage system, such as a shared filesystem, HDFS, HBase, or any data source offering a Hadoop InputFormat.

解释:
RDD可以通过两种方式创建:

1)调用SparkContext的parallelize并行方法,在Driver中一个已经存在的集合(数组)上创建。
2)引用外部存储系统中的数据集。比如,从本地文件加载数据集,或者从HDFS文件系统、HBase、Cassandra、Amazon S3等外部数据源中加载数据集。Spark可以支持文本文件、SequenceFile文件(Hadoop提供的 SequenceFile是一个由二进制序列化过的key/value的字节流组成的文本存储文件)和其他符合Hadoop InputFormat格式的文件。

1.并行化创建RDD

 如果要通过并行化集合来创建RDD,需要针对程序中的集合,调用SparkContext中的parallelize()方法。Spark会将集合中的数据拷贝到集群上去,形成一个分布式的数据集合,也就是一个RDD。即:集合中的部分数据会到一个节点上,而另一部分数据会到其它节点上。然后就可以采用并行的方式来操作这个分布式数据集合。
// 并行化创建RDD部分代码 
val arr = Array(1,2,3,4,5)
val rdd = sc.parallelize(arr)

在这里插入图片描述
注意:
在调用parallelize()方法时,有一个重要的参数可以指定,就是要将集合切分成多少个partition。Spark会为每一个partition运行一个task来进行处理。Spark官方的建议是,为集群中的每个CPU创建2-4个partition。Spark默认会根据集群的情况来设置partition的数量。但是也可以在调用parallelize()方法时,传入第二个参数,来设置RDD的partition数量。比如,parallelize(arr, 10)

2.使用textFile方法,通过本地文件或HDFS创建RDD

  Spark是支持使用任何Hadoop支持的存储系统上的文件创建RDD的,比如说HDFS、Cassandra、HBase以及本地文件。通过调用SparkContext的textFile()方法,可以针对本地文件或HDFS文件创建RDD。Spark是支持使用任何Hadoop支持的存储系统上的文件创建RDD的,比如说HDFS、Cassandra、HBase以及本地文件。通过调用SparkContext的textFile()方法,可以针对本地文件或HDFS文件创建RDD。
// textFile()方法中,输入本地文件路径或是HDFS路径
// HDFS:hdfs://spark1:9000/data.txt
// local:/home/hadoop/data.txt
val rdd = sc.textFile(“/home/hadoop/data.txt”)

注意:
1.如果是针对本地文件的话:如果是在Windows上进行本地测试,windows上有一份文件即可; 如果是在Spark集群上针对Linux本地文件,那么需要将文件拷贝到所有worker节点上(就是在spark-submit上使用—master指定了master节点,使用standlone模式进行运行,而textFile()方法内仍然使用的是Linux本地文件,在这种情况下,是需要将文件拷贝到所有worker节点上的);
2.Spark的textFile()方法支持针对目录、压缩文件以及通配符进行RDD创建
3.Spark默认会为hdfs文件的每一个block创建一个partition,但是也可以通过textFile()的第二个参数手动设置分区数量,只能比block数量多,不能比block数量少

二、RDD基本操作

分为三大类:

1)transformations(不会触发job)

基于现有的RDD,创建一个新的RDD。

2)actions(触发job,能在UI上看到)

3)cache

如果有数据丢失,损坏,它会通过依赖关系,找到这个丢失的上一个依赖,重新计算。

具体操作:

1)map 对RDD中的每个元素进行操作,返回一个新的RDD

扫描二维码关注公众号,回复: 6222021 查看本文章
scala> var a=sc.parallelize(1 to 9)
a: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at <console>:24

scala> a.map(_*2)
res0: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[1] at map at <console>:26

scala> a.map(_*2).collect
res1: Array[Int] = Array(2, 4, 6, 8, 10, 12, 14, 16, 18)

scala> var a=sc.parallelize(List("lion","cat","tiger"))
a: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[3] at parallelize at <console>:24

scala> var b=a.map(x=>(x,1))
b: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[4] at map at <console>:25

scala> b.collect
res2: Array[(String, Int)] = Array((lion,1), (cat,1), (tiger,1))

scala> 

2)filter 对元素进行过滤

scala> var a = sc.parallelize(1 to 9)
a: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[5] at parallelize at <console>:24

scala> a.collect
res6: Array[Int] = Array(1, 2, 3, 4, 5, 6, 7, 8, 9)

scala> a.filter(_%2 == 0).collect
res7: Array[Int] = Array(2, 4, 6, 8)

scala> a.filter(_<5).collect
res8: Array[Int] = Array(1, 2, 3, 4)

scala> var c=sc.parallelize(1 to 6)
c: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[8] at parallelize at <console>:24

scala> val mapRDD=c.map(_*2)
mapRDD: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[9] at map at <console>:25

scala> mapRDD.collect
res9: Array[Int] = Array(2, 4, 6, 8, 10, 12)

scala> mapRDD.filter(_>5).collect
res10: Array[Int] = Array(6, 8, 10, 12)

3)flatMap和Map的区别

map的作用很容易理解就是对RDD之中的元素进行逐一进行函数操作映射为另外一个RDD。flatMap的操作是将函数应用于RDD之中的每一个元素,将返回的迭代器的所有内容构成新的RDD,存在一个压平操作,通常用来切分单词或者词频统计;

本质区别:
map函数会对每一条输入进行指定的操作,然后为每一条输入返回一个对象;而flatMap函数则是两个操作的集合——正是“先映射后扁平化”,最后将所有对象合并为一个对象

scala> var a = sc.parallelize(1 to 9)
a: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[11] at parallelize at <console>:24

scala> var nums = a.map(x=>(x*x))
nums: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[12] at map at <console>:25

scala> nums.collect
res11: Array[Int] = Array(1, 4, 9, 16, 25, 36, 49, 64, 81)

scala> nums.flatMap(x => 1 to x).collect
res12: Array[Int] = Array(1, 1, 2, 3, 4, 1, 2, 3, 4, 5, 6, 7, 8, 9, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 1, 2, 3, 4, 5,...
scala> 

4)mapValues 只对value进行操作,key不动

scala> val a = sc.parallelize(List("dog","tiger","cat")) 
a: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[14] at parallelize at <console>:24

scala> val b = a.map(x=>(x,x.length))
b: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[15] at map at <console>:25

scala> b.collect
res13: Array[(String, Int)] = Array((dog,3), (tiger,5), (cat,3))

scala> b.mapValues("x"+_+"x").collect
res14: Array[(String, String)] = Array((dog,x3x), (tiger,x5x), (cat,x3x))

5)sum 求和操作

scala> var a = sc.parallelize(1 to 100)
a: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[17] at parallelize at <console>:24

scala> a.sum()
res15: Double = 5050.0

scala> a.reduce(_+_)
res16: Int = 5050

6)first 返回数据集第一个元素 和take(1)类似

scala> val a = sc.parallelize(List("dog","tiger","cat")) 
a: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[19] at parallelize at <console>:24

scala> a.first()
res13: String = dog

scala> a.take(1)
res14: Array[String] = Array(dog)

7)top 排序

scala>  sc.parallelize(Array(6,9,4,7,5,8)).top(2)
res2: Array[Int] = Array(9, 8)

//隐式转换后从小到大排序 
scala>  implicit val myorder = implicitly[Ordering[Int]].reverse 
myorder: scala.math.Ordering[Int] = scala.math.Ordering$$anon$4@470ce6e7

scala> sc.parallelize(Array(6,9,4,7,5,8)).top(2)
res3: Array[Int] = Array(4, 5)

8)subtract(去重复元素)

scala> val a = sc.parallelize(1 to 5)
a: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[6] at parallelize at <console>:24

scala> a.collect
res3: Array[Int] = Array(1, 2, 3, 4, 5)

scala> val b = sc.parallelize(2 to 3)
b: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[7] at parallelize at <console>:24

scala> b.collect
res4: Array[Int] = Array(2, 3)

scala> a.subtract(b).collect
res5: Array[Int] = Array(4, 1, 5)

9)intersection(返回重复部分)

scala> val a = sc.parallelize(1 to 5)
a: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[6] at parallelize at <console>:24

scala> a.collect
res3: Array[Int] = Array(1, 2, 3, 4, 5)

scala> val b = sc.parallelize(2 to 3)
b: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[7] at parallelize at <console>:24

scala> b.collect
res4: Array[Int] = Array(2, 3)

scala> a.intersection(b).collect
res5: Array[Int] = Array(2, 3)

10)cartesian(返回笛卡尔积)

scala> val a = sc.parallelize(1 to 5)
a: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at <console>:24

scala>  a.collect
res0: Array[Int] = Array(1, 2, 3, 4, 5)

scala> val b = sc.parallelize(2 to 3)
b: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[1] at parallelize at <console>:24

scala>  b.collect
res1: Array[Int] = Array(2, 3)

scala> a.cartesian(b).collect
res2: Array[(Int, Int)] = Array((1,2), (2,2), (1,3), (2,3), (3,2), (4,2), (5,2), (3,3), (4,3), (5,3))

猜你喜欢

转载自blog.csdn.net/weixin_43212365/article/details/90141670