Spark learning two --Spark of RDD

What is RDD

RDD (Resilient Distributed Dataset) is called to do the elastic distributed data sets , is the most basic data abstraction Spark, which represents an immutable, you can partition the set of elements which may be parallel computation. RDD characteristic data flow model having: automatic fault tolerance, location-aware scheduling and scalability . RDD allows the user to execute multiple queries when the working set explicitly cached in memory, subsequent queries can reuse the working set, which significantly improves query speed.

5 Laid properties RDD

  1. Get the list of partitions (getPartitions): a list of data pieces, the data can be segmented, the segmented data can be calculated in parallel is part of the data set atoms.
  2. Can be calculated (Compute) on each partition: calculated for each partition, may conveniently obtain a result, which is calculated for the implementation on the parent RDD.
  3. Obtaining dependencies of each RDD (getDependencies): calculated for each of the RDD RDD parent dependency list, the source RDD no dependencies, dependency described by descent.
  4. RDD keys partitioner (@transient val partitioner: Option [Partitioner] = None): describes where the partition mode and the data stored in key-value pair is a hash value RDD partition.
  5. Is preferably calculated (getPreferredLocations) on that partition: priority of each slice position is calculated.

Popular point of view, RDD can be understood as a collection of distributed objects, essentially a read-only partition record set. Each RDD can be divided into a plurality of partitions, each partition is a set of data segments. RDD a different partitions may be stored on different nodes in the cluster, which can be calculated in parallel on different nodes in the cluster.

The basic concept of RDD

RDD is the most important abstractions Spark provided, which is a special set of data fault tolerance mechanism, it can be distributed over the nodes of the cluster, so as to perform various functional operations in parallel set of operations.

Spark distributed data set based on the elastic (RDD) model, having good versatility, parallel processing and fault tolerance capabilities data RDD (Resilient Distributed Dataset): Flexible distributed data sets (corresponding set), it is essentially a data set description, the position of the recording data storage, processing method, the conversion relationship, and the like in the form of data (read-only, can be distributed partitioned data set) after the treatment, rather than the data set itself , as evenly as possible when the partitions
1 shows the distribution relationship RDD partitions and work with nodes (Worker node) is.

Here Insert Picture Description

RDD advantage

Spark abstract design is based on this data set (RDD), you operate RDD This abstract data set, just as you do as a local collection, Spark underlying details of the package are hidden (task scheduler, Task execution task fails heavy try to wait), developers use it more simple and convenient

RDD operation, in fact, operate on each partition, partition performs correlation calculation logic generates the Task, Task will dispatch the Executor, and thus the operation of the data into the

Read the file from HDFS to understand how the default partition

Spark number of partitions reading a file from the default HDFS HDFS equal to the number of file blocks (blocks), HDFS block is the smallest unit in the distributed storage. If we upload uncompressed file to the HDFS is 30GB, HDFS default datablock size 128MB, so the file will be divided in the HDFS 235 (30GB / 128MB); Spark read SparkContext.textFile () reads the file, default i.e., the number of partitions is equal to 235 blocks.

Graphic RDD

Here Insert Picture Description

The way to create RDD

2.1 generated by reading the file

Created by an external set of data storage systems, including the local file system, as well as all support Hadoop data sets, such as HDFS, Cassandra, HBase, etc.

scala> val file = sc.textFile("/spark/hello.txt")

Here Insert Picture Description

2.2 Creating RDD way through parallelization

Created by a set of already existing Scala.

scala> val array = Array(1,2,3,4,5)
array: Array[Int] = Array(1, 2, 3, 4, 5)

scala> val rdd = sc.parallelize(array,2)    //后面的参数可以指定2分区数量
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[27] at parallelize at <console>:26

Here Insert Picture Description

3. Call up of an existing RDD of Transformation, will generate a new RDD

RDD programming API

Spark supports two types (operator) Operation: Transformation and Action

Transformation

The main job of an existing RDD is to generate another RDD. Transformation with lazy properties (lazy loading) . Transformation operator code will not actually be executed. Only when we encountered an inside operator action when the code will really be executed . This design allows Spark to run more efficiently.

Commonly used Transformation:

Function name effect Examples result
map(func) The function is applied to each element of RDD, the return value is the new RDD rdd1.map(x=>x+l) {2,3,4,4}
mapValue(func) Applying a function (K, V), value of each element of RDD, the return value is the new RDD rdd1.mapValue(_.sum) “jarry”:(1),tom:(1,2,3)---->“jarry”:1,“tom”,6
flatMap(func) The function is applied to each element of RDD, the data elements will be split into iterator, the return value is the new RDD, i.e., will be flattened rdd1.flatMap(x=>x.to(3)) {1,2,3,2,3,3,3}
filter (func) Function will filter out ineligible elements, return value is the new RDD rdd1.filter(x=>x!=1) {2,3,3}
distinct() The RDD in the elements to re-operate rdd1.distinct() (1,2,3)
union() RDD is generated that contains all the elements of the two new RDD rdd1.union(rdd2) {1,2,3,3,3,4,5}
join() In the (K, V) call, the key for the same subject, which together rdd1.join(rdd2) { "Jarry" 1, "this" 1, "this", 2}, { "this" 1, "lin": 1} - "{" this ", (1,2)," the " , (1,1), "Jarry", (1,1)}
intersection() Find common elements of the two RDD rdd1.intersection(rdd2) {3}
subtract() In RDD and RDD original parameters like elements removed in rdd1.subtract(rdd2) {1,2}
cartesian() Cartesian product of the sum of two RDD rdd1.cartesian(rdd2) {(1,3),(1,4)……(3,5)}
mapPartitions (func) Similar map, but run independently on each slice of RDD, thus running the RDD type T when, func function type must be Iterator [T] => Iterator [U] rdd.mapPartitions((it: Iterator[Int])=> { it.map(e => $e")})
mapPartitionsWithIndex (func) Similar mapPartitions, but with func integer parameter index value indicates a slice, and therefore when running on the RDD, func function type must be (Int, Interator [T] ( int represents the section number, interrator partition represents REFERENCE )) => the Iterator [the U-] ------------------------------------------ time to come up with a partition (partition and no data, but what data record to be read, truly generated Task reads multiple data), and the partition numbers can be taken out ~~~~~~~ features: when the corresponding data fetch partition, the partition number can also be taken out, so that you can know the data (data corresponding to distinguish which of the Task) which belong to the partition mapPartitionsWithIndex( func) val func = (index: Int, it: Iterator[Int]) => { it.map(e => s"part: $index, ele: $e")}//该函数的功能是将对应分区中的数据取出来,并且带上分区编号 传入的参数第一个是参数的编号,第二个是对分区数据的引用,可对分区数据独立操作
groupByKey([numTasks]) 在一个(K,V)的RDD上调用,返回一个(K, Iterator[V])的RDD “jarry”:1,“tom”:1,“tom”,2,“tom”,2,“tom”:3—>“jarry”:(1),tom:(1,2,3)
reduceByKey(func, [numTasks]) 在一个(K,V)的RDD上调用,返回一个(K,V)的RDD,使用指定的reduce函数,将相同key的值聚合到一起,与groupByKey类似,reduce任务的个数可以通过第二个可选的参数来设置,局部先进行聚合,然后全局聚合,局部和全局运用同一个函数进行处理
aggregateByKey(zeroValue)(seqOp, combOp, [numTasks]) 在一个(K,V)的RDD上调用, 先按分区聚合 再总的聚合 ,调用两个函数,分别应用与第一次和第二次 每次要跟初始值交流 aggregateByKey(0)(—±---,—±---) 对k/y的RDD进行操作
sortByKey([ascending], [numTasks]) 在一个(K,V)的RDD上调用,K必须实现Ordered接口,返回一个按照key进行排序的(K,V)的RDD
sortBy(func,[ascending], [numTasks]) 与sortByKey类似,但是更灵活 第一个参数是根据什么排序 第二个是怎么排序 false倒序 第三个排序后分区数 默认与原RDD一样
mapValues(func) mapValues顾名思义就是输入函数应用于RDD中Kev-Value的Value,原RDD中的Key保持不变,与新的Value一起组成新的RDD中的元素。因此,该函数只适用于元素为KV对的RDD。
foldByKey(zero)(func, [numTasks]) 和ruduceByKey类似只是多了个初始值 b.mapValues(“x” + _ + “x”)
combineByKey(func1,func2,func3) 在一个(K,V)的RDD上调用, 第一个分函数的功能将分好组的key的第一个value取出来进行操作,第二个函数的作用是,将每个分区除第一个value的其他value进行操作,第三个函数将各个分区的结果根据key相同的进行操作,因为是比较底层的方法,所以需要指定参数 rdd.combineByKey(x=>x,(a:Int,b:Int)=>{a+b},(n:Int,m:Int)=>{n+m}) //普通的key/value加和 代码二:rdd.combineByKey(x=>ListBuffer[String],(a:ListBuffer[String],b:Int)=>{a+=b}),(n:ListBuffer[String],m:ListBuffer[String])=>{n++=m} //第一个分函数,将每个key的第一个value加入List,第二个分函数,根据key,将其余的value加入List,第三个value,将各个分区的结果进行整合,根据key进行整合 结果如下图和执行过程

Here Insert Picture Description

Action

触发代码的运行,我们一段spark代码里面至少需要有一个action操作。当我们的程序里面遇到一个action算子的时候,代码才会真正的被执行。

函数名 作用 示例 结果
collect() 返回 RDD 的所有元素 rdd.collect() {1,2,3,3}
count() RDD 里元素的个数 rdd.count() 4
countByKey() RDD 出现的元素的个数 rdd.count() (“a”,1)(“b”,2)(“c”,3")(a",1) 变成a–>2,b–>1,c–>1
countByValue() 各元素在 RDD 中的出现次数 rdd.countByValue() {(1,1),(2,1),(3,2})}
take(num) take用于获取RDD中从0到num-1下标的元素,不排序。 rdd.take(2) {1,2}
top(num) 从 RDD 中,按照默认(降序)或者指定的排序返回最前面的 num 个元素 rdd.top(2) {3,3}
reduce() 并行整合所有 RDD 数据,如求和操作 rdd.reduce((x,y)=>x+y) 9
fold(zero)(func) 和 reduce() 功能一样,但需要提供初始值 rdd.fold(0)((x,y)=>x+y) 9
foreach(func) 对 RDD 的每个元素都使用特定函数 rdd1.foreach(x=>printIn(x)) 打印每一个元素
foreachPartition(func) 对 RDD 的每个元素都使用特定函数 ,但是分区,需要传入参数为Iterator[V]函数 rdd1.foreach(it=>it.foreach(x=>println())) 分区打印每一个元素
saveAsTextFile(path) 将数据集的元素,以文本的形式保存到文件系统中 rdd1.saveAsTextFile(file://home/test)
saveAsSequenceFile(path) 将数据集的元素,以顺序文件格式保存到指 定的目录下 saveAsSequenceFile(hdfs://home/test)
aggregate(zero)(func1,func2) 先对分区进行操作,在总体操作,第一个参数指定初始值,可传入两个函数(局部和全局,都需要与初始值比较) rdd1.aggregate (20 is) (Math.max (_, ), + ) // find the first local maximum (maximum time needed to find and compare the initial values, and then sum) -------- rdd1.aggregate ------------ (@) ( + , + _) // string splicing, splicing local and global need to use an initial value @ (a,b,c)(d,e,f,)------>@@abc@def
flatMapValue(func) The function is applied to each element of the RDD Value, the element data split into iterator, the return value is the new RDD, i.e., will be flattened rdd1.flatMapValue(-.split("_")) {“a”,(“1”,“2”),“b”(,“2”,“2”)}----->a,1 a,2 b,2 b,2
sortBy sortBy sorted according to a given function k RDD element to be sorted. // V sorted in descending order according to the scala> rdd1.sortBy (x => x._2, false) .collectres4: Array [(String, Int)] = Array ((B, 7), (B, 6), (B , 3), (A, 2), (A, 1))

WordCount Code

Lite

   sc.textFile(args(0)).flatMap(_.split(" ")).map((_, 1)).reduceByKey(_+_).sortBy(_._2, false).saveAsTextFile(args(1))

Detailed version


object WordCount {

  def main(args: Array[String]): Unit = {



//创建spark配置,设置应用程序名字
    //val conf = new SparkConf().setAppName("ScalaWordCount")
    val conf = new SparkConf().setAppName("ScalaWordCount").setMaster("local[4]")
    //创建spark执行的入口
    val sc = new SparkContext(conf)
 
 
   //指定以后从哪里读取数据创建RDD(弹性分布式数据集)
    val lines: RDD[String] = sc.textFile("hdfs://node-4:9000/wc1", 1)
     //分区数量
    partitionsNum=lines.partitions.length
    
    //切分压平
    val words: RDD[String] = lines.flatMap(_.split(" "))
    
    //将单词和一组合
    val wordAndOne: RDD[(String, Int)] = words.map((_, 1))
    
    //按key进行聚合
    val reduced:RDD[(String, Int)] = wordAndOne.reduceByKey(_+_)
   
    //排序
    val sorted: RDD[(String, Int)] = reduced.sortBy(_._2, false)
    
    //将结果保存到HDFS中
    reduced.saveAsTextFile(args(1))


    //释放资源
    sc.stop()

  }

}








}}

Super Rdd full version of api:

Published 44 original articles · won praise 0 · Views 860

Guess you like

Origin blog.csdn.net/heartless_killer/article/details/104525395