Good programmers Big Data learning routes Share resilient distributed datasets RDD

  Good programmers Big Data learning routes Share resilient distributed datasets RDD, RDD defined, RDD (Resilient Distributed Dataset) is called distributed data sets, Spark is the most basic data abstraction that represents an immutable (data and metadata) , you can partition inside the element parallel computing collection.

RDD features : automatic fault tolerance, location-aware scheduling and scalability

RDD property

1. slices

The basic unit of the dataset. For RDD, each slice will be a computing task processing, and determine the particle size parallel computing. The user can specify the number of slices in the creation of RDD RDD, if not specified, it will default value. The default value is the number assigned to the program of the CPU Core.

2. A function is calculated for each partition.

Spark RDD calculation is based on the slice units, each of the function to compute RDD will achieve this purpose. iterator will compute function complex, do not need to save the results of each calculation.

Dependencies between 3.RDD.

RDD each converter generates a new RDD , it will form a line similar to the same dependency between the front and rear RDD.

Fault tolerance: when part partitioning data loss, Spark can recalculate the lost partition data by this dependency, rather than all partitions RDD were recalculated.

4. A Partitioner, partitioner

I.e. RDD slice function. Spark currently implemented in two types of fragmentation function is a Hash-based HashPartitioner, the other is based RangePartitioner range. Only for key-value of RDD, will have Partitioner, the value of non-key-value Parititioner the RDD is None. Partitioner function not only determines the number of fragments RDD itself, but also determines the number of fragments when parent RDD Shuffle output.

5. a list

Partition memory access priority of each of the (preferred location). -> proximity principle

For an HDFS file, the preservation of this list is the location where the blocks for each Partition . In accordance with the concept of "movement data as good as the mobile computing", the Spark when performing task scheduling, the task will be possible to calculate assign it to a storage location of the data block to be processed.

RDD type

1.Transformation -> calculation recording (recording parameter calculation method)

Change

meaning

map(func)

Returns a new RDD, the RDD by each input element composition after conversion function func

filter(func)

It returns a new RDD, RDD returned by the function calculating after input element func composition is true

flatMap(func)

Similar map, but each input element may be mapped to zero or more output elements (func it should return a sequence, rather than a single element)

mapPartitions (func)

Similar map, but run independently on each slice of RDD, thus running the RDD type T when, func function type must be Iterator [T] => Iterator [U]

mapPartitionsWithIndex (func)

Similar mapPartitions, but func integer with a value of the index parameter indicates the slice, so as on the type of operation when the RDD T, the function func type must be

(Int, Iterator[T]) => Iterator[U]

sample(withReplacement, fraction, seed)

Sample data according to the ratio specified fraction can choose whether to use a random number to be replaced, for specifying the SEED the random number generator seed

union(otherDataset)

After the source RDD and RDD parameters and requirements set returns a new RDD

intersection(otherDataset)

diff -> Set difference

RDD source parameters and returns a new RDD RDD after the intersection of

distinct([numTasks]))

         [Changing the number of partitions]

After the source RDD were to re-return to a new RDD

groupByKey([numTasks])

In a (K, V) of the RDD call, it returns a (K, Iterator [V]) of the RDD

reduceByKey(func, [numTasks])

In a (K, V) of the RDD call, returns a (K, V) of RDD, reduce the specified function, the same key value is polymerized together with groupByKey Similarly, reduce the number of tasks may be performed by a second optional parameters to set

aggregateByKey(zeroValue)(seqOp, combOp, [numTasks])


sortByKey([ascending], [numTasks])

In a (K, V) of the RDD calls, K must implement the Ordered interface returns a sorted according to a key (K, V) of the RDD

sortBy(func,[ascending], [numTasks])

And sortByKey similar, but more flexible

join(otherDataset, [numTasks])

In the type (K, V), and (K, W) of the RDD call returns all elements of one and the same key corresponding to RDD together (K, (V, W)) of

cogroup(otherDataset, [numTasks])

在类型为(K,V)和(K,W)的RDD上调用,返回一个(K,(Iterable<V>,Iterable<W>))类型的RDD

cartesian(otherDataset)

笛卡尔积

pipe(command, [envVars])


coalesce(numPartitions)


repartition(numPartitions)

 重新分区

repartitionAndSortWithinPartitions(partitioner)


2.Action  -> 触发生成job(一个job对应一个action算子)

动作

含义

reduce(func)

通过func函数聚集RDD中的所有元素,这个功能必须是可交换且可并联的

collect()

在驱动程序中,以数组的形式返回数据集的所有元素

count()

返回RDD的元素个数

first()

返回RDD的第一个元素(类似于take(1))

take(n)

取数据集的前n个元素组成的数组

takeSample(withReplacement,num, [seed])

返回一个数组,该数组由从数据集中随机采样的num个元素组成,可以选择是否用随机数替换不足的部分,seed用于指定随机数生成器种子

takeOrdered(n[ordering])

takeOrdered和top类似,只不过以和top相反的顺序返回元素

saveAsTextFile(path)

将数据集的元素以textfile的形式保存到HDFS文件系统或者其他支持的文件系统,对于每个元素,Spark将会调用toString方法,将它装换为文件中的文本

saveAsSequenceFile(path

将数据集中的元素以Hadoop sequencefile的格式保存到指定的目录下,可以使HDFS或者其他Hadoop支持的文件系统。

saveAsObjectFile(path


countByKey()

针对(K,V)类型的RDD,返回一个(K,Int)的map,表示每一个key对应的元素个数。

foreach(func)

在数据集的每一个元素上,运行函数func进行更新。

创建RDD

Linux进入sparkShell:

/usr/local/spark.../bin/spark-shell \

--master spark://hadoop01:7077 \

--executor-memory 512m \

--total-executor-cores 2

或在Maven下:

object lx03 {

  def main(args: Array[String]): Unit = {

    val conf : SparkConf = new SparkConf()

      .setAppName("SparkAPI")

      .setMaster("local[*]")


    val sc: SparkContext = new SparkContext(conf)

    //通过并行化生成rdd

    val rdd1: RDD[Int] = sc.parallelize(List(24,56,3,2,1))

    //对add1的每个元素乘以2然后排序

    val rdd2: RDD[Int] = rdd1.map(_ * 2).sortBy(x => x,true)


    println(rdd2.collect().toBuffer)

    //过滤出大于等于10的元素

//    val rdd3: RDD[Int] = rdd2.filter(_ >= 10)


//    println(rdd3.collect().toBuffer)

  }

练习2

val rdd1 = sc.parallelize(Array("a b c", "d e f", "h i j"))

//将rdd1里面的每一个元素先切分在压平

val rdd2 = rdd1.flatMap(_.split(' '))

rdd2.collect

//复杂的:

val rdd1 = sc.parallelize(List(List("a b c", "a b b"), List("e f g", "a f g"), List("h i j", "a a b")))

//将rdd1里面的每一个元素先切分在压平

val rdd2 = rdd1.flatMap(_.flatMap(_.split(" ")))

练习3

val rdd1 = sc.parallelize(List(5, 6, 4, 3))

val rdd2 = sc.parallelize(List(1, 2, 3, 4))

//求并集

val rdd3 = rdd1.union(rdd2)


//求交集

val rdd4 = rdd1.intersection(rdd2)

//去重

rdd3.distinct.collect

rdd4.collect

练习4

val rdd1 = sc.parallelize(List(("tom", 1), ("jerry", 3), ("kitty", 2)))

val rdd2 = sc.parallelize(List(("jerry", 2), ("tom", 1), ("shuke", 2)))

//求join

val rdd3 = rdd1.join(rdd2)  -> 相同的key组成新的key,value

//结果: Array[(String,(Int,Int))] = Array((tom,(1,1)),(jerry,(3,2)))

rdd3.collect

//求左连接和右连接

val rdd3 = rdd1.leftOuterJoin(rdd2)

rdd3.collect

val rdd3 = rdd1.rightOuterJoin(rdd2)

rdd3.collect

//求并集

val rdd4 = rdd1 union rdd2

//按key进行分组

rdd4.groupByKey

rdd4.collect

//分别用groupByKey和reduceByKey实现单词计数

val rdd3 = rdd1 union rdd2

rdd3.groupByKey().mapValues(_.sum).collect

rdd3.reduceByKey(_+_).collect

groupByKey和reduceByKey的区别

reduceByKey算子比较特殊,它首先会进行局部聚合,再全局聚合,我们只需要传一个局部聚合的函数就可以了

图片1.png

练习5

val rdd1 = sc.parallelize(List(("tom", 1), ("tom", 2), ("jerry", 3), ("kitty", 2)))

val rdd2 = sc.parallelize(List(("jerry", 2), ("tom", 1), ("shuke", 2)))

//cogroup

val rdd3 = rdd1.cogroup(rdd2)

//注意cogroup与groupByKey的区别

rdd3.collect


val rdd1 = sc.parallelize(List(1, 2, 3, 4, 5))

//reduce聚合

val rdd2 = rdd1.reduce(_ + _)


//按value的降序排序

val rdd5 = rdd4.map(t => (t._2, t._1)).sortByKey(false).map(t => (t._2, t._1))

rdd5.collect

//笛卡尔积

val rdd3 = rdd1.cartesian(rdd2)


计算元素个数

scala> val rdd1 = sc.parallelize(List(2,3,1,5,7,3,4))

rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at <console>:27


scala> rdd1.count

res0: Long = 7  

top先升序排序在取值

scala> rdd1.top(3)

res1: Array[Int] = Array(7, 5, 4)                                               


scala> rdd1.top(0)

res2: Array[Int] = Array()


scala> rdd1.top(100)

res3: Array[Int] = Array(7, 5, 4, 3, 3, 2, 1)

take原集合前N个,有几个取几个

scala> rdd1.take(3)

res4: Array[Int] = Array(2, 3, 1)


scala> rdd1.take(100)

res5: Array[Int] = Array(2, 3, 1, 5, 7, 3, 4)


scala> rdd1.first

res6: Int = 2

takeordered倒序排序再取值

scala> rdd1.takeOrdered(3)

res7: Array[Int] = Array(1, 2, 3)


scala> rdd1.takeOrdered(30)

res8: Array[Int] = Array(1, 2, 3, 3, 4, 5, 7)

                             

生成RDD的两种方式

1.并行化方式生成 (默认分区两个)

手动指定分区

scala> val rdd1 = sc.parallelize(List(1,2,3,5))

rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[5] at parallelize at <console>:27


scala> rdd1.partitions.length  //获取分区数

res9: Int = 2


scala> val rdd1 = sc.parallelize(List(1,2,3,5),3)

rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[6] at parallelize at <console>:27


scala> rdd1.partitions.length

res10: Int = 3

2.使用textFile读取文件存储系统里的数据  

scala> val rdd2 = sc.textFile("hdfs://hadoop01:9000/wordcount/input/a.txt").flatMap(_.split(" ")).map((_,1)).reduceByKey(_+_)

rdd2: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[11] at reduceByKey at <console>:27


scala> rdd2.collect  //调用算子得到RDD显示结果

res11: Array[(String, Int)] = Array((hello,6), (beijing,1), (java,1), (gp1808,1), (world,1), (good,1), (qianfeng,1))


scala> val rdd2 =  sc.textFile("hdfs://hadoop01:9000/wordcount/input/a.txt",4).flatMap(_.split(" ")).map((_,1)).reduceByKey(_+_)

rdd2: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[26] at reduceByKey at <console>:27


scala> rdd2.partitions.length    //也可以自己指定分区数

res15: Int = 4


Guess you like

Origin blog.51cto.com/14479068/2431397