Spark-Core

1、Spark简介
 2、Spark-Core核心算子
 3、Spark-Core
4、SparkSQL

Article directory

1. RDD programming

1. RDD serialization

The initialization work is performed on the Driver side, while the actual running program is performed on the Executor side. This involves cross-process communication and requires serialization.

Insert image description here

class User extends Serializable {
    
    
  var name: String = _
}

class Test04 {
    
    
  Logger.getLogger("org").setLevel(Level.ERROR)

  @Test
  def test(): Unit = {
    
    
    val conf: SparkConf = new SparkConf().setAppName("SparkCore").setMaster("local[*]")
    val sc: SparkContext = new SparkContext(conf)
    val rdd01: RDD[(Int, String)] = sc.makeRDD(Array((111, "aaa"), (222, "bbbb"), (333, "ccccc")), 3)


    val user01: User = new User()
    user01.name = "list"
    val user02: User = new User()
    user02.name = "lisi"
    val userRdd01: RDD[User] = sc.makeRDD(List(user01, user02))

    //  没有序列化(java.io.NotSerializableException: day04.User)
    userRdd01.foreach(user => println(user.name))
    sc.stop()

  }
}

1.2 Kryo serialization framework

Reference address: https://github.com/EsotericSoftware/kryo

Java serialization can serialize any class. But it is heavier, and the size of the object after serialization is also relatively large.

For performance reasons, Spark 2.0 began to support another Kryo serialization mechanism. Kryo is 10 times faster than Serializable. When RDDs are shuffled into data, simple data types, arrays, and string types are already serialized internally in Spark using Kryo.

import org.apache.log4j.{
    
    Level, Logger}
import org.apache.spark.rdd.RDD
import org.apache.spark.{
    
    SparkConf, SparkContext}
import org.junit.Test

class Test04 {
    
    
  Logger.getLogger("org").setLevel(Level.ERROR)

  @Test
  def test(): Unit = {
    
    
    val conf: SparkConf = new SparkConf().setAppName("SparkCore").setMaster("local[*]")
      // 替换默认的序列化机制
      .set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
      // 注册需要使用kryo序列化的自定义类
      .registerKryoClasses(Array(classOf[Search]))
    val sc: SparkContext = new SparkContext(conf)

    val rdd: RDD[String] = sc.makeRDD(Array("hello world", "hello", "world"))
    val search: Search = new Search("hello")
    val result: RDD[String] = rdd.filter(search.isMatch)
    println(result.collect().toList)
  }


}

//  关键字封装在一个类里面
//  需要自己先让类实现序列化  之后才能替换使用kryo序列化
class Search(val query: String) extends Serializable {
    
    
  def isMatch(s: String): Boolean = {
    
    
    s.contains(query)
  }
}

2. RDD dependencies

2.1 Check blood relationship

RDDs only support coarse-grained transformations, i.e. single operations performed on a large number of records. Record the series of Lineage (lineage) used to create the RDD in order to recover lost partitions. The RDD's Lineage records the RDD's metadata information and conversion behavior. When some partition data of the RDD is lost, it can recalculate and restore the lost data partitions based on this information.

Insert image description here

The number in parentheses indicates the degree of parallelism of the RDD, that is, how many partitions there are

rdd03.toDebugString

@Test
def test(): Unit = {
    
    
  val conf: SparkConf = new SparkConf().setAppName("SparkCore").setMaster("local[*]")
  val sc: SparkContext = new SparkContext(conf)
  val rdd01: RDD[String] = sc.textFile("input/1.txt")
  println(rdd01.toDebugString)
  println("rdd01===")
  val rdd02: RDD[String] = rdd01.flatMap(_.split(" "))
  println(rdd02.toDebugString)
  println("rdd02====")
  val rdd03: RDD[(String, Int)] = rdd02.map((_, 1))
  println(rdd03.toDebugString)
  println("rdd03====")
  val rdd04: RDD[(String, Int)] = rdd03.reduceByKey(_ + _)
  println(rdd04.toDebugString)
  sc.stop()
}

Insert image description here

2.2 View dependencies

Insert image description here

The relationship between RDDs can be understood from two dimensions: one is which RDDs the RDD is converted from, that is, what is the parent RDD(s) of the RDD ( Bloodline); The other is which Partition(s) of the parent RDD(s) the RDD depends on. This relationship is the dependence between RDDs ( ).

There are two different types of dependencies between an RDD and the parent RDD(s) it depends on, namely narrow dependency (NarrowDependency) and wide dependency (ShuffleDependency).

2.3 Narrow dependencies

One-to-one, many-to-one

Narrow dependency means that each Partition of the parent RDD is used by at most one Partition of the child RDD (one-to-one, many-to-one).
The metaphor of narrow dependence on our image is that of an only child.

Insert image description here

2.4 Wide dependencies

One-to-many will cause Shuffle

Wide dependency means that the Partition of the same parent RDD is dependent on the Partition of multiple child RDDs (can only be one-to-many), which will cause Shuffle.
Summary: Kuan relies on the metaphor of our image as transcendence.
provided bytransformationsinclusive:sort, reduceByKey, groupByKey, join Japanese versionrePartitionFunctional operations.
Wide dependencies have a more important impact on how Spark evaluates a transformation, such as the impact on performance.
Without affecting business requirements, try to avoid using conversion operators with wide dependencies, because if there are wide dependencies, shuffle will definitely occur, which will affect performance.

Insert image description here

2.5 Stage task division

DAG directed acyclic graph

DAG (Directed Acyclic Graph) is a topological graph composed of points and lines. The graph has direction and will not close a loop.

DAG records the transformation process of RDD and the stages of tasks.

RDD task segmentation

RDD tasks are divided into: Application, Job, Stage and Task

Application: Initializing a SparkContext generates an Application;
Job: An Action operator will generate a Job;
Stage: Stage is equal to the number of wide dependencies plus 1;
Task: In a Stage, the number of partitions of the last RDD is the number of Tasks.

Note:Application->Job->Stage->TaskEach layer is a 1 to n relationship.

@Test
def Test(): Unit = {
    
    
  val conf: SparkConf = new SparkConf().setAppName("SparkCore").setMaster("local[*]")
  //  1、Application：初始化一个SparkContext即生成一个Application
  val sc: SparkContext = new SparkContext(conf)
  val lineRdd: RDD[String] = sc.textFile("input/1.txt")
  val rdd01: RDD[String] = lineRdd.flatMap(_.split(" "))
  val rdd02: RDD[(String, Int)] = rdd01.map((_, 1))
  //  3、Stage：reduceByKey算子会有宽依赖,stage阶段+1。一共2个stage
  val resultRdd: RDD[(String, Int)] = rdd02.reduceByKey(_ + _)
  //  2、Job：一个Action算子就会生成一个Job。一共2个Job
  resultRdd.collect().foreach(println)
  resultRdd.saveAsTextFile("output")
  Thread.sleep(Long.MaxValue)
  sc.stop()
}

Application number:

//  1、Application：初始化一个SparkContext即生成一个Application
val sc: SparkContext = new SparkContext(conf)

Number of jobs

Insert image description here

Number of stages

Insert image description here

Number of Tasks

If there is a shuffle process, the system will automatically cache and the UI interface will display the skipped part.
From the Stage, there are 2 Tasks.

Insert image description here

3. RDD persistence

3.1 Cache cache

RDD caches the previous calculation results through the Cache or Persist method. By default, the data is cached in the JVM's heap memory in serialized form. However, these two methods are not cached immediately when called, but when the subsequent action operator is triggered, the RDD will be cached in the memory of the computing node and reused later.

//  cache底层调用的就是persist方法,缓存级别默认用的是MEMORY_ONLY
wortToOneRdd.cache()
//  可以更改缓存级别
wortToOneRdd.persist(StorageLevel.MEMORY_AND_DISK_2)

Case:

val wordRdd: RDD[String] = lineRdd.flatMap(line => line.split(" "))
val wortToOneRdd: RDD[(String, Int)] = wordRdd.map(word => (word, 1))
//  打印血缘关系（缓存前）
println(wortToOneRdd.toDebugString)
//  数据缓存
//  cache底层调用的就是persist方法,缓存级别默认用的是MEMORY_ONLY
wortToOneRdd.cache()
//  可以更改缓存级别
//    wortToOneRdd.persist(StorageLevel.MEMORY_AND_DISK_2)
wortToOneRdd.collect().foreach(println)
//  打印血缘关系（缓存后）
println(wortToOneRdd.toDebugString)

Insert image description here

Cache enumeration parameters

The default storage level is to store only one copy in memory. Adding "_2" to the end of the storage level indicates that the persistent data is stored in two copies.

SER: stands for serialization.

Insert image description here

The cache may be lost, or the data stored in the memory may be deleted due to insufficient memory. The cache fault tolerance mechanism of RDD ensures that even if the cache is lost, the calculation can be executed correctly. Through a series of transformations based on RDD, the lost data will be recalculated. Since each Partition of RDD is relatively independent, only the missing part needs to be calculated, and there is no need to recalculate all Partitions.

Comes with cache采用reduceByKey

// 采用reduceByKey，自带缓存
val wordByKeyRDD: RDD[(String, Int)] = wordToOneRdd.reduceByKey(_+_)

3.2 CheckPoint checkpoint

Checkpoint: by writing RDD intermediate results to disk

Reason: Since long-term blood dependency will cause fault tolerance cost to be too high, it is better to do checkpoint fault tolerance in the intermediate stage. If there is a node problem after the checkpoint, you can start from Checkpoint starts redoing bloodlines, reducing overhead.

Checkpoint storage path: Checkpoint data is usually stored in a fault-tolerant, highly available file system such as HDFS.

The storage format is: Binary file.

Checkpoint cuts off the blood relationship: During the Checkpoint process, all the information in the RDD that depends on the parent RDD will be removed.

Checkpoint trigger time: Checkpoint operation on RDD will not be executed immediately, and Action operation must be executed to trigger it. However, for the sake of data security, the checkpoint will be executed from the beginning of the blood relationship.

Insert image description here

//	设置检查点数据存储路径：
sc.setCheckpointDir("./checkpoint1")
//	调用检查点方法：
wordToOneRdd.checkpoint()

Code:

val rdd: RDD[String] = sc.textFile("input/1.txt")
//  业务逻辑
val rdd01: RDD[String] = rdd.flatMap(line => line.split(" "))
val rdd02: RDD[(String, Long)] = rdd01.map(word => (word, System.currentTimeMillis()))
//  增加缓存，避免再重新跑一个job做checkpoint
rdd02.cache()
//  数据检查点：针对wordToOneRdd做检查点计算
rdd02.checkpoint()
//  会立即启动一个新的job来专门的做checkpoint运算(一共会有2个job)
rdd02.collect().foreach(println)
//  再次触发2次执行逻辑，用来对比
rdd02.collect().foreach(println)
rdd02.collect().foreach(println)

Results of the:

View the DAG diagram via pagehttp://localhost:4040/jobs. You can see that checkpoints cut off blood dependencies.


Only checkpoint is added, Cache cache printing is not added.	After the first job is executed, the checkpoint is triggered. The second job runs the checkpoint and stores the data on the checkpoint. In the third and fourth jobs, data is read directly from the checkpoint.
Add checkpoint and Cache cache printing	After the first job is executed, the data is saved in the Cache. The second job runs checkpoint, directly reads the data in the Cache, and stores the data on the checkpoint. In the third and fourth jobs, data is read directly from the checkpoint.

Insert image description here

3.3 Differences between cache and checkpoints

Cache cache only saves data and does not cut off blood dependencies. Checkpoint cuts off blood dependence.
The data cached by Cache is usually stored in disk, memory, etc., and its reliability is low. Checkpoint data is usually stored in fault-tolerant and highly available file systems such as HDFS, which has high reliability.
It is recommended to use Cache cache for the RDD of checkpoint(), so that the checkpoint job only needs to read data from the Cache cache, otherwise the RDD needs to be calculated from scratch.
If you finish using the cache, you can release the cache through the unpersist() method.

3.4 Checkpoint storage to HDFS cluster

If checkpoint data is stored in an HDFS cluster, pay attention to configuring the user name for accessing the cluster. Otherwise, an access permission exception will be reported.

// 设置访问HDFS集群的用户名
System.setProperty("HADOOP_USER_NAME", "atguigu")

// 需要设置路径.需要提前在HDFS集群上创建/checkpoint路径
sc.setCheckpointDir("hdfs://hadoop102:8020/checkpoint")

//  数据检查点：针对wordToOneRdd做检查点计算
rdd02.checkpoint()

4. Key-value pair RDD data partitioning

Spark currently supports Hash partition, Range partition and user-defined partition. The Hash partition is the current default partition. The partitioner directly determines the number of partitions in the RDD, which partition each piece of data in the RDD enters after Shuffle, and the number of Reduces.

Only RDDs of type Key-Value have partitioners, and the value of partitioning for RDDs of non-Key-Value type is None.
The partition ID range of each RDD is: 0~numPartitions-1, which determines which partition this value belongs to.

val conf: SparkConf = new SparkConf().setAppName("SparkCore").setMaster("local[*]")
val sc: SparkContext = new SparkContext(conf)
//	数据源处理
val rdd: RDD[(Int, Int)] = sc.makeRDD(List((1, 1), (2, 2), (3, 3)))
//	打印分区器
println(rdd.partitioner)
//	使用HashPartitioner对RDD进行重新分区
val rdd02: RDD[(Int, Int)] = rdd.partitionBy(new HashPartitioner(2))
//	打印分区器
println(rdd02.partitioner)
sc.stop()

Hash partition

The principle of HashPartitioner partitioning: for a given key, calculate its hashCode, and divide it by the number of partitions to take the remainder. If the remainder is less than 0, use the remainder + the number of partitions (otherwise add 0), and the final returned value is this The partition ID to which the key belongs.

Disadvantages of HashPartitioner partitioning: It may lead to uneven data volume in each partition. In extreme cases, some partitions may have all the data of the RDD.

Insert image description here

Ranger partition

The role of RangePartitioner: Map numbers within a certain range to a certain partition, try to ensure that the amount of data in each partition is even, and the partitions are ordered. The elements in one partition must be larger than those in another partition. The elements are small or large, but the order of the elements within the partition cannot be guaranteed. Simply put, it is to map numbers within a certain range to a certain partition.

The implementation process is:

The first step: first use the pond sampling algorithm from the entire RDD to extract sample data, sort the sample data, calculate the maximum key value of each partition, and form an array variable rangeBounds of Array[KEY] type;

Step 2: Determine the range of the key in rangeBounds, and give the partition id subscript of the key value in the next RDD; the partitioner requires that the KEY type in the RDD must be sortable

1) We assume that there are 1 million pieces of data divided into 4 areas
2) Draw 100 numbers from 1 million (1, 2, 3, ... 100)
3) Sort the 100 numbers and divide them evenly into 4 segments
4) Get 1 million pieces of data, compare each value with the range of 4 partitions, and put it into the appropriate partition

2. Accumulator

Distributed shared write-only variables (data cannot be read between Executor and Executor)

The accumulator is used to aggregate the variable information on the Executor side to the Driver side. For a variable defined in the Driver, each task on the Executor side will get a new copy of this variable. After each task updates the value of these copies, it is sent back to the Driver side for merge calculation.
Note: The task on the Executor side cannot read the value of the accumulator (for example: if sum.value is called on the Executor side, the value obtained is not the final value of the accumulator). Therefore we say that the accumulator is a distributed shared write-only variable.

Insert image description here

//	累加器定义（SparkContext.accumulator(initialValue)方法）
val sum: LongAccumulator = sc.longAccumulator("sum")
//	累加器添加数据（累加器.add方法）
sum.add(count)
//	累加器获取数据（累加器.value）
sum.value

val dataRdd: RDD[(String, Int)] = sc.makeRDD(List(("a", 1), ("a", 2), ("a", 3), ("a", 4)))
//  设置新的累加器
val accSum: LongAccumulator = sc.longAccumulator("sum")
dataRdd.foreach(line => {
    
    
  //  使用累加器累加
  accSum.add(line._2)
})
//  获取累加器，累加后的值
println(accSum.value)

The accumulator should be placed in the action operator

Because the number of times a transformation operator is executed depends on the number of jobs, if a Spark application has multiple action operators, the accumulator in the transformation operator may be updated more than once, resulting in incorrect results.
So, if we want an accumulator that is absolutely reliable regardless of failure or repeated calculations, we must put it in an action operator like foreach().
For accumulators used in action operators, Spark will only apply modifications to each accumulator once per job.

val value: RDD[Unit] = dataRdd.map {
    
    
  case (a, count) => {
    
    
    accSum.add(count)
  }
}
//假如放在map中，调用两次行动算子，map执行两次，导致最终累加器的值翻倍
mapRDD.collect()
mapRDD.collect()

3. Broadcast variables

Distributed shared read-only variables

Broadcast variables are used to distribute larger objects efficiently. Sends a large read-only value to all worker nodes for use by one or more Spark Task operations.

step:

Call SparkContext.broadcast (broadcast variable) to create a broadcast object. Any serializable type can be implemented in this way.
Access the value of the object by broadcasting the variable .value.
Broadcast variables will only be sent to each node once and are treated as read-only values (modifying this value will not affect other nodes).

//	声明广播变量
val bdStr: Broadcast[Int] = sc.broadcast(num)
//	使用广播变量
bdstr.value

val rdd: RDD[Int] = sc.makeRDD(List(1, 2, 3, 4, 5, 6), 3)
//  需要广播的值
val num: Int = 1
//  声明广播变量
val bdStr: Broadcast[Int] = sc.broadcast(num)
//  使用广播变量
val rdd02: RDD[Int] = rdd.filter(lin => {
    
    
  lin.equals(bdStr.value)
})
rdd02.foreach(println)

Article directory

1. RDD programming

1. RDD serialization

1.2 Kryo serialization framework

2. RDD dependencies

2.1 Check blood relationship

2.2 View dependencies

2.3 Narrow dependencies

2.4 Wide dependencies

2.5 Stage task division

3. RDD persistence

3.1 Cache cache

3.2 CheckPoint checkpoint

3.3 Differences between cache and checkpoints

3.4 Checkpoint storage to HDFS cluster

4. Key-value pair RDD data partitioning

2. Accumulator

3. Broadcast variables

Guess you like