SparkCore of running architecture (Continued - term knowledge)

Five characteristics of RDD and the five characteristics reflected in the source code

First of all content before the way, the five characteristics of the five characteristics of RDD and reflected in the source code, the previous blog a detailed explanation, here briefly about. Details, refer to: the Spark of RDD

  • RDD ①A list of partitions is made a series of Partition.
    Method corresponds Source: getPartitions
  • ②A function for computing each split RDD to do a calculation, in fact, for RDD is calculated in each partition.
    Method source corresponding to: compute
  • Between ③A list of dependencies on other RDDs RDD with a converter, there is dependent, there is a correspondence between them
    corresponding to the source Method: getDependencies
  • ④Optionally, a Partitioner for key-value RDDs Alternatively RDD for the key-value can specify a partitioner, telling it how to partition
    the corresponding method source code: partitioner
  • ⑤Optionally, a list of preferred locations to compute each split on the time calculated for each slice, to get the best position for each PreferredLocations Split, calculated to run / execute the run on which it is preferably (s) of machine
    the corresponding source code method: getPreferredLocations
    Also note that a partition corresponds to a task, how many partitions there are that many task to perform.

So the question is,

  • 1) corresponding to the three methods above ①②③ in the end is the executor or end run driver end it? ? ?
    With the answer. . . .
    Spark after you write the code, which code running on driver side, which code runs executor end? Themselves need very clear. For example, some code may not enough memory in the driver-side implementation, some code executor side execution is not enough memory, need a different approach for different scenarios, so the need is clear.
    We can know from the previous section: Spark application contains a driver and a plurality of executor. Driver program is a process that runs the application inside the application's main () function, and create SparkContext the main function inside. Executor is to start the application process on the worker node, this process can run multiple tasks and data stored in memory or on disk storage.
  • 2) What Input Output Input and output corresponding to three of the above method is ①②③?
    ①getPartitions is no input, the output is Array [Partition], an array, which array is the Partition;
    ②compute the Partition is input, the output is Iterator [T] iterator;
    ((iterator) is not a collection, there is provided an iterator Iterator the method of access to the collection, while or may be implemented by a for loop traversal iterator)
    ③getDependencies no input, the output is Seq [dependency [_]], as an array which is dependent series.

About Stage

Each job gets divided into smaller sets of tasks called stages that depend on each other (similar to the map and reduce stages in MapReduce);. You'll see this term used in the driver's logs
every action triggers a job, each job was cut into small set of tasks, the task set is called stages small, and the interdependence between them (similar to the MapReduce map and reduce in stages). A job can have multiple stage.
Encountered an action would trigger a job, met shuffle will trigger a stage.
Access: http://hadoop001:4040/jobs/
Here Insert Picture Description
Example Demonstration:
executing code:

scala> sc.parallelize(List("a","wc","b","a","word","wc")).map((_,1)).collect
res1: Array[(String, Int)] = Array((a,1), (wc,1), (b,1), (a,1), (word,1), (wc,1))

See page:
Here Insert Picture Description
encountered an action would trigger a job, collect above is a action.
It can be seen that a total of a stage, there is a map and parallelize two operators. There are two task tasks.
Here Insert Picture Description
More specific point:
Here Insert Picture Description
Here Insert Picture Description
again execute the code below:

scala> sc.parallelize(List("a","wc","b","a","word","wc")).map((_,1)).reduceByKey(_+_).collect
res2: Array[(String, Int)] = Array((b,1), (word,1), (wc,2), (a,2))

See page:
Here Insert Picture Description
encountered an action would trigger a job, collect above is a action.
Encounter shuffle will trigger a stage, reduceByKey exist above the shuffle process, it will trigger stage.
It can be seen that a total of two stage, two stage with two tasks are a total of four tasks.
Here Insert Picture Description
Here Insert Picture Description
Here Insert Picture Description
Here Insert Picture Description
Here Insert Picture Description
Summary, it can be seen from the above, there may be a plurality of job stage, a stage is composed of a stack task, a task is sent stolen a minimum operating unit to execute the executor.

Eet 之 cache (persist)

Official website explained

Tell me what network: http://spark.apache.org/docs/latest/rdd-programming-guide.html#rdd-persistence

One of the most important capabilities in Spark is persisting (or caching) a dataset in memory across operations. When you persist an RDD, each node stores any partitions of it that it computes in memory and reuses them in other actions on that dataset (or datasets derived from it). This allows future actions to be much faster (often by more than 10x). Caching is a key tool for iterative algorithms and fast interactive use.
one of the most important feature is the Spark can be between multiple operations , persisting the data memory (or Caching) to (here in executor). When you persist or cache of a RDD, (RDD is constituted by a partition), each node stores these partitions, these partitions each node calculation data in memory and other operations on the data set (or from derived data sets) data is reused in these partitions. Such action behind the operation faster (typically more than 10 times). Caching is a key tool for fast interactive and iterative algorithm used.

You can mark an RDD to be persisted using the persist () or cache () methods on it The first time it is computed in an action, it will be kept in memory on the nodes Spark's cache is fault-tolerant -.. If any partition of an RDD is lost, it will automatically be recomputed using the transformations that originally created it.
you can persist () or Cache () method RDD marked as persistent. The results of data (that is, the RDD data) data of an operation (Action) trigger the RDD for the first time when it is calculated, it is saved in the memory of the node, the calculation will be in the form of partition is cached on compute nodes memory. Spark cache is fault tolerant - RDD if any partition is lost, it will create its transformations automatically recalculated before use, by blood information (Lineage).

In addition, each persisted RDD can be stored using a different storage level, allowing you, for example, to persist the dataset on disk, persist it in memory but as serialized Java objects (to save space), replicate it across nodes. These levels are set by passing a StorageLevel object ( Scala, Java, Python) to persist (). The cache () method is a shorthand for using the default storage level, which is StorageLevel.MEMORY_ONLY (store deserialized objects in memory).
in addition, each RDD can use a different persistent store StorageLevel, for example, allows you to save the data on disk sets, stored in memory, but as a sequence of Java objects (to save space), copy it to the node. These levels are set to persist () by passing a StorageLevel objects (Scala, Java, Python). The Cache () method is to use the default storage level short, it is StorageLevel.MEMORY_ONLY (deserialize the objects stored in memory).

RDD forms of storage or storage medium can be defined by the storage level (Storage Level) a. For example, the data is persisted to disk, and then the Java object serialization (helps to save space) to the cache memory, open the copy (RDD partition data can be backed up to prevent the loss of more than one node) or use the external heap memory (Tachyon) . persist () can receive a StorageLevel objects (Scala, Java, Python) to define the storage level, if you are using the default storage level (StorageLevel.MEMORY_ONLY), Spark provides a convenient method: cache ().

For example Introduction

Execute the code below:

scala> val lines = sc.textFile("hdfs://hadoop001:9000/data/wordcount.txt")
lines: org.apache.spark.rdd.RDD[String] = hdfs://hadoop001:9000/data/wordcount.txt MapPartitionsRDD[6] at textFile at <console>:24

scala> lines.collect
res3: Array[String] = Array(world       world   hello, China    hello, people   person, love)

Look at the page cache is not to be: the
Here Insert Picture Description
then execute it:

scala> lines.cache
res4: lines.type = hdfs://hadoop001:9000/data/wordcount.txt MapPartitionsRDD[6] at textFile at <console>:24

Look on the page, or not cache, cache or persist is because of lazy, and transformtion is the same, it needs to be triggered by action.
On the basis of the above, and then execute it, containing action, and to trigger it, look at the page, there is:

scala> lines.collect
res5: Array[String] = Array(world       world   hello, China    hello, people   person, love)

Here Insert Picture Description
Here Insert Picture Description
Can be seen from the above, the input is 74.0B, after the buffer becomes 312.0 B, and larger.

cache and persist difference (frequently asked interview)

You can see the corresponding source code

   // Persist this RDD with the default storage level (`MEMORY_ONLY`).
  def persist(): this.type = persist(StorageLevel.MEMORY_ONLY)

   // Persist this RDD with the default storage level (`MEMORY_ONLY`).
  def cache(): this.type = persist()

We can see the underlying cache call is persist, persist low-level calls that persist (StorageLevel.MEMORY_ONLY) (overloaded methods).

persist之StorageLevel

Store level options are as follows:

Storage Level Meaning
MEMORY_ONLY Spark as the RDD deserialize Java objects stored in the JVM does not pass through a sequence of processing. If RDD save any in memory, and save any of these partitions will not be cache, it will be recalculated when they were needed. This is the default level.
MEMORY_AND_DISK RDD deserialized as Java objects stored in the JVM, RDD stored in memory if no less than, the partitions save any data to be stored on disk. Read directly from disk when needed.
MEMORY_ONLY_SER (Java and Scala) The RDD serialize (each partition serialized into a byte array) and cached in memory. Because the need to deserialize will use more CPU computing resources, but more provincial memory storage space. Excess RDD partitions will not be cached in memory, but need to be recalculated.
MEMORY_AND_DISK_SER (Java and Scala) Compared to MEMORY_ONLY_SER, after data in low memory situations, the sequence of stored on disk.
DISK_ONLY RDD using only disk storage of data (without serialization).
MEMORY_ONLY_2, etc. In an example MEMORY_ONLY_2, MEMORY_ONLY_2 MEMORY_ONLY compared to stored data in the same way, except that each partition would backup to two different nodes in the cluster, similar to other situations.
OFF_HEAP (experimental) And MEMORY_ONLY_SER similar, but the data stored in the memory stack. This requires enabling piles of memory.
源码:
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/storage/StorageLevel.scala

Look at the source code

class StorageLevel private(
    private var _useDisk: Boolean,   //是否使用磁盘
    private var _useMemory: Boolean,   //是否使用内存
    private var _useOffHeap: Boolean,   //是否使用Heap
    private var _deserialized: Boolean,   //是否使用 反序列化
    private var _replication: Int = 1)   //是否使用副本(不写就默认1)
  extends Externalizable {
  // TODO: Also add fields for caching priority, dataset ID, and flushing.
  private def this(flags: Int, replication: Int) {
    this((flags & 8) != 0, (flags & 4) != 0, (flags & 2) != 0, (flags & 1) != 0, replication)
  }

Look
Here Insert Picture Description
to see the page:
Here Insert Picture Description
how to use caching?
Defines an RDD lines,
can be used as such: lines.persist (StorageLevel .MEMORY_ONLY_SER_2)
how to remove cache?
lines.unpersist (true)
after such direct execution, the page will not see its cache, which also shows unpersist not lazy, it is the action and the same.

How to choose the Storage Level?

Spark also automatically persists some intermediate data in shuffle operations (eg reduceByKey), even without users calling persist. This is done to avoid recomputing the entire input if a node fails during the shuffle. We still recommend users call persist on the resulting RDD if they plan to reuse it.
even if the user is not actively using persist, containing shuffle operation (such reduceByKey) in, Spark will automatically persist the data cache some intermediary. Avoid doing so: during the shuffle, a node linked to recalculate the entire input data. If the RDD results will later be reused, Spark still recommend users to use persist or cache to cache the data down.
So how to choose the store level?
Selected level is actually stored double Memory tradeoff between efficiency and CPU can refer to the following:

  • 1) If you RDD for default storage level (MEMORY_ONLY), then let them keep it that way. This is the most efficient CPU work a way that allows the operator to run the RDD as quickly as possible.
  • 2) If 1) is not satisfied, that could not walk 1) if, then 2), try using MEMORY_ONLY_SER, and choose a fast sequence of the FAST serialization Library A library the Selecting , so that the object to save more space, but access speed is still quite fast. (Java and Scala)
  • 3) Do not persist or cache data to disk unless the calculation is very "expensive" or the calculation process will filter out large amounts of data, recalculated because a partition may be higher than the speed of data read from a disk partition data rate; (intended only for Spark core inside, as long as the memory disk with all do)
  • 4) If you need fast failure recovery mechanisms, use the backup storage level, such as MEMORY_ONLY_2, MEMORY_AND_DISK_2; while all store level fault tolerance can be achieved by recalculating the lost data, but backup mechanism allows the application in most cases without interruption, i.e., the loss of data, the backup data directly, without the need to recalculate the data of the process. (In fact the production is not so much resources you use, if you put forward a copy of several, more harm than good) (here do not need to copy it directly to the default 1)
  • Benefits 5) If at a scene or a large memory applications, OFF_HEAP can bring the following:
    A Spark Executors it allows to share data memory Tachyon;.
    B JVM garbage collection to reduce the performance overhead caused by its large extent. ;
    . The Executors c the Spark failure does not cause data loss.

Comprehensive above, conclude it can be seen, for the Spark core inside, as long as the disk with a deposit of all do not, then do not need a copy, default 1 directly on it, so we only need to consider MEMORY_ONLY, MEMORY_ONLY_SER both on it .

Remove cached data Removing Data

Spark automatically monitors cache usage on each node and drops out old data partitions in a least-recently-used (LRU) fashion. If you would like to manually remove an RDD instead of waiting for it to fall out of the cache, use the RDD .unpersist () method,.
the Spark will automatically monitor the cache usage on each node, and at least recently used (LRU algorithm), delete the old data partition data. If you want to manually delete the RDD, rather than waiting for it to exit the cache, use the RDD.unpersist () method.
cache data in the executor above, when you sc.stop () after, in theory, these caches will be gone, but still the best in the final with xxx.unpersist (true).

recompute recalculated

Here Insert Picture Description
Figure fancy, when RDD1 inside a partition r3 hung up behind the calculation certainly can not continue to go down, then we need to find the appropriate partition of the parent RDD r3, where it found p2, p2 will be made again recalculate get r3, but the other partition is not involved in the calculation. As another example, if p2 also hung up, continue to move forward to find, until you find.
R3 hang above simply recalculated, blood relationship is relatively simple.
Here are some narrow width dependence and dependence.

Narrow and wide dependency dependency (often Interview)

It depends on the relationship between parent and RDD RDD There are two different types, i.e., narrow-dependent (narrow dependency) and wide-dependent (wide dependency).
Narrow dependence: a parent RDD RDD of Partition only a quilt of partition used once.
Wide dependence: a parent of Partition quilts RDD RDD of partition used multiple times.

Shuffle wide dependency exists.
If you join ask is dependent on a narrow or wide-dependent, it should be divided.
The Lineage (descent): RDD support only coarse-grained conversion, that is, only a single recording operation performed on a single block. RDD will create a series of Lineage (ie descent) recorded, in order to recover lost partitions. The RDD RDD Lineage records metadata information and the conversion behavior, when part of the RDD partition data is lost, it can be re-operations and recover lost data partitions based on the information.
Shown below, the left is narrow dependent, dependent on the right is wide:
Here Insert Picture Description

FIG case has three stage, A to B has encountered groupBy shuffle, demolished a stage1;
C to D, D, E to F, map, union are dependent narrow, no shuffle, not split into stage.
B, F G to split stage3.
Here Insert Picture Description
Encountered before Action, if the shuffle operator experience, will become two stage. Encountered an action would trigger a job.
A trigger action a job, a job constituted by the n-th stage, a stage composed of n task.
Not to say that a child is considered a task.
Like the figure, narrow dependence, which is based pipeline manner of operation, a direct water pipes go in the end. C is a partition corresponding to the partition D, F and then in the corresponding partition, directly in the end, this is a task. A partition is a task. = The number of parallel partition = number of task.

Key-Value Pairs key-value pairs

Spark for RDD can do many operations, these RDD can contain a variety of types, but only a small part of the special operation is only available on key RDD on the structure. The most common is contained shuffle operation, as performed by grouping or aggregating operation key element.
In scala, these operations are automatically available scala contains tuple2 RDDS (tuple language built-in objects, by a simple writing (A, B)), the operation of key-value pairs are defined in the PairRDDFunctions class, the RDD class will tuple of function enhancements.
For example, reduceByKey operating statistics Key-Value text file using the following code in each row there have been many times:

val lines = sc.textFile("data.txt")
val pairs = lines.map(s => (s, 1))
val counts = pairs.reduceByKey((a, b) => a + b)

//reduceByKey虽然表面上没有用PairRDDFunctions,但是底层还是调用PairRDDFunctions。  它是PairRDDFunctions.scala类里面的。有隐式转换的。可以跟踪一下源码。
//面试题:reduceByKey是哪个类里面的???并不是RDD.scala类里面的。

We can also use counts.sortByKey (), for example, be in alphabetical order, the last used counts.collect () returns the data collected to the driver.
Note: When you use a key target of the operation RDD custom, if you override equals () method must also override the hashCode () method. For more information, please refer to Object.hashCode () documentation conventions listed.

Guess you like

Origin blog.csdn.net/liweihope/article/details/91349815