Spark architecture

Abstract: I recently saw a blog post about the Spark architecture by Alexey Grishchenko. Students who have read Alexey's blog post should know that he has a very deep understanding of Spark. After reading his "spark-architecture" blog post, there is a feeling of enlightenment. From JVM memory allocation to Spark cluster resource management, step by step Go deeper, feel a lot.

I recently saw a blog post about Spark architecture by Alexey Grishchenko. Students who have read Alexey's blog post should know that he has a very deep understanding of Spark. After reading his "spark-architecture" blog post, there is a feeling of enlightenment. From JVM memory allocation to Spark cluster resource management, step by step Go deeper, feel a lot. Therefore, in my spare time on the weekend, I translated the core content of this article into Chinese and shared it with you here. If there are any mistakes in the translation process, please point it out.

First look at an official picture of Spark 1.3.0, as follows:

In this picture, you will see a lot of terms, such as "executor", "task", "cache", "Worker Node" and so on. The original author said that when he started to learn spark, the above picture was the only picture that could be found (Spark 1.3.0), and the situation was not optimistic. Even more unfortunately, this diagram doesn't quite express some of the concepts inherent in Spark. Therefore, through continuous learning, the author organizes the knowledge he has learned into a series, and this article is only one of them. Below are the core points.

Spark Memory Allocation Any Spark program running normally

on your cluster or local machine is a JVM process. For any JVM process, you can configure its heap size with -Xmx and -Xms. The question is: how do these processes use its heap memory and why do they need it? The following is a slow start around this issue.

First, let's take a look at the following Spark JVM heap memory allocation diagram:

Heap Size

By default, Spark will initialize 512M of JVM heap memory when it starts. From a safety point of view and to avoid OOM errors, Spark only allows 90% of the heap memory to be used. This parameter can be controlled by Spark's spark.storage.safetyFraction parameter. OK, you may have heard that Spark is an in-memory tool that allows you to store data in memory. If you have read the author's Spark Misconceptions article, then you should know that Spark is not really an in-memory tool. It just uses memory during the LRU cache (http://en.wikipedia.org/wiki/Cache_algorithms). So part of the memory is used for data cache, which usually accounts for 60% of the safe heap memory (90%). This parameter can also be controlled by configuring spark.storage.memoryFraction. So if you want to know how much data can be cached in Spark, you can do this by summing the heap sizes of all executors and multiplying by safetyFraction and storage.memoryFraction, which by default is 0.9 * 0.6 = 0.54, which is the total 54% of heap memory is available to Spark.

Shuffle Memory

Next, let's talk about shuffle memory. The calculation formula is "Heap Size" * spark.shuffle.safetyFraction * spark.shuffle.memoryFraction. The default value of spark.shuffle.safetyFraction is 0.8 or 80%, and the default value of spark.shuffle.memoryFraction is 0.2 or 20%, so the final JVM heap memory size you can use for shuffle is 0.8*0.2=0.16, which is the total heap 16% of size. The question is how does Spark use this memory? There is a more detailed explanation on the official Github (https://github.com/apache/spark/blob/branch-1.3/core/src/main/scala/org/apache/spark/shuffle/ShuffleMemoryManager.scala). In general, Spark uses this part of memory for calling other specific tasks in the Shuffle phase. After the shuffle is performed, sometimes you need to sort the data. In the sort phase, you usually also need a buffer-like buffer to store the sorted data (remember, you cannot modify the data already in the LRU cache, because the data may be used again). Therefore, a certain amount of RAM is required to store the sorted data blocks. What if you don't have enough memory for sorting? Search wikipedia for "external sorting" and read it carefully. Outer sorting allows you to sort chunks of data and then merge the final results together.

unroll Memory

The last thing to talk about about RAM is "unroll" memory. The formula for calculating the total amount of memory for an unroll process is: spark.storage.unrollFraction * spark.storage.memoryFraction *spark.storage.safetyFraction. By default it is 0.2 * 0.6 * 0.9 = 0.108,
which is 10.8% of the heap size. Use it when you need to unwind a block of data in memory. Why do you need the unroll operation? In Spark, it is allowed to store data in two ways: serialized and deserialized, but the serialized data cannot be used directly, so it must be unrolled when using it, so this Part of the RAM is the memory used for unrolling operations. Unroll memory and storage RAM are shared, that is, when you perform an unroll operation on data, if you need memory and there is not enough memory at this time, it may cause the undo to store fewer data blocks in the Spark LRU cache.

Spark cluster mode JVM allocation

is OK. Through the above explanation, we should have a further understanding of the Spark process, and already know how it utilizes the memory in the JVM process. Now switch to the cluster, take YARN mode as an example.

In a YARN cluster, it has a YARN ResourceMananger daemon that controls cluster resources (aka memory), and a series of YARN Node Managers running on each node of the cluster that control the use of node resources. From YARN's point of view, each node can be seen as an assignable RAM pool. When you send a request to the ResourceManager to request resources, it will return some NodeManager information. These NodeManagers will provide you with an execution container, and each The execution container is a JVM process with the heap size you specify when you send the request. The location of the JVM is managed by the YARN ResourceMananger, which you have no control over. If a node has 64GB of RAM controlled by YARN (by setting the parameter yarn.nodemanager.resource.memory-mb in the yarn-site.xml configuration file), when you request 10 executors with 4G memory, these executors may Running on the same node, it doesn't matter if your cluster is bigger.

When starting a spark cluster in YARN mode, you can specify the number of executors (-num-executors or spark.executor.instances parameter), and you can specify the inherent memory size of each executor (-executor-memory or spark.executor.memory) , you can specify the number of cpu cores used by each executor (-executor-cores or spark.executor.cores), you can specify the number of cores allocated to each task (spark.task.cpus), and you can specify the number of cores used on the driver Memory (-driver-memory or spark.driver.memory).

When you execute an application on a cluster, the job program will be divided into multiple stages, and each stage will be divided into multiple tasks, and each task is scheduled separately. You can regard the JVM process of each executor as a task Execution slot pool, each executor will set spark.executor.cores/ spark.task.cpus execution slots for your task. For example, there are 12 nodes running in the YARN NodeManager of the cluster, each node has 64G memory and 32 CPU cores (16 hyper-threaded physical cores). Each node can start 2 executors with 26G memory (the remaining RAM is used for system programs, YARN NM and DataNode), and each executor has 12 cpu cores that can be used to execute tasks (the rest are used for system programs, YARN NM and DataNode). NM and DataNode), so that the entire cluster can handle 12 machines * 2 executors per machine * 12 cores per executor / 1 core = 288 task execution slots, which means that your spark cluster can run 288 tasks at the same time, almost fully utilized All resources. The memory used by the entire cluster to cache data is 0.9 spark.storage.safetyFraction * 0.6 spark.storage.memoryFraction * 12 machines * 2 executors per machine * 26 GB per executor = 336.96 GB. Actually not that much, but in most cases down, that's enough.

At this point, you have probably understood how spark uses the memory of the JVM and what is the execution slot of the cluster. As for task, it is the unit of work performed by Spark and is executed as a thread in the exector JVM process. This is also the reason why Spark job startup time is faster, starting a thread in the JVM is faster than starting a single JVM process block, and executing a MapReduce application in Hadoop starts multiple JVM processes.

Spark Partition

Let's talk about another abstract concept of Spark "partition". During the running of the Spark program, all data will be divided into multiple Partions. The question is what is a partition and how is the number of partitions determined? First of all, the size of the Partition depends entirely on your data source. In Spark, most of the methods used to read data can specify the number of Partitions in the generated RDD. When you read a file from hdfs, you specify it using Hadoop's InputFormat. By default, each InputSplit returned by InputFormat is mapped to a Partition in the RDD. For most files on HDFS, an InputSplit is generated for each data block, approximately 64 MB/128 MB of data in size. Approximately, the block boundaries of data on HDFS are measured in bytes (a 64MB block), but when the data is processed, it is split by record. For text files, the splitting character is the newline, for sequence files it ends with a block, and so on. The special thing is the compressed file. Since the entire file is compressed, it cannot be split by line. The entire file has only one inputsplit, so there will be only one parition in spark, and it needs to be manually repatitioned during processing.

This article is a translation of the core points of the first article of the Distributed Systems Architecture series by Alexey Grishchenko. The original author's second article is about shuffle, [Original link], and the third article is about memory management mode, original link http://click.aliyun.com/m/25869/

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326360177&siteId=291194637