spark bit by bit

 

The core of ~spark is to abstract the data source into a distributed object RDD, which is distributed in the memory of each computing node. Local computing + memory + cluster computing makes spark more efficient.

~For users, RDD comes with various operators, which makes writing distributed programs the same as local programs. Spark converts RDD operators into tasks that actually work, and serializes tasks into class bytecodes , so that it can be deserialized, loaded and run on each computing node.

~, Every new action operation will cause the entire RDD to be recalculated, which can be solved using the cache.

When ~.sc.textFile(), the data is not read in. Like the conversion operation, the read data operation may also be read multiple times, which is a waste of memory

~, RDD consists of partitions distributed on different nodes. Partition is the smallest unit. Each partition runs a task. The spark program can control data distribution through partitions to obtain the least network transmission. Try to keep computation and data on the same node to improve performance.

~, if two RDDs are associated with each other, the large RDD can be partitioned and persisted first, so that the small RDD data is transmitted to the partition where the large RDD is located, saving network transmission. Like reduceByKey, groupByKey, etc. will be partitioned. Good control can improve performance.

~, accumulator is a simple syntax for aggregating values ​​from worker nodes into driver programs.

~, the variables in the RDD closure will be copied to each Task, but the broadcast variables will only be copied to each node. If the variable data volume is large, broadcasting variables can improve performance.

~, operates on data based on partitions, which can avoid repeated configuration work for each data element, such as database connections.

~, yarn-client mode, the client runs the user program directly after startup, starts the driver-related work, the driver registers with the ClusterMaster, the clusterMaster starts the executor, and the driver acts as the AppMaster role, that is, the client converts a series of RDD operations into DAGs Graph, DAG graph is submitted to DAGScheduler for parsing into stages, each stage consists of multiple tasks. Then send the Task to an Executor for execution.

~, yarn-cluster mode, after the client starts, submit the app to ClusterMaster, clustMaster starts the driver in an executor, executes RDD as a series of tasks, and distributes it to each executor for execution.

~, Spark is executed based on the yarn framework. The clustermaster of spark is the resourcemanager, and the driver function of spark is the AppMaster function of yarn. Spark's executor is yarn's container

~, RDD operations with narrow dependencies will be in one stage.

~, the configuration of hadoop will be broadcast to each node.

~, the stages are executed sequentially and cannot be executed in parallel

~, each Spark application can have multiple action operations, each action operation triggers a job, and each job consists of multiple stages, and the division of stages starts from the Rdd of the triggered job (that is, the last RDD of the job). ), when a wide dependency is encountered, the RDD after the wide dependency is determined as a stage, so the end of each stage is a wide dependency, and the RDD operations in each stage are performed by the same Task. There are two types of tasks, one is ShuffleMapTask , which is used to process all operations in the stage until the shuffle process of wide dependencies. Another Task type is ResultTask, which is the Task of the last stage, because it is responsible for calculating the final result of this job and reporting it to the driver.

~, the checkpoint mechanism actually stores the results of all dependent calculations of the RDD in the CheckPointRDD to replace the dependent rdd, and the partition data of the checkpointrdd can be directly taken in the next calculation.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326540456&siteId=291194637
Bit
BIT