Several important terms in Spark

 

 

spark several nouns

1 job, an action operation triggers a job 

Some actions for actionshttp://spark.apache.org/docs/latest/programming-guide.html#actions 

2 stage division

  1) If there is a shuffle operation, the shuffle is a stage before

  2) The data landing output is a stage

 

3 Task is a specific task execution unit. Personally, it is understood as a thread. The division of Task depends on several aspects.

  1) The number of allocated cpus and the number of cpu cores, the number of cpus * the number of cpu cores is the total number of tasks.

  2) The total number of partitions Partition 

  If the number in 1 is less than the number in 2, it is the number of tasks in 1, otherwise it is the number of partitions in 2, that is, the number of partitions determines the number of concurrent executions.

  If there are too few metadata partitions, it can be repartitioned, otherwise there will be no concurrency.

4 The worker personally understands the number of machines that work, and the Worker Node is the physical node.

5 Executor, which is the number of CPUs on the worker machine. If num-executors=5 is set, 5 CPUs are allocated to this task to execute the task.

  If executor-cores=10, the number of cores allocated to each cpu is 10, that is, 10 threads are started on each executor to execute tasks, and the total number of tasks allocated is 5*10.

  executor-memory=2g The memory allocated for each Task is 2g. If this value is too large, it will affect the number of executors that are started.

  spark-submit --master  yarn-cluster  --name importdtaweather3  --num-executors 10  --executor-cores 12  --executor-memory 3g --queue  def0  --class  com.jusn.spark.test.DFTestRowkeySelf  weatherimport-1.0-jar-with-dependencies.jar

 

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=326721038&siteId=291194637