spark several nouns
1 job, an action operation triggers a job
Some actions for actionshttp://spark.apache.org/docs/latest/programming-guide.html#actions
2 stage division
1) If there is a shuffle operation, the shuffle is a stage before
2) The data landing output is a stage
3 Task is a specific task execution unit. Personally, it is understood as a thread. The division of Task depends on several aspects.
1) The number of allocated cpus and the number of cpu cores, the number of cpus * the number of cpu cores is the total number of tasks.
2) The total number of partitions Partition
If the number in 1 is less than the number in 2, it is the number of tasks in 1, otherwise it is the number of partitions in 2, that is, the number of partitions determines the number of concurrent executions.
If there are too few metadata partitions, it can be repartitioned, otherwise there will be no concurrency.
4 The worker personally understands the number of machines that work, and the Worker Node is the physical node.
5 Executor, which is the number of CPUs on the worker machine. If num-executors=5 is set, 5 CPUs are allocated to this task to execute the task.
If executor-cores=10, the number of cores allocated to each cpu is 10, that is, 10 threads are started on each executor to execute tasks, and the total number of tasks allocated is 5*10.
executor-memory=2g The memory allocated for each Task is 2g. If this value is too large, it will affect the number of executors that are started.
spark-submit --master yarn-cluster --name importdtaweather3 --num-executors 10 --executor-cores 12 --executor-memory 3g --queue def0 --class com.jusn.spark.test.DFTestRowkeySelf weatherimport-1.0-jar-with-dependencies.jar