When the big data computing framework on yarn is executed, where does the task go?

Recently, a friend asked me a very interesting question, which is the big data framework spark, flink, etc. When executing on yarn, where did the task essentially go?

I didn’t even want to say it, it must be running on my own cluster. When on yarn, I just changed the scheduler to yarn.

But when I was about to say it, I didn’t think it right. I suddenly thought of a problem. When Hadoop runs mr tasks on its own, the reason why yarn can play a role is because there are resourcemanager and nodemanager, but other frameworks. It seems that there is no such bridge when executing on Yarn. Even if we will configure the association with Yarn, the components that actually communicate with Yarn don’t seem to exist. That is to say, there is only the embarrassing situation of head and body, so I run tasks in my own cluster. The argument is a bit far-fetched

After that I checked the schematic diagram on the Internet, and then I thought about it myself. I suddenly realized that when other frameworks are on yarn, the task is essentially running on hadoop.

I won’t give you the schematic diagram here, because I’m afraid that it’s more difficult for you to understand. I’ll explain it to you in plain English here, and you’ll understand.

If you have read my MR operating principle or other materials, everyone will know that the MR task can run because the driver node plans the task before the task is submitted. At the same time, the task is still running while the driver process is running. At that time, I actually had a bold guess in my mind at the time when the results of the task were processed and other operations.

When I was curious, I used spark to do an experiment to verify my thoughts, and finally I came to the conclusion that when any framework on yarn is executed, you need to start Hadoop and other data computing frameworks normally. , Perform task execution. At this time, when the task is submitted, the task will be initialized locally in other frameworks, generate a framework's own driver, and interact with yarn. This is the same as when the MR task is submitted, there is a driver process to plan the task. , Is exactly the same process

When the task is run later, the driver is still responsible for how the task is run. That is to say, when other frameworks are on yarn, they only use a driver generated by themselves to replace the driver of hadoop when the original MR is executed, so that The task is still running on the continer of hadoop, but the specific task is no longer under the control of hadoop.

And the one we configured is associated with yarn. In addition to letting other frameworks know the address of yarn, it also interacts with the task status. For example, when Spark on yarn, you click the type task details in the task details through the yarnweb interface. You can jump to the Spark UI interface

Guess you like

Origin blog.csdn.net/dudadudadd/article/details/114648566