Flink——Task exit process and failover mechanism

Flink——Task exit process and failover mechanism

1 TaskExecutor side Task exit logic

Task.doRun()The core method that guides Task to initialize and execute its related code,
constructs and instantiates the executable object of Task: AbstractInvokable invokable.
Call AbstractInvokable.invoke()to start the calculation logic contained in the Task.

After AbstractInvokable.invoke()executing the exit, perform the corresponding operation according to the exit type:

  1. Exit after normal execution: output ResultPartitionbuffer data, close the buffer, and mark Task as Finished;
  2. Cancel operation leads to exit: mark Task as CANCELED, close user code;
  3. AbstractInvokable.invoke()Throw an exception and exit during execution: mark the Task as FAILED, close the user code, and record the exception;
  4. AbstractInvokable.invoke()JVM throws an error during execution: the virtual machine is forcibly terminated and the current process exits.

Then release Task-related network, memory, and file system resources. Finally, the Task termination status is notified to the Leader JobMaster thread through the transfer link of Task->TaskManager->JobMaster.

Task.notifyFinalState() -> TaskManagerActions.updateTaskExecutionState(TaskExecutionState) -> JobMasterGateway.updateTaskExecutionState(TaskExecutionState)

  • TaskExecutionStateKey information to carry:

    TaskExecutionState {
          
          
        JobID // 任务ID
        ExecutionAttemptID  // Task执行的唯一ID,标示每次执行
        ExecutionState  // 枚举值,Task执行状态
        SerializedThrowable // 若Task抛出异常,该字段记录异常堆栈信息
        ...
    }
    
  • Task performs state transitions:

   CREATED  -> SCHEDULED -> DEPLOYING -> RUNNING -> FINISHED
      |            |            |          |
      |            |            |   +------+
      |            |            V   V
      |            |         CANCELLING -----+----> CANCELED
      |            |                         |
      |            +-------------------------+
      |
      |                                   ... -> FAILED
      V
  RECONCILING  -> RUNNING | FINISHED | CANCELED | FAILED

2 JobMaster-side failover process

2.1 Task Execute State Handle

JobMaster receives the task execution state change information sent by TaskManager through rpc, and will notify the scheduler ( SchedulerNG) of the current Flink job to process, because they are all called through the same thread, and subsequent ExecutionGraphstateful instances such as (runtime execution plan) and failover count There will be no thread safety issues in the read/write operations.

JobMasterThe core processing logic is in SchedulerBase.updateTaskExecutionState(TaskExecutionStateTransition)( TaskExecutionStateTransitionmainly TaskExecutionStatethe readability package).
Processing logic: try to update the received Task execution status information to ExecutionGraph. SchedulingStrategyIf the update is successful and the target status is FINISHED , select the consumable result partition and schedule the corresponding consumer Task according to the specific implementation strategy; if the update is successful and the target status is FAILED, enter the specific failover process.

  • SchedulerBase.updateTaskExecutionState(TaskExecutionStateTransition)

        public final boolean updateTaskExecutionState(
                final TaskExecutionStateTransition taskExecutionState) {
          
          
            final Optional<ExecutionVertexID> executionVertexId =
                    getExecutionVertexId(taskExecutionState.getID());
    
            boolean updateSuccess = executionGraph.updateState(taskExecutionState);
    
            if (updateSuccess) {
          
          
                checkState(executionVertexId.isPresent());
                if (isNotifiable(executionVertexId.get(), taskExecutionState)) {
          
          
                    updateTaskExecutionStateInternal(executionVertexId.get(), taskExecutionState);
                }
                return true;
            } else {
          
          
                return false;
            }
        }
    
  • ExecutionGraph.updateState(TaskExecutionStateTransition)ExecutionAttemptID: The update will fail when the target cannot be found in the current physical execution topology . It should be noted that this ID is used to uniquely identify one Execution, and Executionrepresents ExecutionVertexan execution instance (a subTask plan representing a topological vertex), which ExecutionVertexcan be repeated multiple times. This means that when a subTask reruns, currentExecutionsit will no longer hold the ID information of the last execution.

       /**
         * Updates the state of one of the ExecutionVertex's Execution attempts. If the new status if
         * "FINISHED", this also updates the accumulators.
         *
         * @param state The state update.
         * @return True, if the task update was properly applied, false, if the execution attempt was
         *     not found.
         */
        public boolean updateState(TaskExecutionStateTransition state) {
          
          
           assertRunningInJobMasterMainThread();
            final Execution attempt = currentExecutions.get(state.getID());
            if (attempt != null) {
          
          
                try {
          
          
                    final boolean stateUpdated = updateStateInternal(state, attempt);
                    maybeReleasePartitions(attempt);
                    return stateUpdated;
                } catch (Throwable t) {
          
          
                    ......
                    return false;
                }
            } else {
          
          
                return false;
            }
        }
    
  • JobMaster: Responsible for the central operation class of a task topology, involving job scheduling, resource management, external communication, etc...

  • SchedulerNG: Responsible for scheduling job topology. All calls to methods of this type of object will be ComponentMainThreadExecutortriggered, and there will be no concurrent calls .

  • ExecutionGraph: The central data structure of the current execution topology, which coordinates the Execution distributed on each node. Describes each SubTask of the entire task and its partition data, and maintains communication with it.

2.2 Job Failover

2.2.1 Task Failure Handle

  • The main process of Task exception DefaultScheduler.handleTaskFailure(ExecutionVertexID, Throwable)is to RestartBackoffTimeStrategyjudge whether to restart or fail-job; to FailoverStrategyselect the Subtask that needs to be restarted; finally, SchedulingStrategyto restart the corresponding Subtask according to the current scheduling strategy of the task.

        private void handleTaskFailure(
                final ExecutionVertexID executionVertexId, @Nullable final Throwable error) {
          
          
            // 更新当前任务异常信息
            setGlobalFailureCause(error);
            // 如果相关的算子(source、sink)存在coordinator,同知其进一步操作
            notifyCoordinatorsAboutTaskFailure(executionVertexId, error);
            // 应用当前的restart-stratege并获取FailureHandlingResult
            final FailureHandlingResult failureHandlingResult =
                    executionFailureHandler.getFailureHandlingResult(executionVertexId, error);
            // 根据结果重启Task或将任务失败
            maybeRestartTasks(failureHandlingResult);
        }
    
    
        public class FailureHandlingResult {
          
          
            //恢复所需要重启的所有SubTask
              Set<ExecutionVertexID> verticesToRestart;
            //重启延迟
              long restartDelayMS;
            //万恶之源
              Throwable error;
            //是否全局失败
              boolean globalFailure;
        }
    
  • ExecutionFailureHandler: Handles exception information and returns exception handling results according to the current application strategy.

        public FailureHandlingResult getFailureHandlingResult(
                ExecutionVertexID failedTask, Throwable cause) {
          
          
            return handleFailure(
                    cause, 
                    failoverStrategy.getTasksNeedingRestart(failedTask, cause),  // 选择出需要重启的SubTask
                    false); 
        }
    
        private FailureHandlingResult handleFailure(
                final Throwable cause,
                final Set<ExecutionVertexID> verticesToRestart,
                final boolean globalFailure) {
          
          
    
            if (isUnrecoverableError(cause)) {
          
          
                return FailureHandlingResult.unrecoverable(
                        new JobException("The failure is not recoverable", cause), globalFailure);
            }
    
            restartBackoffTimeStrategy.notifyFailure(cause);
            if (restartBackoffTimeStrategy.canRestart()) {
          
          
                numberOfRestarts++;
    
                return FailureHandlingResult.restartable(
                        verticesToRestart, restartBackoffTimeStrategy.getBackoffTime(), globalFailure);
            } else {
          
          
                return FailureHandlingResult.unrecoverable(
                        new JobException(
                                "Recovery is suppressed by " + restartBackoffTimeStrategy, cause),
                        globalFailure);
            }
        }
    
  • FailoverStrategy: Failover strategy.

    • RestartAllFailoverStrategy: With this strategy, when a failure occurs, the entire job will be restarted, that is, all subtasks will be restarted.
    • RestartPipelinedRegionFailoverStrategy: When a fault occurs, restart the Region where the Subtask is located.
  • RestartBackoffTimeStrategy: When the Task fails, decide whether to restart and the delay time of restart.

    • FixedDelayRestartBackoffTimeStrategy: Allows the task to restart a fixed number of times with the specified delay.
    • FailureRateRestartBackoffTimeStrategy: Allow restarting with a fixed delay within a fixed failure frequency.
    • NoRestartBackoffTimeStrategy: Do not restart.
  • SchedulingStrategy: Task execution instance scheduling policy

    • EagerSchedulingStrategy: Hunger scheduling, all tasks are scheduled at the same time.
    • LazyFromSourcesSchedulingStrategy: When the consumed partition data is ready, it starts to schedule its subsequent Task, which is used for batch processing tasks.
    • PipelinedRegionSchedulingStrategy: Take the Task linked by the pipline as a Region as its scheduling granularity.

2.2.2 Restart Task


    private void maybeRestartTasks(final FailureHandlingResult failureHandlingResult) {
    
    
        if (failureHandlingResult.canRestart()) {
    
    
            restartTasksWithDelay(failureHandlingResult);
        } else {
    
    
            failJob(failureHandlingResult.getError());
        }
    }

    private void restartTasksWithDelay(final FailureHandlingResult failureHandlingResult) {
    
    
        final Set<ExecutionVertexID> verticesToRestart =
                failureHandlingResult.getVerticesToRestart();

        final Set<ExecutionVertexVersion> executionVertexVersions =
                new HashSet<>(
                        executionVertexVersioner
                                .recordVertexModifications(verticesToRestart)
                                .values());
        final boolean globalRecovery = failureHandlingResult.isGlobalFailure();

        addVerticesToRestartPending(verticesToRestart);

        // 取消所有需要重启的SubTask
        final CompletableFuture<?> cancelFuture = cancelTasksAsync(verticesToRestart);

        delayExecutor.schedule(
                () ->
                        FutureUtils.assertNoException(
                                cancelFuture.thenRunAsync(   // 停止后才能重新启动
                                        restartTasks(executionVertexVersions, globalRecovery), 
                                        getMainThreadExecutor())),
                failureHandlingResult.getRestartDelayMS(),
                TimeUnit.MILLISECONDS);
    }

2.2.3 Cancel Task:

  • Cancel the SubTask that is waiting for slot allocation. If it is already in the deployment/running state, you need to notify the TaskExecutor to perform the stop operation and wait for the operation to complete.

        private CompletableFuture<?> cancelTasksAsync(final Set<ExecutionVertexID> verticesToRestart) {
          
          
            // clean up all the related pending requests to avoid that immediately returned slot
            // is used to fulfill the pending requests of these tasks
            verticesToRestart.stream().forEach(executionSlotAllocator::cancel); // 取消可能正处于等待分配Slot的SubTask
    
            final List<CompletableFuture<?>> cancelFutures =
                    verticesToRestart.stream()
                            .map(this::cancelExecutionVertex) // 开始停止SubTask
                            .collect(Collectors.toList());
    
            return FutureUtils.combineAll(cancelFutures);
        }
    
        public void cancel() {
          
          
            while (true) {
          
           // 状态变更失败则重试
                ExecutionState current = this.state;
                if (current == CANCELING || current == CANCELED) {
          
          
                    // already taken care of, no need to cancel again
                    return;
                }
                else if (current == RUNNING || current == DEPLOYING) {
          
          
                    // 当前状态设为CANCELING,并向TaskExecutor发送RPC请求停止SubTask
                    if (startCancelling(NUM_CANCEL_CALL_TRIES)) {
          
          
                        return;
                    }
                } else if (current == FINISHED) {
          
          
                    // 即使完成运行,后续也无法消费,释放分区结果
                    sendReleaseIntermediateResultPartitionsRpcCall();
                    return;
                } else if (current == FAILED) {
          
          
                    return;
                } else if (current == CREATED || current == SCHEDULED) {
          
          
                    // from here, we can directly switch to cancelled, because no task has been deployed
                    if (cancelAtomically()) {
          
          
                        return;
                    }
                } else {
          
          
                    throw new IllegalStateException(current.name());
                }
            }
        }
    
  • After the operation is completed, the Task exit process will be executed to notify ExecutionGraph to perform the corresponding data update: ExecutionGraph.updateState(TaskExecutionStateTransition)-> ExecutionGraph.updateStateInternal(TaskExecutionStateTransition, Execution)-> -> Execution.completeCancelling(..)-> Execution.finishCancellation(boolean)-> ExecutionGraph.deregisterExecution(Execution). The deregisterExecution operation will currentExecutionsremove the stopped execution ExecutionTask.

2.2.4 Start Task

        private Runnable restartTasks(
                final Set<ExecutionVertexVersion> executionVertexVersions,
                final boolean isGlobalRecovery) {
    
    
            return () -> {
    
    
                final Set<ExecutionVertexID> verticesToRestart =
                        executionVertexVersioner.getUnmodifiedExecutionVertices(
                                executionVertexVersions);
    
                removeVerticesFromRestartPending(verticesToRestart);
                // 实例化新的SubTask执行实例(Execution)
                resetForNewExecutions(verticesToRestart);
    
                try {
    
    
                    // 恢复状态
                    restoreState(verticesToRestart, isGlobalRecovery);
                } catch (Throwable t) {
    
    
                    handleGlobalFailure(t);
                    return;
                }
                // 开始调度,申请Slot并部署
                schedulingStrategy.restartTasks(verticesToRestart);
            };
        }

3 possible problems

There is a hidden danger in the current failover mechanism, can you see it? There will be another announcement in the future.
Problems and optimization: Flink restart strategy (restart-strategy) optimization

Guess you like

Origin blog.csdn.net/qq_30708747/article/details/123080836