Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[SPARK-50648][CORE] Cleanup zombie tasks in non-running stages when t…
…he job is cancelled ### What changes were proposed in this pull request? This is a problem that Spark always had. See the following section for the scenario when the problem occurs. When cancel a job, some tasks may be still running. The reason is that when `DAGScheduler#handleTaskCompletion` encounters FetchFailed, `markStageAsFinished` will be called to remove the stage in `DAGScheduler#runningStages` (see https://github.com/apache/spark/blob/7cd5c4a1d1eb56fa92c10696bdbd8450d357b128/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L2059) and don't `killAllTaskAttempts`. But `DAGScheduler#cancelRunningIndependentStages` only find `runningStages`, this will leave zombie shuffle tasks, occupying cluster resources. ### Why are the changes needed? Assume a job is stage1-> stage2, when FetchFailed occurs during the stage 2, the stage1 and stage2 will resubmit (stage2 may still have some tasks running even if stage2 is resubmitted , this is as expected, because these tasks may eventually succeed and avoid retry) But during the execution of the stage1-retry , if the SQL is canceled, the tasks in stage1 and stage1-retry can all be killed, but the tasks previously running in stage2 are still running and can't be killed. These tasks can greatly affect cluster stability and occupy resources. ### Does this PR introduce _any_ user-facing change? No ### Was this patch authored or co-authored using generative AI tooling? No Closes #49270 from yabola/zombie-task-when-shuffle-retry. Authored-by: chenliang.lu <chenlianglu@tencent.com> Signed-off-by: Yi Wu <yi.wu@databricks.com>
- Loading branch information