tasks not consumed from DSQ #1153

arighi · 2025-01-03T14:10:27Z

Under certain conditions it seems that per-CPU kthreads can get stuck in a DSQ without being consumed.

This is an old issue that was noticed in rustland and bpfland a long time ago and it was masked by allowing per-CPU kthreads to preempt other tasks (queuing them with SCX_ENQ_PREEMPT).

However, always allowing preemption can introduce excessive scheduling overhead with certain workloads, leading to poor performance (see for example #1148).

For example from this stall:

CPU 1   : nr_run=2 flags=0x1 cpu_rel=0 ops_qseq=1648 pnt_seq=10281
          curr=dxvk-shader-n[20288] class=ext_sched_class

 *R dxvk-shader-n[20288] -1ms
      scx_state/flags=3/0xd dsq_flags=0x0 ops_state/qseq=0/0
      sticky/holding_cpu=-1/-1 dsq_id=(n/a) dsq_vtime=101176519145
      cpus=3f

  R kworker/1:0[29] -5056ms
      scx_state/flags=3/0x9 dsq_flags=0x1 ops_state/qseq=0/0
      sticky/holding_cpu=-1/-1 dsq_id=0x0 dsq_vtime=0
      cpus=02

    kthread+0x14a/0x170
    ret_from_fork+0x37/0x50
    ret_from_fork_asm+0x1a/0x30

In this case kworker/1 has dsq_vtime=0, so it should be the first one in the queue, however it has been sitting in DSQ 0 for more than 5sec (tasks from DSQ 0 can be consumed from all CPUs). The task that is currently running on CPU 1 is dxvk-shader-n[20288], that apparently has been running only for 1ms, so 1ms ago kworker/1 was still sitting in the queue, but it wasn't selected to run.

Looking at the kernel code I don't see any reason why a per-CPU kthread could get stuck in a DSQ indefinitely (especially in this case where DSQ 0 can be consumed by any CPU, so it's not a cpumask-related issue).

This trace was generated using this scheduler https://github.com/sched-ext/scx/tree/scx_vder-testing (which is a simplified version of scx_bpfland, designed specifically to debug this issue).

Am I missing something obvious here?

The text was updated successfully, but these errors were encountered:

arighi · 2025-01-05T13:11:07Z

Another nonsense trace (at least I'm unable to explain what is happening), This one with scx_bpfland, bumping the .timeout_ms to 30s (kernel 6.13-rc4):

CPU 1   : nr_run=4 flags=0x1 cpu_rel=0 ops_qseq=520949 pnt_seq=458781
          curr=TaskThread2[4793] class=ext_sched_class

 *R TaskThread2[4793] -3ms
      scx_state/flags=3/0x5 dsq_flags=0x0 ops_state/qseq=0/0
      sticky/holding_cpu=-1/-1 dsq_id=(n/a) dsq_vtime=1440445754412
      cpus=3f

  R wineserver[4698] -31260ms
      scx_state/flags=3/0x9 dsq_flags=0x0 ops_state/qseq=0/0
      sticky/holding_cpu=-1/-1 dsq_id=0x8000000000000002 dsq_vtime=1409568584439
      cpus=ff

    __x64_sys_epoll_wait+0xc8/0x100
    do_syscall_64+0x88/0x170
    entry_SYSCALL_64_after_hwframe+0x76/0x7e

  R kworker/1:2[182] -31232ms
      scx_state/flags=3/0x9 dsq_flags=0x0 ops_state/qseq=0/0
      sticky/holding_cpu=-1/-1 dsq_id=0x8000000000000002 dsq_vtime=5296670
      cpus=02

    kthread+0x14a/0x170
    ret_from_fork+0x37/0x50
    ret_from_fork_asm+0x1a/0x30

  R kworker/1:1H[106] -29696ms
      scx_state/flags=3/0x9 dsq_flags=0x0 ops_state/qseq=0/0
      sticky/holding_cpu=-1/-1 dsq_id=0x8000000000000002 dsq_vtime=1875
      cpus=02

    kthread+0x14a/0x170
    ret_from_fork+0x37/0x50
    ret_from_fork_asm+0x1a/0x30

How is it possible that wineserver and kworker/1 were stuck for more than 30s if they were dispatched to the local DSQ of CPU 1?

arighi · 2025-01-05T13:25:52Z

Updated the subject: it seems to happens with tasks in general, not just per-cpu kthreads.

vax-r · 2025-01-06T14:17:04Z

@arighi can I try to help with this issue if it isn't taken yet?
Btw some questions I want to ask,
Does it only happens in bpfland and rustland (if it's generated by kernel, I think all schedulers should be affected ?)
What workload did you use in order to produce this bug?
Also want to ask how to generate those debug infos like you posted up there ?

arighi · 2025-01-06T14:35:56Z

@arighi can I try to help with this issue if it isn't taken yet?

Absolutely! This issue is literally killing me, so any help is very appreciated! :D

Btw some questions I want to ask, Does it only happens in bpfland and rustland (if it's generated by kernel, I think all schedulers should be affected ?) What workload did you use in order to produce this bug? Also want to ask how to generate those debug infos like you posted up there ?

It seems to happen only with bpfland, I've also tried to implement a totally new scheduler from scratch, that is like a simplified version of bpfland (https://github.com/sched-ext/scx/tree/scx_vder) to see if we could better track down the issue and it's happening also with this one.

I really don't understand why it's not happening with scx_rusty, scx_lavd, or scx_simple...

Unfortunately I've never been able to reproduce this problem on any of my systems, only people from the CachyOS community are able to reproduce this. They've been helping a lot with tests, but it's difficult to coordinate effective tests remotely, so we're kinda stuck at this point.

The good thing is that the CachyOS guys are able to reproduce this quickly, it's enough to run a simple kernel build apparently.

htejun · 2025-01-06T19:02:34Z

Another nonsense trace (at least I'm unable to explain what is happening), This one with scx_bpfland, bumping the .timeout_ms to 30s (kernel 6.13-rc4):
CPU 1   : nr_run=4 flags=0x1 cpu_rel=0 ops_qseq=520949 pnt_seq=458781
          curr=TaskThread2[4793] class=ext_sched_class

 *R TaskThread2[4793] -3ms

...

R wineserver[4698] -31260ms
...
How is it possible that wineserver and kworker/1 were stuck for more than 30s if they were dispatched to the local DSQ of CPU 1?

SCX_ENQ_PREEMPT implies SCX_ENQ_HEAD, so if you keep preempting a CPU with new tasks and the CPU stays saturated, tasks on the local DSQ can stall out.

arighi · 2025-01-06T19:58:15Z

Right, but here the problem seems to be the opposite, if I remove SCX_ENQ_PREEMPT the stall is happening, I need to add SCX_ENQ_PREEMPT to prevent the stall.

arighi · 2025-01-06T22:25:10Z

Great news! I've been able to reproduce the stall locally (on my son's old PC, which is now my new test PC for sched_ext). 😄

According to the first tests it seems that the main problem is related to refilling the time slice in ops.running(). Also the stall condition seems to happen when large I/O is involved (and the storage is not really fast), for example during a build it always happen at the end during the linking.

I'll continue the investigation tomorrow.

multics69 · 2025-01-07T00:25:58Z

So is it not related to SCX_ENQ_PREEMPT?

While prioritizing per-CPU kthreads is generally a good idea due to their critical role in the system, allowing them to preempt other tasks indiscriminately can result in significant scheduling overhead sometimes. A notable example is starting a VM, which can be quite inefficient if we allow with per-CPU kthread preemption. For instance, running a simple `time vng -r -- uname -r` produces the following results: - EEVDF: 1.53 seconds - bpfland: 3.30 seconds - bpfland-no-preempt: 1.32 seconds Moreover, disabling per-CPU kthread preemption does not appear to introduce any regressions in typical latency-sensitive benchmarks. Therefore, it makes sense to disable preemption for the per-CPU kthreads. Additionally, avoid refilling the task's time slice in ops.running() to prevent a task from monopolizing a CPU for too long, if the same task ends in a rapid sequence of ops.running()/ops.stopping(), which could starve other tasks that can only run on that CPU (e.g., per-CPU kthreads). Moreover, avoid refilling the task's time slice in ops.running() to prevent a task from monopolizing a CPU for too long by repeatedly transitioning between ops.running() and ops.stopping() without having the chance to trigger ops.dispatch(). This could starve other tasks limited to that CPU (e.g., per-CPU kthreads). This also fixes issue #1153. Signed-off-by: Andrea Righi <arighi@nvidia.com>

arighi · 2025-01-07T12:02:51Z

So is it not related to SCX_ENQ_PREEMPT?

SCX_ENQ_PREEMPT was the (ugly) workaround. Details below.

The main problem was refilling the task's time slice in ops.running(). My assumption was that a task can end up in a sequence of multiple ops.running() + ops.stopping() only by expiring its time slice (if it's blocking on something it would go to ops.quiescent() and follow the wakeup/enqueue path again).

Instead, a task can also end up in a sequence of multiple ops.running() + ops.stopping() with the flag SCX_TASK_QUEUED set and p->scx.slice > 0. When this happens the task will be re-added to the head of the queue (see SCX_ENQ_HEAD below):

static void put_prev_task_scx(struct rq *rq, struct task_struct *p,
                              struct task_struct *next)
{
        update_curr_scx(rq);

        /* see dequeue_task_scx() on why we skip when !QUEUED */
        if (SCX_HAS_OP(stopping) && (p->scx.flags & SCX_TASK_QUEUED))
                SCX_CALL_OP_TASK(SCX_KF_REST, stopping, p, true);

        if (p->scx.flags & SCX_TASK_QUEUED) {
                set_task_runnable(rq, p);

                /*
                 * If @p has slice left and is being put, @p is getting
                 * preempted by a higher priority scheduler class or core-sched
                 * forcing a different task. Leave it at the head of the local
                 * DSQ.
                 */
                if (p->scx.slice && !scx_rq_bypassing(rq)) {
                        dispatch_enqueue(&rq->scx.local_dsq, p, SCX_ENQ_HEAD);
                        return;
                }
...

This particular condition seems to happen in presence of large I/O operations (a task preempting itself? cond_rescehd() maybe?).

If the running/stopping sequence is rapid enough we may end up constantly refilling the time slice and the task will monopolize the CPU, starving the other tasks that are waiting in the local DSQ.

With the workaround of allowing per-CPU kthreads to preempt other tasks (SCX_ENQ_PREEMPT) I was basically breaking this loop, allowing per-CPU kthreads to just go ahead of the "endless" tasks (hiding the real problem).

With #1148 I removed the time slice refill in ops.running() and per-CPU kthreads are not allowed to preempt other tasks anymore. With these changes in place I wasn't able to stall the scheduler and I also got a significant performance improvement with certain workloads (e.g., booting a VM is now 2x faster).

arighi added kernel Expose a kernel issue bug Something isn't working help wanted Extra attention is needed labels Jan 3, 2025

arighi changed the title ~~kthreads not consumed from DSQ~~ tasks not consumed from DSQ Jan 5, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tasks not consumed from DSQ #1153

tasks not consumed from DSQ #1153

arighi commented Jan 3, 2025 •

edited

Loading

arighi commented Jan 5, 2025

arighi commented Jan 5, 2025

vax-r commented Jan 6, 2025 •

edited

Loading

arighi commented Jan 6, 2025

htejun commented Jan 6, 2025

arighi commented Jan 6, 2025

arighi commented Jan 6, 2025

multics69 commented Jan 7, 2025

arighi commented Jan 7, 2025

tasks not consumed from DSQ #1153

tasks not consumed from DSQ #1153

Comments

arighi commented Jan 3, 2025 • edited Loading

arighi commented Jan 5, 2025

arighi commented Jan 5, 2025

vax-r commented Jan 6, 2025 • edited Loading

arighi commented Jan 6, 2025

htejun commented Jan 6, 2025

arighi commented Jan 6, 2025

arighi commented Jan 6, 2025

multics69 commented Jan 7, 2025

arighi commented Jan 7, 2025

arighi commented Jan 3, 2025 •

edited

Loading

vax-r commented Jan 6, 2025 •

edited

Loading