Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tasks not consumed from DSQ #1153

Open
arighi opened this issue Jan 3, 2025 · 9 comments
Open

tasks not consumed from DSQ #1153

arighi opened this issue Jan 3, 2025 · 9 comments
Labels
bug Something isn't working help wanted Extra attention is needed kernel Expose a kernel issue

Comments

@arighi
Copy link
Contributor

arighi commented Jan 3, 2025

Under certain conditions it seems that per-CPU kthreads can get stuck in a DSQ without being consumed.

This is an old issue that was noticed in rustland and bpfland a long time ago and it was masked by allowing per-CPU kthreads to preempt other tasks (queuing them with SCX_ENQ_PREEMPT).

However, always allowing preemption can introduce excessive scheduling overhead with certain workloads, leading to poor performance (see for example #1148).

For example from this stall:

CPU 1   : nr_run=2 flags=0x1 cpu_rel=0 ops_qseq=1648 pnt_seq=10281
          curr=dxvk-shader-n[20288] class=ext_sched_class

 *R dxvk-shader-n[20288] -1ms
      scx_state/flags=3/0xd dsq_flags=0x0 ops_state/qseq=0/0
      sticky/holding_cpu=-1/-1 dsq_id=(n/a) dsq_vtime=101176519145
      cpus=3f

  R kworker/1:0[29] -5056ms
      scx_state/flags=3/0x9 dsq_flags=0x1 ops_state/qseq=0/0
      sticky/holding_cpu=-1/-1 dsq_id=0x0 dsq_vtime=0
      cpus=02

    kthread+0x14a/0x170
    ret_from_fork+0x37/0x50
    ret_from_fork_asm+0x1a/0x30

In this case kworker/1 has dsq_vtime=0, so it should be the first one in the queue, however it has been sitting in DSQ 0 for more than 5sec (tasks from DSQ 0 can be consumed from all CPUs). The task that is currently running on CPU 1 is dxvk-shader-n[20288], that apparently has been running only for 1ms, so 1ms ago kworker/1 was still sitting in the queue, but it wasn't selected to run.

Looking at the kernel code I don't see any reason why a per-CPU kthread could get stuck in a DSQ indefinitely (especially in this case where DSQ 0 can be consumed by any CPU, so it's not a cpumask-related issue).

This trace was generated using this scheduler https://github.com/sched-ext/scx/tree/scx_vder-testing (which is a simplified version of scx_bpfland, designed specifically to debug this issue).

Am I missing something obvious here?

@arighi arighi added kernel Expose a kernel issue bug Something isn't working help wanted Extra attention is needed labels Jan 3, 2025
@arighi
Copy link
Contributor Author

arighi commented Jan 5, 2025

Another nonsense trace (at least I'm unable to explain what is happening), This one with scx_bpfland, bumping the .timeout_ms to 30s (kernel 6.13-rc4):

CPU 1   : nr_run=4 flags=0x1 cpu_rel=0 ops_qseq=520949 pnt_seq=458781
          curr=TaskThread2[4793] class=ext_sched_class

 *R TaskThread2[4793] -3ms
      scx_state/flags=3/0x5 dsq_flags=0x0 ops_state/qseq=0/0
      sticky/holding_cpu=-1/-1 dsq_id=(n/a) dsq_vtime=1440445754412
      cpus=3f

  R wineserver[4698] -31260ms
      scx_state/flags=3/0x9 dsq_flags=0x0 ops_state/qseq=0/0
      sticky/holding_cpu=-1/-1 dsq_id=0x8000000000000002 dsq_vtime=1409568584439
      cpus=ff

    __x64_sys_epoll_wait+0xc8/0x100
    do_syscall_64+0x88/0x170
    entry_SYSCALL_64_after_hwframe+0x76/0x7e

  R kworker/1:2[182] -31232ms
      scx_state/flags=3/0x9 dsq_flags=0x0 ops_state/qseq=0/0
      sticky/holding_cpu=-1/-1 dsq_id=0x8000000000000002 dsq_vtime=5296670
      cpus=02

    kthread+0x14a/0x170
    ret_from_fork+0x37/0x50
    ret_from_fork_asm+0x1a/0x30

  R kworker/1:1H[106] -29696ms
      scx_state/flags=3/0x9 dsq_flags=0x0 ops_state/qseq=0/0
      sticky/holding_cpu=-1/-1 dsq_id=0x8000000000000002 dsq_vtime=1875
      cpus=02

    kthread+0x14a/0x170
    ret_from_fork+0x37/0x50
    ret_from_fork_asm+0x1a/0x30

How is it possible that wineserver and kworker/1 were stuck for more than 30s if they were dispatched to the local DSQ of CPU 1?

@arighi arighi changed the title kthreads not consumed from DSQ tasks not consumed from DSQ Jan 5, 2025
@arighi
Copy link
Contributor Author

arighi commented Jan 5, 2025

Updated the subject: it seems to happens with tasks in general, not just per-cpu kthreads.

@vax-r
Copy link
Contributor

vax-r commented Jan 6, 2025

@arighi can I try to help with this issue if it isn't taken yet?
Btw some questions I want to ask,
Does it only happens in bpfland and rustland (if it's generated by kernel, I think all schedulers should be affected ?)
What workload did you use in order to produce this bug?
Also want to ask how to generate those debug infos like you posted up there ?

@arighi
Copy link
Contributor Author

arighi commented Jan 6, 2025

@arighi can I try to help with this issue if it isn't taken yet?

Absolutely! This issue is literally killing me, so any help is very appreciated! :D

Btw some questions I want to ask, Does it only happens in bpfland and rustland (if it's generated by kernel, I think all schedulers should be affected ?) What workload did you use in order to produce this bug? Also want to ask how to generate those debug infos like you posted up there ?

It seems to happen only with bpfland, I've also tried to implement a totally new scheduler from scratch, that is like a simplified version of bpfland (https://github.com/sched-ext/scx/tree/scx_vder) to see if we could better track down the issue and it's happening also with this one.

I really don't understand why it's not happening with scx_rusty, scx_lavd, or scx_simple...

Unfortunately I've never been able to reproduce this problem on any of my systems, only people from the CachyOS community are able to reproduce this. They've been helping a lot with tests, but it's difficult to coordinate effective tests remotely, so we're kinda stuck at this point.

The good thing is that the CachyOS guys are able to reproduce this quickly, it's enough to run a simple kernel build apparently.

@htejun
Copy link
Contributor

htejun commented Jan 6, 2025

Another nonsense trace (at least I'm unable to explain what is happening), This one with scx_bpfland, bumping the .timeout_ms to 30s (kernel 6.13-rc4):

CPU 1   : nr_run=4 flags=0x1 cpu_rel=0 ops_qseq=520949 pnt_seq=458781
          curr=TaskThread2[4793] class=ext_sched_class

 *R TaskThread2[4793] -3ms

...

R wineserver[4698] -31260ms
...
How is it possible that wineserver and kworker/1 were stuck for more than 30s if they were dispatched to the local DSQ of CPU 1?

SCX_ENQ_PREEMPT implies SCX_ENQ_HEAD, so if you keep preempting a CPU with new tasks and the CPU stays saturated, tasks on the local DSQ can stall out.

@arighi
Copy link
Contributor Author

arighi commented Jan 6, 2025

Right, but here the problem seems to be the opposite, if I remove SCX_ENQ_PREEMPT the stall is happening, I need to add SCX_ENQ_PREEMPT to prevent the stall.

@arighi
Copy link
Contributor Author

arighi commented Jan 6, 2025

Great news! I've been able to reproduce the stall locally (on my son's old PC, which is now my new test PC for sched_ext). 😄

According to the first tests it seems that the main problem is related to refilling the time slice in ops.running(). Also the stall condition seems to happen when large I/O is involved (and the storage is not really fast), for example during a build it always happen at the end during the linking.

I'll continue the investigation tomorrow.

@multics69
Copy link
Contributor

So is it not related to SCX_ENQ_PREEMPT?

arighi added a commit that referenced this issue Jan 7, 2025
While prioritizing per-CPU kthreads is generally a good idea due to
their critical role in the system, allowing them to preempt other tasks
indiscriminately can result in significant scheduling overhead
sometimes.

A notable example is starting a VM, which can be quite inefficient if we
allow with per-CPU kthread preemption.

For instance, running a simple `time vng -r -- uname -r` produces the
following results:

 - EEVDF:              1.53 seconds
 - bpfland:            3.30 seconds
 - bpfland-no-preempt: 1.32 seconds

Moreover, disabling per-CPU kthread preemption does not appear to
introduce any regressions in typical latency-sensitive benchmarks.

Therefore, it makes sense to disable preemption for the per-CPU
kthreads.

Additionally, avoid refilling the task's time slice in ops.running() to
prevent a task from monopolizing a CPU for too long, if the same task
ends in a rapid sequence of ops.running()/ops.stopping(), which could
starve other tasks that can only run on that CPU (e.g., per-CPU
kthreads).

Moreover, avoid refilling the task's time slice in ops.running() to
prevent a task from monopolizing a CPU for too long by repeatedly
transitioning between ops.running() and ops.stopping() without having
the chance to trigger ops.dispatch(). This could starve other tasks
limited to that CPU (e.g., per-CPU kthreads).

This also fixes issue #1153.

Signed-off-by: Andrea Righi <arighi@nvidia.com>
@arighi
Copy link
Contributor Author

arighi commented Jan 7, 2025

So is it not related to SCX_ENQ_PREEMPT?

SCX_ENQ_PREEMPT was the (ugly) workaround. Details below.

The main problem was refilling the task's time slice in ops.running(). My assumption was that a task can end up in a sequence of multiple ops.running() + ops.stopping() only by expiring its time slice (if it's blocking on something it would go to ops.quiescent() and follow the wakeup/enqueue path again).

Instead, a task can also end up in a sequence of multiple ops.running() + ops.stopping() with the flag SCX_TASK_QUEUED set and p->scx.slice > 0. When this happens the task will be re-added to the head of the queue (see SCX_ENQ_HEAD below):

static void put_prev_task_scx(struct rq *rq, struct task_struct *p,
                              struct task_struct *next)
{
        update_curr_scx(rq);

        /* see dequeue_task_scx() on why we skip when !QUEUED */
        if (SCX_HAS_OP(stopping) && (p->scx.flags & SCX_TASK_QUEUED))
                SCX_CALL_OP_TASK(SCX_KF_REST, stopping, p, true);

        if (p->scx.flags & SCX_TASK_QUEUED) {
                set_task_runnable(rq, p);

                /*
                 * If @p has slice left and is being put, @p is getting
                 * preempted by a higher priority scheduler class or core-sched
                 * forcing a different task. Leave it at the head of the local
                 * DSQ.
                 */
                if (p->scx.slice && !scx_rq_bypassing(rq)) {
                        dispatch_enqueue(&rq->scx.local_dsq, p, SCX_ENQ_HEAD);
                        return;
                }
...

This particular condition seems to happen in presence of large I/O operations (a task preempting itself? cond_rescehd() maybe?).

If the running/stopping sequence is rapid enough we may end up constantly refilling the time slice and the task will monopolize the CPU, starving the other tasks that are waiting in the local DSQ.

With the workaround of allowing per-CPU kthreads to preempt other tasks (SCX_ENQ_PREEMPT) I was basically breaking this loop, allowing per-CPU kthreads to just go ahead of the "endless" tasks (hiding the real problem).

With #1148 I removed the time slice refill in ops.running() and per-CPU kthreads are not allowed to preempt other tasks anymore. With these changes in place I wasn't able to stall the scheduler and I also got a significant performance improvement with certain workloads (e.g., booting a VM is now 2x faster).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Extra attention is needed kernel Expose a kernel issue
Projects
None yet
Development

No branches or pull requests

4 participants