-
Notifications
You must be signed in to change notification settings - Fork 99
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
tasks not consumed from DSQ #1153
Comments
Another nonsense trace (at least I'm unable to explain what is happening), This one with scx_bpfland, bumping the
How is it possible that wineserver and kworker/1 were stuck for more than 30s if they were dispatched to the local DSQ of CPU 1? |
Updated the subject: it seems to happens with tasks in general, not just per-cpu kthreads. |
@arighi can I try to help with this issue if it isn't taken yet? |
Absolutely! This issue is literally killing me, so any help is very appreciated! :D
It seems to happen only with bpfland, I've also tried to implement a totally new scheduler from scratch, that is like a simplified version of bpfland (https://github.com/sched-ext/scx/tree/scx_vder) to see if we could better track down the issue and it's happening also with this one. I really don't understand why it's not happening with scx_rusty, scx_lavd, or scx_simple... Unfortunately I've never been able to reproduce this problem on any of my systems, only people from the CachyOS community are able to reproduce this. They've been helping a lot with tests, but it's difficult to coordinate effective tests remotely, so we're kinda stuck at this point. The good thing is that the CachyOS guys are able to reproduce this quickly, it's enough to run a simple kernel build apparently. |
...
|
Right, but here the problem seems to be the opposite, if I remove |
Great news! I've been able to reproduce the stall locally (on my son's old PC, which is now my new test PC for sched_ext). 😄 According to the first tests it seems that the main problem is related to refilling the time slice in ops.running(). Also the stall condition seems to happen when large I/O is involved (and the storage is not really fast), for example during a build it always happen at the end during the linking. I'll continue the investigation tomorrow. |
So is it not related to |
While prioritizing per-CPU kthreads is generally a good idea due to their critical role in the system, allowing them to preempt other tasks indiscriminately can result in significant scheduling overhead sometimes. A notable example is starting a VM, which can be quite inefficient if we allow with per-CPU kthread preemption. For instance, running a simple `time vng -r -- uname -r` produces the following results: - EEVDF: 1.53 seconds - bpfland: 3.30 seconds - bpfland-no-preempt: 1.32 seconds Moreover, disabling per-CPU kthread preemption does not appear to introduce any regressions in typical latency-sensitive benchmarks. Therefore, it makes sense to disable preemption for the per-CPU kthreads. Additionally, avoid refilling the task's time slice in ops.running() to prevent a task from monopolizing a CPU for too long, if the same task ends in a rapid sequence of ops.running()/ops.stopping(), which could starve other tasks that can only run on that CPU (e.g., per-CPU kthreads). Moreover, avoid refilling the task's time slice in ops.running() to prevent a task from monopolizing a CPU for too long by repeatedly transitioning between ops.running() and ops.stopping() without having the chance to trigger ops.dispatch(). This could starve other tasks limited to that CPU (e.g., per-CPU kthreads). This also fixes issue #1153. Signed-off-by: Andrea Righi <arighi@nvidia.com>
The main problem was refilling the task's time slice in Instead, a task can also end up in a sequence of multiple
This particular condition seems to happen in presence of large I/O operations (a task preempting itself? If the running/stopping sequence is rapid enough we may end up constantly refilling the time slice and the task will monopolize the CPU, starving the other tasks that are waiting in the local DSQ. With the workaround of allowing per-CPU kthreads to preempt other tasks (SCX_ENQ_PREEMPT) I was basically breaking this loop, allowing per-CPU kthreads to just go ahead of the "endless" tasks (hiding the real problem). With #1148 I removed the time slice refill in |
Under certain conditions it seems that per-CPU kthreads can get stuck in a DSQ without being consumed.
This is an old issue that was noticed in rustland and bpfland a long time ago and it was masked by allowing per-CPU kthreads to preempt other tasks (queuing them with
SCX_ENQ_PREEMPT
).However, always allowing preemption can introduce excessive scheduling overhead with certain workloads, leading to poor performance (see for example #1148).
For example from this stall:
In this case
kworker/1
hasdsq_vtime=0
, so it should be the first one in the queue, however it has been sitting in DSQ 0 for more than 5sec (tasks from DSQ 0 can be consumed from all CPUs). The task that is currently running on CPU 1 isdxvk-shader-n[20288]
, that apparently has been running only for 1ms, so 1ms agokworker/1
was still sitting in the queue, but it wasn't selected to run.Looking at the kernel code I don't see any reason why a per-CPU kthread could get stuck in a DSQ indefinitely (especially in this case where DSQ 0 can be consumed by any CPU, so it's not a cpumask-related issue).
This trace was generated using this scheduler https://github.com/sched-ext/scx/tree/scx_vder-testing (which is a simplified version of scx_bpfland, designed specifically to debug this issue).
Am I missing something obvious here?
The text was updated successfully, but these errors were encountered: