-
Notifications
You must be signed in to change notification settings - Fork 99
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kernel: Implement cached current time interface #858
Comments
I have looked a bit, and this is what I am thinking. Considering the discussion on the BPF mailing list (https://lore.kernel.org/bpf/CAEf4BzaBNNCYaf9a4oHsB2AzYyc6JCWXpHx6jk22Btv=UAgX4A@mail.gmail.com/), I think we can assume the following two (or something similar) new APIs:
Based on these two new APIs, we can provide two common utility functions at
In my opinion, it would be quite difficult to handle the clock drift of What do you think, @htejun ? |
One challenge is that For reference, Chris Mason implemented a simple benchmark to tests TSC performance: https://github.com/masoncl/tscbench. The main problem being observed is |
Some thoughts on the problem:
Thoughts @multics69 @htejun ? |
|
Thank you for the feedback @htejun and @etsal !
BTW, @htejun -- do you have some tsc benchmark numbers on sapphire rapids? It would be great to understand how bad |
I only heard results second-hand. IIRC, on sapphire rapids, rdtsc wasn't much better than rdtscp. |
TSCBench results with a single-threaded execution (default)2-socket Xeon (Intel(R) Xeon(R) Silver 4114 CPU @ 2.20GHz, 40 CPUs = 10x2x2)
1-socket AMD (AMD Ryzen 9 PRO 6950H, 16 CPUs = 8x1x2)
1-socket Intel Rapter Lake (13th Gen Intel(R) Core(TM) i7-13700H, 20 CPUs = 6x2+8)
SteamDeck OLED (AMD = 8 CPUs = 4x2)
|
TSCBench results with a multi-threaded execution (N=the max CPUs)
2-socket Xeon (Intel(R) Xeon(R) Silver 4114 CPU @ 2.20GHz, 40 CPUs = 10x2x2, nthreads=40)
1-socket AMD (AMD Ryzen 9 PRO 6950H, 16 CPUs = 8x1x2, nthreads=16)
|
The observations so far can be summarized as follows:
|
The patch v2 was posted: https://lore.kernel.org/lkml/20241202043849.1465664-1-changwoo@igalia.com/ |
SCX schedulers tend to use
bpf_ktime_get_ns()
a lot. On x86, this eventually is serviced byrdtsc_ordered()
which is therdtscp
instruction. The instruction is known to be expensive and has scalability issues on large machines when the CPUs are saturated (the cost of the instruction increases as machine gets saturated).In most cases, we don't really care about nanosec accuracy that we're paying for. There can be a couple approaches in addressing this:
ktime_get_ns()
result and provide a kfunc to access the cached time. The kernel should have reasonable cache invalidation points (e.g. at the start of dispatch and when rq lock is released during dispatch and so on).rdtsc
in x86) along with helpers to compare and calculate delta between two timestamps.rdtsc
is cheaper and doesn't have scalability issues thatrdtscp
has but it's unclear how this would map in other archs.Considerations:
rdtscp
guarantees this but that's why it's expensive.after - before
underflowing) can be cause some headaches. Can probably be alleviated reasonably with a good set of helpers.The text was updated successfully, but these errors were encountered: