What is CacULE exactly? #37

hamadmarri · 2021-07-15T10:59:17Z

hamadmarri
Jul 15, 2021
Maintainer

I just wanted to explain what is CacULE exactly and a bit of history on CacULE development starting from it's old name Cachy or maybe before (I had an attempt called Popcorn several years ago).

This post will be written and edited gradually.

In short, if we imagine that the Linux kernel is a car, then the Completely Fair Scheduler (CFS) is the engine, CacULE is a hack on throttle system (to make the engine more responsive), and RDB is a turbo.

As you can see, CacULE itself doesn't increase your car engine horsepower, it just makes the engine responds faster when you push the gas pedal. I noticed some people like to compare CacULE with other schedulers for performance tests as if CacULE is a fully different than CFS, which is not correct. Most likely, CacULE will not make any big differences in terms of performance than CFS.

I tried to make CacULE as small as possible to fit all CFS features, which also helps to adapt any future changes of CFS without any troubles. RDB on the other hand, is a replacement of CFS tasks load balancer. RDB is very hacky and therefore it doesn't support CFS's FAIR_GROUP (or SCHED_GROUP). Not supporting some task groups features means some containers tools such as docker will not work properly (some of my friends says docker still works fine with RDB).

The RDB is a short of Response Driven Balancer, it enhances the response that CacULE provides and also increases the performance of CFS. Why increasing performance? because RDB is a lightweight balancer that depends purely on tasks response measurements to balance tasks among CPUs.

What makes RDB lightweight?
1- I dropped many tasks/groups statistics updates that are used for CFS loadbalancer
2- Only one task moved at a time (from single CPU perspective), no need to go through all tasks in runqeues to pick the best task/s, RDB just picks the task placed at the head of runqueue (since head always has the task with the highest Interactivity Score (IS)).
3- The runqueues lock time is less, since only moving one task and then we are done of balancing.

Why RDB is working from first place as a performant balancer?
The core idea of RDB is that:

Since CacULE will make a CPU runs a task that has the highest IS,
why we bother moving task/s to this CPU if they are not going to run sooner?
We only move in/pull a task that is waiting (not currently running) on source
CPU but it has higher IS than the highest task in the destination CPU.

Example:
CPU0: (t0 10 running) (t1 8) (t4 1) (t5 1) (t6 1) (t7 1) (t8 1)
CPU1: (t2 5 running) (t3 1)

CPU1 will pull task t1 from CPU0 since t1 has IS of 8 which is higher than t2s therefore t1 will run much more fairly since its IS is higher. Notice here we don't need to pull any of t4-8 even though the CPU0 has more number of tasks than CPU1, but moving those tasks won't help at the moment since they will not run any sooner any way (in terms of CacULE strategy).

Mostly all the recent benchmarks others and I have done on RDB, show RDB increase the performance. See one of the results that shows an extreme differences between RDB and others (https://forum.garudalinux.org/t/unigine-720p-testing-tkg-against-zen-kernel-on-ryzen-5-1600/9546/5). Thanks to binarydepth for his work.

Of course, not all tests show a big differences like the above test, but usually RDB is on the top.

TBC

hamadmarri · 2021-07-15T12:09:16Z

hamadmarri
Jul 15, 2021
Maintainer Author

The first attempt to hack linux scheduler was in ~2017. Before that I had a course in University of Regina with Dr. Robert Hilderman (https://www.cs.uregina.ca/People/Profiles/RobertHilderman.html) where we studied a simplified unix-like operating system called xv6. My project was to change the round robin cpu scheduler of xv6 and compare the results. I have implemented the following schedulers to xv6:
1- Lottery scheduler (randomly picking tasks)
2- CFS (implemented in simplified version with vruntime concept and RBtree)
3- BFS (Brain Fuck Scheduler) by Con (implemented the concept only), Dr. Hilderman asked me to only say BFS in the presentation without going through the full name :D
4- HRRN (highest response ration) <- which is the base idea of Popcorn and Cachy
5- FIFO

I have implemented different benchmarks to test throughput, response, scheduler latency, and desktop usage.
As I remember the results were:
CFS: provides the highest throughput and lower schedule latency
BFS: was the highest in response time
HRRN: was the second in response time, and the throughput

After that I tried to hack the linux kernel since I didn't feel xv6 is fair to see the actual cpu scheduler effect.
Of course, I was full of the momentum until I hit the fair.c wall with 11K LOC! damn xv6 was only one file for the scheduler and it was only 600 LOC! I poorly implemented the HRRN on linux kernel and called it Popcorn scheduler. It was a bad implementation since I couldn't understand all the CFS code, and no obvious effects I can see or measure. It is like nothing changed to CFS!

In the late of 2019 I thought what will happen if I just implement a round robin scheduler with no CPU balancing at all to utilize the cpu cache. Where the round robin pointer go through tasks like a pendulum to utilize more cpu cache (hence the name Cachy). I spend 2 months learning CFS code and finally I was able to implement something that has different and obvious effects. The changes were: very high throughput but very bad interactivity! Reason: no tasks priority, the scheduler was dump. So for people who like to judge by FPS numbers and think that it means better scheduler, let me know so I can find and send that bad scheduler for you to make you more satisfied. The scheduler was having audio interrupt when playing YT even though there is no heavy load.

After couple weeks tweaking that RR nightmare without any improvements, I almost gave up. But remembered the Popcorn work. I opened the patch and noticed many wrong implementations on Popcorn code. Every single line of code was wrong. So, I decided to try the HRRN again! but now instead of using the RBtree I will use the link list I already have in Cachy. After I applied HRRN, I ran the tests again, and all the audio interrupts are gone! I recompiled the kernel and made sure that it is not the CFS I am testing! All worked good and the response was super high even with heavy loads. I did simple test where I ran 16 thread of make gcc and stress-ng and the system is still responsive as nothing is running in background. I compared it with CFS and I noticed the changes on response. So I posted in Reddit and I got frustrated since no one even tried to test it. They just criticized the idea of Cachy because of the tasks balance lack. I changed it and make it only balance when the CPU is idle. It was good enough for my usage but no one cared that time.

After reading some toxic comments in Reddit, I said fuck all linux community. Let's move to FreeBSD! I was a FreeBSD user for couple weeks and promising my self to not open github anymore. I went through FreeBSD cpu scheduler code (ULE) it was pleasant just to read the code. The whole scheduler was 3K LOC if I recall correctly.

Then I found an email from Alexandre Frade (the maintainer of XanMod) asking me to change some stuff in Cachy so we can add it to XanMod kernel. I opened the github and saw 100 stars which a result of Michael Larabel post in phoronix (https://www.phoronix.com/scan.php?page=news_item&px=Cachy-Linux-Scheduler). Luckily, they were fighting why using YT instead of others. And the post got about 63 comments where only 2 comments are really related to the Cachy scheduler.

TBC

0 replies

hamadmarri · 2021-07-16T08:55:33Z

hamadmarri
Jul 16, 2021
Maintainer Author

How XanMod knew about Cachy?

One Turkish guy (he still refuses to tell me his name) lives in Germany posted to XanMod forum about Cachy with the link of phoronix. XanMod tested Cachy and went through the code, and liked the fact that Cachy is only a relatively small patch on top of CFS.

What is with the loyalty to XanMod?

Cachy then wasn't good enough for throughput, it was all about responsiveness. With XanMod help, we could achieve acceptable throughput (similar to CFS) without affecting the responsiveness Cachy provides. At that time, many other kernels maintainers were quite happy to just criticize Cachy without any attempt to at least give a positive feedback.

Alexandre did almost 50% of the total work to enhance Cachy. He was adjusting the Cachy code and test the changes. Not only that, he also suggested some ideas from other schedulers which can fix a particular issue we had. I would say he is a very good tester! At the same time he takes risks (no success without risks). I remember one time he just placed Cachy instead of CFS on the main kernel so people can test and give feedback. That was a risk he took which could effect XanMod reputation. However, after many people were having issues with Cachy, he returned CFS back and told me at least now we know what else to fix in Cachy. Such a smart and risky guy is ideal to maintain and lead a very popular and trusted custom linux kernel. I have seen many patches collectors who think it is a supermarket of patches where you pick whatever shiny and keep patching inconsistent patches together. XanMod is different, and behind his set of patches are a big time and efforts on testing and troubleshooting.

I hope all the best to XanMod 👍

0 replies

hamadmarri · 2021-07-22T19:35:43Z

hamadmarri
Jul 22, 2021
Maintainer Author

Bad repetition of Cachy and the transition to CacULE

Renaming Cachy to CacULE was 99% a business driven action. When Cachy has reached its stable form, most people who tested Cachy before have already bad impressions on it. They have tested it when it was lacking CPU balancing, NUMA support, and SCHED_GROUP. After the hard work by XanMod and I, we had finally a good scheduler, but repetition of Cachy was still bad.

The good version Cachy stayed couple months without any changes. At that time I was reading about some issues about CFS balancer where it doesn't fully utilizes the CPUs (it could be outdated and got fixed now). I tried to modify the cfs loadbalancer but no success. I then read some researches about task load balancing on multiple processors systems. There are many interesting ideas on the literatures. I like the idea of the global run queue where all tasks are placed in one runqueue with a global lock and each cpu picks one task from the global runqueue after obtaining the lock.

Pros of a global runqueue:

No need of any balancing strategy, all cpus are fully utilized.
For a small set of cpus, the system would run fast.
a cpu won't be idle if any task is waiting to run (or the idle time is very small)

Cons:

For a large number of cpus the system will slow down since cpus are fighting to obtain the global lock, and the main concept or the beauty of global runqueue will be malformed. The cpus start to wait for the lock since many cpus are there, and the idle time of each cpu increases which is quite the opposite of the goal of the global run queue. Imagine 32 cpus fighting to obtain a lock every 1ms.
Task cache is not considered at all, even if there are 4 tasks running in 4 cpus system, best way is to keep each task in its current cpu without migrating (after tasks are distributed equally in cpus), think of a scenario where every time a cpu just picks a task that was running on different cpu.

Anyway, I tried to make a global runqueue which was called GRQ, on my machine it was super fast, but only before it crashes since I missed to change some local runqueue locking to global lock in some places in the scheduler. The scheduler code is hug, and whatever I do to convert local runqueues to a global runqueue was non-sense, since every time you compile the kernel after changing something and you are full of hope that's the last changes you will do and it is fixed. After running the new kernel for random times in range of 1m to 5h! which is really hurting when the system freezes after 5 hours of pride. I decided to grep every runqueue in the kernel and change all locks. While I am doing this, I noticed after each global lock addition I add, the performance keep reducing because now we have many places where cpus need to acquire the global lock.

While working on GRQ, I thought what if I just make a multi runqueues and simple balancer just to test if the crashes that I had in GRQ was from something else that I don't know. So, I made a very basic balancer in few hours (compared to GRQ it was about of 2 months of work, 12 hours a day). The new balancer was working without any crashes, but I didn't considered it as a main work, it was just a testing for the GRQ work. Luckily, I pushed a new branch with that simple balancer and called it rdb. The name I picked was influenced by a voting poll in xanmod group to rename Cachy. One of the names was RDS (Response Driven Scheduler). So, I just used it with Balancer instead of Scheduler.

I have continued missing with the GRQ without any success, until @raykzhao and @hf29h8sh321 suggested to me that the rdb branch somehow is working better than normal Cachy but slightly worse than GRQ #21 . So, I tested the performance of RDB again, and noticed that it is not bad.

CacULE

While I was disappointed by the GRQ errors and before realizing the abandoned rdb branch was good, I was reading the ULE research paper by its author Jeff Roberson (https://web.cs.ucdavis.edu/~roper/ecs150/ULE.pdf). I was looking at the task balancing mechanism at ULE. I also read the interactivity score (IS) section which was quite interesting. I thought what if I just change the HRRN in Cachy to use interactivity score instead. I had already compiled kernel and all what I need is just to edit fair.c which will not cause a full recompilation. After booting the new kernel with IS, I did a test on Gnome that some how shows how responsive the scheduler is. The test was make -j16 + stress-ng with 4 cpu and keep opening the applications and search for settings. It seems searching for settings on gnome is heavier than any other search terms. CFS was obvious to show some slowness on this test 9 out of 10 tries I noticed slow response (1 / 10), Cachy was 5 / 10, but when tried the IS it was 9 / 10. Almost perfect response, the interactivity_factor was then 10. As usual, I sent to XanMod for testing and he confirmed that the new changes is obviously better.

As I mentioned before, we were thinking to rename Cachy since the name is not applicable anymore. No cpu cache consideration with Cachy balancing support, and another issue was the bad experience people had with Cachy. So we need a new name. With IS changes, I thought it is a good idea to plug ULE name to the scheduler name so we use the FreeBSD excellent work for our side. I was thinking of Cachy + ULE = CULE can be pronounced cool, then my left brain said no it is better to be some weird and totally unheard of name, let's pick CacULE. After renaming on the github repo, I just remembered that I should have done some search in web before, so no cacule name is there. The first search result was a city in Brazil named Caculé. I said it is ok, it could be worse, I accepted it. Five minutes later a Russian guy in telegram group said that cacule means sh*t in Russian, not only that, it is the childish way of saying it. The other Russian guy confirmed, and the chat turned to Russian language. At that time I was playing some chess to train my damn left brain.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What is CacULE exactly? #37

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

What is CacULE exactly? #37

hamadmarri Jul 15, 2021 Maintainer

Replies: 3 comments

hamadmarri Jul 15, 2021 Maintainer Author

hamadmarri Jul 16, 2021 Maintainer Author

How XanMod knew about Cachy?

What is with the loyalty to XanMod?

hamadmarri Jul 22, 2021 Maintainer Author

Bad repetition of Cachy and the transition to CacULE

CacULE

hamadmarri
Jul 15, 2021
Maintainer

hamadmarri
Jul 15, 2021
Maintainer Author

hamadmarri
Jul 16, 2021
Maintainer Author

hamadmarri
Jul 22, 2021
Maintainer Author