Phantom events #1450
NDStrahilevitz
started this conversation in
Development
Phantom events
#1450
Replies: 1 comment 2 replies
-
Thanks a lot @NDStrahilevitz for the description. Helps a lot when we are remote. Let me see what I can find about this. |
Beta Was this translation helpful? Give feedback.
2 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Background
In the last months @mtcherni95 and I have been trying to find and measure performance bottlenecks in tracee. As part of that we had to find a way to reliably benchmark tracee consistently.
We've arrived at a solution where we simulate tracee in a single binary, and run it in a container alongside a workload (which must also be a container).
We have the following settings for tracee (in addition to standard CNDR settings and signatures)
ls workload
(explained soon) track onlyls
commands.--cpuset-cpus=0 --cpus="0.1"
, as such limiting it to run on core no. 0 with up to 10% of it's usage.And as such the workload we run is simply a slightly more robust version of
while true; do ls; done
which also counts the number ofls
runs.During our benchmarks, and before even reaching this method, we had a suspicion we were missing events over time that we could not measure, and during our runs using the method I will later present, we've come across a clear observation of this phenomenon of "lost lost events" or as we will call them phantom events.
And now...
Some math
In the case of our scenario, where we run this tracee simulator alongside the
ls-workload
there is a simple math equation that relates events(X), lost events(Y) and expected events(LS_COUNT * EVENTS_PER_LS = Z):X + Y >= Z
If there are 0 lost events then theoretically X=Z (we measured exactly the events related to the workload, all others are filtered out), and if we start to measure lost events, the count of lost events does not consider the "relevancy" of the lost events to the
--trace
settings, and as such we can count irrelevant lost events, and so it can amount to more than the expected amount(Z).The experiment
As we ran this benchmark multiple time, what happened over and over is a break of the established equation above, meaning, we have observed repeatedly that in fact, Z > X+Y.
In all cases where we observed that, Y > 0, otherwise (if Y=0), we observed: Z=X.
Below are benchmark sets we ran.
Note in this benchmark the first two rows were ran on a different machine, however the metrics remain fairly consistent (according to the last two rows)
Next two benchmarks were ran on a non dedicated AWS host:
An important thing to note about the next set is that for the first three runs we disabled the
tracee-rules
component, thus giving thetracee-ebpf
component more processing time.Introducing: Phantom Events
So, the equation has been broken in almost all of the above benchmarks. Since obviously there is a case of lost events here (X < Z), we'd expect that the amount of lost events would at least complement up to or above Z.
However, the amount of lost events is missing lost events, which is how we knew we observed what we suspected, the existence of phantom events.
How do we explain this
Currently, we don't. We haven't yet found a satisfying explanation for this (we've tried to find some inner caching layer for example), which is why I've opened this discussion.
Beta Was this translation helpful? Give feedback.
All reactions