Phantom events #1450

NDStrahilevitz · 2022-02-07T15:32:12Z

NDStrahilevitz
Feb 7, 2022
Maintainer

Background

In the last months @mtcherni95 and I have been trying to find and measure performance bottlenecks in tracee. As part of that we had to find a way to reliably benchmark tracee consistently.
We've arrived at a solution where we simulate tracee in a single binary, and run it in a container alongside a workload (which must also be a container).

We have the following settings for tracee (in addition to standard CNDR settings and signatures)

Trace only new containers.
When running the ls workload (explained soon) track only ls commands.
Run tracee on a limited hardware through the docker flags --cpuset-cpus=0 --cpus="0.1", as such limiting it to run on core no. 0 with up to 10% of it's usage.
Run the workload on all other cores (also set through docker flags).

And as such the workload we run is simply a slightly more robust version of while true; do ls; done which also counts the number of ls runs.

During our benchmarks, and before even reaching this method, we had a suspicion we were missing events over time that we could not measure, and during our runs using the method I will later present, we've come across a clear observation of this phenomenon of "lost lost events" or as we will call them phantom events.

And now...

Some math

In the case of our scenario, where we run this tracee simulator alongside the ls-workload there is a simple math equation that relates events(X), lost events(Y) and expected events(LS_COUNT * EVENTS_PER_LS = Z):
X + Y >= Z

If there are 0 lost events then theoretically X=Z (we measured exactly the events related to the workload, all others are filtered out), and if we start to measure lost events, the count of lost events does not consider the "relevancy" of the lost events to the --trace settings, and as such we can count irrelevant lost events, and so it can amount to more than the expected amount(Z).

The experiment

As we ran this benchmark multiple time, what happened over and over is a break of the established equation above, meaning, we have observed repeatedly that in fact, Z > X+Y.
In all cases where we observed that, Y > 0, otherwise (if Y=0), we observed: Z=X.

Below are benchmark sets we ran.

Note in this benchmark the first two rows were ran on a different machine, however the metrics remain fairly consistent (according to the last two rows)

Next two benchmarks were ran on a non dedicated AWS host:

An important thing to note about the next set is that for the first three runs we disabled the tracee-rules component, thus giving the tracee-ebpf component more processing time.

Introducing: Phantom Events

So, the equation has been broken in almost all of the above benchmarks. Since obviously there is a case of lost events here (X < Z), we'd expect that the amount of lost events would at least complement up to or above Z.
However, the amount of lost events is missing lost events, which is how we knew we observed what we suspected, the existence of phantom events.

How do we explain this

Currently, we don't. We haven't yet found a satisfying explanation for this (we've tried to find some inner caching layer for example), which is why I've opened this discussion.

rafaeldtinoco · 2022-02-07T17:57:05Z

rafaeldtinoco
Feb 7, 2022

Thanks a lot @NDStrahilevitz for the description. Helps a lot when we are remote. Let me see what I can find about this.

2 replies

rafaeldtinoco Feb 8, 2022

@NDStrahilevitz I've been trying to reproduce this without success. How are you able to get event loss by filtering a specific comm ? I can stress the machine being traced but, because I have a specific comm=X filter in place, I can't manage to stress it enough to cause event loss (so I always get the expected # of events from the controlled test).

NDStrahilevitz Feb 8, 2022
Maintainer Author

Right, I forgot to mention in my post, we've been purposefully running tracee on a constrained hardware, through docker we have been limiting the cpu usage and core by setting the flags --cpuset-cpus=0 --cpus="0.1". These set the image to run only on core no. 0, with up to 10% of its usage.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Phantom events #1450

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 2 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Phantom events #1450

NDStrahilevitz Feb 7, 2022 Maintainer

Background

Some math

The experiment

Introducing: Phantom Events

How do we explain this

Replies: 1 comment · 2 replies

rafaeldtinoco Feb 7, 2022

rafaeldtinoco Feb 8, 2022

NDStrahilevitz Feb 8, 2022 Maintainer Author

NDStrahilevitz
Feb 7, 2022
Maintainer

Replies: 1 comment 2 replies

rafaeldtinoco
Feb 7, 2022

NDStrahilevitz Feb 8, 2022
Maintainer Author