Flaky smoke tests and performance issues #1442

rafaeldtinoco · 2022-02-05T05:48:25Z

rafaeldtinoco
Feb 5, 2022

Performance Observations (env: 5.15 kernel, 4 cpus, 4gb)

@danielpacak and I investigated some flaky smoke tests today. Smoke tests started being VERY flaky after we caused docker daemon to download and instantiate a container image, right before the test execution, instead of pulling the image first and then initializing the container.

What happened was that the workload of the docker daemon, in the docker hub node, was enough to make a very simple test (TRC-2) to fail significantly. I'm enumerating reproducers in multiple steps, here, isolating the variables:

multiple events + multiple rules
multiple events + single rule
single event + single rule

The test is the following:

Run tracee-ebpf
Run while true; do ps -ef; done in a 4 cpu box
Run strace ls multiple times (wait a bit so pipeline gets full)

The tests:

multiple events + multiple rules

$ sudo ./dist/tracee-ebpf -o format:json -o option:parse-arguments --trace event=close,dup,dup2,dup3,execve,init_module,magic_write,mem_prot_alert,process_vm_writev,ptrace,sched_process_exec,sched_process_exit,security_bprm_check,security_file_open,security_kernel_read_file,security_sb_mount,security_socket_connect | ./dist/tracee-rules --input-tracee file:stdin --input-tracee format:json
Loaded 12 signature(s): [TRC-2 TRC-14 TRC-3 TRC-11 TRC-9 TRC-4 TRC-5 TRC-12 TRC-8 TRC-6 TRC-10 TRC-7]

*** Detection ***
Time: 2022-02-05T05:08:23Z
Signature ID: TRC-2
Signature: Anti-Debugging
Data: map[]
Command: strace
Hostname: debian

*** Detection ***
Time: 2022-02-05T05:08:43Z
Signature ID: TRC-2
Signature: Anti-Debugging
Data: map[]
Command: strace
Hostname: debian

lost 5891 events
lost 100578 events

Executed: 5 strace commands
Got: 2 detections

multiple events + single rule

$ sudo ./dist/tracee-ebpf -o format:json -o option:parse-arguments --trace event=close,dup,dup2,dup3,execve,init_module,magic_write,mem_prot_alert,process_vm_writev,ptrace,sched_process_exec,sched_process_exit,security_bprm_check,security_file_open,security_kernel_read_file,security_sb_mount,security_socket_connect | ./dist/tracee-rules --input-tracee file:stdin --input-tracee format:json
Loaded 1 signature(s): [TRC-2]

*** Detection ***
Time: 2022-02-05T05:15:42Z
Signature ID: TRC-2
Signature: Anti-Debugging
Data: map[]
Command: strace
Hostname: debian

*** Detection ***
Time: 2022-02-05T05:15:45Z
Signature ID: TRC-2
Signature: Anti-Debugging
Data: map[]
Command: strace
Hostname: debian
lost 9959 events

*** Detection ***
Time: 2022-02-05T05:15:47Z
Signature ID: TRC-2
Signature: Anti-Debugging
Data: map[]
Command: strace
Hostname: debian

lost 17789 events
lost 7556 events

*** Detection ***
Time: 2022-02-05T05:15:53Z
Signature ID: TRC-2
Signature: Anti-Debugging
Data: map[]
Command: strace
Hostname: debian

Executed: 5 strace commands
Got: 4 detections

Same test, but now, instead of selecting all events (from all existing signatures), select only ptrace event. Keep the same pallalel workload (while true; do ps -ef; done).

Executed: 10 strace commands
Got: 10 detections

Some conclusions:

When loading all existing signatures (14), not much if we consider CNDR, we reduce the detection rate considerably (way more that if we select multiple events to be probed for a single signature). At least if that test is done in a small environment (4 cpus, 4 gb)

The fact that tracee-ebpf output is not consumed fast enough makes it to hold the pipeline, because the nature of the OS I/O handling, and then tracee-ebpf cannot consume the perf buffer fast enough. This makes ebpf programs to overwrite perf buffer events that were not consumed, causing event loss.

I do know that:

we can raise the perf buffer size
have a plan to have tracee-ebpf & tracee-rules part of the same process (multiple routines) and have a channel to pass over the received events.
we can use ring buffers (in newer kernels), which could improve event delivery performance

I'm wondering if we don't also need some sort of event caching in between the tracee-ebpf and tracee-rules (not relying only in the channel buffering only, when both are part of the same proccess, as it may exhaust very fast).

Don't we need something very fast to remove perf buffer pressure and allow tracee-rules to be slower if it needs to ? Eventually this cache would be drained during workload relief.

rafaeldtinoco · 2022-02-05T05:52:41Z

rafaeldtinoco
Feb 5, 2022
Author

I would like feedback from @mtcherni95, @itaysk and @yanivagman in this topic, if possible.

For the smoke tests case, @danielpacak and I can select a single signature and a single event (ptrace + anti_debugging_ptraceme.rego) and, by filtering amount of events, make the smoke test to be very trustful (even when github node is loaded).

I'm more worried about the general usage tbh.

0 replies

rafaeldtinoco · 2022-02-05T08:02:59Z

rafaeldtinoco
Feb 5, 2022
Author

A quick example to illustrate the reduced throughput in between tests (1) and (2). By using the "pipe viewer" tool, I'm able to see the throughput in the pipe.

When we have multiple rules loaded in tracee-rules:

$ sudo ./dist/tracee-ebpf -o format:json -o option:parse-arguments --trace event=close,dup,dup2,dup3,execve,init_module,magic_write,mem_prot_alert,process_vm_writev,ptrace,sched_process_exec,sched_process_exit,security_bprm_check,security_file_open,security_kernel_read_file,security_sb_mount,security_socket_connect | pv -pterb | ./dist/tracee-rules --input-tracee file:stdin --input-tracee format:json
Loaded 12 signature(s): [TRC-2 TRC-14 TRC-3 TRC-11 TRC-9 TRC-4 TRC-5 TRC-12 TRC-8 TRC-6 TRC-10 TRC-7]

16.7MiB 0:00:10 [1.85MiB/s]

The speed tops at ~1.9MiB/s in this test box.

When using a single rule:

$ sudo ./dist/tracee-ebpf -o format:json -o option:parse-arguments --trace event=close,dup,dup2,dup3,execve,init_module,magic_write,mem_prot_alert,process_vm_writev,ptrace,sched_process_exec,sched_process_exit,security_bprm_check,security_file_open,security_kernel_read_file,security_sb_mount,security_socket_connect | pv -pterb | ./dist/tracee-rules --input-tracee file:stdin --input-tracee format:json
Loaded 1 signature(s): [TRC-2]

12.9MiB 0:00:12 [6.51MiB/s]

The speed in between tracee-ebpf and tracee-rules tops at ~7MiB/s.

If we have a big enough circular buffer acting like FIFO in between the tracee-ebpf and tracee-rules, we may cache megabytes/gigabytes of events into this cache and process the events when the pressure on it is reduced (levering high and lows of the node load). Size and depth of this cache could be tuned for smaller/bigger hosts.

0 replies

rafaeldtinoco · 2022-02-07T06:20:21Z

rafaeldtinoco
Feb 7, 2022
Author

Alright, I played with this idea a little bit (still using pipe and not changing tracee)... I have created a pipebuf binary to stay in between tracee-ebpf and tracee-rules:

#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>

#include <pthread.h>

#include <glib.h>
#include <glib/gprintf.h>

#define chunk_size 64 * 1024 * 1024

GAsyncQueue *queue;
volatile gint allocated;

void *readstdin(void *ptr) {
	int multi = 1000;
	gchar *entry;

	for(;;) {
		entry = g_malloc0(chunk_size);
		read(fileno(stdin), entry, chunk_size);
		g_async_queue_push(queue, entry);

		if ((multi % 1000) == 0) {
			fprintf(stderr, "buffer size: %d\n", g_async_queue_length(queue));
			fflush(stderr);
		}
		multi++;
	}
}

void *writestdout(void *ptr) {
	gchar *entry;

	for(;;) {
		entry = g_async_queue_pop(queue);
		g_printf("%s", (gchar *) entry);
		fflush(stdout);
		g_free(entry);
	}
}

int main(int arc, char **argv)
{
	pthread_t read_t, write_t;
	int ret = 0;

	allocated = 0;

	queue = g_async_queue_new();

	ret |= pthread_create(&read_t, NULL, readstdin, NULL);
	ret |= pthread_create(&write_t, NULL, writestdout, NULL);

	if (ret != 0) {
		fprintf(stdout, "error creating threads\n");
		exit(1);
	}

	pthread_join(read_t, NULL);
	pthread_join(write_t, NULL);

	return 0;
}

I haven't lost eBPF events while caching (of course there is an OS limit for the cache size) and I haven't lost any TRC-2 signature detection during "bursts" of the test case (while true; do ps -ef; done).

Of course, because of caching, the detection sometimes happen much later than the time the strace command happened, but... still, detection seems to be in better shape:

$ sudo ./dist/tracee-ebpf -o format:json -o option:parse-arguments --trace event=close,dup,dup2,dup3,execve,init_module,magic_write,mem_prot_alert,process_vm_writev,ptrace,sched_process_exec,sched_process_exit,security_bprm_check,security_file_open,security_kernel_read_file,security_sb_mount,security_socket_connect | pipebuf | ./dist/tracee-rules --input-tracee file:stdin --input-tracee format:json
Loaded 12 signature(s): [TRC-2 TRC-14 TRC-3 TRC-11 TRC-9 TRC-4 TRC-5 TRC-12 TRC-8 TRC-6 TRC-10 TRC-7]

buffer size: 0
buffer size: 644
buffer size: 1085
buffer size: 1788
buffer size: 2329
buffer size: 3112
buffer size: 4051
buffer size: 4953
buffer size: 5670
buffer size: 6382
buffer size: 7244
buffer size: 8084
buffer size: 8855
buffer size: 9736
buffer size: 10443
buffer size: 11269
buffer size: 12023
buffer size: 12932
buffer size: 13704
buffer size: 14342
buffer size: 15043
buffer size: 15826
buffer size: 16586
buffer size: 17386
buffer size: 18235
buffer size: 19051
buffer size: 19696
buffer size: 20550
buffer size: 21222
buffer size: 22077
buffer size: 22811
buffer size: 23673
buffer size: 24376
buffer size: 25309
buffer size: 26075

*** Detection ***
Time: 2022-02-07T06:19:39Z
Signature ID: TRC-2
Signature: Anti-Debugging
Data: map[]
Command: strace
Hostname: debian

buffer size: 26927
buffer size: 27695
buffer size: 28331
buffer size: 29227
buffer size: 29993
buffer size: 30597
buffer size: 31464
buffer size: 32330
buffer size: 32892

*** Detection ***
Time: 2022-02-07T06:19:41Z
Signature ID: TRC-2
Signature: Anti-Debugging
Data: map[]
Command: strace
Hostname: debian

*** Detection ***
Time: 2022-02-07T06:19:42Z
Signature ID: TRC-2
Signature: Anti-Debugging
Data: map[]
Command: strace
Hostname: debian

0 replies

rafaeldtinoco · 2022-02-07T06:23:22Z

rafaeldtinoco
Feb 7, 2022
Author

@mtcherni95 So, summarizing, a big cache in between tracee-ebpf and tracee-rules seems to be needed (even when using tracee-ebpf as a library, like CNDR).

Up to 8 MB/s of data is being produced with the selected events, and around 2MB/s being consumed for multiple signatures loaded.

The delta is what defines the size of the buffer we need (for this particular case).

Considering a sustained workload (of this test case), for X amount of seconds, will define the buffer size not to lose detections (and OS limits as well).

Watermarks of this buffer usage might define event priority reductions (so the input throughput is reduced and output can catchup if buffer wasn't enough). We can have classes of events, and start reducing priorities per class when buffer watermark reaches 50%, 60, 70 etc.

Not sure this is the direction you are taking, just thinking about it a bit more.

Note: removing the OS file handling logic (and data copying from user land to kernel and back) will also help the throughput but the ratio 4:1 in production: consumption will, likely, remain.

6 replies

rafaeldtinoco Feb 7, 2022
Author

@rafaeldtinoco really nice research! I was wondering if in case multiple events + multiple rules, you had the change to know how many "lost-events" we had (tracee-ebpf lost events I mean). Even when adding a cache, eventually we might overflowing the cache memory size and start to loose events there as well.

@NDStrahilevitz please join us in the discussion here (your take is much appreciated).

Yes, so... we would have to calculate the buffer size formula based on:

the heaviest workload possible
the host size (cpus/memory)

The heaviest workload possible has to be benchmarked depending on the host size. We have to pick a workload (traced by tracee) that maximizes tracee-ebpf throughput to a limit (in this example 8 MB/s). Then, in the same environment, we have to calculate what is the ratio in between tracee-ebpf production and tracee-rules consumption. That ratio is what will tell us the buffer depth.

My proposal is that, instead of measuring amount of lost events (when things went south already), we change a priority scale that reduces the amount of events being sent whenever we reach different buffer watermarks (50%, 60%, 70%). And, whenever buffer is brought to lower marks, we raise the minimum priority again. This way we will prioritize more important events whenever needed (and only lose events if really needed).

This would allow us to have a very big queue in very big hosts and reduce amount of events being lost considerably (128GB ram host sizes could have ~16GB of cache, or allow user to specify the amount of cache, having in mind it will reduce lost events).

That is why I mentioned, in our last talk, scalability studies rather than hand picking low hanging performance fruits, which is also valid but may not solve a core issue we currently have.

When running tracee as a library, we push events in a go channel (from tracee ebpf to tracee rules). The cache layer you are proposing can actually be the channel itself, just by increasing its buffer size. WDYT? Still, eventually if the rate of producing events from ebpf is faster than the consuming rate (from user), then no matter which type of cache we put, we will get to lost events.

I think golang channels might do the trick, as long as we can measure how many buffers are being used at a particular moment (so we can decide on priorities and make tracee-ebpf to take actions on most important ones only whenever buffer reaches its watermarks).

If channels aren't good enough, we can always rely in regular async queues (perhaps with techniques to avoid GC to step into our feet). Implementation details are easy whenever we define the model.

The only problem is that... currently, in open source tracee-ebpf, I don't have tracee-rules consuming events out of a buffered channel, instead I have it consuming events from a serialized FILE * (stdin). This means I'll need to have a "generic" buffer now, and change it when we unite tracee-ebpf and tracee-rules (easy like the example I showed here).

NDStrahilevitz Feb 7, 2022
Maintainer

The only problem is that... currently, in open source tracee-ebpf, I don't have tracee-rules consuming events out of a buffered channel, instead I have it consuming events from a serialized FILE * (stdin).

That's not completely accurate, the files are being piped into a channel which is then passed to the Engine struct. That channel is unbuffered (you can see it in cmd/tracee-rules/input.go:73).
@danielpacak you've mentioned this in the signature buffers PR.
Additionally the output channel of tracee-ebpf is also unbuffered by default (cmd/tracee-ebpf/main.go:154)
It's important to note that in CNDR those channels ARE buffered, however the phenomenon of "phantom events" or as it might manifest here as missing detections, still occurs, but perhaps those buffers are too small, so research should be done on their size's effect (we have so far focused on other parts).

My proposal is that, instead of measuring amount of lost events (when things went south already), we change a priority scale that reduces the amount of events being sent whenever we reach different buffer watermarks (50%, 60%, 70%). And, whenever buffer is brought to lower marks, we raise the minimum priority again. This way we will prioritize more important events whenever needed (and only lose events if really needed).

Considering the suggestion to use the go channels as the cache layer, that means we need to buffer those input channels. And if i'm not mistaken @mtcherni95 has a draft PR (#1353) which could be implemented to work based on the amount of objects in the channel buffers.

rafaeldtinoco Feb 7, 2022
Author

The only problem is that... currently, in open source tracee-ebpf, I don't have tracee-rules consuming events out of a buffered channel, instead I have it consuming events from a serialized FILE * (stdin).

That's not completely accurate, the files are being piped into a channel which is then passed to the Engine struct. That channel is unbuffered (you can see it in cmd/tracee-rules/input.go:73). @danielpacak you've mentioned this in the signature buffers PR. Additionally the output channel of tracee-ebpf is also unbuffered by default (cmd/tracee-ebpf/main.go:154)

I think I wasn't clear in that sentence. What I meant is that, differently than CNDR, which uses buffered channel for the pipeline of tracee-ebpf with tracee-rules, current tracee-ebpf might have either buffered or unbuffered channels but, still, there is a small bufering (pipes max width is 64K) and huge serialization (OS wise due to stdout synchronization among different OS threads) happening in pipelining stdout of tracee-ebpf to `tracee-rules stdin via pipe.

My proposal is that, instead of measuring amount of lost events (when things went south already), we change a priority scale that reduces the amount of events being sent whenever we reach different buffer watermarks (50%, 60%, 70%). And, whenever buffer is brought to lower marks, we raise the minimum priority again. This way we will prioritize more important events whenever needed (and only lose events if really needed).

Considering the suggestion to use the go channels as the cache layer, that means we need to buffer those input channels. And if i'm not mistaken @mtcherni95 has a draft PR (#1353) which could be implemented to work based on the amount of objects in the channel buffers.

So, now, we only need to know size of this buffer depending on host numbers AND average throughput in worst case scenario with all rules being loaded. This can tell us how big this buffer should be. In a 2nd step we can implement the dynamic priority changes to eBPF events.

NDStrahilevitz Feb 7, 2022
Maintainer

I think I wasn't clear in that sentence. What I meant is that, differently than CNDR, which uses buffered channel for the pipeline of tracee-ebpf with tracee-rules, current tracee-ebpf might have either buffered or unbuffered channels but, still, there is a small bufering (pipes max width is 64K) and huge serialization (OS wise due to stdout synchronization among different OS threads) happening in pipelining stdout of tracee-ebpf to `tracee-rules stdin via pipe.

If I understand correctly, practically this means that even if we buffered the input channel to the Engine in the tracee-rules CLI, there would still be a bottleneck in the OS pipe (which would make buffering those channels irrelevant)?

rafaeldtinoco Feb 7, 2022
Author

If I understand correctly, practically this means that even if we buffered the input channel to the Engine in the tracee-rules CLI, there would still be a bottleneck in the OS pipe (which would make buffering those channels irrelevant)?

Yep, to document what we've just discussed (on Teams). That is exactly it. In open-source version I'll have to have a "generic" async buffer in the pipeline (to hold the events to be sent to stdout while tracee-ebpf stdin doesn't consume it).

When we merge tracee-rules and tracee-ebpf in the open source version, this cache can be replaced by the channel buffer (and your tests will define the depth of this cache/buffer for both cases although my ratio of 4:1 (production:consumption) will be worse than yours in CNDR).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Flaky smoke tests and performance issues #1442

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 4 comments 6 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

Flaky smoke tests and performance issues #1442

rafaeldtinoco Feb 5, 2022

Performance Observations (env: 5.15 kernel, 4 cpus, 4gb)

Replies: 4 comments · 6 replies

rafaeldtinoco Feb 5, 2022 Author

rafaeldtinoco Feb 5, 2022 Author

rafaeldtinoco Feb 7, 2022 Author

rafaeldtinoco Feb 7, 2022 Author

rafaeldtinoco Feb 7, 2022 Author

NDStrahilevitz Feb 7, 2022 Maintainer

rafaeldtinoco Feb 7, 2022 Author

NDStrahilevitz Feb 7, 2022 Maintainer

rafaeldtinoco Feb 7, 2022 Author

rafaeldtinoco
Feb 5, 2022

Replies: 4 comments 6 replies

rafaeldtinoco
Feb 5, 2022
Author

rafaeldtinoco
Feb 5, 2022
Author

rafaeldtinoco
Feb 7, 2022
Author

rafaeldtinoco
Feb 7, 2022
Author

rafaeldtinoco Feb 7, 2022
Author

NDStrahilevitz Feb 7, 2022
Maintainer

rafaeldtinoco Feb 7, 2022
Author

NDStrahilevitz Feb 7, 2022
Maintainer

rafaeldtinoco Feb 7, 2022
Author