Flaky smoke tests and performance issues #1442
Replies: 4 comments 6 replies
-
I would like feedback from @mtcherni95, @itaysk and @yanivagman in this topic, if possible. For the smoke tests case, @danielpacak and I can select a single signature and a single event (ptrace + anti_debugging_ptraceme.rego) and, by filtering amount of events, make the smoke test to be very trustful (even when github node is loaded). I'm more worried about the general usage tbh. |
Beta Was this translation helpful? Give feedback.
-
A quick example to illustrate the reduced throughput in between tests (1) and (2). By using the "pipe viewer" tool, I'm able to see the throughput in the pipe.
The speed tops at ~1.9MiB/s in this test box.
The speed in between tracee-ebpf and tracee-rules tops at ~7MiB/s. If we have a big enough circular buffer acting like FIFO in between the |
Beta Was this translation helpful? Give feedback.
-
Alright, I played with this idea a little bit (still using pipe and not changing tracee)... I have created a #include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <pthread.h>
#include <glib.h>
#include <glib/gprintf.h>
#define chunk_size 64 * 1024 * 1024
GAsyncQueue *queue;
volatile gint allocated;
void *readstdin(void *ptr) {
int multi = 1000;
gchar *entry;
for(;;) {
entry = g_malloc0(chunk_size);
read(fileno(stdin), entry, chunk_size);
g_async_queue_push(queue, entry);
if ((multi % 1000) == 0) {
fprintf(stderr, "buffer size: %d\n", g_async_queue_length(queue));
fflush(stderr);
}
multi++;
}
}
void *writestdout(void *ptr) {
gchar *entry;
for(;;) {
entry = g_async_queue_pop(queue);
g_printf("%s", (gchar *) entry);
fflush(stdout);
g_free(entry);
}
}
int main(int arc, char **argv)
{
pthread_t read_t, write_t;
int ret = 0;
allocated = 0;
queue = g_async_queue_new();
ret |= pthread_create(&read_t, NULL, readstdin, NULL);
ret |= pthread_create(&write_t, NULL, writestdout, NULL);
if (ret != 0) {
fprintf(stdout, "error creating threads\n");
exit(1);
}
pthread_join(read_t, NULL);
pthread_join(write_t, NULL);
return 0;
} I haven't lost eBPF events while caching (of course there is an OS limit for the cache size) and I haven't lost any TRC-2 signature detection during "bursts" of the test case ( Of course, because of caching, the detection sometimes happen much later than the time the
|
Beta Was this translation helpful? Give feedback.
-
@mtcherni95 So, summarizing, a big cache in between Up to 8 MB/s of data is being produced with the selected events, and around 2MB/s being consumed for multiple signatures loaded. The delta is what defines the size of the buffer we need (for this particular case). Considering a sustained workload (of this test case), for X amount of seconds, will define the buffer size not to lose detections (and OS limits as well). Watermarks of this buffer usage might define event priority reductions (so the input throughput is reduced and output can catchup if buffer wasn't enough). We can have classes of events, and start reducing priorities per class when buffer watermark reaches 50%, 60, 70 etc. Not sure this is the direction you are taking, just thinking about it a bit more. Note: removing the OS file handling logic (and data copying from user land to kernel and back) will also help the throughput but the ratio 4:1 in production: consumption will, likely, remain. |
Beta Was this translation helpful? Give feedback.
-
Performance Observations (env: 5.15 kernel, 4 cpus, 4gb)
@danielpacak and I investigated some flaky smoke tests today. Smoke tests started being VERY flaky after we caused docker daemon to download and instantiate a container image, right before the test execution, instead of pulling the image first and then initializing the container.
What happened was that the workload of the docker daemon, in the docker hub node, was enough to make a very simple test (TRC-2) to fail significantly. I'm enumerating reproducers in multiple steps, here, isolating the variables:
The test is the following:
tracee-ebpf
while true; do ps -ef; done
in a 4 cpu boxstrace ls
multiple times (wait a bit so pipeline gets full)The tests:
Executed: 5
strace
commandsGot: 2 detections
Executed: 5
strace
commandsGot: 4 detections
ptrace
event. Keep the same pallalel workload (while true; do ps -ef; done
).Executed: 10
strace
commandsGot: 10 detections
Some conclusions:
When loading all existing signatures (14), not much if we consider CNDR, we reduce the detection rate considerably (way more that if we select multiple events to be probed for a single signature). At least if that test is done in a small environment (4 cpus, 4 gb)
The fact that tracee-ebpf output is not consumed fast enough makes it to hold the pipeline, because the nature of the OS I/O handling, and then tracee-ebpf cannot consume the perf buffer fast enough. This makes ebpf programs to overwrite perf buffer events that were not consumed, causing event loss.
I do know that:
I'm wondering if we don't also need some sort of event caching in between the tracee-ebpf and tracee-rules (not relying only in the channel buffering only, when both are part of the same proccess, as it may exhaust very fast).
Don't we need something very fast to remove perf buffer pressure and allow tracee-rules to be slower if it needs to ? Eventually this cache would be drained during workload relief.
Beta Was this translation helpful? Give feedback.
All reactions