Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The event pipeline needs a way to signal to push based inputs that they should backpressure incoming requests #41892

Open
cmacknz opened this issue Dec 4, 2024 · 2 comments
Labels
Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team

Comments

@cmacknz
Copy link
Member

cmacknz commented Dec 4, 2024

Calls to the Beat event pipeline's Publish() method will block if the internal queue is full. This provides a way to propagate backpressure from the output back to pull based inputs like filestream. Pull based inputs will block on the call to Publish and stop reading in additional data until there is space available in the internal queue.

In the case of push based inputs, like the http_endpoint that accepts HTTP requests from clients, this approach does not work. To make a call to Publish() the input must be holding an event in memory. For the http_endpoint input, there could be an unbounded number of concurrent HTTP requests that have deserialized the request body and are holding it in memory while being blocked on a call to Publish() because the internal queue is full.

The internal queue being full is only one variant of this problem. The lack of a concurrency limit in a push based input means the memory usage is no longer bounded by the size of the internal queue, because each request handler holds an event in memory. The queue could have space for 100 more events and a client could open 500 concurrent requests.

Rather than require inputs to implement arbitrary concurrency limits, the Beat event pipeline should provide a way for push based inputs to detect that they should backpressure the source before doing any work with the data they need to accept. For the http_endpoint input, the input would want a way to know that it should return a 429 or 503 error to clients immediately before processing the request body at all.

@cmacknz cmacknz added the Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team label Dec 4, 2024
@elasticmachine
Copy link
Collaborator

Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane)

@leehinman
Copy link
Contributor

One concern I have is when the input has received input that expands to several orders of magnitude events more than the queue size. A good example would be something like an AWS VPC flow log, where a single gzipped log file could have 100k events in it. In a situation like that, the single flow log would keep the queue full and signaling for back pressure for 100s of round trips to Elasticsearch. In that scenario, the input could have signaled multiple time to the "pusher" to back-off. And with geometric back off you could have large holes, where the pusher is waiting for it's backoff to expire before trying again, and the queue is empty. This would result in an eps that is lower than what it really could be doing.

We need to make sure we have some way to "tune" ourselves out of a problem like that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team
Projects
None yet
Development

No branches or pull requests

3 participants