Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix monitoring metrics for individual collectors #389

Conversation

ananya-mallik-ps
Copy link
Contributor

@ananya-mallik-ps ananya-mallik-ps commented Nov 19, 2024

Fix monitoring metrics

Problem

Currently, each time a metrics endpoint is hit (with collect parameter e.g., /metrics?collect=pubsub.googleapis.com/topic), a new collector is created. This causes the monitoring metrics (stackdriver_monitoring_api_calls_total, stackdriver_monitoring_scrapes_total etc.) to reset on each scrape.

Solution

This PR:

  • Add collector caching in the handler using a map
  • Key the cache using project ID + filters combination
  • Reuse collectors to maintain metric state across scrapes

Testing

  • Tested build locally, with one of our GCP project for different collector(s)
  • Deployed in one of our cluster and tested with multiple collector (~ 110+ different metrics-type-prefixes)
  • Verified metrics accumulate properly across scrapes for each collector
  • Memory usage remains stable over time and is similar to previous version

@ananya-mallik-ps ananya-mallik-ps marked this pull request as ready for review November 19, 2024 18:13
@ananya-mallik-ps
Copy link
Contributor Author

@SuperQ @kgeckhart

Copy link
Contributor

@kgeckhart kgeckhart left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is awesome and allows aggregate deltas + descriptor cache to be compatible with collect time filters (resolving #315).

Since both of those features prune their storage via collector usage we should probably add a collector cache TTL that at least matches the largest TTL if either feature is in use. This way if a scrape config changes the collect filters the exporter will eventually clean itself up without needing to be restarted.

stackdriver_exporter.go Outdated Show resolved Hide resolved
@ananya-mallik-ps ananya-mallik-ps force-pushed the fix/collector-metrics-reset branch from 028c6cb to f53ae4e Compare December 3, 2024 11:50
Signed-off-by: Ananya Kumar Mallik <ananya.mallik@paymentsense.com>
@ananya-mallik-ps ananya-mallik-ps force-pushed the fix/collector-metrics-reset branch from c9ce282 to 715e0fe Compare December 3, 2024 11:59
@ananya-mallik-ps
Copy link
Contributor Author

ananya-mallik-ps commented Dec 3, 2024

Since both of those features prune their storage via collector usage we should probably add a collector cache TTL that at least matches the largest TTL if either feature is in use. This way if a scrape config changes the collect filters the exporter will eventually clean itself up without needing to be restarted.

@kgeckhart Added default collector cache TTL to be 2 hours (open to suggestion) and overriding it with largest of all 3 TTLs.

@ananya-mallik-ps ananya-mallik-ps force-pushed the fix/collector-metrics-reset branch from 715e0fe to e39be65 Compare December 3, 2024 13:39
Signed-off-by: Ananya Kumar Mallik <ananya.mallik@paymentsense.com>
Signed-off-by: Ananya Kumar Mallik <ananya.mallik@paymentsense.com>
Signed-off-by: Ananya Kumar Mallik <ananya.mallik@paymentsense.com>
@ananya-mallik-ps ananya-mallik-ps force-pushed the fix/collector-metrics-reset branch from e620717 to d9499df Compare December 3, 2024 13:59
Copy link
Contributor

@kgeckhart kgeckhart left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I think I didn't fully explain the purpose of the TTL and how it relates to the descriptor cache/aggregate deltas.

There's three use cases,

  1. A Collector is consistently being used -> safe to keep forever
  2. A Collector is no longer being used with the descriptor cache and/or aggregate deltas enabled -> Remove after greatest TTL between the two features
    • Both features add memory overhead through caching but prune themselves when the collector is in use
    • In the case of aggregate deltas, this could be a very large amount of memory because every metric is kept around to enable the delta aggregations
    • Removing the collector after the TTL will drop these caches that would otherwise be left around until the exporter is restarted
  3. A Collector is no longer being used and descriptor cache/aggregate deltas is disabled -> Remove after the default 2 hour TTL (overhead of these collectors should negligible)

I left some more direct comments for the implementation.

collectors/cache.go Outdated Show resolved Hide resolved
collectors/cache.go Outdated Show resolved Hide resolved
@ananya-mallik-ps
Copy link
Contributor Author

@kgeckhart : Added suggested changes for descriptor cache/aggregate deltas use case.

Signed-off-by: Ananya Kumar Mallik <ananya.mallik@paymentsense.com>
@ananya-mallik-ps ananya-mallik-ps force-pushed the fix/collector-metrics-reset branch from 8a786a7 to 172634e Compare December 6, 2024 15:38
@kgeckhart kgeckhart merged commit 6a15a74 into prometheus-community:master Dec 12, 2024
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants