Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Record Metrics for Reminder #4831

Closed

Conversation

Vyom-Yadav
Copy link
Collaborator

Summary

Provide a brief overview of the changes and the issue being addressed.
Explain the rationale and any background necessary for understanding the changes.
List dependencies required by this change, if any.

Fixes #(related issue)

Change Type

Mark the type of change your PR introduces:

  • Bug fix (resolves an issue without affecting existing features)
  • Feature (adds new functionality without breaking changes)
  • Breaking change (may impact existing functionalities or require documentation updates)
  • Documentation (updates or additions to documentation)
  • Refactoring or test improvements (no bug fixes or new functionality)

Testing

Outline how the changes were tested, including steps to reproduce and any relevant configurations.
Attach screenshots if helpful.

Review Checklist:

  • Reviewed my own code for quality and clarity.
  • Added comments to complex or tricky code sections.
  • Updated any affected documentation.
  • Included tests that validate the fix or feature.
  • Checked that related changes are merged.

Signed-off-by: Vyom Yadav <jackhammervyom@gmail.com>
Copy link
Member

@evankanderson evankanderson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A bunch of comments, mostly aiming to help simplify your life. 😁

Comment on lines +31 to +34
// For tracking average calculation
// TODO: consider persisting this to avoid reset on restart (maybe)
totalBatches int64
totalReminders int64
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than storing these as int64s, why not use an Int64Counter? That will automatically record the total, but also that the metric is cumulative over time.

Comment on lines +25 to +26
// Current number of reminders in the batch
BatchSize metric.Int64Gauge
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not use an Int64Histogram for this, which will also give you the Average metric you're tracking next (approximately, but with more-detailed frequency information, which is probably interesting).

Comment on lines +10 to +18
0, // immediate
10, // 10 seconds
20, // 20 seconds
40, // 40 seconds
80, // 1m 20s
160, // 2m 40s
320, // 5m 20s
640, // 10m 40s
1280, // 21m 20s
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While powers of 2 are generally handy, I'd tend to align this more with human units, e.g.:

0, 10, 30, 60, 120, 300, 600, 900, 1800, 3600, 7200

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what frequency we'll run reminder at, but 2 hours seems reasonably pessimistic for now -- it's really annoying when you run into problems and your frequency distribution has always been "it's not so good".

Given that our expected value for latency is 1/2 scan interval, I could even be convinced that < 1 minute is not particularly interesting (which could trim a couple buckets if you're worried about it -- I'm not!).

Comment on lines +75 to +77
// Update running average
m.totalBatches++
m.totalReminders += size
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should aim to get the metrics code to do this for you automatically -- in particular, if we end up with two reminder processes running, getting a global average from reminder_avg_batch_size is impossible without knowing the numerator and denominator for each process. Exporting them as two cumulative totals allows aggregation across multiple processes.

remMetrics.SendDelay.Record(ctx, (time.Now().Sub(reminderLastSent.Time) - r.cfg.RecurrenceConfig.MinElapsed).Seconds())
} else {
// TODO: Should the send delay be zero if the reminder has never been sent?
remMetrics.SendDelay.Record(ctx, 0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will produce a bunch of +Inf values in the histogram bucket, which will disturb any sort of Nth percentile.

It's probably better to either export nothing here, or export it under some different counter metric ("new reminders").

otel.SetMeterProvider(mp)

// Create metrics
meter := mp.Meter("reminder-service")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
meter := mp.Meter("reminder-service")
meter := mp.Meter("reminder")

}

// Shutdown gracefully shuts down the metrics server
func (p *Provider) Shutdown(ctx context.Context) error {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you make this method private, then cancelling the context to Start or NewProvider becomes the only way to reach this code, which simplifies your lifecycle over having multiple callers.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(among other things, you can guarantee that p is not nil, because you don't call the shutdown method unless p is non-nil.)

Comment on lines +130 to +132
func (p *Provider) Metrics() *Metrics {
return p.metrics
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My main suggestion is to remove the Provider object in favor of a static setup function with cancellation, but if you do keep it, I'd suggesting making Metrics exported, rather than needing to add an accessor function.

@@ -143,6 +163,11 @@ func (r *reminder) sendReminders(ctx context.Context) error {
return fmt.Errorf("error creating reminder messages: %w", err)
}

remMetrics := r.metricsProvider.Metrics()
if remMetrics != nil {
remMetrics.RecordBatch(ctx, int64(len(repos)))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you make batches a histogram, this can just be a Record() on that histogram.

Comment on lines +82 to +84
func (m *Metrics) RecordSendDelay(ctx context.Context, delaySeconds float64) {
m.SendDelay.Record(ctx, delaySeconds)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like this function isn't doing anything -- you're already exposing SendDelay, and there's not much semantic benefit to foo.RecordSendDelay(...) over foo.SendDelay.Record(...). That also saves you from needing to write a comment.

If you transform BatchSize into a histogram as well, then you might be able to get rid of all the methods on Metrics, which would be nice.

Copy link
Contributor

This PR needs additional information before we can continue. It is now marked as stale because it has been open for 30 days with no activity. Please provide the necessary details to continue or it will be closed in 30 days.

@github-actions github-actions bot added the Stale label Nov 28, 2024
@github-actions github-actions bot closed this Dec 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants