-
Notifications
You must be signed in to change notification settings - Fork 521
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Grafana: improved aspnetcore.json #7021
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Detailed my changes as thoroughly as I could
@@ -198,7 +198,7 @@ | |||
"uid": "${DS_PROMETHEUS}" | |||
}, | |||
"editorMode": "code", | |||
"expr": "histogram_quantile(0.50, sum(rate(http_server_request_duration_seconds_bucket{job=\"$job\", instance=\"$instance\"}[$__rate_interval])) by (le))", | |||
"expr": "histogram_quantile(0.50, sum(rate(http_server_request_duration_seconds_bucket{job=~\"$job\", instance=~\"$instance\"}[$__rate_interval])) by (le))", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're going to see this change on every data plot on every chart:
{job=\"$job\", instance=\"$instance\"}
to
{job=~\"$job\", instance=~\"$instance\"}
The =~
being the key change here - this allows Grafana to expand the variables to include all of the selected job
/ instance
values. I'm not going to comment on every instance of this change because it's all the same, but that's what this is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this chart is the 99% latency chart, and addition to changing it to support multi-select I've also modified it to use the $__range
value in Grafana, which corresponds to the tailing time window, versus the $__rate__interval
variable, which is just 4x the Prometheus scraping interval: https://grafana.com/blog/2020/09/28/new-in-grafana-7.2-__rate_interval-for-prometheus-rate-queries-that-just-work/
It's useful for rate queries, but that's not what we're measuring here - we're instead trying to determine "what was 99% latency over X period of time?" - the $__range
value is better for that, and I've made that change in several charts.
Let's do a before and after comparison for this chart specifically: same app and time-range, but only a single instance:
Before
This PR
This is probably mostly a taste thing, but what I appreciate about the latter chart are the following:
- It's possible to "scope in" using cursor selection in Grafana if I want to look at a specific time period on the chart. With the original chart I get "no data" in many scenarios.
- This chart design scales down to smaller time-scales for applications that have less request data - this application does ~1m requests per day and I still have zero data using the default chart design unless I zoom out to a 24 hour time window or so.
The only drawback of this second design is that it's harder to see really large outliers - that's easier to see with the original chart design. This design is averaging rates over a longer period of time, which is what makes it more reliable at showing data over smaller time intervals - you can still see where spikes occur, such as this instance here:
But it's not nearly as pronounced. Happy to take feedback on what's more appropriate here, but making sure the chart worked correctly across a wider range of traffic workloads / time ranges was my objective here.
@@ -424,7 +425,7 @@ | |||
"uid": "${DS_PROMETHEUS}" | |||
}, | |||
"editorMode": "code", | |||
"expr": "sum(rate(http_server_request_duration_seconds_bucket{job=\"$job\", instance=\"$instance\", http_response_status_code=~\"4..|5..\"}[$__rate_interval]) or vector(0)) / sum(rate(http_server_request_duration_seconds_bucket{job=\"$job\", instance=\"$instance\"}[$__rate_interval]))", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the error rate chart, which had some of the same problems that the Requests Duration chart did:
- Not showing data over many intervals
- Data it showed didn't always make a lot of sense.
Before
I actually don't know what to make of those error rates - 0% last and 97.5% max? What does that mean?
This PR
We've changed the error rate chart to use the $__range
and now the story it's telling is much clearer:
I have a roughly 70% 400-level error rate over the past 24 hours - and that makes sense with what I know about this application: it runs a private NuGet package endpoint and we get a lot of 404s from clients looking for packages that this server doesn't host, as a function of how NuGet clients are designed.
If I scope out to the past 6 hours, which mostly covers Saturdays when our customers aren't working:
That 404 rate scales down, with the rest of our traffic, to roughly 11%. It's now easier to tell that by looking at the charts than before.
"expr": "sum(kestrel_active_connections{job=\"$job\", instance=\"$instance\"})", | ||
"legendFormat": "__auto", | ||
"expr": "sum(kestrel_active_connections{job=~\"$job\", instance=~\"$instance\"})", | ||
"legendFormat": "active connections", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed the legend to use a manual entry because otherwise, now that we can aggregrate Kestrel data across multiple applications, you end up with a very gross and verbose-looking auto-generated tooltip here on the legend.
"expr": "sum(http_server_active_requests{job=\"$job\", instance=\"$instance\"})", | ||
"legendFormat": "__auto", | ||
"expr": "sum(http_server_active_requests{job=~\"$job\", instance=~\"$instance\"})", | ||
"legendFormat": "active requests", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same deal as with the kestrel connections - manually renamed the legend here.
I toyed with the idea of breaking out active connections by service / instance and that's easily doable, but opted for something simpler - the user can get that same data by changing the template variables on the selector.
"label": "Job", | ||
"multi": false, | ||
"multi": true, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Enables mutli-select to be used, which is now feasible since all of the PromQL queries have been updated to support it.
@@ -1304,16 +1316,17 @@ | |||
"type": "query" | |||
}, | |||
{ | |||
"allValue": ".*", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the instance
template variable - it as all of the same changes as the job
template variable.
@@ -1350,6 +1364,6 @@ | |||
"timezone": "", | |||
"title": "ASP.NET Core", | |||
"uid": "KdDACDp4z", | |||
"version": 1, | |||
"version": 2, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to bump the chart version if this gets pushed to Grafana Cloud.
Great work. I think the interval and refresh values should be reset back (I think small values are good for immediately after someone imports the chart to verify that it works) but otherwise the changes all look good. There is also the endpoint dashboard. It has most of the same set of issues and the changes here should be applied to it. You're welcome to do it, or I can copy the changes to it in a follow up PR. |
Done!
I hadn't even seen that one, but I'll take a look at it - ok to do that in a separate PR? |
I tried it out in the metrics samples app - dotnet/aspire-samples#638. I got some results that don't look right. Steps when trying out the metrics sample were something like:
What I see: |
I played around with the dashboard queries.
Counts are still a little off. For example: increase + floor = 260 |
I think for the numbers ceil might need to be moved to the outside of the query, rather than on the components being summed |
Sorry, I don't know what this means. Is ceil + increase the right approach if it creates inaccurate numbers? |
Turns out that approach didn't work because I also don't think these numbers are "inaccurate" either - there might be a loss of precision because the values are now calculated over a time range versus just showing whatever the current cumulative value of the counter. The trade-off there is that if you changed the time range / picker, the counter value would never change - which makes that value not super useful (all it can tell you is how many HTTP requests there have been since this counter was last reset.) By scoping that value to a time range with |
I did some research and There doesn't seem to be a good way to get an exact count of a range. I guess we have to live with it. I don't think it is too bad because these counters are increased by 1 at a time. If I understand things correctly, we'd lose a few requests in boundary values, but that means totals won't be too far off. On the changes to the charts, they look subjective rather than strictly better. I'll test some more to see whether they're a good improvement. |
I think the time series graphs should keep using The output of the graphs makes more sense with interval: In your screenshots the graphs look messy, but that's because you're looking at 24 hours of data. It seems like messy output on these kind of graphs is expected until you zoom in. |
So for the error rates and request duration - change those back to rate_interval? |
Yes, in the time series graphs. |
Description
I'm going to do a self-review with some screenshots because many of the changes made to the Grafana dashboard here can't be appreciated without the data being visualized, but this PR contains several quality of life improvements for the default Grafana dashboards for ASP.NET Core + Prometheus, such as:
All
option for bothjob
andinstance
so it's easier to aggregate metrics by application / across several applications.Checklist
<remarks />
and<code />
elements on your triple slash comments?breaking-change
template):doc-idea
template):Microsoft Reviewers: Open in CodeFlow