Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Grafana: improved aspnetcore.json #7021

Open
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

Aaronontheweb
Copy link

@Aaronontheweb Aaronontheweb commented Jan 4, 2025

Description

I'm going to do a self-review with some screenshots because many of the changes made to the Grafana dashboard here can't be appreciated without the data being visualized, but this PR contains several quality of life improvements for the default Grafana dashboards for ASP.NET Core + Prometheus, such as:

  1. Rather than displaying absolute raw counter totals, we display the delta between the beginning / end of the time range
  2. Include an All option for both job and instance so it's easier to aggregate metrics by application / across several applications.
  3. Changed several metrics to use the time-range, rather than the Grafana refresh interval for calculations since the former is usually what people care about.

Checklist

  • Is this feature complete?
    • Yes. Ready to ship.
    • No. Follow-up changes expected.
  • Are you including unit tests for the changes and scenario tests if relevant?
    • Yes
    • No
  • Did you add public API?
    • Yes
      • If yes, did you have an API Review for it?
        • Yes
        • No
      • Did you add <remarks /> and <code /> elements on your triple slash comments?
        • Yes
        • No
    • No
  • Does the change make any security assumptions or guarantees?
    • Yes
      • If yes, have you done a threat model and had a security review?
        • Yes
        • No
    • No
  • Does the change require an update in our Aspire docs?
Microsoft Reviewers: Open in CodeFlow

@dotnet-policy-service dotnet-policy-service bot added the community-contribution Indicates that the PR has been added by a community member label Jan 4, 2025
Copy link
Author

@Aaronontheweb Aaronontheweb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Detailed my changes as thoroughly as I could

@@ -198,7 +198,7 @@
"uid": "${DS_PROMETHEUS}"
},
"editorMode": "code",
"expr": "histogram_quantile(0.50, sum(rate(http_server_request_duration_seconds_bucket{job=\"$job\", instance=\"$instance\"}[$__rate_interval])) by (le))",
"expr": "histogram_quantile(0.50, sum(rate(http_server_request_duration_seconds_bucket{job=~\"$job\", instance=~\"$instance\"}[$__rate_interval])) by (le))",
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're going to see this change on every data plot on every chart:

{job=\"$job\", instance=\"$instance\"}

to

{job=~\"$job\", instance=~\"$instance\"}

The =~ being the key change here - this allows Grafana to expand the variables to include all of the selected job / instance values. I'm not going to comment on every instance of this change because it's all the same, but that's what this is.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this chart is the 99% latency chart, and addition to changing it to support multi-select I've also modified it to use the $__range value in Grafana, which corresponds to the tailing time window, versus the $__rate__interval variable, which is just 4x the Prometheus scraping interval: https://grafana.com/blog/2020/09/28/new-in-grafana-7.2-__rate_interval-for-prometheus-rate-queries-that-just-work/

It's useful for rate queries, but that's not what we're measuring here - we're instead trying to determine "what was 99% latency over X period of time?" - the $__range value is better for that, and I've made that change in several charts.

Let's do a before and after comparison for this chart specifically: same app and time-range, but only a single instance:

Before

image

This PR

image

This is probably mostly a taste thing, but what I appreciate about the latter chart are the following:

  1. It's possible to "scope in" using cursor selection in Grafana if I want to look at a specific time period on the chart. With the original chart I get "no data" in many scenarios.
  2. This chart design scales down to smaller time-scales for applications that have less request data - this application does ~1m requests per day and I still have zero data using the default chart design unless I zoom out to a 24 hour time window or so.

The only drawback of this second design is that it's harder to see really large outliers - that's easier to see with the original chart design. This design is averaging rates over a longer period of time, which is what makes it more reliable at showing data over smaller time intervals - you can still see where spikes occur, such as this instance here:

image

But it's not nearly as pronounced. Happy to take feedback on what's more appropriate here, but making sure the chart worked correctly across a wider range of traffic workloads / time ranges was my objective here.

@@ -424,7 +425,7 @@
"uid": "${DS_PROMETHEUS}"
},
"editorMode": "code",
"expr": "sum(rate(http_server_request_duration_seconds_bucket{job=\"$job\", instance=\"$instance\", http_response_status_code=~\"4..|5..\"}[$__rate_interval]) or vector(0)) / sum(rate(http_server_request_duration_seconds_bucket{job=\"$job\", instance=\"$instance\"}[$__rate_interval]))",
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the error rate chart, which had some of the same problems that the Requests Duration chart did:

  1. Not showing data over many intervals
  2. Data it showed didn't always make a lot of sense.

Before

image

I actually don't know what to make of those error rates - 0% last and 97.5% max? What does that mean?

This PR

We've changed the error rate chart to use the $__range and now the story it's telling is much clearer:

image

I have a roughly 70% 400-level error rate over the past 24 hours - and that makes sense with what I know about this application: it runs a private NuGet package endpoint and we get a lot of 404s from clients looking for packages that this server doesn't host, as a function of how NuGet clients are designed.

If I scope out to the past 6 hours, which mostly covers Saturdays when our customers aren't working:

image

That 404 rate scales down, with the rest of our traffic, to roughly 11%. It's now easier to tell that by looking at the charts than before.

src/Grafana/dashboards/aspnetcore.json Outdated Show resolved Hide resolved
"expr": "sum(kestrel_active_connections{job=\"$job\", instance=\"$instance\"})",
"legendFormat": "__auto",
"expr": "sum(kestrel_active_connections{job=~\"$job\", instance=~\"$instance\"})",
"legendFormat": "active connections",
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed the legend to use a manual entry because otherwise, now that we can aggregrate Kestrel data across multiple applications, you end up with a very gross and verbose-looking auto-generated tooltip here on the legend.

"expr": "sum(http_server_active_requests{job=\"$job\", instance=\"$instance\"})",
"legendFormat": "__auto",
"expr": "sum(http_server_active_requests{job=~\"$job\", instance=~\"$instance\"})",
"legendFormat": "active requests",
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same deal as with the kestrel connections - manually renamed the legend here.

I toyed with the idea of breaking out active connections by service / instance and that's easily doable, but opted for something simpler - the user can get that same data by changing the template variables on the selector.

"label": "Job",
"multi": false,
"multi": true,
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Enables mutli-select to be used, which is now feasible since all of the PromQL queries have been updated to support it.

@@ -1304,16 +1316,17 @@
"type": "query"
},
{
"allValue": ".*",
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the instance template variable - it as all of the same changes as the job template variable.

src/Grafana/dashboards/aspnetcore.json Outdated Show resolved Hide resolved
@@ -1350,6 +1364,6 @@
"timezone": "",
"title": "ASP.NET Core",
"uid": "KdDACDp4z",
"version": 1,
"version": 2,
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to bump the chart version if this gets pushed to Grafana Cloud.

src/Grafana/dashboards/aspnetcore.json Outdated Show resolved Hide resolved
@JamesNK
Copy link
Member

JamesNK commented Jan 5, 2025

Great work. I think the interval and refresh values should be reset back (I think small values are good for immediately after someone imports the chart to verify that it works) but otherwise the changes all look good.

There is also the endpoint dashboard. It has most of the same set of issues and the changes here should be applied to it. You're welcome to do it, or I can copy the changes to it in a follow up PR.

@Aaronontheweb
Copy link
Author

I think the interval and refresh values should be reset back (I think small values are good for immediately after someone imports the chart to verify that it works)

Done!

There is also the endpoint dashboard. It has most of the same set of issues and the changes here should be applied to it. You're welcome to do it, or I can copy the changes to it in a follow up PR.

I hadn't even seen that one, but I'll take a look at it - ok to do that in a separate PR?

@JamesNK
Copy link
Member

JamesNK commented Jan 6, 2025

I tried it out in the metrics samples app - dotnet/aspire-samples#638. I got some results that don't look right.

Steps when trying out the metrics sample were something like:

  • Launch the aspire solution
  • Visit weather page. Generates an auth error when getting the weather api
  • Visit auth page and login
  • Visit weather page again and refresh data a couple of times. Generates a mix of successful data fetches and errors

What I see:

  • Fractional number of requests in the counts:

    image

  • Another thing is the counts in the top requested and unhandled exception endpoints. There has been at least one request to /api/login and one error getting /api/weather, but they report zero.

    image

@JamesNK
Copy link
Member

JamesNK commented Jan 6, 2025

I played around with the dashboard queries.

  • I think the problem is the unhandled exception panel is __rate_interval wasn't replaced with __range.
  • In the top 10 panels, ceil seems to give more correct results with small numbers than floor.
  • In the counts panels (e.g. total requests) ceil seems to fix the fractions

Counts are still a little off. For example:

image

image

image

increase + floor = 260
increase + ceil = 267
main = 270

@Aaronontheweb
Copy link
Author

I think for the numbers ceil might need to be moved to the outside of the query, rather than on the components being summed

@JamesNK
Copy link
Member

JamesNK commented Jan 8, 2025

I think for the numbers ceil might need to be moved to the outside of the query, rather than on the components being summed

Sorry, I don't know what this means. Is ceil + increase the right approach if it creates inaccurate numbers?

@Aaronontheweb
Copy link
Author

I think for the numbers ceil might need to be moved to the outside of the query, rather than on the components being summed

Sorry, I don't know what this means. Is ceil + increase the right approach if it creates inaccurate numbers?

Turns out that approach didn't work because ceil expects a vector and not a scalar value, so ceil has to stay where it is in the PromQL query.

I also don't think these numbers are "inaccurate" either - there might be a loss of precision because the values are now calculated over a time range versus just showing whatever the current cumulative value of the counter. The trade-off there is that if you changed the time range / picker, the counter value would never change - which makes that value not super useful (all it can tell you is how many HTTP requests there have been since this counter was last reset.)

By scoping that value to a time range with increase you can actually see what happened over a given span of time, and that is more accurate if you're trying to understand your application's performance over that window of time.

@JamesNK
Copy link
Member

JamesNK commented Jan 9, 2025

I did some research and increase and rate look like they're inaccurate by design - prometheus/prometheus#3746 and https://www.innoq.com/en/blog/2019/05/prometheus-counters/

There doesn't seem to be a good way to get an exact count of a range. I guess we have to live with it. I don't think it is too bad because these counters are increased by 1 at a time. If I understand things correctly, we'd lose a few requests in boundary values, but that means totals won't be too far off.

On the changes to the charts, they look subjective rather than strictly better. I'll test some more to see whether they're a good improvement.

@JamesNK
Copy link
Member

JamesNK commented Jan 9, 2025

I think the time series graphs should keep using $__rate_interval. All the samples and discussion I've seen say that this is the right value to use. I believe $__range attempts to average out all the values in the visible range which doesn't seem what you want in a time series graph.

The output of the graphs makes more sense with interval:

range:
image

rate_interval:
image

In your screenshots the graphs look messy, but that's because you're looking at 24 hours of data. It seems like messy output on these kind of graphs is expected until you zoom in.

@Aaronontheweb
Copy link
Author

So for the error rates and request duration - change those back to rate_interval?

@JamesNK
Copy link
Member

JamesNK commented Jan 13, 2025

Yes, in the time series graphs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
community-contribution Indicates that the PR has been added by a community member
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants