New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Grafana: improved aspnetcore.json #7021

Open

Aaronontheweb wants to merge 9 commits into dotnet:main from Aaronontheweb:fix-grafana-dashboards

+28 −27

Aaronontheweb commented Jan 4, 2025 •

edited

Loading

Description

I'm going to do a self-review with some screenshots because many of the changes made to the Grafana dashboard here can't be appreciated without the data being visualized, but this PR contains several quality of life improvements for the default Grafana dashboards for ASP.NET Core + Prometheus, such as:

Rather than displaying absolute raw counter totals, we display the delta between the beginning / end of the time range
Include an All option for both job and instance so it's easier to aggregate metrics by application / across several applications.
Changed several metrics to use the time-range, rather than the Grafana refresh interval for calculations since the former is usually what people care about.

Checklist

Is this feature complete?
- Yes. Ready to ship.
- No. Follow-up changes expected.
Are you including unit tests for the changes and scenario tests if relevant?
- Yes
- No
Did you add public API?
- Yes
  - If yes, did you have an API Review for it?
    - Yes
    - No
  - Did you add <remarks /> and <code /> elements on your triple slash comments?
    - Yes
    - No
- No
Does the change make any security assumptions or guarantees?
- Yes
  - If yes, have you done a threat model and had a security review?
    - Yes
    - No
- No
Does the change require an update in our Aspire docs?
- Yes
  - Is this introducing a breaking change?
    - Yes
      - Link to aspire-docs issue (please use this breaking-change template):
    - No
      - Link to aspire-docs issue (please use this doc-idea template):
- No

Microsoft Reviewers: Open in CodeFlow


          Grafana: improved aspnetcore.json

a17ea12

dotnet-policy-service bot added the community-contribution label

Aaronontheweb added 2 commits

January 4, 2025 15:29


          fix allValues

88321f5


          fixed latency and error rate charts

154cfe3

Aaronontheweb commented

View reviewed changes

Author

Aaronontheweb left a comment

Detailed my changes as thoroughly as I could

src/Grafana/dashboards/aspnetcore.json Outdated

@@ @@ -198,7 +198,7 @@ @@
                           "uid": "${DS_PROMETHEUS}"
                         },
                         "editorMode": "code",
-                        "expr": "histogram_quantile(0.50, sum(rate(http_server_request_duration_seconds_bucket{job=\"$job\", instance=\"$instance\"}[$__rate_interval])) by (le))",
+                        "expr": "histogram_quantile(0.50, sum(rate(http_server_request_duration_seconds_bucket{job=~\"$job\", instance=~\"$instance\"}[$__rate_interval])) by (le))",

Author

Aaronontheweb Jan 4, 2025

You're going to see this change on every data plot on every chart:

{job=\"$job\", instance=\"$instance\"}

to

{job=~\"$job\", instance=~\"$instance\"}

The =~ being the key change here - this allows Grafana to expand the variables to include all of the selected job / instance values. I'm not going to comment on every instance of this change because it's all the same, but that's what this is.

Author

Aaronontheweb Jan 4, 2025

So this chart is the 99% latency chart, and addition to changing it to support multi-select I've also modified it to use the $__range value in Grafana, which corresponds to the tailing time window, versus the $__rate__interval variable, which is just 4x the Prometheus scraping interval: https://grafana.com/blog/2020/09/28/new-in-grafana-7.2-__rate_interval-for-prometheus-rate-queries-that-just-work/

It's useful for rate queries, but that's not what we're measuring here - we're instead trying to determine "what was 99% latency over X period of time?" - the $__range value is better for that, and I've made that change in several charts.

Let's do a before and after comparison for this chart specifically: same app and time-range, but only a single instance:

Before

This PR

This is probably mostly a taste thing, but what I appreciate about the latter chart are the following:

It's possible to "scope in" using cursor selection in Grafana if I want to look at a specific time period on the chart. With the original chart I get "no data" in many scenarios.
This chart design scales down to smaller time-scales for applications that have less request data - this application does ~1m requests per day and I still have zero data using the default chart design unless I zoom out to a 24 hour time window or so.

The only drawback of this second design is that it's harder to see really large outliers - that's easier to see with the original chart design. This design is averaging rates over a longer period of time, which is what makes it more reliable at showing data over smaller time intervals - you can still see where spikes occur, such as this instance here:

But it's not nearly as pronounced. Happy to take feedback on what's more appropriate here, but making sure the chart worked correctly across a wider range of traffic workloads / time ranges was my objective here.

src/Grafana/dashboards/aspnetcore.json

@@ @@ -424,7 +425,7 @@ @@
                           "uid": "${DS_PROMETHEUS}"
                         },
                         "editorMode": "code",
-                        "expr": "sum(rate(http_server_request_duration_seconds_bucket{job=\"$job\", instance=\"$instance\", http_response_status_code=~\"4..|5..\"}[$__rate_interval]) or vector(0)) / sum(rate(http_server_request_duration_seconds_bucket{job=\"$job\", instance=\"$instance\"}[$__rate_interval]))",

Author

Aaronontheweb Jan 4, 2025

This is the error rate chart, which had some of the same problems that the Requests Duration chart did:

Not showing data over many intervals
Data it showed didn't always make a lot of sense.

Before

I actually don't know what to make of those error rates - 0% last and 97.5% max? What does that mean?

This PR

We've changed the error rate chart to use the $__range and now the story it's telling is much clearer:

I have a roughly 70% 400-level error rate over the past 24 hours - and that makes sense with what I know about this application: it runs a private NuGet package endpoint and we get a lot of 404s from clients looking for packages that this server doesn't host, as a function of how NuGet clients are designed.

If I scope out to the past 6 hours, which mostly covers Saturdays when our customers aren't working:

That 404 rate scales down, with the rest of our traffic, to roughly 11%. It's now easier to tell that by looking at the charts than before.

src/Grafana/dashboards/aspnetcore.json Outdated Show resolved Hide resolved

src/Grafana/dashboards/aspnetcore.json

    
                        "expr": "sum(kestrel_active_connections{job=\"$job\", instance=\"$instance\"})",

                        "legendFormat": "__auto",

                        "expr": "sum(kestrel_active_connections{job=~\"$job\", instance=~\"$instance\"})",

                        "legendFormat": "active connections",

Author

Aaronontheweb Jan 4, 2025

Changed the legend to use a manual entry because otherwise, now that we can aggregrate Kestrel data across multiple applications, you end up with a very gross and verbose-looking auto-generated tooltip here on the legend.

src/Grafana/dashboards/aspnetcore.json

    
                        "expr": "sum(http_server_active_requests{job=\"$job\", instance=\"$instance\"})",

                        "legendFormat": "__auto",

                        "expr": "sum(http_server_active_requests{job=~\"$job\", instance=~\"$instance\"})",

                        "legendFormat": "active requests",

Author

Aaronontheweb Jan 4, 2025

Same deal as with the kestrel connections - manually renamed the legend here.

I toyed with the idea of breaking out active connections by service / instance and that's easily doable, but opted for something simpler - the user can get that same data by changing the template variables on the selector.

src/Grafana/dashboards/aspnetcore.json

                       "label": "Job",
-                      "multi": false,
+                      "multi": true,

Author

Aaronontheweb Jan 4, 2025

Enables mutli-select to be used, which is now feasible since all of the PromQL queries have been updated to support it.

src/Grafana/dashboards/aspnetcore.json

@@ @@ -1304,16 +1316,17 @@ @@
                       "type": "query"
                     },
                     {
+                      "allValue": ".*",

Author

Aaronontheweb Jan 4, 2025

This is the instance template variable - it as all of the same changes as the job template variable.

src/Grafana/dashboards/aspnetcore.json Outdated Show resolved Hide resolved

src/Grafana/dashboards/aspnetcore.json

@@ @@ -1350,6 +1364,6 @@ @@
                 "timezone": "",
                 "title": "ASP.NET Core",
                 "uid": "KdDACDp4z",
-                "version": 1,
+                "version": 2,

Author

Aaronontheweb Jan 4, 2025

Need to bump the chart version if this gets pushed to Grafana Cloud.

src/Grafana/dashboards/aspnetcore.json Outdated Show resolved Hide resolved

Aaronontheweb added 2 commits

January 4, 2025 16:51


          removed Grafana export junk

1a799e0


          removed maxHeights from tooltips

10aa0cf

build-analysis bot mentioned this pull request

The active test run was aborted. Reason: Test host process crashed dotnet/dnceng#451

Open

3 tasks

Member

JamesNK commented Jan 5, 2025 •

edited

Loading

Great work. I think the interval and refresh values should be reset back (I think small values are good for immediately after someone imports the chart to verify that it works) but otherwise the changes all look good.

There is also the endpoint dashboard. It has most of the same set of issues and the changes here should be applied to it. You're welcome to do it, or I can copy the changes to it in a follow up PR.


          reset default time window and refresh interval

f5be4e7

Author

Aaronontheweb commented Jan 5, 2025

I think the interval and refresh values should be reset back (I think small values are good for immediately after someone imports the chart to verify that it works)

Done!

There is also the endpoint dashboard. It has most of the same set of issues and the changes here should be applied to it. You're welcome to do it, or I can copy the changes to it in a follow up PR.

I hadn't even seen that one, but I'll take a look at it - ok to do that in a separate PR?

JamesNK mentioned this pull request

Update metrics dashboard from PR dotnet/aspire-samples#638

Draft

Member

JamesNK commented Jan 6, 2025 •

edited

Loading

I tried it out in the metrics samples app - dotnet/aspire-samples#638. I got some results that don't look right.

Steps when trying out the metrics sample were something like:

Launch the aspire solution
Visit weather page. Generates an auth error when getting the weather api
Visit auth page and login
Visit weather page again and refresh data a couple of times. Generates a mix of successful data fetches and errors

What I see:

Fractional number of requests in the counts:
Another thing is the counts in the top requested and unhandled exception endpoints. There has been at least one request to /api/login and one error getting /api/weather, but they report zero.

Member

JamesNK commented Jan 6, 2025 •

edited

Loading

I played around with the dashboard queries.

I think the problem is the unhandled exception panel is __rate_interval wasn't replaced with __range.
In the top 10 panels, ceil seems to give more correct results with small numbers than floor.
In the counts panels (e.g. total requests) ceil seems to fix the fractions

Counts are still a little off. For example:

increase + floor = 260
increase + ceil = 267
main = 270

Author

Aaronontheweb commented Jan 6, 2025

I think for the numbers ceil might need to be moved to the outside of the query, rather than on the components being summed

Aaronontheweb added 3 commits

January 7, 2025 09:57


          fixed ceil call on top 10 requested endpoints

d93b8cf


          fixed unhandled exception charts and HTTP rate counts using ceil


          Merge branch 'dotnet:main' into fix-grafana-dashboards

49ac75c

Member

JamesNK commented Jan 8, 2025

I think for the numbers ceil might need to be moved to the outside of the query, rather than on the components being summed

Sorry, I don't know what this means. Is ceil + increase the right approach if it creates inaccurate numbers?

Author

Aaronontheweb commented Jan 8, 2025

I think for the numbers ceil might need to be moved to the outside of the query, rather than on the components being summed

Sorry, I don't know what this means. Is ceil + increase the right approach if it creates inaccurate numbers?

Turns out that approach didn't work because ceil expects a vector and not a scalar value, so ceil has to stay where it is in the PromQL query.

I also don't think these numbers are "inaccurate" either - there might be a loss of precision because the values are now calculated over a time range versus just showing whatever the current cumulative value of the counter. The trade-off there is that if you changed the time range / picker, the counter value would never change - which makes that value not super useful (all it can tell you is how many HTTP requests there have been since this counter was last reset.)

By scoping that value to a time range with increase you can actually see what happened over a given span of time, and that is more accurate if you're trying to understand your application's performance over that window of time.

Member

JamesNK commented Jan 9, 2025 •

edited

Loading

I did some research and increase and rate look like they're inaccurate by design - prometheus/prometheus#3746 and https://www.innoq.com/en/blog/2019/05/prometheus-counters/

There doesn't seem to be a good way to get an exact count of a range. I guess we have to live with it. I don't think it is too bad because these counters are increased by 1 at a time. If I understand things correctly, we'd lose a few requests in boundary values, but that means totals won't be too far off.

On the changes to the charts, they look subjective rather than strictly better. I'll test some more to see whether they're a good improvement.

Member

JamesNK commented Jan 9, 2025 •

edited

Loading

I think the time series graphs should keep using $__rate_interval. All the samples and discussion I've seen say that this is the right value to use. I believe $__range attempts to average out all the values in the visible range which doesn't seem what you want in a time series graph.

The output of the graphs makes more sense with interval:

range:

rate_interval:

In your screenshots the graphs look messy, but that's because you're looking at 24 hours of data. It seems like messy output on these kind of graphs is expected until you zoom in.

Author

Aaronontheweb commented Jan 9, 2025

So for the error rates and request duration - change those back to rate_interval?

Member

JamesNK commented Jan 13, 2025

Yes, in the time series graphs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-contribution