Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

badarith errors when collecting metrics #12815

Open
mkuratczyk opened this issue Nov 26, 2024 · 4 comments
Open

badarith errors when collecting metrics #12815

mkuratczyk opened this issue Nov 26, 2024 · 4 comments
Labels

Comments

@mkuratczyk
Copy link
Contributor

mkuratczyk commented Nov 26, 2024

Describe the bug

I don't have the exact repro steps. I think this is a race condition when a queue is deleted while the metrics are collected and the queue is found by the collector but then returns empty values ('') instead of the expected numbers. I've seen this occur in 3 different places:

crasher:
  initial call: rabbit_mgmt_metrics_collector:init/1
  pid: <0.699.0>
  registered_name: connection_coarse_metrics_metrics_collector
  exception error: an error occurred when evaluating an arithmetic expression
    in operator  -/2
       called as '' - 19495760921
    in call from rabbit_mgmt_metrics_collector:difference/2 (rabbit_mgmt_metrics_collector.erl, line 649)
    in call from rabbit_mgmt_metrics_collector:aggregate_entry/4 (rabbit_mgmt_metrics_collector.erl, line 234)
    in call from lists:foldl/3 (lists.erl, line 1594)
    in call from ets:do_foldl/4 (ets.erl, line 656)
    in call from ets:foldl/3 (ets.erl, line 645)
    in call from rabbit_mgmt_metrics_collector:aggregate_metrics/2 (rabbit_mgmt_metrics_collector.erl, line 172)
    in call from rabbit_mgmt_metrics_collector:handle_info/2 (rabbit_mgmt_metrics_collector.erl, line 106)
crasher:
  initial call: rabbit_mgmt_metrics_collector:init/1
  pid: <0.7180.0>
  registered_name: connection_coarse_metrics_metrics_collector
  exception error: an error occurred when evaluating an arithmetic expression
    in operator  +/2
       called as 19340950184 + ''
    in call from rabbit_mgmt_metrics_collector:sum_entry/2 (rabbit_mgmt_metrics_collector.erl, line 631)
    in call from rabbit_mgmt_metrics_collector:'-insert_entry_op/4-fun-0-'/2 (rabbit_mgmt_metrics_collector.erl, line 598)
    in call from maps:update_with/4 (maps.erl, line 296)
    in call from rabbit_mgmt_metrics_collector:insert_entry_op/4 (rabbit_mgmt_metrics_collector.erl, line 597)
    in call from lists:foldl/3 (lists.erl, line 1594)
    in call from rabbit_mgmt_metrics_collector:aggregate_entry/4 (rabbit_mgmt_metrics_collector.erl, line 235)
    in call from lists:foldl/3 (lists.erl, line 1594)
crasher:
  initial call: cowboy_stream_h:request_process/3
  pid: <0.95144.0>
  registered_name: []
  exception error: an error occurred when evaluating an arithmetic expression
    in operator  +/2
       called as '' + 4193012
    in call from prometheus_rabbitmq_core_metrics_collector:'-get_data/3-fun-5-'/3 (prometheus_rabbitmq_core_metrics_collector.erl, line 648)
    in call from lists:foldl/3 (lists.erl, line 1594)
    in call from ets:do_foldl/4 (ets.erl, line 656)
    in call from ets:foldl/3 (ets.erl, line 645)
    in call from prometheus_rabbitmq_core_metrics_collector:get_data/3 (prometheus_rabbitmq_core_metrics_collector.erl, line 633)
    in call from prometheus_rabbitmq_core_metrics_collector:'-collect/5-lc$^0/1-0-'/6 (prometheus_rabbitmq_core_metrics_collector.erl, line 327)
    in call from prometheus_rabbitmq_core_metrics_collector:collect/5 (prometheus_rabbitmq_core_metrics_collector.erl, line 329)

Observed on the main branch on November 26

Reproduction steps

Not clear

Expected behavior

No crashes

Additional context

No response

@mkuratczyk mkuratczyk added the bug label Nov 26, 2024
@mkuratczyk mkuratczyk changed the title Occasional badarith errors when collecting metrics badarith errors when collecting metrics Nov 27, 2024
@mkuratczyk
Copy link
Contributor Author

I'm not sure if that's because I'm testing a slightly different scenario perhaps these errors are caused by some recent changes but I haven't seen them in the past at all and now I see them very often.

@michaelklishin
Copy link
Member

We have seen these before, usually on a freshly booted node when some values are not yet available. '' is used as undefined historically by a few metrics that come from the environment, e.g. free disk space, which is not computed instantly the node starts booting.

The Prometheus plugin should filter such values out because if a data point isn't available yet, what else can it do?

@mkuratczyk
Copy link
Contributor Author

I've seen this enough times now to say that it's certainly not about a freshly booted node, but about a queue getting deleted.

@michaelklishin
Copy link
Member

@mkuratczyk yup, that can be another case where some metrics no longer exist. During a boot, they do not yet exist, and after queue deletion, they no longer exist.

I am all for making the code more defensive but if some samples are missing… what other than an error can the Prometheus scraping API endpoint return?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants