Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Kubernetes Integration] Fix for apiserver token expiration #42016

Merged
merged 26 commits into from
Jan 7, 2025
Merged

Conversation

gizas
Copy link
Contributor

@gizas gizas commented Dec 12, 2024

  • Bug

Proposed commit message

WHAT: Adds the ability to NewPrometheusClient to refresh the authentication bearer token

WHY: It is needed in specific K8s metrcisets that use prometheus to retrieve the metrics. In such cases, when the token expires, the client still uses old connection and eventually it gets rejected with 401 unauthorised error

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

How to test this PR locally

Using following documentation to build metricbeat locally

Use only following manifest:

 metricbeat.autodiscover:
      providers:
        - type: kubernetes
          scope: cluster
          node: ${NODE_NAME}
          unique: true
          templates:
            - config:
                - module: kubernetes
                  metricsets:
                    - apiserver
                  hosts: ["https://${KUBERNETES_SERVICE_HOST}:${KUBERNETES_SERVICE_PORT}"]
                  # use_kubeadm: true
                  # bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
                  bearer_token_file: /service-account/token
                  ssl.certificate_authorities:
                    - /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
                  period: 60s
...
       volumeMounts:
            - name: token-vol
              mountPath: /service-account
              readOnly: true
     ....
        volumes:
          - name: token-vol
            projected:
              sources:
                - serviceAccountToken:
                    path: token
                    expirationSeconds: 600

Related issues

Use cases

Reported here: https://github.com/elastic/sdh-beats/issues/5439

Screenshots

apiserver

For the last 30m or so processing continues

Logs

Consequative messages
message":"OLEEEE--- -> Denotes that prometheus metrics
"message":"PASSSS--- -> Denotes that prometheus metrics got 401 and then refresh of token happens
message":"OLEEEE--- -> Denotes that prometheus metrics continues processing

{"log.level":"info","@timestamp":"2024-12-12T14:32:07.296Z","log.logger":"PASSSSOLEEE","log.origin":{"function":"github.com/elastic/beats/v7/metricbeat/module/kubernetes/apiserver.(*Metricset).Fetch","file.name":"apiserver/metricset.go","file.line":80},"message":"OLEEEE--- TWe need to march 1:%!s(<nil>) and err: unexpected status code 401","service.name":"metricbeat","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2024-12-12T14:32:34.199Z","log.logger":"monitoring","log.origin":{"function":"github.com/elastic/beats/v7/libbeat/monitoring/report/log.(*reporter).logSnapshot","file.name":"log/log.go","file.line":192},"message":"Non-zero metrics in the last 30s","service.name":"metricbeat","monitoring":{"metrics":{"beat":{"cgroup":{"memory":{"mem":{"usage":{"bytes":3196932096}}}},"cpu":{"system":{"ticks":790,"time":{"ms":20}},"total":{"ticks":3960,"time":{"ms":160},"value":3960},"user":{"ticks":3170,"time":{"ms":140}}},"handles":{"limit":{"hard":1048576,"soft":1048576},"open":10},"info":{"ephemeral_id":"eef1f440-b9d3-4dcf-954c-555c1cf5d5fe","uptime":{"ms":1110038},"version":"9.0.0"},"memstats":{"gc_next":29111896,"memory_alloc":25035752,"memory_total":1164904664,"rss":103301120},"runtime":{"goroutines":25}},"libbeat":{"config":{"module":{"running":1}},"output":{"events":{"acked":528,"active":0,"batches":1,"total":528},"read":{"bytes":13880,"errors":1},"write":{"bytes":56510,"latency":{"histogram":{"count":19,"max":37,"mean":26.68421052631579,"median":27,"min":19,"p75":32,"p95":37,"p99":37,"p999":37,"stddev":5.2621051578736795}}}},"pipeline":{"clients":1,"events":{"active":0,"published":528,"total":528},"queue":{"acked":528,"added":{"bytes":451342,"events":528},"consumed":{"bytes":451342,"events":528},"filled":{"bytes":0,"events":0,"pct":0},"max_bytes":0,"max_events":3200,"removed":{"bytes":451342,"events":528}}}},"metricbeat":{"kubernetes":{"apiserver":{"events":528,"success":528}}},"system":{"load":{"1":5.87,"15":4.69,"5":4.9,"norm":{"1":0.8386,"15":0.67,"5":0.7}}}},"ecs.version":"1.6.0"}}
{"log.level":"info","@timestamp":"2024-12-12T14:33:04.201Z","log.logger":"monitoring","log.origin":{"function":"github.com/elastic/beats/v7/libbeat/monitoring/report/log.(*reporter).logSnapshot","file.name":"log/log.go","file.line":192},"message":"Non-zero metrics in the last 30s","service.name":"metricbeat","monitoring":{"metrics":{"beat":{"cgroup":{"memory":{"mem":{"usage":{"bytes":3199254528}}}},"cpu":{"system":{"ticks":800,"time":{"ms":10}},"total":{"ticks":4010,"time":{"ms":50},"value":4010},"user":{"ticks":3210,"time":{"ms":40}}},"handles":{"limit":{"hard":1048576,"soft":1048576},"open":10},"info":{"ephemeral_id":"eef1f440-b9d3-4dcf-954c-555c1cf5d5fe","uptime":{"ms":1140036},"version":"9.0.0"},"memstats":{"gc_next":29111896,"memory_alloc":26091616,"memory_total":1165960528,"rss":103825408},"runtime":{"goroutines":25}},"libbeat":{"config":{"module":{"running":1}},"output":{"events":{"active":0},"write":{"latency":{"histogram":{"count":19,"max":37,"mean":26.68421052631579,"median":27,"min":19,"p75":32,"p95":37,"p99":37,"p999":37,"stddev":5.2621051578736795}}}},"pipeline":{"clients":1,"events":{"active":0},"queue":{"filled":{"bytes":0,"events":0,"pct":0},"max_bytes":0,"max_events":3200}}},"system":{"load":{"1":5.15,"15":4.67,"5":4.82,"norm":{"1":0.7357,"15":0.6671,"5":0.6886}}}},"ecs.version":"1.6.0"}}
{"log.level":"info","@timestamp":"2024-12-12T14:33:07.191Z","log.logger":"PASSSSOLEEE","log.origin":{"function":"github.com/elastic/beats/v7/metricbeat/module/kubernetes/apiserver.(*Metricset).Fetch","file.name":"apiserver/metricset.go","file.line":80},"message":"OLEEEE--- TWe need to march 1:unexpected status code 401 from server and err: unexpected status code 401","service.name":"metricbeat","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2024-12-12T14:33:07.192Z","log.logger":"PASSSSOLEEE","log.origin":{"function":"github.com/elastic/beats/v7/metricbeat/module/kubernetes/apiserver.(*Metricset).Fetch","file.name":"apiserver/metricset.go","file.line":85},"message":"PASSSS--- This is the connection event with err: unexpected status code 401 from server","service.name":"metricbeat","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2024-12-12T14:33:34.199Z","log.logger":"monitoring","log.origin":{"function":"github.com/elastic/beats/v7/libbeat/monitoring/report/log.(*reporter).logSnapshot","file.name":"log/log.go","file.line":192},"message":"Non-zero metrics in the last 30s","service.name":"metricbeat","monitoring":{"metrics":{"beat":{"cgroup":{"memory":{"mem":{"usage":{"bytes":3198558208}}}},"cpu":{"system":{"ticks":830,"time":{"ms":30}},"total":{"ticks":4170,"time":{"ms":160},"value":4170},"user":{"ticks":3340,"time":{"ms":130}}},"handles":{"limit":{"hard":1048576,"soft":1048576},"open":10},"info":{"ephemeral_id":"eef1f440-b9d3-4dcf-954c-555c1cf5d5fe","uptime":{"ms":1170037},"version":"9.0.0"},"memstats":{"gc_next":20762024,"memory_alloc":13202504,"memory_total":1224858216,"rss":101543936},"runtime":{"goroutines":25}},"libbeat":{"config":{"module":{"running":1}},"output":{"events":{"acked":528,"active":0,"batches":1,"total":528},"read":{"bytes":13880,"errors":1},"write":{"bytes":56434,"latency":{"histogram":{"count":20,"max":37,"mean":26.45,"median":26.5,"min":19,"p75":31.75,"p95":36.849999999999994,"p99":37,"p999":37,"stddev":5.229483722127836}}}},"pipeline":{"clients":1,"events":{"active":0,"published":528,"total":528},"queue":{"acked":528,"added":{"bytes":450892,"events":528},"consumed":{"bytes":450892,"events":528},"filled":{"bytes":0,"events":0,"pct":0},"max_bytes":0,"max_events":3200,"removed":{"bytes":450892,"events":528}}}},"metricbeat":{"kubernetes":{"apiserver":{"events":528,"success":528}}},"system":{"load":{"1":5.35,"15":4.71,"5":4.93,"norm":{"1":0.7643,"15":0.6729,"5":0.7043}}}},"ecs.version":"1.6.0"}}
{"log.level":"info","@timestamp":"2024-12-12T14:34:04.202Z","log.logger":"monitoring","log.origin":{"function":"github.com/elastic/beats/v7/libbeat/monitoring/report/log.(*reporter).logSnapshot","file.name":"log/log.go","file.line":192},"message":"Non-zero metrics in the last 30s","service.name":"metricbeat","monitoring":{"metrics":{"beat":{"cgroup":{"memory":{"mem":{"usage":{"bytes":3199209472}}}},"cpu":{"system":{"ticks":840,"time":{"ms":10}},"total":{"ticks":4220,"time":{"ms":50},"value":4220},"user":{"ticks":3380,"time":{"ms":40}}},"handles":{"limit":{"hard":1048576,"soft":1048576},"open":10},"info":{"ephemeral_id":"eef1f440-b9d3-4dcf-954c-555c1cf5d5fe","uptime":{"ms":1200037},"version":"9.0.0"},"memstats":{"gc_next":20762024,"memory_alloc":14106328,"memory_total":1225762040,"rss":101543936},"runtime":{"goroutines":25}},"libbeat":{"config":{"module":{"running":1}},"output":{"events":{"active":0},"write":{"latency":{"histogram":{"count":20,"max":37,"mean":26.45,"median":26.5,"min":19,"p75":31.75,"p95":36.849999999999994,"p99":37,"p999":37,"stddev":5.229483722127836}}}},"pipeline":{"clients":1,"events":{"active":0},"queue":{"filled":{"bytes":0,"events":0,"pct":0},"max_bytes":0,"max_events":3200}}},"system":{"load":{"1":4.66,"15":4.68,"5":4.81,"norm":{"1":0.6657,"15":0.6686,"5":0.6871}}}},"ecs.version":"1.6.0"}}
{"log.level":"info","@timestamp":"2024-12-12T14:34:07.306Z","log.logger":"PASSSSOLEEE","log.origin":{"function":"github.com/elastic/beats/v7/metricbeat/module/kubernetes/apiserver.(*Metricset).Fetch","file.name":"apiserver/metricset.go","file.line":80},"message":"OLEEEE--- TWe need to march 1:%!s(<nil>) and err: unexpected status code 401","service.name":"metricbeat","ecs.version":"1.6.0"}

@botelastic botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label Dec 12, 2024
@gizas gizas added the Team:obs-ds-hosted-services Label for the Observability Hosted Services team label Dec 12, 2024
@botelastic botelastic bot removed the needs_team Indicates that the issue/PR needs a Team:* label label Dec 12, 2024
@mergify mergify bot assigned gizas Dec 12, 2024
Copy link
Contributor

mergify bot commented Dec 12, 2024

This pull request does not have a backport label.
If this is a bug or security fix, could you label this PR @gizas? 🙏.
For such, you'll need to label your PR with:

  • The upcoming major version of the Elastic Stack
  • The upcoming minor version of the Elastic Stack (if you're not pushing a breaking change)

To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

  • backport-8./d is the label to automatically backport to the 8./d branch. /d is the digit

Copy link
Contributor

mergify bot commented Dec 12, 2024

backport-8.x has been added to help with the transition to the new branch 8.x.
If you don't need it please use backport-skip label and remove the backport-8.x label.

@mergify mergify bot added the backport-8.x Automated backport to the 8.x branch with mergify label Dec 12, 2024
@gizas gizas added backport-8.16 Automated backport with mergify backport-8.17 Automated backport with mergify labels Dec 12, 2024
@gizas
Copy link
Contributor Author

gizas commented Dec 13, 2024

Repeated tests for Controllermanager and Scheduler metricsets:

MetricBeat Manifest for Controllermanager and Scheduler metricsets
   - module: kubernetes
      enabled: true
      metricsets:
        - controllermanager
      hosts: ["https://localhost:10257"]
      bearer_token_file: /service-account/token
      ssl.verification_mode: none
      period: 60s
    - module: kubernetes
      enabled: true
      metricsets:
        - scheduler
      hosts: ["https://localhost:10259"]
      bearer_token_file: /service-account/token
      ssl.verification_mode: none
      period: 60s
Screenshot 2024-12-13 at 11 42 54 AM Screenshot 2024-12-13 at 12 07 26 PM

@gizas gizas marked this pull request as ready for review December 13, 2024 10:13
@gizas gizas requested review from a team as code owners December 13, 2024 10:13
@gizas gizas requested review from belimawr and mauri870 December 13, 2024 10:13
@elasticmachine
Copy link
Collaborator

Pinging @elastic/obs-ds-hosted-services (Team:obs-ds-hosted-services)

@gizas gizas changed the title Initial fix for apiserver token expiration [Kubernetes Integration] Fix for apiserver token expiration Dec 13, 2024
Copy link
Contributor

mergify bot commented Dec 18, 2024

This pull request is now in conflicts. Could you fix it? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b k8sapibearer upstream/k8sapibearer
git merge upstream/main
git push upstream k8sapibearer

Comment on lines 121 to 136
error_string := fmt.Sprintf("%s", err)
errorUnauthorisedMsg := fmt.Sprintf("unexpected status code %d", http.StatusUnauthorized)
if err != nil && strings.Contains(error_string, errorUnauthorisedMsg) {
count := 2 // We retry twice to refresh the Authorisation token in case of http.StatusUnauthorize = 401 Error
for count > 0 {
if _, errAuth := m.http.RefreshAuthorizationHeader(); errAuth == nil {
events, err = m.prometheusClient.GetProcessedMetrics(m.prometheusMappings)
}
if err != nil {
time.Sleep(m.mod.Config().Period)
count--
} else {
break
}
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code is the same across 3 metricsets, right? Can we place it anywhere and call it from there for the 3 metricsets?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed I tried it but we get in circular dependencies between the packages and the imported interfaces. I guess that this is the reason that it was the same already

@gizas
Copy link
Contributor Author

gizas commented Dec 18, 2024

@constanca-m , @MichaelKatsoulis I would say we are ready to merge.
I tried to implement some unitests but without success as was overkill to try to check the errors returned.
I guess part of the refreshauthorsation is covered here

Let me know if you have any other suggestions

@gizas gizas merged commit 7e25c4d into main Jan 7, 2025
38 checks passed
@gizas gizas deleted the k8sapibearer branch January 7, 2025 07:50
mergify bot pushed a commit that referenced this pull request Jan 7, 2025
* initial fix for apiserver

* adding fix for controller and schedule

(cherry picked from commit 7e25c4d)
mergify bot pushed a commit that referenced this pull request Jan 7, 2025
* initial fix for apiserver

* adding fix for controller and schedule

(cherry picked from commit 7e25c4d)
mergify bot pushed a commit that referenced this pull request Jan 7, 2025
* initial fix for apiserver

* adding fix for controller and schedule

(cherry picked from commit 7e25c4d)
gizas added a commit that referenced this pull request Jan 7, 2025
…ken expiration (#42230)

* [Kubernetes Integration] Fix for apiserver token expiration (#42016)

* initial fix for apiserver

* adding fix for controller and schedule

(cherry picked from commit 7e25c4d)

* correcting changelog.next.asciidoc file

---------

Co-authored-by: Andrew Gizas <andreas.gkizas@elastic.co>
Co-authored-by: Denis <denis.rechkunov@elastic.co>
gizas added a commit that referenced this pull request Jan 7, 2025
…42231)

* initial fix for apiserver

* adding fix for controller and schedule

(cherry picked from commit 7e25c4d)

Co-authored-by: Andrew Gizas <andreas.gkizas@elastic.co>
gizas added a commit that referenced this pull request Jan 7, 2025
…42232)

* initial fix for apiserver

* adding fix for controller and schedule

(cherry picked from commit 7e25c4d)

Co-authored-by: Andrew Gizas <andreas.gkizas@elastic.co>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-8.x Automated backport to the 8.x branch with mergify backport-8.16 Automated backport with mergify backport-8.17 Automated backport with mergify Team:obs-ds-hosted-services Label for the Observability Hosted Services team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants