Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Azure] After some days etcd-main, etcd-events & kops-controller pods of Azure KOPS clusters filled with 401 errors while trying to access kops storage account #16839

Open
ajgupta42 opened this issue Sep 16, 2024 · 1 comment
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale.

Comments

@ajgupta42
Copy link
Contributor

/kind bug

After some days etcd-main, etcd-events & kops-controller pods of Azure KOPS clusters filled with 401 errors while trying to access kops storage account.
Have seen it in multiple clusters.
After some more days, it starts complaining
AuthenticationErrorDetail: Lifetime validation failed. The token is expired

Temp Fix: KOPS-controller pod can be fixed by deleting the pod, new pod comeup fine, but for etcd pods we have to restart the control-plane machine.
Expected: Token refresh should happen automatically for system identity.

W0916 05:23:12.468888    4968 controller.go:161] unexpected error running etcd cluster reconciliation loop: error checking control store: error reading cluster-creation marker file azureblob://cluster-configs/<cluster-key>.eastus2.azure.reai.io/backups/etcd/main/control/etcd-cluster-created: -> sigs.k8s.io/etcdadm/etcd-manager/vendor/github.com/Azure/azure-storage-blob-go/azblob.newStorageError, vendor/github.com/Azure/azure-storage-blob-go/azblob/zc_storage_error.go:42
===== RESPONSE ERROR (ServiceCode=InvalidAuthenticationInfo) =====
Description=Server failed to authenticate the request. Please refer to the information in the www-authenticate header.
RequestId:xxxxx-xxxxx-xxxxx-xxxxxxx
Time:2024-09-16T05:23:12.4718165Z, Details:
   AuthenticationErrorDetail: Signature validation failed. Signature key not found.
   Code: InvalidAuthenticationInfo
   GET https://<kops-storage-account>.blob.core.windows.net/cluster-configs/<cluster-key>.eastus2.azure.reai.io/backups/etcd/main/control/etcd-cluster-created?timeout=61
   Authorization: REDACTED
   User-Agent: [Azure-Storage/0.15 (go1.19.9; linux)]
   X-Ms-Client-Request-Id: [xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxx]
   X-Ms-Version: [2020-10-02]
   --------------------------------------------------------------------------------
   RESPONSE Status: 401 Server failed to authenticate the request. Please refer to the information in the www-authenticate header.
   Content-Length: [408]
   Content-Type: [application/xml]
   Date: [Mon, 16 Sep 2024 05:23:12 GMT]
   Server: [Microsoft-HTTPAPI/2.0]
   Www-Authenticate: [Bearer authorization_uri=https://login.microsoftonline.com/xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxx/oauth2/authorize resource_id=https://storage.azure.com/]
   X-Ms-Error-Code: [InvalidAuthenticationInfo]
   X-Ms-Request-Id: [xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxx]

1. What kops version are you running? The command kops version, will display
this information.

v1.28.5

2. What Kubernetes version are you running? kubectl version will print the
version if a cluster is running or provide the Kubernetes version specified as
a kops flag.

v1.28.11

3. What cloud provider are you using?
azure

4. What commands did you run? What is the simplest way to reproduce this issue?
kubectl -n kube-system logs -f etcd-manager-main-control-plane-eastus2-3000005

5. What happened after the commands executed?
The stack trace mentioned can be seen
401 while accessing kops storage account

6. What did you expect to happen?
No error

7. Please provide your cluster manifest. Execute
kops get --name my.example.com -o yaml to display your cluster manifest.
You may want to remove your cluster name and other sensitive information.

apiVersion: kops.k8s.io/v1alpha2
kind: Cluster
metadata:
  name: <cluster_key>.eastus2.azure.reai.io
spec:
  api:
    loadBalancer:
      type: Public
  authorization:
    rbac: {}
  channel: stable
  cloudConfig:
    azure:
      adminUser: ubuntu
      resourceGroupName: <cluster_key>
      routeTableName: <cluster_key>
      subscriptionId: xxxx
      tenantId: xxxxxx
  cloudLabels:
    cluster-name: <cluster_key>
    k8s.io_cluster-autoscaler_<cluster_key>.eastus2.azure.reai.io: owned
    k8s.io_cluster-autoscaler_enabled: "1"
    k8s.io_cluster-autoscaler_node-template_label: "1"
  cloudProvider: azure
  configBase: azureblob://cluster-configs/<cluster_key>.eastus2.azure.reai.io
  etcdClusters:
  - cpuRequest: 200m
    etcdMembers:
    - instanceGroup: control-plane-eastus2-3
      volumeType: StandardSSD_LRS
      name: etcd-3
    manager:
      backupRetentionDays: 7
    memoryRequest: 100Mi
    name: main
  - cpuRequest: 100m
    etcdMembers:
    - instanceGroup: control-plane-eastus2-3
       volumeType: StandardSSD_LRS
       name: etcd-3
    manager:
      backupRetentionDays: 7
    memoryRequest: 100Mi
    name: events
  iam:
    allowContainerRegistry: true
    legacy: false
  kubeControllerManager:
    terminatedPodGCThreshold: 1024
  kubeDNS:
    provider: CoreDNS
    nodeLocalDNS:
      enabled: true
  kubelet:
    anonymousAuth: false
    authenticationTokenWebhook: true
    authorizationMode: Webhook
    SerializeImagePulls: true
  kubernetesVersion: 1.28.11
  masterPublicName: api.<cluster-key>.eastus2.azure.reai.io
  networkCIDR: 172.26.240.0/20
  kubeProxy:
    enabled: true
  networking:
    cilium:
      enableNodePort: false
  nonMasqueradeCIDR: 100.64.0.0/10
  subnets:
  - cidr: 172.26.240.0/22
    name: utility-eastus2
    region: eastus2
    type: Public
  - cidr: 172.26.248.0/21
    name: eastus2
    region: eastus2
    type: Private
  topology:
    dns:
      type: None
    masters: private
    nodes: private
  updatePolicy: external

8. Please run the commands with most verbose logging by adding the -v 10 flag.
Paste the logs into this report, or in a gist and provide the gist link here.

9. Anything else do we need to know?
KOPS-controller pod can be fixed by deleting the pod, new pod comeup fine, but for etcd pods we have to restart the control-plane machine.

After some more days, it is filled with

W0916 04:58:08.982238    5081 controller.go:161] unexpected error running etcd cluster reconciliation loop: error checking control store: error reading cluster-creation marker file azureblob://cluster-configs/<cluster-key>.eastus2.azure.reai.io/backups/etcd/main/control/etcd-cluster-created: -> sigs.k8s.io/etcdadm/etcd-manager/vendor/github.com/Azure/azure-storage-blob-go/azblob.newStorageError, vendor/github.com/Azure/azure-storage-blob-go/azblob/zc_storage_error.go:42
===== RESPONSE ERROR (ServiceCode=InvalidAuthenticationInfo) =====
Description=Server failed to authenticate the request. Please refer to the information in the www-authenticate header.
RequestId:xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxx
Time:2024-09-16T04:58:08.9828773Z, Details:
   AuthenticationErrorDetail: Lifetime validation failed. The token is expired.
   Code: InvalidAuthenticationInfo
   GET https://<kops-storageaccount>.blob.core.windows.net/cluster-configs/<cluster-key>.eastus2.azure.reai.io/backups/etcd/main/control/etcd-cluster-created?timeout=61
   Authorization: REDACTED
   User-Agent: [Azure-Storage/0.15 (go1.19.9; linux)]
   X-Ms-Client-Request-Id: [xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxx]
   X-Ms-Version: [2020-10-02]
   --------------------------------------------------------------------------------
   RESPONSE Status: 401 Server failed to authenticate the request. Please refer to the information in the www-authenticate header.
   Content-Length: [404]
   Content-Type: [application/xml]
   Date: [Mon, 16 Sep 2024 04:58:08 GMT]
   Server: [Microsoft-HTTPAPI/2.0]
   Www-Authenticate: [Bearer authorization_uri=https://login.microsoftonline.com/xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxx/oauth2/authorize resource_id=https://storage.azure.com]
   X-Ms-Error-Code: [InvalidAuthenticationInfo]
   X-Ms-Request-Id: [xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxx]

cc: @hakman

@k8s-ci-robot k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Sep 16, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale.
Projects
None yet
Development

No branches or pull requests

4 participants