Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Custom resource watchers die but operator does not in environments with restricted Kubernetes API access #1145

Open
james-mchugh opened this issue Dec 10, 2024 · 2 comments
Labels
bug Something isn't working

Comments

@james-mchugh
Copy link

james-mchugh commented Dec 10, 2024

Long story short

When the operator's service account has limited access to the Kubernetes cluster (such as an RBAC that only gives it access to the current namespace), the watchers may die (such as due to a temporary auth issue) and never recover. This results in the operator continuing to run, but not monitoring resources for changes anymore. This appears to only happen for operators that are handling custom resources.

Kopf version

1.37.2

Kubernetes version

1.29.5

Python version

3.10.14

Related Issues

Code

# monkeypatch the errors.check_response method so we can simulate when an auth error occurs via
# pkill -SIGUSR1 -nf kopf

import logging

import kopf
from kopf._cogs.clients import errors

logger = logging.getLogger(__name__)

old_check_response = errors.check_response

BROKEN_AUTH = False


def check_response(*args, **kwargs):
    logger.info("Running monkey patched checked response")
    if BROKEN_AUTH:
        logger.info("Auth is broken, raising error.")
        raise errors.APIUnauthorizedError(None, status=401)
    return old_check_response(*args, **kwargs)


errors.check_response = check_response

import signal


def break_auth(*_):
    global BROKEN_AUTH
    logger.info("Breaking auth")
    BROKEN_AUTH = True


signal.signal(signal.SIGUSR1, break_auth)

@kopf.on.update(
    CR_GROUP,
    CR_VERSION,
    CR_KIND,
)
@kopf.on.create(
    CR_GROUP,
    CR_VERSION,
    CR_KIND,
)
def monitor_custom_resource(
    name: str,
    namespace: str,
    status: kopf.Status,
    labels: kopf.Labels,
    **_,
): ...

Logs

[2024-12-10 15:43:22,060] kopf._core.engines.a [INFO    ] Initial authentication has been initiated.
[2024-12-10 15:43:22,070] kopf.activities.auth [INFO    ] Activity 'login_via_client' succeeded.
[2024-12-10 15:43:22,070] kopf._core.engines.a [INFO    ] Initial authentication has finished.
[2024-12-10 15:43:22,080] __kopf_script_0__/Us [INFO    ] Running monkey patched checked response
[2024-12-10 15:43:22,081] __kopf_script_0__/Us [INFO    ] Running monkey patched checked response
[2024-12-10 15:43:22,083] __kopf_script_0__/Us [INFO    ] Running monkey patched checked response
[2024-12-10 15:43:22,084] __kopf_script_0__/Us [INFO    ] Running monkey patched checked response
[2024-12-10 15:43:22,087] __kopf_script_0__/Us [INFO    ] Running monkey patched checked response
[2024-12-10 15:43:22,088] __kopf_script_0__/Us [INFO    ] Running monkey patched checked response
[2024-12-10 15:43:22,091] __kopf_script_0__/Us [INFO    ] Running monkey patched checked response
[2024-12-10 15:43:22,091] __kopf_script_0__/Us [INFO    ] Running monkey patched checked response
[2024-12-10 15:43:22,091] kopf._core.reactor.o [WARNING ] Not enough permissions to list namespaces. Falling back to a list of namespaces which are assumed to exist: {'default'}
[2024-12-10 15:43:22,093] kopf._core.reactor.o [WARNING ] Not enough permissions to watch for resources: changes (creation/deletion/updates) will not be noticed; the resources are only refreshed on operator restarts.
[2024-12-10 15:43:22,094] __kopf_script_0__/Us [INFO    ] Running monkey patched checked response
[2024-12-10 15:43:22,094] kopf._core.reactor.o [WARNING ] Not enough permissions to watch for namespaces: changes (deletion/creation) will not be noticed; the namespaces are only refreshed on operator restarts.
[2024-12-10 15:43:22,115] __kopf_script_0__/Us [INFO    ] Running monkey patched checked response
[2024-12-10 15:43:22,126] __kopf_script_0__/Us [INFO    ] Running monkey patched checked response
[2024-12-10 15:43:43,996] __kopf_script_0__/Us [INFO    ] Breaking auth
[2024-12-10 15:43:48,157] __kopf_script_0__/Us [INFO    ] Running monkey patched checked response
[2024-12-10 15:43:48,157] __kopf_script_0__/Us [INFO    ] Auth is broken, raising error.
[2024-12-10 15:43:48,158] kopf._core.engines.a [INFO    ] Re-authentication has been initiated.
[2024-12-10 15:43:48,167] kopf.activities.auth [INFO    ] Activity 'login_via_client' succeeded.
[2024-12-10 15:43:48,167] kopf._core.engines.a [INFO    ] Re-authentication has finished.
[2024-12-10 15:43:48,167] kopf.objects         [ERROR   ] [default/custom-resource] Throttling for 1 seconds due to an unexpected error: LoginError('Ran out of valid credentials. Consider installing an API client library or adding a login handler. See more: https://kopf.readthedocs.io/en/stable/authentication/')
Traceback (most recent call last):
  File "/Users/jamesmchugh/anaconda3/envs/python-3.10/lib/python3.10/site-packages/kopf/_cogs/clients/auth.py", line 50, in wrapper
    response = await fn(*args, **kwargs, context=context)
  File "/Users/jamesmchugh/anaconda3/envs/python-3.10/lib/python3.10/site-packages/kopf/_cogs/clients/api.py", line 85, in request
    await errors.check_response(response)  # but do not parse it!
  File "/Users/jamesmchugh/git/operators/test_operator_bug.py", line 23, in check_response
    raise errors.APIUnauthorizedError(None, status=401)
kopf._cogs.clients.errors.APIUnauthorizedError: (None, None)

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/Users/jamesmchugh/anaconda3/envs/python-3.10/lib/python3.10/site-packages/kopf/_core/actions/throttlers.py", line 44, in throttled
    yield should_run
  File "/Users/jamesmchugh/anaconda3/envs/python-3.10/lib/python3.10/site-packages/kopf/_core/reactor/processing.py", line 130, in process_resource_event
    applied = await application.apply(
  File "/Users/jamesmchugh/anaconda3/envs/python-3.10/lib/python3.10/site-packages/kopf/_core/actions/application.py", line 60, in apply
    await patch_and_check(
  File "/Users/jamesmchugh/anaconda3/envs/python-3.10/lib/python3.10/site-packages/kopf/_core/actions/application.py", line 131, in patch_and_check
    resulting_body = await patching.patch_obj(
  File "/Users/jamesmchugh/anaconda3/envs/python-3.10/lib/python3.10/site-packages/kopf/_cogs/clients/patching.py", line 47, in patch_obj
    patched_body = await api.patch(
  File "/Users/jamesmchugh/anaconda3/envs/python-3.10/lib/python3.10/site-packages/kopf/_cogs/clients/api.py", line 155, in patch
    response = await request(
  File "/Users/jamesmchugh/anaconda3/envs/python-3.10/lib/python3.10/site-packages/kopf/_cogs/clients/auth.py", line 56, in wrapper
    await vault.invalidate(key, exc=e)
  File "/Users/jamesmchugh/anaconda3/envs/python-3.10/lib/python3.10/site-packages/kopf/_cogs/structs/credentials.py", line 297, in invalidate
    raise LoginError("Ran out of valid credentials. Consider installing "
kopf._cogs.structs.credentials.LoginError: Ran out of valid credentials. Consider installing an API client library or adding a login handler. See more: https://kopf.readthedocs.io/en/stable/authentication/
[2024-12-10 15:43:49,168] kopf.objects         [INFO    ] [default/custom-resource] Throttling is over. Switching back to normal operations.
[2024-12-10 15:43:49,169] kopf.objects         [ERROR   ] [default/custom-resource] Throttling for 1 seconds due to an unexpected error: LoginError('Ran out of valid credentials. Consider installing an API client library or adding a login handler. See more: https://kopf.readthedocs.io/en/stable/authentication/')
Traceback (most recent call last):
  File "/Users/jamesmchugh/anaconda3/envs/python-3.10/lib/python3.10/site-packages/kopf/_core/actions/throttlers.py", line 44, in throttled
    yield should_run
  File "/Users/jamesmchugh/anaconda3/envs/python-3.10/lib/python3.10/site-packages/kopf/_core/reactor/processing.py", line 130, in process_resource_event
    applied = await application.apply(
  File "/Users/jamesmchugh/anaconda3/envs/python-3.10/lib/python3.10/site-packages/kopf/_core/actions/application.py", line 60, in apply
    await patch_and_check(
  File "/Users/jamesmchugh/anaconda3/envs/python-3.10/lib/python3.10/site-packages/kopf/_core/actions/application.py", line 131, in patch_and_check
    resulting_body = await patching.patch_obj(
  File "/Users/jamesmchugh/anaconda3/envs/python-3.10/lib/python3.10/site-packages/kopf/_cogs/clients/patching.py", line 47, in patch_obj
    patched_body = await api.patch(
  File "/Users/jamesmchugh/anaconda3/envs/python-3.10/lib/python3.10/site-packages/kopf/_cogs/clients/api.py", line 155, in patch
    response = await request(
  File "/Users/jamesmchugh/anaconda3/envs/python-3.10/lib/python3.10/site-packages/kopf/_cogs/clients/auth.py", line 48, in wrapper
    async for key, info, context in vault.extended(APIContext, 'contexts'):
  File "/Users/jamesmchugh/anaconda3/envs/python-3.10/lib/python3.10/site-packages/kopf/_cogs/structs/credentials.py", line 158, in extended
    async for key, item in self._items():
  File "/Users/jamesmchugh/anaconda3/envs/python-3.10/lib/python3.10/site-packages/kopf/_cogs/structs/credentials.py", line 195, in _items
    yielded_key, yielded_item = self.select()
  File "/Users/jamesmchugh/anaconda3/envs/python-3.10/lib/python3.10/site-packages/kopf/_cogs/structs/credentials.py", line 214, in select
    raise LoginError("Ran out of valid credentials. Consider installing "
kopf._cogs.structs.credentials.LoginError: Ran out of valid credentials. Consider installing an API client library or adding a login handler. See more: https://kopf.readthedocs.io/en/stable/authentication/
[2024-12-10 15:43:50,165] kopf._core.reactor.q [WARNING ] Unprocessed streams left for [(custom-resource.v1beta1.foo.com, 'd782af8b-1cf4-42bc-abc3-c02ff635470f')].
[2024-12-10 15:43:50,166] kopf._core.reactor.o [ERROR   ] Watcher for custom-resource.v1beta1.foo.com@default has failed: Ran out of valid credentials. Consider installing an API client library or adding a login handler. See more: https://kopf.readthedocs.io/en/stable/authentication/
Traceback (most recent call last):
  File "/Users/jamesmchugh/anaconda3/envs/python-3.10/lib/python3.10/site-packages/kopf/_cogs/aiokits/aiotasks.py", line 96, in guard
    await coro
  File "/Users/jamesmchugh/anaconda3/envs/python-3.10/lib/python3.10/site-packages/kopf/_core/reactor/queueing.py", line 175, in watcher
    async for raw_event in stream:
  File "/Users/jamesmchugh/anaconda3/envs/python-3.10/lib/python3.10/site-packages/kopf/_cogs/clients/watching.py", line 86, in infinite_watch
    async for raw_event in stream:
  File "/Users/jamesmchugh/anaconda3/envs/python-3.10/lib/python3.10/site-packages/kopf/_cogs/clients/watching.py", line 201, in continuous_watch
    async for raw_input in stream:
  File "/Users/jamesmchugh/anaconda3/envs/python-3.10/lib/python3.10/site-packages/kopf/_cogs/clients/watching.py", line 266, in watch_objs
    async for raw_input in api.stream(
  File "/Users/jamesmchugh/anaconda3/envs/python-3.10/lib/python3.10/site-packages/kopf/_cogs/clients/api.py", line 200, in stream
    response = await request(
  File "/Users/jamesmchugh/anaconda3/envs/python-3.10/lib/python3.10/site-packages/kopf/_cogs/clients/auth.py", line 48, in wrapper
    async for key, info, context in vault.extended(APIContext, 'contexts'):
  File "/Users/jamesmchugh/anaconda3/envs/python-3.10/lib/python3.10/site-packages/kopf/_cogs/structs/credentials.py", line 158, in extended
    async for key, item in self._items():
  File "/Users/jamesmchugh/anaconda3/envs/python-3.10/lib/python3.10/site-packages/kopf/_cogs/structs/credentials.py", line 195, in _items
    yielded_key, yielded_item = self.select()
  File "/Users/jamesmchugh/anaconda3/envs/python-3.10/lib/python3.10/site-packages/kopf/_cogs/structs/credentials.py", line 214, in select
    raise LoginError("Ran out of valid credentials. Consider installing "
kopf._cogs.structs.credentials.LoginError: Ran out of valid credentials. Consider installing an API client library or adding a login handler. See more: https://kopf.readthedocs.io/en/stable/authentication/

# operator continues running, but doing nothing

Additional information

To reproduce this scenario, create a CRD and set the CR_* vars in the code above. Additionally, create a service account with roles that only have access to the resources in the namespace the operator is monitoring, such as below:

apiVersion: v1
kind: ServiceAccount
metadata:
  name: operator-test
  namespace: default
automountServiceAccountToken: true
---
kind: Role
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  name: operator-test
  namespace: default
rules:
  - apiGroups: ["*"]
    resources: ["*"]
    verbs: ["*"]
---
kind: RoleBinding
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  name: operator-test
  namespace: default
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: Role
  name: operator-test
subjects:
  - kind: ServiceAccount
    name: operator-test
    namespace: "default"

Create a token for that service account (kubectl create token operator-test) and add it as a new user to your kubeconfig. Change contexts so this new context with the new user is being actively used.

Run the operator with

kopf run -n default <filename>

After startup, issue the SIGUSR1 signal to the process trigger the monkeypatched auth method to raise an AuthenticationError error next time it runs.

pkill -SIGUSR1 -nf kopf

Create or update the custom resource. Observe that the operator logs an error but continues running. Future create/update events of the custom resource (or any other resource if multiple handlers are used) are not observed.

In an environment without restricted access to a single namespace, the resource-observer and namespace-observer tasks run which are core operator tasks. Therefore, an error such as an auth failure will cause those tasks to fail and the operator to die. This is not the case when observing a single namespace.

Additionally, for resources that are not custom, the event-poster core task uses the events API to report when handlers succeed/fail. This too will fail in the face of an auth issue, causing the event-poster task to die and the operator to then die.

In the case of observing a custom resource within a single namespace, neither of the above safety nets gets tiggered. This results in the operator silently dying. From reviewing the code, I think a fix to this could be for to update

async def ochestrator(
*,
processor: ResourceWatchStreamProcessor,
settings: configuration.OperatorSettings,
identity: peering.Identity,
insights: references.Insights,
operator_paused: aiotoggles.ToggleSet,
) -> None:
peering_missing = await operator_paused.make_toggle(name='peering CRD is missing')
ensemble = Ensemble(
peering_missing=peering_missing,
operator_paused=operator_paused,
operator_indexed=aiotoggles.ToggleSet(all),
)
try:
async with insights.revised:
while True:
await insights.revised.wait()
await adjust_tasks(
processor=processor,
insights=insights,
settings=settings,
identity=identity,
ensemble=ensemble,
)
except asyncio.CancelledError:
tasks = ensemble.get_tasks(ensemble.get_keys())
await aiotasks.stop(tasks, title="streaming", logger=logger, interval=10)
raise
so that orchestrator checks monitors the status of the tasks in the ensemble, and raises an exception if they fail. An example is below

async def ochestrator(
        *,
        processor: queueing.WatchStreamProcessor,
        settings: configuration.OperatorSettings,
        identity: peering.Identity,
        insights: references.Insights,
        operator_paused: aiotoggles.ToggleSet,
) -> None:
    peering_missing = await operator_paused.make_toggle(name='peering CRD is missing')
    ensemble = Ensemble(
        peering_missing=peering_missing,
        operator_paused=operator_paused,
        operator_indexed=aiotoggles.ToggleSet(all),
    )
    try:
        async with insights.revised:
            while True:
                wait_for_insights_task = aiotasks.create_guarded_task(insights.revised.wait(), "wait-for-insights")
                done, pending = await aiotasks.wait([wait_for_insights_task, *ensemble.get_tasks(ensemble.get_keys())], return_when=asyncio.FIRST_COMPLETED)
                for task in done:
                    if task.exception() is not None:
                        raise task.exception()
                if wait_for_insights_task.done():
                    await adjust_tasks(
                        processor=processor,
                        insights=insights,
                        settings=settings,
                        identity=identity,
                        ensemble=ensemble,
                    )
    except asyncio.CancelledError:
        tasks = ensemble.get_tasks(ensemble.get_keys())
        await aiotasks.stop(tasks, title="streaming", logger=logger, interval=10)
        raise

Now, watcher tasks whose exit status was not previously monitored are now monitored, and exceptions in them will cause the operator to exit.

I am not sure if there are other side-effects of this approach though

@james-mchugh james-mchugh added the bug Something isn't working label Dec 10, 2024
@james-mchugh
Copy link
Author

For some additional context, the auth related error I mentioned is the one trigger I found to reproduce this issue. However, it may not be the only trigger

@james-mchugh
Copy link
Author

james-mchugh commented Dec 10, 2024

In theory, this can also be reproduced by dropping all of the signal handling and monkeypatching from the above code, and instead just deleting the service account (or removing its rolebinding/role) to trigger the bug.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant