Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Credentials JSONFileCache explodes if cache file has malformed JSON #3106

Open
mildmojo opened this issue Jan 25, 2024 · 7 comments
Open

Credentials JSONFileCache explodes if cache file has malformed JSON #3106

mildmojo opened this issue Jan 25, 2024 · 7 comments
Labels
bug This issue is a confirmed bug. confusing-error credentials p2 This is a standard priority issue

Comments

@mildmojo
Copy link

mildmojo commented Jan 25, 2024

Describe the bug

The JSONFileCache class __getitem__ accessor tries to load JSON from a file, but does not handle syntax errors in that file, raising an exception that bubbles up to the client application.

Given this is a cache, a failure to read from the cache should be treated as a cache miss, not an error.

Expected Behavior

In botocore/credentials.py, in the CachedCredentialFetcher class, I expected _load_from_cache() to return None if the cache was invalid, causing credentials to be re-fetched and the cache to be rebuilt.

Current Behavior

I found this while using awscli, which consumes botocore. When a credentials cache file was malformed, every command using that particular AWS account failed super cryptically, printing the bad file's cache key to STDERR:

$ aws ecr get-login-password --profile root

'c09eaf18178406053ca8802bc45deadbeefdc57a'

$ echo $?
255

Using a debugger, I located the failure at utils.py:3518, when botocore failed to parse a cache file.

Cache file contents (prettified here):

{
    "Credentials":
    {
        "AccessKeyId": "<redacted>",
        "SecretAccessKey": "<redacted>",
        "SessionToken": "<redacted>",
        "Expiration": "2024-01-24T02:49:26+00:00"
    },
    "AssumedRoleUser":
    {
        "AssumedRoleId": "<redacted>:botocore-session-1706060963",
        "Arn": "arn:aws:sts::<redacted>:assumed-role/Admin/botocore-session-1706060963"
    },
    "PackedPolicySize": 6,
    "ResponseMetadata":
    {
        "RequestId": "<redacted>",
        "HTTPStatusCode": 200,
        "HTTPHeaders":
        {
            "x-amzn-requestid": "<redacted>",
            "content-type": "text/xml",
            "content-length": "1569",
            "date": "Wed, 24 Jan 2024 01:49:26 GMT"
        },
        "RetryAttempts": 1
    }
}: true, "RetryAttempts": 2}}

(The bit at the bottom, : true, "RetryAttempts": 2}} invalidates the JSON.)

Reproduction Steps

Here's a minimal reproduction:

# json-cache-repro.py

from botocore.utils import JSONFileCache
from botocore.credentials import CachedCredentialFetcher
from os import remove

CACHE_KEY = 'foo-botocore-test'
CACHE_DIR = '/tmp'
CACHE_FILE = f'{CACHE_DIR}/{CACHE_KEY}.json'

class TestCredentialsFetcher(CachedCredentialFetcher):
    def _create_cache_key(_):
        return CACHE_KEY

with open(CACHE_FILE, 'w') as f:
    f.write('hello')

cache = JSONFileCache(working_dir=CACHE_DIR)

fetcher = TestCredentialsFetcher(cache=cache)

try:
    fetcher.fetch_credentials()
finally:
    remove(CACHE_FILE)

Example run:

$ python -m venv json-cache-repro
$ . json-cache-repro/bin/activate
(json-cache-repro) $ pip install botocore

(json-cache-repro) $ python3 json-cache-repro.py
Traceback (most recent call last):
  File "/Users/me/json-cache-repro/lib/python3.11/site-packages/botocore/utils.py", line 3518, in __getitem__
    return json.load(f)
           ^^^^^^^^^^^^
  File "/opt/homebrew/Cellar/python@3.11/3.11.7_1/Frameworks/Python.framework/Versions/3.11/lib/python3.11/json/__init__.py", line 293, in load
    return loads(fp.read(),
           ^^^^^^^^^^^^^^^^
  File "/opt/homebrew/Cellar/python@3.11/3.11.7_1/Frameworks/Python.framework/Versions/3.11/lib/python3.11/json/__init__.py", line 346, in loads
    return _default_decoder.decode(s)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/Cellar/python@3.11/3.11.7_1/Frameworks/Python.framework/Versions/3.11/lib/python3.11/json/decoder.py", line 337, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/Cellar/python@3.11/3.11.7_1/Frameworks/Python.framework/Versions/3.11/lib/python3.11/json/decoder.py", line 355, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/me/json-cache-repro.py", line 17, in <module>
    fetcher.fetch_credentials()
  File "/Users/me/json-cache-repro/lib/python3.11/site-packages/botocore/credentials.py", line 685, in fetch_credentials
    return self._get_cached_credentials()
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/me/json-cache-repro/lib/python3.11/site-packages/botocore/credentials.py", line 693, in _get_cached_credentials
    response = self._load_from_cache()
               ^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/me/json-cache-repro/lib/python3.11/site-packages/botocore/credentials.py", line 711, in _load_from_cache
    creds = deepcopy(self._cache[self._cache_key])
                     ~~~~~~~~~~~^^^^^^^^^^^^^^^^^
  File "/Users/me/json-cache-repro/lib/python3.11/site-packages/botocore/utils.py", line 3520, in __getitem__
    raise KeyError(cache_key)
KeyError: 'foo-botocore-test'

Possible Solution

Workaround

Delete the affected cache file, or delete all files in the cache dir (e.g. ~/.aws/cli/cache/).

Fix

This is tricky. JSONFileCache doesn't know what it's storing, and CachedCredentialFetcher doesn't know what kind of cache it's using. CachedCredentialFetcher assumes that if a key exists in the cache, it's valid to read, which is incorrect for JSONFileCache.

Suggestion:

  1. In JSONFileCache#__getitem__, catch (OSError, ValueError, json.decoder.JSONDecodeError) and re-raise any of them as a new CacheMissError.
  2. In CachedCredentialFetcher#_load_from_cache, catch CacheMissError and return None. This will hopefully trigger a credential refresh that will overwrite the invalid cache file.

Triage

If the maintainers prefer to raise exceptions here (maybe there's a lot of extra work to change behavior), in the meantime, the exception message in JSONFileCache#__getitem__ could be improved to suggest manual cleanup:

try:
    with open(actual_key) as f:
        return json.load(f)
except (OSError, ValueError, json.decoder.JSONDecodeError):
    raise KeyError(f'Cache file {actual_key} may be corrupt, please delete it.')

Additional Information/Context

It took me most of a day to track this down when it started causing awscli errors. I initially thought it was an AWS permissions issue, or a problem with my local credentials, or corrupted awscli configuration, or my environment, or something a reboot might fix. None of my teammates could reproduce it.

I was disappointed to find out it was a cache error!

Thanks for your consideration.

SDK version used

v1.34.28 (also observed in v2.0.0dev155, installed as awscli v2.15.13 dependency via homebrew)

Environment details (OS name and version, etc.)

MacOS 14.2.1 (arm64)

@mildmojo mildmojo added bug This issue is a confirmed bug. needs-triage This issue or PR still needs to be triaged. labels Jan 25, 2024
@mildmojo
Copy link
Author

Thinking about it last night, I also believe the current code will never heal a corrupted cache file on its own. With the current strategy, a cache entry is only updated if it doesn't exist yet (cache miss) or the data can be parsed and evaluated for expiration. A cache file that can't be read or parsed will break the library forever unless there's manual intervention to clear the cache on disk, and that intervention is less likely because the exceptions raised by the cache class don't contain enough information to suggest a solution.

@tim-finnigan tim-finnigan self-assigned this May 20, 2024
@tim-finnigan tim-finnigan added the investigating This issue is being investigated and/or work is in progress to resolve the issue. label May 20, 2024
@tim-finnigan
Copy link
Contributor

Thanks for reaching out and for your patience here. I could reproduce the errors you referenced using your code snippet. I will plan to bring this up with the team for discussion.

@tim-finnigan tim-finnigan added needs-review This issue or pull request needs review from a core team member. p2 This is a standard priority issue and removed investigating This issue is being investigated and/or work is in progress to resolve the issue. needs-triage This issue or PR still needs to be triaged. labels May 20, 2024
@tim-finnigan
Copy link
Contributor

After discussing with the team, we think that there may be some opportunities to improve the behavior here. I created a PR with the error message you suggested for the team to consider: #3183. Further investigation may still be needed in terms of when the error occurs in Botocore/the AWS CLI.

@tim-finnigan tim-finnigan removed their assignment May 21, 2024
@tim-finnigan tim-finnigan added credentials confusing-error and removed needs-review This issue or pull request needs review from a core team member. labels May 21, 2024
@mikelambert
Copy link

It looks like #3213 was closed in favor of this one, but this one looks to previously be about a confusing error message (and proposed fix to the message).

Is it safe to assume this one is now also about fixing the race condition (potentially with a flock) to avoid corrupting the cache file at all? Let me know if it’d be more helpful to submit an untested PR with a proposed fix. Thank you!

@tim-finnigan
Copy link
Contributor

Thanks for following up, this issue is more focused on the error message so I can reopen #3213 for further review/discussion. You can submit a PR for this but I can't provide any guarantees on if or when it might be considered. I think more research is needed here into what can cause JSONFileCache to break and how that may be addressed.

@mildmojo
Copy link
Author

I think the error message is a good triage, and it's definitely worth exploring corruption root causes. But long-term, I also think it's worth making cache failures non-fatal. Maybe a warning, if there's value in communicating that this optimization has failed.

@amberkushwaha
Copy link

think its been done in it for the mean time values for it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug This issue is a confirmed bug. confusing-error credentials p2 This is a standard priority issue
Projects
None yet
Development

No branches or pull requests

4 participants