Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Redis lock acquired prematurely leading to 500 internal server errors #1972

Open
ngshiheng opened this issue Jun 24, 2024 · 1 comment
Open
Labels
bug Something isn't working

Comments

@ngshiheng
Copy link
Contributor

ngshiheng commented Jun 24, 2024

Describe the bug
Background: we use Athens with Redis (as our SingleFlightMode) to manage concurrent write access to our storage (S3).

We have encountered an issue in our Athens setup where requests for certain modules and versions (e.g. github.com/aws/aws-sdk-go/@v/v1.12.42.info) has resulted in a 500 status code and a "redislock: not obtained" error in our logs.

Error Message

When users see 500 Internal Server Error. Our Athens log:

redislock: not obtained

When users see 404 Not Found. Our Athens log (expected):

github.com/aws/aws-sdk-go@v1.12.42: verifying go.mod: github.com/aws/aws-sdk-go@v1.12.42/go.mod: reading https://sum.golang.org/lookup/github.com/aws/aws-sdk-go@v1.12.42: 404 Not Found
   server response:
   not found: create zip: models/apis/sagemaker/2017-07-24/api-2.json: case-insensitive file name collision: "models/apis/SageMaker/2017-07-24/api-2.json" and "models/apis/sagemaker/2017-07-24/api-2.json"
   models/apis/sagemaker/2017-07-24/docs-2.json: case-insensitive file name collision: "models/apis/SageMaker/2017-07-24/docs-2.json" and "models/apis/sagemaker/2017-07-24/docs-2.json"

To Reproduce
Steps to reproduce the behavior:

  1. Clone this repository
  2. Start a Redis locally: docker run -d --name some-redis -p 6379:6379 redis
  3. Update config.dev.toml to SingleFlightType = "redis"
  4. Start Athens: make run
  5. To reproduce, rapidly send few requests to time curl http://localhost:3000/github.com/aws/aws-sdk-go/@v/v1.12.42.info. This took me ~30s on average. Here, you should see redislock: not obtained getting logged

Expected behavior
The expected behavior here is for Athens to return a 404 status code as this module does not exist. However, Athens is returning a 500 status code (with a Redis lock error logged on the server-side).

Environment

  • OS: macOS
  • Go version: go1.22.2 darwin/arm64
  • Proxy version : v0.14.1
  • Storage (fs/mongodb/s3 etc.): S3

Additional context

  1. Sending the same request to the public proxy https://proxy.golang.org/github.com/aws/aws-sdk-go/@v/v1.12.42.info yields immediate 404 (probably because it caches non-200 responses too)
  2. /github.com/aws/aws-sdk-go/@v/v1.12.42 is part of our transitive dependencies so there isn't a way to avoid it
  3. /github.com/aws/aws-sdk-go/@v/v1.12.42 is just an example, we've seen similar cases with other Go modules like: /cloud.google.com/go/storage/@v/v1.35.0.info, /github.com/go-redis/redis/@v/v6.15.3+incompatible.info, /gopkg.in/go-playground/validator.v9/@v/v9.30.1.info (... and more). Generally they seemed to be 'problematic' modules to run go mod download -json on

Further investigation

The root cause appears to be related to Athens obtaining the Redis lock before fetching from the upstream source (code) at module.downloadModule. In this specific case, fetching from the upstream took a significant amount of time, and obtaining the Redis lock prematurely has lead to lock contention and timeouts.

Upon further investigation, it was discovered that the bottleneck occurred within the goGetFetcher.Fetch function, which calls module.downloadModule (reference)

The module.downloadModule function is responsible for executing the go mod download -json github.com/aws/aws-sdk-go@v1.12.42. Here, the go mod download -json command took on average about 30 seconds to complete

Proposal

  • I propose that instead of obtaining the Redis lock before fetching from the upstream source, the lock should be acquired only after successfully fetching the module from upstream and before stashing it to the storage backend
  • This approach ensures that the lock is held for a minimal duration (reduces the chances of lock contention and timeouts)
@matt0x6F matt0x6F added the bug Something isn't working label Jun 25, 2024
@matt0x6F
Copy link
Contributor

@ngshiheng thanks for the report!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants