Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Function crash details not reported #323

Open
DCVortexxx opened this issue May 2, 2024 · 7 comments
Open

Function crash details not reported #323

DCVortexxx opened this issue May 2, 2024 · 7 comments
Labels
kind/enhancement Improvements to existing feature.

Comments

@DCVortexxx
Copy link

DCVortexxx commented May 2, 2024

Expected behavior

When a function crashes for an unexpected reason (fatalError, memory corruption...), the stack trace and details of the error are not stored/logged/reported to CloudWatch.

I'm fairly new to server-side Swift and AWS in general, so if I'm missing something, feel free to point it out. 🙏

Actual behavior

I would like to have some more informations about the crash, in order to debug and fix crashes in my lambda.

The only details I can see in CloudWatch when my function crashes are:

RequestId: xyz Error: Runtime exited with error: signal: illegal instruction
Runtime.ExitError

Steps to reproduce

  1. Create a new lambda function
  2. Make your function crash on purpose (using fatalError for instance)
  3. Deploy the function
  4. Execute it

If possible, minimal yet complete reproducer code (or URL to code)

You can simply use the ErrorHandling example from this repository.

Send a .fatal request, causing a crash.

What version of this project (swift-aws-lambda-runtime) are you using?

1.0.0-alpha

Swift version

The lambda is archived in a docker container using the image swift:5.9.0-amazonlinux2, on the ubuntu-latest runner (x86_64 architecture).

Amazon Linux 2 docker image version

swift:5.9.0-amazonlinux2

@sebsto
Copy link
Contributor

sebsto commented May 13, 2024

Hello,

This illegalInstruction error is most likely due to the fact you compiled for Arm64 and execute on x64 (or the other way around)

If you compile on Apple Silicon machines (M1 or newer), be sure to create a Lamdda function that runs on on Arm64 architecture.

If you use SAM to deploy, there is a a one-line code change in your SAM template :

         Architectures:
            - arm64

If you created your function in the AWS console, there is a similar parameter you can set at function creation time.

@DCVortexxx
Copy link
Author

Hi Sébastien,

Thanks for replying.

I don’t think that is it, the architecture do match, and the lambda is working fine most of the time.

However, I do have a race condition or logic error that makes it crash from time to time, and I can’t get any information or stack trace on the AWS console (other than the illegal instruction message).

Since I can’t reproduce locally, it is a pain to debug, and I’m trying to figure out if there is any way to get the stack trace of the crash.

Thanks for your time!

Max.

@sebsto
Copy link
Contributor

sebsto commented May 15, 2024

@DCVortexxx You're saying that the error is intermitent, and most of the time, it works. That rules out an Architecture mismatch.

You can try to set the environemnt variable LOG_LEVEL=trace in the Lambda environment. The runtime will produce more tracing, maybe the cause will be visible there.

Otherwise, we will need to modify the error handling to produce more verbose output in case of a runtime crash

@DCVortexxx
Copy link
Author

Hello @sebsto, and thanks for your reply.
Sorry about the delay, I set the log level and to be honest, it slipped out of my mind for a couple of weeks.

Unfortunately, setting LOG_LEVEL=trace in the environment does indeed increase logs from the lambda runtime SDK, however, it does not include the stack trace of why a function exited with an error.

It is mentioned in the lifecycle management section of the README that:

By default, the library also registers a Signal handler that traps INT and TERM, which are typical Signals used in modern deployment platforms to communicate shutdown request.

What I would expect (or like) is when such a signal is captured, the SDK would provide the developer with sufficient informations about what happened, to fix his own issue.

However, I'm not fully sure how signal trapping works and maybe what I'm asking is impossible.
If so, maybe we could have an environment variable to disable signal trapping, letting the program crash and access the stack trace as we would in any other program that crashes?
Once again, I'm not very experienced in that area, so feel free to correct me if I'm misunderstanding something or if what I'm asking is impossible.

On a side note, I've added (a lot of) logs in my own function as well to help me debug, and I managed to pinpoint the location of the crash.
Still unsure about what happens and how to fix it, but at least there's progress 🙃

@sebsto
Copy link
Contributor

sebsto commented May 29, 2024

I'm not sure it's possible to print a stacktrace when the binary is compiled in release mode. Binaries typically crash with EXC_BAD_ACCESS error and nothing more. Can you reproduce the crash when executing locally in DEBUG mode ?

Another debug strategy I often use is to capture the raw event (as string) passed to the runtime. Setting LOG_LEVEL=trace should allow you to capture the raw JSON. Then I verify if the JSON can be decoded by the corresponding Lambda Event struct.

Anyway, we're on the verge to rewrite the Lambda runtime to accommodate for Swift 6 strict concurrent and Service lifecycle. I suggest to not change anything related to signal handling in this version but rather take this feedback into consideration for v2. @fabianfett wdyt ?

@sebsto sebsto added the kind/enhancement Improvements to existing feature. label May 29, 2024
@DCVortexxx
Copy link
Author

DCVortexxx commented Jun 13, 2024

I'm not sure it's possible to print a stacktrace when the binary is compiled in release mode. Binaries typically crash with EXC_BAD_ACCESS error and nothing more.

That makes sense indeed.
However, I don't think the final, uploaded binary actually crashes, since as the documentation states, the termination signal is trapped.
So I was thinking that maybe, in that case, the stack trace of where the signal happened would be available somewhere.
Once again, that's only an assumption, I'm definitely not an expert in that field.

Can you reproduce the crash when executing locally in DEBUG mode ?

No I did not manage to reproduce it in debug, but given my logs in production, it happens ~0.05% of the time.
And since my project does not have a lot of users currently, the data is not that easy to get.

Another debug strategy I often use is to capture the raw event (as string) passed to the runtime. Setting LOG_LEVEL=trace should allow you to capture the raw JSON. Then I verify if the JSON can be decoded by the corresponding Lambda Event struct.

Yeah, all good on that side, there's nothing distinctive about the event that could explain it.
With the same event content, 99.95% of the time, the lambda executes and terminates as expected, but 0.05% of the time, the lambda logs Runtime exited with error: signal: illegal instruction.

I managed to narrow it down to a call to URLSession.dataTask(with:completionHandler:).
I have some logs on the line just before that call, and some logs just after it, and the second ones are not shown.
The completion is not called either, of course.

This call is wrapped in a withCheckedContinuation in order to make use of the Swift concurrency, because it is not available on Linux.
I'm currently trying to simply to get rid of this async-await wrapper, and see if it improves things.

Anyway, we're on the verge to rewrite the Lambda runtime to accommodate for Swift 6 strict concurrent and Service lifecycle. I suggest to not change anything related to signal handling in this version but rather take this feedback into consideration for v2. @fabianfett wdyt ?

That definitely makes sense.
Thanks again for your time!

@sebsto
Copy link
Contributor

sebsto commented Nov 7, 2024

Hello @DCVortexxx
Do you have time to test against the runtime from the main branch ? This is our current dev version for the future v2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/enhancement Improvements to existing feature.
Projects
None yet
Development

No branches or pull requests

2 participants