Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cloudflare error: The script will never generate a response #195

Closed
rndquu opened this issue Nov 12, 2024 · 28 comments · Fixed by #199
Closed

Cloudflare error: The script will never generate a response #195

rndquu opened this issue Nov 12, 2024 · 28 comments · Fixed by #199
Assignees

Comments

@rndquu
Copy link
Member

rndquu commented Nov 12, 2024

Most of the requests to the kernel in the latest main branch (v2.5.3) throw with these cloudflare errors:

  • The script will never generate a response
  • Worker exceeded CPU time limit

That is why the production kernel was downgraded to v2.5.2 (in particular this commit) which works as expected.

When the Devcon conference ends (15th November) we should:

  1. Release a new kernel version
  2. Set the newly released kernel version as the "latest" in cloudflare dashboard of the ubiquity-os-kernel-main worker
  3. Check if the errors still persist
  4. If the errors are still there then find the root cause and fix it

Possible solution (originally posted by gentlementlegen):

was digging into this a bit, sadly works fine locally. I also syumbled upon this thread https://github.com/cloudflare/workerd/issues/210

Another theory is that recently we made all plugins being called simultaneously, which maybe hits a cloudflare limit or doesn't resolve properly, which would not be an issue when run locally.

https://github.com/ubiquity-os/ubiquity-os-kernel/blob/6037f76c1ec2bad7abf34b6971b477b1109439c9/src/github/handlers/index.ts#L72

maybe worth a try to use allSettled or move it back to synchronous and see if that clears the issue.
Copy link

A new workroom has been created for this task. Join chat

@github-project-automation github-project-automation bot moved this to Done in Ubiquity Nov 14, 2024
@rndquu rndquu removed this from Development Nov 15, 2024
@rndquu rndquu reopened this Nov 27, 2024
@rndquu rndquu removed their assignment Nov 27, 2024
@rndquu
Copy link
Member Author

rndquu commented Nov 27, 2024

Reopening this one because the error is back and the bot is unresponsive from time to time.

@gentlementlegen FYI

@Keyrxng
Copy link
Contributor

Keyrxng commented Nov 27, 2024

https://developers.cloudflare.com/workers/platform/limits/#simultaneous-open-connections - says 6 is the limit and we currently have 6 worker plugins installed so perhaps we may need to start batching requests and/or handling connections/limits.

https://developers.cloudflare.com/workers/ai/

"The script will never generate a response": This error occurs when the Workers runtime detects that all the code associated with the request has executed, but no events are left in the event loop, and a Response has not been returned. This is often caused by unresolved Promises or WebSocket connections that are never closed.

Worker exceeded CPU time limit: This error occurs when a Cloudflare Worker exceeds the allowed CPU time limit. CPU time is the time spent executing code, such as loops or parsing JSON, and does not include time spent on network requests. This error is thrown when the Worker's CPU usage exceeds the allowed limit, indicating that the code is not optimized for performance.

@gentlementlegen
Copy link
Member

A lot of changes occured in the plugins and the kernel, due to the LLM addition. It is possible that some async code is breaking. I do not know that part of the codebase really well so I would need to investigate.
image

@whilefoo rfc

@gentlementlegen
Copy link
Member

I tried replicating this on my Cloudflare unsuccessfully. I am not keen on working directly within this org but I might have no choice.

@rndquu
Copy link
Member Author

rndquu commented Nov 29, 2024

What if we wrap this line into https://developers.cloudflare.com/workers/runtime-apis/context/#waituntil? As far as I understand waitUntil increases worker lifetime for up to 30 seconds.

@gentlementlegen
Copy link
Member

gentlementlegen commented Nov 29, 2024

Worth a try! If the reason is that our workers take too much time to execute. It could also be an async exception that is not caught.

I think the following can be investigated:

  • exceptions during async operations
  • disabling plugins and see if a specific one is breaking
  • disabling LLM

@gentlementlegen
Copy link
Member

@rndquu after merging the changes it seems to run. Let's keep monitoring.

Copy link

+ Evaluating results. Please wait...

Copy link

! No price label has been set. Skipping permit generation.

@rndquu
Copy link
Member Author

rndquu commented Nov 29, 2024

@rndquu after merging the changes it seems to run. Let's keep monitoring.

Shouldn't we release a new kernel version in the main branch? Last time the kernel also worked fine for some time but then suddenly started throwing errors.

@gentlementlegen
Copy link
Member

@rndquu yes we can. We should also release all the plugins otherwise they won't work, I can take care of it.

@rndquu rndquu removed this from Development Nov 30, 2024
@gentlementlegen
Copy link
Member

@rndquu Sad updates about this: when overloaded, I had the Cloudflare instance stuck and not taking any more action probably due to the waitUntil never cancelling the instance run. I reverted this in the meantime, we should figure out the root cause of the script that will never generate a response.

I also noticed that due to the caching of the manifest here maybe multiple instances of children spawned by Cloudflare could try to read and write to that variable at the same time which could be the cause of IO issues silently crashing the run. I'll try to debug that. The problem is that I cannot see exactly what CF does behind the scenes with fetch and spawn of children, that cannot really be replicated locally.

@rndquu
Copy link
Member Author

rndquu commented Dec 2, 2024

@gentlementlegen Last 24 hours were pretty much stable, only 4 errors out of ~8.6k requests:
Screenshot 2024-12-02 at 10 09 03

@gentlementlegen
Copy link
Member

gentlementlegen commented Dec 2, 2024

@rndquu It has been steadily broken for the past hour, once one plugin didn't complete its run all the subsequent calls got broken and nothing would respond to any command. I was preparing the bot for a demo and nothing would work.


Without this I see that we have a lot of Worker exceeded CPU time limit.. With it if any run hangs it breaks indefinitely. There should be a way to set timeouts somewhere.


I tried adding passThroughOnException to let the workers open fail instead of hanging forever, let's see if that helps.


Same result the worker will hang forever.

@rndquu rndquu reopened this Dec 2, 2024
@rndquu
Copy link
Member Author

rndquu commented Dec 2, 2024

Ok, so the The script will never generate a response error is solely cloudflare related. It's thrown because somewhere there's an unresolved promise living in a global js context.

The thing is that the unresolved promise (causing the error) may live in any 3rd party npm package (I've checked the kernel code sources and haven't found any promises in the global scope). So any such import, for example import { EmitterWebhookEvent } from "@octokit/webhooks"; and the 3rd party package code like this (example):

const myVar: Promise<string>;

class EmitterWebhookEvent {
// ... some code
}

will cause the worker to throw the The script will never generate a response error on heavy requests load when multiple requests end up on the same worker instance (physically).

Even if we find the root cause right now, there's no guarantee that in the future, on any new npm package import we won't get the same error again wasting tons of time on debugging.

I think the right strategy right now is to redeploy the kernel on some other platform (I've read the https://vercel.com/ doesn't have such issues) and, at first, check how it works.

@gentlementlegen
Copy link
Member

In the unresolved promises, the only one that comes to mind on our side is when Cloudflare makes a fetch request from a spawned child which causes an IO error and leaves the promise unresolved. The happens once in a while during the fetch of the manifests. The problem is that we do not store the Response itself but only the returned string so I cannot grasp why this error appears. My only guess is that the variable instance creates a race condition between spawned children. But this would mean either using KV or some external storage, which is something we tried to avoid.

I can deploy and instance to Vercel and see how it goes, why not.

@0x4007
Copy link
Member

0x4007 commented Dec 2, 2024

Given that we are working towards a partnership with Microsoft, we should explore the use of Azure instead.

@whilefoo
Copy link
Contributor

whilefoo commented Dec 2, 2024

I think we should immediately switch to Hono (I'll start working on this) which can run basically on any platform so we are not dependant on one provider and we can switch if issues that are not easily solved like this one appear. After we get it stable on another platform we have more time to find the root cause.

@rndquu Last time you reverted to a previous commit that seemed to not have this problem, right? Which commit is that? We should compare what was added in the code after that commit and maybe we can find what is causing it.

@rndquu
Copy link
Member Author

rndquu commented Dec 2, 2024

I think we should immediately switch to Hono

If https://hono.dev/ allows keeping the same codebase for cloudlfare / vercel / azure then it's a good idea.

Last time you reverted to a previous commit that seemed to not have this problem, right? Which commit is that? We should compare what was added in the code after that commit and maybe we can find what is causing it.

This is not "specific commit related". If you switch production kernel deploy in cloudlfare dashboard to any commit then the kernel, at first, starts working without errors but after some time (under heavy load?) it starts throwing The script will never generate a response error.

@rndquu rndquu mentioned this issue Dec 2, 2024
@gentlementlegen
Copy link
Member

The latest change we made is here. It allows for longer runs because otherwise we get a lot of "Exceeded CPU max duration" which cancels the whole run (thus nothing happens). But having this seemed to introduce runs that would hang indefinitely. Also I tried removing the caching mechanism for the manifests, but then we would exceed the max fetch allowed calls.

@gentlementlegen
Copy link
Member

Example of run hanging:
ubiquity-os/plugins-wishlist#2 (comment)

Logs on Cloudflare
image

Somehow after "Events issues received" no callback is summoned.

@whilefoo
Copy link
Contributor

whilefoo commented Dec 2, 2024

It allows for longer runs because otherwise we get a lot of "Exceeded CPU max duration" which cancels the whole run (thus nothing happens).

We get only 10ms of CPU time on the free plan so it is quite little however it seems weird to me that waitUntil would increase CPU time if free plan has 10ms limit and paid plan has 30s limit, which would mean we get "paid" plan for free?

One detail I noticed in Cloudflare docs:

As long as the client that sent the request remains connected, the Worker can continue processing, making subrequests, and setting timeouts on behalf of that request. When the client disconnects, all tasks associated with that client request are canceled.

Github states that webhook's timeout is 10 seconds so when that happens Cloudflare will cancel the worker, so if our kernel take more than 10s it will timeout. This means we have to use waitUntil to prevent this timeout

@gentlementlegen gentlementlegen mentioned this issue Dec 2, 2024
@gentlementlegen
Copy link
Member

The slow part is that we sequentially summon the plugins. We could add back the Promise.all but it might lead to the I/O errors again.

I've spent the night testing Azure, we could test it as an alternative. Here is the related PR (not cleaned up nor finished but the endpoint is working)

https://github.com/ubiquity-os/ubiquity-os-kernel/pull/214/files
https://ubiquity-os.azurewebsites.net

Even if we do not chose CF the switch to hono will be beneficial. Azure has a timeout of 5 minutes on functions which should get us covered, and does not rely on v8 but a fully fledged nodejs instance.

@whilefoo
Copy link
Contributor

whilefoo commented Dec 3, 2024

The slow part is that we sequentially summon the plugins. We could add back the Promise.all but it might lead to the I/O errors again.

I think I understand how time works in Cloudflare.
As long as the client is connected to the worker, there's no real time timeout only 10ms CPU time limit. When you call 'waitUntil' the response is returned but the background task has 30 real time seconds to finish up (10ms CPU limit still applies).

This means that if we use 'waitUntil' and we do sequential processing of plugins we could easily reach 30 real time seconds.
If we don't use 'waitUntil' we have 10 seconds until Github disconnects which also cancels the worker.

One option we could look at are Cloudflare Queues. They have 30 seconds of CPU time and 15 minutes of real time and they dont mention any subrequest limit like in Workers. I'm not sure about the latency between worker and start of execution in the consumer

We could switch to Azure permanently however there are probably some downsides like higher boot times (Azure has cold boot time, Cloudflare does not)

@gentlementlegen
Copy link
Member

@whilefoo that makes sense. Queues are a paid services afaik, so it seems we've reached free plan limit either way

Azure does not have cold boot time in some plans, I do not know how slow it gets, I'll experiment.
image

@gentlementlegen
Copy link
Member

Closing as we have Azure and Cloudflare has been stable again lately.

@gentlementlegen gentlementlegen closed this as not planned Won't fix, can't repro, duplicate, stale Dec 8, 2024
@rndquu rndquu removed this from Development Dec 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants