-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cloudflare error: The script will never generate a response
#195
Comments
A new workroom has been created for this task. Join chat |
Reopening this one because the error is back and the bot is unresponsive from time to time. @gentlementlegen FYI |
https://developers.cloudflare.com/workers/platform/limits/#simultaneous-open-connections - says 6 is the limit and we currently have 6 worker plugins installed so perhaps we may need to start batching requests and/or handling connections/limits. https://developers.cloudflare.com/workers/ai/
|
A lot of changes occured in the plugins and the kernel, due to the LLM addition. It is possible that some async code is breaking. I do not know that part of the codebase really well so I would need to investigate. @whilefoo rfc |
I tried replicating this on my Cloudflare unsuccessfully. I am not keen on working directly within this org but I might have no choice. |
What if we wrap this line into https://developers.cloudflare.com/workers/runtime-apis/context/#waituntil? As far as I understand |
Worth a try! If the reason is that our workers take too much time to execute. It could also be an async exception that is not caught. I think the following can be investigated:
|
@rndquu after merging the changes it seems to run. Let's keep monitoring. |
+ Evaluating results. Please wait... |
! No price label has been set. Skipping permit generation. |
Shouldn't we release a new kernel version in the |
@rndquu yes we can. We should also release all the plugins otherwise they won't work, I can take care of it. |
@rndquu Sad updates about this: when overloaded, I had the Cloudflare instance stuck and not taking any more action probably due to the I also noticed that due to the caching of the manifest here maybe multiple instances of children spawned by Cloudflare could try to read and write to that variable at the same time which could be the cause of IO issues silently crashing the run. I'll try to debug that. The problem is that I cannot see exactly what CF does behind the scenes with |
@gentlementlegen Last 24 hours were pretty much stable, only 4 errors out of ~8.6k requests: |
@rndquu It has been steadily broken for the past hour, once one plugin didn't complete its run all the subsequent calls got broken and nothing would respond to any command. I was preparing the bot for a demo and nothing would work. Without this I see that we have a lot of I tried adding Same result the worker will hang forever. |
Ok, so the The thing is that the unresolved promise (causing the error) may live in any 3rd party npm package (I've checked the kernel code sources and haven't found any promises in the global scope). So any such import, for example
will cause the worker to throw the Even if we find the root cause right now, there's no guarantee that in the future, on any new npm package import we won't get the same error again wasting tons of time on debugging. I think the right strategy right now is to redeploy the kernel on some other platform (I've read the https://vercel.com/ doesn't have such issues) and, at first, check how it works. |
In the unresolved promises, the only one that comes to mind on our side is when Cloudflare makes a fetch request from a spawned child which causes an IO error and leaves the promise unresolved. The happens once in a while during the fetch of the manifests. The problem is that we do not store the Response itself but only the returned string so I cannot grasp why this error appears. My only guess is that the variable instance creates a race condition between spawned children. But this would mean either using KV or some external storage, which is something we tried to avoid. I can deploy and instance to Vercel and see how it goes, why not. |
Given that we are working towards a partnership with Microsoft, we should explore the use of Azure instead. |
I think we should immediately switch to Hono (I'll start working on this) which can run basically on any platform so we are not dependant on one provider and we can switch if issues that are not easily solved like this one appear. After we get it stable on another platform we have more time to find the root cause. @rndquu Last time you reverted to a previous commit that seemed to not have this problem, right? Which commit is that? We should compare what was added in the code after that commit and maybe we can find what is causing it. |
If https://hono.dev/ allows keeping the same codebase for cloudlfare / vercel / azure then it's a good idea.
This is not "specific commit related". If you switch production kernel deploy in cloudlfare dashboard to any commit then the kernel, at first, starts working without errors but after some time (under heavy load?) it starts throwing |
The latest change we made is here. It allows for longer runs because otherwise we get a lot of "Exceeded CPU max duration" which cancels the whole run (thus nothing happens). But having this seemed to introduce runs that would hang indefinitely. Also I tried removing the caching mechanism for the manifests, but then we would exceed the max fetch allowed calls. |
Example of run hanging: Somehow after "Events issues received" no callback is summoned. |
We get only 10ms of CPU time on the free plan so it is quite little however it seems weird to me that One detail I noticed in Cloudflare docs:
Github states that webhook's timeout is 10 seconds so when that happens Cloudflare will cancel the worker, so if our kernel take more than 10s it will timeout. This means we have to use |
The slow part is that we sequentially summon the plugins. We could add back the I've spent the night testing Azure, we could test it as an alternative. Here is the related PR (not cleaned up nor finished but the endpoint is working) https://github.com/ubiquity-os/ubiquity-os-kernel/pull/214/files Even if we do not chose CF the switch to |
I think I understand how time works in Cloudflare. This means that if we use 'waitUntil' and we do sequential processing of plugins we could easily reach 30 real time seconds. One option we could look at are Cloudflare Queues. They have 30 seconds of CPU time and 15 minutes of real time and they dont mention any subrequest limit like in Workers. I'm not sure about the latency between worker and start of execution in the consumer We could switch to Azure permanently however there are probably some downsides like higher boot times (Azure has cold boot time, Cloudflare does not) |
@whilefoo that makes sense. Queues are a paid services afaik, so it seems we've reached free plan limit either way Azure does not have cold boot time in some plans, I do not know how slow it gets, I'll experiment. |
Closing as we have Azure and Cloudflare has been stable again lately. |
Most of the requests to the kernel in the latest
main
branch (v2.5.3) throw with these cloudflare errors:The script will never generate a response
Worker exceeded CPU time limit
That is why the production kernel was downgraded to v2.5.2 (in particular this commit) which works as expected.
When the Devcon conference ends (15th November) we should:
ubiquity-os-kernel-main
workerPossible solution (originally posted by
gentlementlegen
):The text was updated successfully, but these errors were encountered: