Cloudflare error: `The script will never generate a response` #195

rndquu · 2024-11-12T12:16:24Z

Most of the requests to the kernel in the latest main branch (v2.5.3) throw with these cloudflare errors:

The script will never generate a response
Worker exceeded CPU time limit

That is why the production kernel was downgraded to v2.5.2 (in particular this commit) which works as expected.

When the Devcon conference ends (15th November) we should:

Release a new kernel version
Set the newly released kernel version as the "latest" in cloudflare dashboard of the ubiquity-os-kernel-main worker
Check if the errors still persist
If the errors are still there then find the root cause and fix it

Possible solution (originally posted by gentlementlegen):

was digging into this a bit, sadly works fine locally. I also syumbled upon this thread https://github.com/cloudflare/workerd/issues/210

Another theory is that recently we made all plugins being called simultaneously, which maybe hits a cloudflare limit or doesn't resolve properly, which would not be an issue when run locally.

https://github.com/ubiquity-os/ubiquity-os-kernel/blob/6037f76c1ec2bad7abf34b6971b477b1109439c9/src/github/handlers/index.ts#L72

maybe worth a try to use allSettled or move it back to synchronous and see if that clears the issue.

The text was updated successfully, but these errors were encountered:

ubiquity-os-beta · 2024-11-14T07:14:27Z

A new workroom has been created for this task. Join chat

rndquu · 2024-11-27T16:58:52Z

Reopening this one because the error is back and the bot is unresponsive from time to time.

@gentlementlegen FYI

Keyrxng · 2024-11-27T22:36:28Z

https://developers.cloudflare.com/workers/platform/limits/#simultaneous-open-connections - says 6 is the limit and we currently have 6 worker plugins installed so perhaps we may need to start batching requests and/or handling connections/limits.

https://developers.cloudflare.com/workers/ai/

"The script will never generate a response": This error occurs when the Workers runtime detects that all the code associated with the request has executed, but no events are left in the event loop, and a Response has not been returned. This is often caused by unresolved Promises or WebSocket connections that are never closed.

Worker exceeded CPU time limit: This error occurs when a Cloudflare Worker exceeds the allowed CPU time limit. CPU time is the time spent executing code, such as loops or parsing JSON, and does not include time spent on network requests. This error is thrown when the Worker's CPU usage exceeds the allowed limit, indicating that the code is not optimized for performance.

gentlementlegen · 2024-11-28T04:20:55Z

A lot of changes occured in the plugins and the kernel, due to the LLM addition. It is possible that some async code is breaking. I do not know that part of the codebase really well so I would need to investigate.

@whilefoo rfc

gentlementlegen · 2024-11-29T05:28:38Z

I tried replicating this on my Cloudflare unsuccessfully. I am not keen on working directly within this org but I might have no choice.

rndquu · 2024-11-29T07:27:09Z

What if we wrap this line into https://developers.cloudflare.com/workers/runtime-apis/context/#waituntil? As far as I understand waitUntil increases worker lifetime for up to 30 seconds.

gentlementlegen · 2024-11-29T08:06:29Z

Worth a try! If the reason is that our workers take too much time to execute. It could also be an async exception that is not caught.

I think the following can be investigated:

exceptions during async operations
disabling plugins and see if a specific one is breaking
disabling LLM

gentlementlegen · 2024-11-29T10:23:08Z

@rndquu after merging the changes it seems to run. Let's keep monitoring.

ubiquity-os-beta · 2024-11-29T10:23:37Z

+ Evaluating results. Please wait...

ubiquity-os-beta · 2024-11-29T10:23:41Z

! No price label has been set. Skipping permit generation.

rndquu · 2024-11-29T10:41:29Z

@rndquu after merging the changes it seems to run. Let's keep monitoring.

Shouldn't we release a new kernel version in the main branch? Last time the kernel also worked fine for some time but then suddenly started throwing errors.

gentlementlegen · 2024-11-29T10:42:47Z

@rndquu yes we can. We should also release all the plugins otherwise they won't work, I can take care of it.

gentlementlegen · 2024-11-29T14:40:18Z

Some have failing workflows that should be fixed before being merged to main, fixing.

gentlementlegen · 2024-12-02T06:56:10Z

@rndquu Sad updates about this: when overloaded, I had the Cloudflare instance stuck and not taking any more action probably due to the waitUntil never cancelling the instance run. I reverted this in the meantime, we should figure out the root cause of the script that will never generate a response.

I also noticed that due to the caching of the manifest here maybe multiple instances of children spawned by Cloudflare could try to read and write to that variable at the same time which could be the cause of IO issues silently crashing the run. I'll try to debug that. The problem is that I cannot see exactly what CF does behind the scenes with fetch and spawn of children, that cannot really be replicated locally.

rndquu · 2024-12-02T07:11:45Z

@gentlementlegen Last 24 hours were pretty much stable, only 4 errors out of ~8.6k requests:

gentlementlegen · 2024-12-02T07:15:15Z

@rndquu It has been steadily broken for the past hour, once one plugin didn't complete its run all the subsequent calls got broken and nothing would respond to any command. I was preparing the bot for a demo and nothing would work.

Without this I see that we have a lot of Worker exceeded CPU time limit.. With it if any run hangs it breaks indefinitely. There should be a way to set timeouts somewhere.

I tried adding passThroughOnException to let the workers open fail instead of hanging forever, let's see if that helps.

Same result the worker will hang forever.

rndquu · 2024-12-02T08:37:03Z

Ok, so the The script will never generate a response error is solely cloudflare related. It's thrown because somewhere there's an unresolved promise living in a global js context.

The thing is that the unresolved promise (causing the error) may live in any 3rd party npm package (I've checked the kernel code sources and haven't found any promises in the global scope). So any such import, for example import { EmitterWebhookEvent } from "@octokit/webhooks"; and the 3rd party package code like this (example):

const myVar: Promise<string>;

class EmitterWebhookEvent {
// ... some code
}

will cause the worker to throw the The script will never generate a response error on heavy requests load when multiple requests end up on the same worker instance (physically).

Even if we find the root cause right now, there's no guarantee that in the future, on any new npm package import we won't get the same error again wasting tons of time on debugging.

I think the right strategy right now is to redeploy the kernel on some other platform (I've read the https://vercel.com/ doesn't have such issues) and, at first, check how it works.

gentlementlegen · 2024-12-02T09:12:52Z

In the unresolved promises, the only one that comes to mind on our side is when Cloudflare makes a fetch request from a spawned child which causes an IO error and leaves the promise unresolved. The happens once in a while during the fetch of the manifests. The problem is that we do not store the Response itself but only the returned string so I cannot grasp why this error appears. My only guess is that the variable instance creates a race condition between spawned children. But this would mean either using KV or some external storage, which is something we tried to avoid.

I can deploy and instance to Vercel and see how it goes, why not.

0x4007 · 2024-12-02T09:22:21Z

Given that we are working towards a partnership with Microsoft, we should explore the use of Azure instead.

whilefoo · 2024-12-02T10:42:08Z

I think we should immediately switch to Hono (I'll start working on this) which can run basically on any platform so we are not dependant on one provider and we can switch if issues that are not easily solved like this one appear. After we get it stable on another platform we have more time to find the root cause.

@rndquu Last time you reverted to a previous commit that seemed to not have this problem, right? Which commit is that? We should compare what was added in the code after that commit and maybe we can find what is causing it.

rndquu · 2024-12-02T13:22:39Z

I think we should immediately switch to Hono

If https://hono.dev/ allows keeping the same codebase for cloudlfare / vercel / azure then it's a good idea.

Last time you reverted to a previous commit that seemed to not have this problem, right? Which commit is that? We should compare what was added in the code after that commit and maybe we can find what is causing it.

This is not "specific commit related". If you switch production kernel deploy in cloudlfare dashboard to any commit then the kernel, at first, starts working without errors but after some time (under heavy load?) it starts throwing The script will never generate a response error.

gentlementlegen · 2024-12-02T13:31:38Z

The latest change we made is here. It allows for longer runs because otherwise we get a lot of "Exceeded CPU max duration" which cancels the whole run (thus nothing happens). But having this seemed to introduce runs that would hang indefinitely. Also I tried removing the caching mechanism for the manifests, but then we would exceed the max fetch allowed calls.

gentlementlegen · 2024-12-02T13:47:27Z

Example of run hanging:
ubiquity-os/plugins-wishlist#2 (comment)

Logs on Cloudflare

Somehow after "Events issues received" no callback is summoned.

whilefoo · 2024-12-02T19:45:20Z

It allows for longer runs because otherwise we get a lot of "Exceeded CPU max duration" which cancels the whole run (thus nothing happens).

We get only 10ms of CPU time on the free plan so it is quite little however it seems weird to me that waitUntil would increase CPU time if free plan has 10ms limit and paid plan has 30s limit, which would mean we get "paid" plan for free?

One detail I noticed in Cloudflare docs:

As long as the client that sent the request remains connected, the Worker can continue processing, making subrequests, and setting timeouts on behalf of that request. When the client disconnects, all tasks associated with that client request are canceled.

Github states that webhook's timeout is 10 seconds so when that happens Cloudflare will cancel the worker, so if our kernel take more than 10s it will timeout. This means we have to use waitUntil to prevent this timeout

gentlementlegen · 2024-12-02T20:02:35Z

The slow part is that we sequentially summon the plugins. We could add back the Promise.all but it might lead to the I/O errors again.

I've spent the night testing Azure, we could test it as an alternative. Here is the related PR (not cleaned up nor finished but the endpoint is working)

https://github.com/ubiquity-os/ubiquity-os-kernel/pull/214/files
https://ubiquity-os.azurewebsites.net

Even if we do not chose CF the switch to hono will be beneficial. Azure has a timeout of 5 minutes on functions which should get us covered, and does not rely on v8 but a fully fledged nodejs instance.

whilefoo · 2024-12-03T21:56:04Z

The slow part is that we sequentially summon the plugins. We could add back the Promise.all but it might lead to the I/O errors again.

I think I understand how time works in Cloudflare.
As long as the client is connected to the worker, there's no real time timeout only 10ms CPU time limit. When you call 'waitUntil' the response is returned but the background task has 30 real time seconds to finish up (10ms CPU limit still applies).

This means that if we use 'waitUntil' and we do sequential processing of plugins we could easily reach 30 real time seconds.
If we don't use 'waitUntil' we have 10 seconds until Github disconnects which also cancels the worker.

One option we could look at are Cloudflare Queues. They have 30 seconds of CPU time and 15 minutes of real time and they dont mention any subrequest limit like in Workers. I'm not sure about the latency between worker and start of execution in the consumer

We could switch to Azure permanently however there are probably some downsides like higher boot times (Azure has cold boot time, Cloudflare does not)

gentlementlegen · 2024-12-03T23:42:29Z

@whilefoo that makes sense. Queues are a paid services afaik, so it seems we've reached free plan limit either way

Azure does not have cold boot time in some plans, I do not know how slow it gets, I'll experiment.

gentlementlegen · 2024-12-08T20:52:01Z

Closing as we have Azure and Cloudflare has been stable again lately.

rndquu added the Priority: 3 (High) label Nov 12, 2024

devpool-directory-superintendent bot mentioned this issue Nov 12, 2024

Cloudflare error: The script will never generate a response ubiquity/devpool-directory#1877

Closed

rndquu added this to Ubiquity and Development Nov 13, 2024

rndquu self-assigned this Nov 14, 2024

rndquu mentioned this issue Nov 14, 2024

fix: call plugins synchronously #199

Merged

rndquu closed this as completed in #199 Nov 14, 2024

github-project-automation bot moved this to Done in Development Nov 14, 2024

github-project-automation bot moved this to Done in Ubiquity Nov 14, 2024

rndquu removed this from Development Nov 15, 2024

rndquu reopened this Nov 27, 2024

rndquu removed their assignment Nov 27, 2024

gentlementlegen self-assigned this Nov 28, 2024

rndquu added this to Development Nov 28, 2024

rndquu mentioned this issue Nov 28, 2024

UI improvements ubiquity-os/ubiquity-os-plugin-installer#19

Closed

gentlementlegen mentioned this issue Nov 29, 2024

Changed the execution to be handled by the context #210

Merged

gentlementlegen closed this as completed Nov 29, 2024

github-project-automation bot moved this to Done in Development Nov 29, 2024

rndquu removed this from Development Nov 30, 2024

rndquu reopened this Dec 2, 2024

rndquu mentioned this issue Dec 2, 2024

Use hono #213

Closed

gentlementlegen mentioned this issue Dec 2, 2024

Azure #214

Closed

gentlementlegen mentioned this issue Dec 2, 2024

feat: azure deployment #217

Merged

rndquu added this to Development Dec 3, 2024

gentlementlegen mentioned this issue Dec 6, 2024

Bug: follow up on completed task ubiquity-os-marketplace/daemon-disqualifier#60

Closed

gentlementlegen closed this as not planned Won't fix, can't repro, duplicate, stale Dec 8, 2024

github-project-automation bot moved this to Done in Development Dec 8, 2024

rndquu removed this from Development Dec 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cloudflare error: `The script will never generate a response` #195

Cloudflare error: `The script will never generate a response` #195

rndquu commented Nov 12, 2024 •

edited

Loading

ubiquity-os-beta bot commented Nov 14, 2024

rndquu commented Nov 27, 2024

Keyrxng commented Nov 27, 2024

gentlementlegen commented Nov 28, 2024

gentlementlegen commented Nov 29, 2024

rndquu commented Nov 29, 2024

gentlementlegen commented Nov 29, 2024 •

edited

Loading

gentlementlegen commented Nov 29, 2024

ubiquity-os-beta bot commented Nov 29, 2024

ubiquity-os-beta bot commented Nov 29, 2024

rndquu commented Nov 29, 2024

gentlementlegen commented Nov 29, 2024

gentlementlegen commented Nov 29, 2024

gentlementlegen commented Dec 2, 2024

rndquu commented Dec 2, 2024

gentlementlegen commented Dec 2, 2024 •

edited

Loading

rndquu commented Dec 2, 2024

gentlementlegen commented Dec 2, 2024

0x4007 commented Dec 2, 2024 •

edited

Loading

whilefoo commented Dec 2, 2024 •

edited

Loading

rndquu commented Dec 2, 2024

gentlementlegen commented Dec 2, 2024

gentlementlegen commented Dec 2, 2024

whilefoo commented Dec 2, 2024 •

edited

Loading

gentlementlegen commented Dec 2, 2024

whilefoo commented Dec 3, 2024 •

edited

Loading

gentlementlegen commented Dec 3, 2024

gentlementlegen commented Dec 8, 2024

Cloudflare error: The script will never generate a response #195

Cloudflare error: The script will never generate a response #195

Comments

rndquu commented Nov 12, 2024 • edited Loading

ubiquity-os-beta bot commented Nov 14, 2024

rndquu commented Nov 27, 2024

Keyrxng commented Nov 27, 2024

gentlementlegen commented Nov 28, 2024

gentlementlegen commented Nov 29, 2024

rndquu commented Nov 29, 2024

gentlementlegen commented Nov 29, 2024 • edited Loading

gentlementlegen commented Nov 29, 2024

ubiquity-os-beta bot commented Nov 29, 2024

ubiquity-os-beta bot commented Nov 29, 2024

rndquu commented Nov 29, 2024

gentlementlegen commented Nov 29, 2024

gentlementlegen commented Nov 29, 2024

gentlementlegen commented Dec 2, 2024

rndquu commented Dec 2, 2024

gentlementlegen commented Dec 2, 2024 • edited Loading

rndquu commented Dec 2, 2024

gentlementlegen commented Dec 2, 2024

0x4007 commented Dec 2, 2024 • edited Loading

whilefoo commented Dec 2, 2024 • edited Loading

rndquu commented Dec 2, 2024

gentlementlegen commented Dec 2, 2024

gentlementlegen commented Dec 2, 2024

whilefoo commented Dec 2, 2024 • edited Loading

gentlementlegen commented Dec 2, 2024

whilefoo commented Dec 3, 2024 • edited Loading

gentlementlegen commented Dec 3, 2024

gentlementlegen commented Dec 8, 2024

Cloudflare error: `The script will never generate a response` #195

Cloudflare error: `The script will never generate a response` #195

rndquu commented Nov 12, 2024 •

edited

Loading

gentlementlegen commented Nov 29, 2024 •

edited

Loading

gentlementlegen commented Dec 2, 2024 •

edited

Loading

0x4007 commented Dec 2, 2024 •

edited

Loading

whilefoo commented Dec 2, 2024 •

edited

Loading

whilefoo commented Dec 2, 2024 •

edited

Loading

whilefoo commented Dec 3, 2024 •

edited

Loading