Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Assistance Required with tvmjs Integration and webgpu.get_fmap Error #59

Open
ZoneLikeWonderland opened this issue Feb 28, 2024 · 14 comments

Comments

@ZoneLikeWonderland
Copy link

Firstly, I'd like to express my admiration for the remarkable work done on this project. The advancements and capabilities it offers are truly impressive.

I've been diligently following the provided "walkthrough.ipynb" to familiarize myself with the pipeline. Unfortunately, I encountered an issue with the trace part, which seems to malfunction, possibly due to updates in the diffusers library. To circumvent this, I opted for a simplified network module as demonstrated below:

class Net(nn.Module):
    def __init__(self):
        super().__init__()

    def forward(self, x):
        return x + 1

Following this, I proceeded to convert the network to ONNX format and subsequently to IR:

trace = torch.jit.trace(net, input.to(dtype).to(device))

torch.onnx.export(
    trace, input.to(dtype).to(device), "test/net.onnx", verbose=True, input_names=["input"], output_names=["input"],
)
# Exported graph: graph(%input.1 : Float(1, 3, strides=[3, 1], requires_grad=0, device=cpu)):
#   %/Constant_output_0 : Float(requires_grad=0, device=cpu) = onnx::Constant[value={1}, onnx_name="/Constant"](), scope: Net:: # /tmp/ipykernel_3004661/4051825751.py:6:0
#   %input : Float(1, 3, strides=[3, 1], requires_grad=0, device=cpu) = onnx::Add[onnx_name="/Add"](%input.1, %/Constant_output_0), scope: Net:: # /tmp/ipykernel_3004661/4051825751.py:6:0
#   return (%input)

# ============= Diagnostic Run torch.onnx.export version 2.0.0+cu117 =============
# verbose: False, log level: Level.ERROR
# ======================= 0 NONE 0 NOTE 0 WARNING 0 ERROR ========================
onnx_model_path = "test/net.onnx"
model = onnx.load(onnx_model_path)
tvm_model = from_onnx(model, keep_params_in_input=True)
tvm_model

# # from tvm.script import ir as I
# # from tvm.script import relax as R

# @I.ir_module
# class Module:
#     @R.function
#     def main(input_1: R.Tensor((1, 3), dtype="float32")) -> R.Tensor((1, 3), dtype="float32"):
#         R.func_attr({"num_input": 1})
#         with R.dataflow():
#             gv: R.Tensor((1, 3), dtype="float32") = R.add(input_1, R.const(1, "float32"))
#             R.output(gv)
#         return gv

After that, I compiled it to wasm:

tvm_model, model_params = relax.frontend.detach_params(tvm_model) # no params actually
target = tvm.target.Target(
    "webgpu", host="llvm -mtriple=wasm32-unknown-unknown-wasm"
)
ex = relax.build(mod=tvm_model, target=target)
ex.export_library("test/net.wasm")

Finally, I used the following JS to run it:

const tvmjs = require("./public/dist/tvmjs.bundle.js");
const EmccWASI = require("./public/dist/tvmjs_runtime.wasi.js");


window.tvmjs = tvmjs

async function asyncInitTVM() {


    const wasmSource = await (
        await fetch("./public/net.wasm")
    ).arrayBuffer();


    logger = function (message) {
        console.log(message);
    };

    const tvm = await tvmjs.instantiate(
        new Uint8Array(wasmSource),
        new EmccWASI(),
        logger
    );

    const output = await tvmjs.detectGPUDevice();
    if (output !== undefined) {
        var label = "WebGPU";
        if (output.adapterInfo.description.length != 0) {
            label += " - " + output.adapterInfo.description;
        } else {
            label += " - " + output.adapterInfo.vendor;
        }
        console.log("Initialize GPU device: " + label);
        tvm.initWebGPU(output.device);
    } else {
        console.log("This browser env do not support WebGPU");
    }



    tvm.withNewScope(() => {
        device = tvm.webgpu();
        // device = tvm.cpu();
        vm = tvm.detachFromCurrentScope(tvm.createVirtualMachine(device));
        net = tvm.detachFromCurrentScope(vm.getFunction("main"));
    })

    await tvm.asyncLoadWebGPUPipelines(vm.getInternalModule());

    const input_cpu = tvm.withNewScope(() => {
        return tvm.detachFromCurrentScope(
            tvm.empty([1, 3], "float32", tvm.cpu()).copyFrom([1, 1, 1])
        )
    });
    const input_gpu = tvm.withNewScope(() => {
        return tvm.detachFromCurrentScope(
            tvm.empty([1, 3], "float32", device)
        )
    });

    input_gpu.copyFrom(input_cpu);
    await tvm.webgpu().sync();
    console.log("input_cpu", input_cpu.toArray());

    tvm.withNewScope(() => {
        output_gpu = net(input_gpu);
        output_gpu = tvm.detachFromCurrentScope(output_gpu);
    });


    const output_cpu = tvm.withNewScope(() => {
        return tvm.detachFromCurrentScope(
            tvm.empty([1, 3], "float32", tvm.cpu()).copyFrom([2, 3, 4])
        )
    });

    output_cpu.copyFrom(output_gpu);
    await tvm.webgpu().sync();
    console.log("output_cpu", output_cpu.toArray());

}

asyncInitTVM()

However, I've hit a roadblock during the execution phase, particularly at await tvm.asyncLoadWebGPUPipelines(vm.getInternalModule());, where the console outputs the following error:

tvmjs.bundle.js:1863  Uncaught (in promise) Error: Cannot find function webgpu.get_fmap
    at Module.getFunction (tvmjs.bundle.js:1863:23)
    at Instance.eval (tvmjs.bundle.js:2791:38)
    at Generator.next (<anonymous>)
    at eval (tvmjs.bundle.js:28:75)
    at new Promise (<anonymous>)
    at __awaiter (tvmjs.bundle.js:24:16)
    at Instance.asyncLoadWebGPUPipelines (tvmjs.bundle.js:2786:20)
    at asyncInitTVM (main.js:48:15)

In addition, I found that when I use llvm as build target instead of webgpu and use tvm.cpu() as device and skip this line, the example is working.

Given the scarcity of detailed documentation and tutorials on integrating custom networks with tvmjs, especially regarding WebGPU support, I find myself in need of your expertise and guidance.

Could you please help me identify any potential missteps in my approach? I am particularly interested in ensuring that my network can be successfully operated using tvmjs and would greatly appreciate any insights or suggestions you might have.

Thank you very much for your time and assistance.

@senlyu163
Copy link

same error... do u solve it?

@ZoneLikeWonderland
Copy link
Author

same error... do u solve it?

not yet👀

@senlyu163
Copy link

@tqchen Sorry to bother you. If it's convenient, can you give some suggestions?

@senlyu163
Copy link

@ZoneLikeWonderland The error is still reported on Ubuntu, but it is normal on apple M2Pro. I am fresh in tvm, and dont know why...

@abacaaaaaa
Copy link

@senlyu163
Sorry to bother you. Following the get started documentation, I’ve successfully implemented build.py and deploy.py, and they run smoothly during deployment on the web. However, I encountered the following issue:

Generate error, GPUPipelineError: Entry-point uses workgroup_size(1024, 1, 1) that exceeds the maximum allowed (256, 256, 64).
    at ValidateComputeStageWorkgroupSize (../../third_party/dawn/src/dawn/native/ShaderModule.cpp:1004)

image

My environment : Python 3.11, MacOS 14.5 M2
Do you have any suggestions? Thanks.

@senlyu163
Copy link

@senlyu163 Sorry to bother you. Following the get started documentation, I’ve successfully implemented build.py and deploy.py, and they run smoothly during deployment on the web. However, I encountered the following issue:

Generate error, GPUPipelineError: Entry-point uses workgroup_size(1024, 1, 1) that exceeds the maximum allowed (256, 256, 64).
    at ValidateComputeStageWorkgroupSize (../../third_party/dawn/src/dawn/native/ShaderModule.cpp:1004)

image

My environment : Python 3.11, MacOS 14.5 M2 Do you have any suggestions? Thanks.

I also encountered this problem when deployed on the web, probably because of webgpu limitations. You can use the configuration of the following code to tune.

    ms.relax_integration.tune_relax(
        mod=mod,
        target=tvm.target.Target("apple/m1-gpu-restricted"),
        ......
    )

I tested several configurations and succeeded under the “apple/m1-gpu-restricted” configuration.

@abacaaaaaa
Copy link

@senlyu163
Thank you for your response. I’ve found multiple instances of ms.relax_integration.tune_relax in the TVM source code and made modifications to all the files accordingly. However, the issue still persists. Could you please provide detailed information on the files that require modification? Thank you very much.

@senlyu163
Copy link

@abacaaaaaa

  1. You only need to specify the target as "apple/m1-gpu-restricted" when generating log_db in the tune phase.
  2. Just use the repo's default code during the lib compilation phase.

I think the problem you are having is that the number of gpu threads configuration is too large during the tune phase.

It is recommended that you refer to the guoyaol's repo: https://github.com/guoyaol/web-real-esrgan

@abacaaaaaa
Copy link

@senlyu163
Sorry to bother you. I tried specifying apple/m1-gpu-restricted when generating the log_db, referring to https://github.com/guoyaol/web-real-esrgan and https://github.com/happyme531/RK3588-stable-diffusion-GPU, but the issue still persists.

My code is as follows:

def tune(mod: tvm.IRModule) -> None:
    from tvm import meta_schedule as ms

    ms.relax_integration.tune_relax(
        mod=mod,
        target=tvm.target.Target("apple/m1-gpu-restricted"),
        params={},
        builder=ms.builder.LocalBuilder(
            max_workers=2,
        ),
        runner=ms.runner.LocalRunner(),
        work_dir="log_db_tuning_1000_small",
        max_trials_global=5000,  
        max_trials_per_task=200, 
        strategy=ms.search_strategy.EvolutionarySearch(init_min_unmeasured=10, max_fail_count=15),
    )

I’m just tinkering with the parameters — max_trials_global, max_trials_per_task, init_min_unmeasured, and max_fail_count — and they might not be entirely reasonable. However, even after this tentative tuning, the current log_db still throws errors on the web. Could you kindly provide some suggestions? If it’s convenient for you, could you please share the tune file or log_db files? Thank you very much.

@senlyu163
Copy link

@abacaaaaaa
The tune code looks correct.

I verified webgpu on the web-realesrgan project without trying on stable-diffusion yet. So the log_db of sd cannot be provided. You can reproduce the work of web-realesrgan and then migrate to sd.

I am a freshman in tvm, so i can not provide more professional advice.

@grf53
Copy link

grf53 commented Jul 8, 2024

Hi, is there anyone who made progress with the webgpu.get_fmap error?

@grf53
Copy link

grf53 commented Jul 30, 2024

I found a way to pass through await tvm.asyncLoadWebGPUPipelines(vm.getInternalModule()); without webgpu.get_fmap error.

I realized that models built from https://github.com/mlc-ai/mlc-llm project don't have that problem. There are quite many differences, but I could narrow down the point.

There is build 'pipeline' when you build an Executable from the relax module. https://github.com/apache/tvm/blob/9e88018c3a56ab378dd11410a662ed5c3da1f4df/python/tvm/relax/vm_build.py#L259
Without any manipulation on pipeline, your option is only one of 'zero' or 'default'. Meanwhile the pipeline for mlc-llm build includes quite many pipeline steps(compile passes).
https://github.com/mlc-ai/mlc-llm/blob/551f3fee7eaa09e6f024e2866510625842111c7d/python/mlc_llm/compiler_pass/pipeline.py#L77
Among those extra passes, the part that applying ones from tvm.dlight was the point.
https://github.com/mlc-ai/mlc-llm/blob/551f3fee7eaa09e6f024e2866510625842111c7d/python/mlc_llm/compiler_pass/pipeline.py#L142-L148

After adding that step in the pipeline, the built wasm files don't make the webgpu.get_fmap problem.

But I still have trouble to use tvm with WebGPU. Copying TVM NDArray between devices(cpu <-> gpu) looks not properly working.

I hope this helps someone with the same issue and can be part of a constructive discussion to help me resolve my remaining issues.

@Yumin-gd
Copy link

@grf53
I'd like to express my gratitude to @grf53 for the solution regarding the "without webgpu.get_fmap" error. I attempted to implement the suggestion of incorporating tvm.dlight Schedules into the build pipeline. However, instead of resolving the initial error, I encountered a new issue after adding these schedules: the build process now fails for convolution operations with a kernel size of 1. This behavior exactly aligns with the problem described in mlc-ai/mlc-llm#2276.

I would be immensely grateful if @grf53 could provide some additional guidance:

Specific Implementation Details: Could you kindly provide more details about how and where you precisely integrated the tvm.dlight Schedules into the build pipeline?
Additional Adaptations: In addition to applying the tvm.dlight Schedules, were there any other modifications or adjustments you made to the code or the build process?
Build Environment: If possible, could you please share details about your build environment? I'm currently working with Ubuntu 20.04 and TVM v0.16.0.
Any further insights you can offer would be incredibly valuable in helping me overcome this hurdle. Thank you so much for your time and assistance!

@grf53
Copy link

grf53 commented Aug 25, 2024

@Yumin-gd Hey, I could not help you for that issue, but let me just answer your questions.

Specific Implementation Details: Could you kindly provide more details about how and where you precisely integrated the tvm.dlight Schedules into the build pipeline?

I made my own function to return 'module pass(pipeline)' by mimicking https://github.com/mlc-ai/relax/blob/mlc/python/tvm/relax/pipeline.py. (I used mlc-ai/relax as tvm repo.)

def my_build_pipeline(
    ...
):
    from tvm import dlight as dl

    @tvm.transform.module_pass(opt_level=0)
    def _pipeline(mod: tvm.ir.IRModule, _ctx: tvm.transform.PassContext) -> tvm.ir.IRModule:
        seq = tvm.transform.Sequential(
            [
                ...
            ]
        )
        mod = seq(mod)
        return mod

    return _pipeline

In that pipeline, the following step is included along with the existing steps of default pipeline in https://github.com/mlc-ai/relax/blob/mlc/python/tvm/relax/pipeline.py.

[
    ...
    dl.ApplyDefaultSchedule(
        dl.gpu.Matmul(),
        dl.gpu.GEMV(),
        dl.gpu.Reduction(),
        dl.gpu.GeneralReduction(),
        dl.gpu.Fallback(),
    ),
    ...
]

And also, I just added that step into the existing default_build_pipeline() in the pipieline.py. And it also had the same effect.

Additional Adaptations: In addition to applying the tvm.dlight Schedules, were there any other modifications or adjustments you made to the code or the build process?

I just saw the file changes, and there is no other modification in the tvm(mlc-ai/relax) repo. I just modified codes using tvm library, but the pipeline part.

Build Environment: If possible, could you please share details about your build environment? I'm currently working with Ubuntu 20.04 and TVM v0.16.0.

I am using Macbook Pro with M3 Pro chip. The OS version is Sonoma 14.6.1 currently.(Not significantly different from that time.) And I used mlc-ai/relax repo as tvm, and I just used the default branch named 'mlc'.
I have just checked the exact commit hash, that is f5f048bbd71513f087799f987019e3931f68a6d9, which looks like not currently on the 'mlc' branch.

Any further insights you can offer would be incredibly valuable in helping me overcome this hurdle. Thank you so much for your time and assistance!

I can't get what kind of things can help you. One thought that helped me while using tvm was that the project is still making progress so existing codes become easily out of date. And as I see, the direction of progress or maintenance is a lot focused on the 'currently attractive topics' like generative AI, LLM, etc. So maybe we can find the clue more quickly by researching that kind of things.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants