Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

!!! Exception during processing !!! All input tensors need to be on the same GPU, but found some tensors to not be on a GPU: #6

Open
iamsuper123 opened this issue Dec 25, 2024 · 7 comments

Comments

@iamsuper123
Copy link

image

here's my workflow:
image

here's the logs:
C:\Users\alienware\Desktop\very important shit\ai (32 gb)\ComfyUI_windows_portable_nvidia\ComfyUI_windows_portable>.\python_embeded\python.exe -s ComfyUI\main.py --windows-standalone-build
[START] Security scan
[DONE] Security scan

ComfyUI-Manager: installing dependencies done.

** ComfyUI startup time: 2024-12-25 22:57:57.250645
** Platform: Windows
** Python version: 3.12.7 (tags/v3.12.7:0b05ead, Oct 1 2024, 03:06:41) [MSC v.1941 64 bit (AMD64)]
** Python executable: C:\Users\alienware\Desktop\very important shit\ai (32 gb)\ComfyUI_windows_portable_nvidia\ComfyUI_windows_portable\python_embeded\python.exe
** ComfyUI Path: C:\Users\alienware\Desktop\very important shit\ai (32 gb)\ComfyUI_windows_portable_nvidia\ComfyUI_windows_portable\ComfyUI
** Log path: C:\Users\alienware\Desktop\very important shit\ai (32 gb)\ComfyUI_windows_portable_nvidia\ComfyUI_windows_portable\comfyui.log

Prestartup times for custom nodes:
10.3 seconds: C:\Users\alienware\Desktop\very important shit\ai (32 gb)\ComfyUI_windows_portable_nvidia\ComfyUI_windows_portable\ComfyUI\custom_nodes\ComfyUI-Manager

Total VRAM 4096 MB, total RAM 16054 MB
pytorch version: 2.5.1+cu124
Set vram state to: NORMAL_VRAM
Device: cuda:0 NVIDIA GeForce RTX 3050 Laptop GPU : cudaMallocAsync
Using pytorch cross attention
[Prompt Server] web root: C:\Users\alienware\Desktop\very important shit\ai (32 gb)\ComfyUI_windows_portable_nvidia\ComfyUI_windows_portable\ComfyUI\web

Loading: ComfyUI-Manager (V2.55.4)

ComfyUI Revision: 2890 [9a616b81] *DETACHED | Released on '2024-12-04'

Import times for custom nodes:
0.0 seconds: C:\Users\alienware\Desktop\very important shit\ai (32 gb)\ComfyUI_windows_portable_nvidia\ComfyUI_windows_portable\ComfyUI\custom_nodes\websocket_image_save.py
0.2 seconds: C:\Users\alienware\Desktop\very important shit\ai (32 gb)\ComfyUI_windows_portable_nvidia\ComfyUI_windows_portable\ComfyUI\custom_nodes\ComfyUI_bnb_nf4_fp4_Loaders-master
2.0 seconds: C:\Users\alienware\Desktop\very important shit\ai (32 gb)\ComfyUI_windows_portable_nvidia\ComfyUI_windows_portable\ComfyUI\custom_nodes\ComfyUI-Manager

Starting server

To see the GUI go to: http://127.0.0.1:8188
FETCH DATA from: C:\Users\alienware\Desktop\very important shit\ai (32 gb)\ComfyUI_windows_portable_nvidia\ComfyUI_windows_portable\ComfyUI\custom_nodes\ComfyUI-Manager\extension-node-map.json [DONE]
[ComfyUI-Manager] default cache updated: https://raw.githubusercontent.com/ltdrdata/ComfyUI-Manager/main/alter-list.json
[ComfyUI-Manager] default cache updated: https://raw.githubusercontent.com/ltdrdata/ComfyUI-Manager/main/model-list.json
[ComfyUI-Manager] default cache updated: https://raw.githubusercontent.com/ltdrdata/ComfyUI-Manager/main/github-stats.json
[ComfyUI-Manager] default cache updated: https://raw.githubusercontent.com/ltdrdata/ComfyUI-Manager/main/custom-node-list.json
[ComfyUI-Manager] default cache updated: https://raw.githubusercontent.com/ltdrdata/ComfyUI-Manager/main/extension-node-map.json
got prompt
Using pytorch attention in VAE
Using pytorch attention in VAE
model weight dtype torch.bfloat16, manual cast: None
model_type FLUX
Using pytorch attention in VAE
Using pytorch attention in VAE
Requested to load FluxClipModel_
loaded partially 1884.2000005722045 1883.88671875 0
Requested to load Flux
loaded completely 1745.7710005722047 1745.7708854675293 False
0%| | 0/20 [00:00<?, ?it/s]
!!! Exception during processing !!! All input tensors need to be on the same GPU, but found some tensors to not be on a GPU:
[(torch.Size([4718592, 1]), device(type='cpu')), (torch.Size([1, 3072]), device(type='cuda', index=0)), (torch.Size([1, 3072]), device(type='cuda', index=0)), (torch.Size([147456]), device(type='cpu')), (torch.Size([16]), device(type='cpu'))]
Traceback (most recent call last):
File "C:\Users\alienware\Desktop\very important shit\ai (32 gb)\ComfyUI_windows_portable_nvidia\ComfyUI_windows_portable\ComfyUI\execution.py", line 324, in execute
output_data, output_ui, has_subgraph = get_output_data(obj, input_data_all, execution_block_cb=execution_block_cb, pre_execute_cb=pre_execute_cb)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\alienware\Desktop\very important shit\ai (32 gb)\ComfyUI_windows_portable_nvidia\ComfyUI_windows_portable\ComfyUI\execution.py", line 199, in get_output_data
return_values = _map_node_over_list(obj, input_data_all, obj.FUNCTION, allow_interrupt=True, execution_block_cb=execution_block_cb, pre_execute_cb=pre_execute_cb)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\alienware\Desktop\very important shit\ai (32 gb)\ComfyUI_windows_portable_nvidia\ComfyUI_windows_portable\ComfyUI\execution.py", line 170, in _map_node_over_list
process_inputs(input_dict, i)
File "C:\Users\alienware\Desktop\very important shit\ai (32 gb)\ComfyUI_windows_portable_nvidia\ComfyUI_windows_portable\ComfyUI\execution.py", line 159, in process_inputs
results.append(getattr(obj, func)(**inputs))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\alienware\Desktop\very important shit\ai (32 gb)\ComfyUI_windows_portable_nvidia\ComfyUI_windows_portable\ComfyUI\comfy_extras\nodes_custom_sampler.py", line 633, in sample
samples = guider.sample(noise.generate_noise(latent), latent_image, sampler, sigmas, denoise_mask=noise_mask, callback=callback, disable_pbar=disable_pbar, seed=noise.seed)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\alienware\Desktop\very important shit\ai (32 gb)\ComfyUI_windows_portable_nvidia\ComfyUI_windows_portable\ComfyUI\comfy\samplers.py", line 904, in sample
output = executor.execute(noise, latent_image, sampler, sigmas, denoise_mask, callback, disable_pbar, seed)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\alienware\Desktop\very important shit\ai (32 gb)\ComfyUI_windows_portable_nvidia\ComfyUI_windows_portable\ComfyUI\comfy\patcher_extension.py", line 110, in execute
return self.original(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\alienware\Desktop\very important shit\ai (32 gb)\ComfyUI_windows_portable_nvidia\ComfyUI_windows_portable\ComfyUI\comfy\samplers.py", line 873, in outer_sample
output = self.inner_sample(noise, latent_image, device, sampler, sigmas, denoise_mask, callback, disable_pbar, seed)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\alienware\Desktop\very important shit\ai (32 gb)\ComfyUI_windows_portable_nvidia\ComfyUI_windows_portable\ComfyUI\comfy\samplers.py", line 857, in inner_sample
samples = executor.execute(self, sigmas, extra_args, callback, noise, latent_image, denoise_mask, disable_pbar)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\alienware\Desktop\very important shit\ai (32 gb)\ComfyUI_windows_portable_nvidia\ComfyUI_windows_portable\ComfyUI\comfy\patcher_extension.py", line 110, in execute
return self.original(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\alienware\Desktop\very important shit\ai (32 gb)\ComfyUI_windows_portable_nvidia\ComfyUI_windows_portable\ComfyUI\comfy\samplers.py", line 714, in sample
samples = self.sampler_function(model_k, noise, sigmas, extra_args=extra_args, callback=k_callback, disable=disable_pbar, **self.extra_options)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\alienware\Desktop\very important shit\ai (32 gb)\ComfyUI_windows_portable_nvidia\ComfyUI_windows_portable\python_embeded\Lib\site-packages\torch\utils_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\alienware\Desktop\very important shit\ai (32 gb)\ComfyUI_windows_portable_nvidia\ComfyUI_windows_portable\ComfyUI\comfy\k_diffusion\sampling.py", line 155, in sample_euler
denoised = model(x, sigma_hat * s_in, **extra_args)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\alienware\Desktop\very important shit\ai (32 gb)\ComfyUI_windows_portable_nvidia\ComfyUI_windows_portable\ComfyUI\comfy\samplers.py", line 384, in call
out = self.inner_model(x, sigma, model_options=model_options, seed=seed)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\alienware\Desktop\very important shit\ai (32 gb)\ComfyUI_windows_portable_nvidia\ComfyUI_windows_portable\ComfyUI\comfy\samplers.py", line 839, in call
return self.predict_noise(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\alienware\Desktop\very important shit\ai (32 gb)\ComfyUI_windows_portable_nvidia\ComfyUI_windows_portable\ComfyUI\comfy\samplers.py", line 842, in predict_noise
return sampling_function(self.inner_model, x, timestep, self.conds.get("negative", None), self.conds.get("positive", None), self.cfg, model_options=model_options, seed=seed)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\alienware\Desktop\very important shit\ai (32 gb)\ComfyUI_windows_portable_nvidia\ComfyUI_windows_portable\ComfyUI\comfy\samplers.py", line 364, in sampling_function
out = calc_cond_batch(model, conds, x, timestep, model_options)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\alienware\Desktop\very important shit\ai (32 gb)\ComfyUI_windows_portable_nvidia\ComfyUI_windows_portable\ComfyUI\comfy\samplers.py", line 200, in calc_cond_batch
return executor.execute(model, conds, x_in, timestep, model_options)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\alienware\Desktop\very important shit\ai (32 gb)\ComfyUI_windows_portable_nvidia\ComfyUI_windows_portable\ComfyUI\comfy\patcher_extension.py", line 110, in execute
return self.original(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\alienware\Desktop\very important shit\ai (32 gb)\ComfyUI_windows_portable_nvidia\ComfyUI_windows_portable\ComfyUI\comfy\samplers.py", line 313, in calc_cond_batch
output = model.apply_model(input_x, timestep
, **c).chunk(batch_chunks)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\alienware\Desktop\very important shit\ai (32 gb)\ComfyUI_windows_portable_nvidia\ComfyUI_windows_portable\ComfyUI\comfy\model_base.py", line 128, in apply_model
return comfy.patcher_extension.WrapperExecutor.new_class_executor(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\alienware\Desktop\very important shit\ai (32 gb)\ComfyUI_windows_portable_nvidia\ComfyUI_windows_portable\ComfyUI\comfy\patcher_extension.py", line 110, in execute
return self.original(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\alienware\Desktop\very important shit\ai (32 gb)\ComfyUI_windows_portable_nvidia\ComfyUI_windows_portable\ComfyUI\comfy\model_base.py", line 157, in _apply_model
model_output = self.diffusion_model(xc, t, context=context, control=control, transformer_options=transformer_options, **extra_conds).float()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\alienware\Desktop\very important shit\ai (32 gb)\ComfyUI_windows_portable_nvidia\ComfyUI_windows_portable\python_embeded\Lib\site-packages\torch\nn\modules\module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\alienware\Desktop\very important shit\ai (32 gb)\ComfyUI_windows_portable_nvidia\ComfyUI_windows_portable\python_embeded\Lib\site-packages\torch\nn\modules\module.py", line 1747, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\alienware\Desktop\very important shit\ai (32 gb)\ComfyUI_windows_portable_nvidia\ComfyUI_windows_portable\ComfyUI\comfy\ldm\flux\model.py", line 184, in forward
out = self.forward_orig(img, img_ids, context, txt_ids, timestep, y, guidance, control, transformer_options)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\alienware\Desktop\very important shit\ai (32 gb)\ComfyUI_windows_portable_nvidia\ComfyUI_windows_portable\ComfyUI\comfy\ldm\flux\model.py", line 110, in forward_orig
vec = self.time_in(timestep_embedding(timesteps, 256).to(img.dtype))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\alienware\Desktop\very important shit\ai (32 gb)\ComfyUI_windows_portable_nvidia\ComfyUI_windows_portable\python_embeded\Lib\site-packages\torch\nn\modules\module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\alienware\Desktop\very important shit\ai (32 gb)\ComfyUI_windows_portable_nvidia\ComfyUI_windows_portable\python_embeded\Lib\site-packages\torch\nn\modules\module.py", line 1747, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\alienware\Desktop\very important shit\ai (32 gb)\ComfyUI_windows_portable_nvidia\ComfyUI_windows_portable\ComfyUI\comfy\ldm\flux\layers.py", line 58, in forward
return self.out_layer(self.silu(self.in_layer(x)))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\alienware\Desktop\very important shit\ai (32 gb)\ComfyUI_windows_portable_nvidia\ComfyUI_windows_portable\python_embeded\Lib\site-packages\torch\nn\modules\module.py", line 1736, in _wrapped_call_impl
return self.call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\alienware\Desktop\very important shit\ai (32 gb)\ComfyUI_windows_portable_nvidia\ComfyUI_windows_portable\python_embeded\Lib\site-packages\torch\nn\modules\module.py", line 1747, in call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\alienware\Desktop\very important shit\ai (32 gb)\ComfyUI_windows_portable_nvidia\ComfyUI_windows_portable\ComfyUI\custom_nodes\ComfyUI_bnb_nf4_fp4_Loaders-master_init
.py", line 161, in forward
return functional_linear_4bits(x, self.weight, self.bias)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\alienware\Desktop\very important shit\ai (32 gb)\ComfyUI_windows_portable_nvidia\ComfyUI_windows_portable\ComfyUI\custom_nodes\ComfyUI_bnb_nf4_fp4_Loaders-master_init
.py", line 15, in functional_linear_4bits
out = bnb.matmul_4bit(x, weight.t(), bias=bias, quant_state=weight.quant_state)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\alienware\Desktop\very important shit\ai (32 gb)\ComfyUI_windows_portable_nvidia\ComfyUI_windows_portable\python_embeded\Lib\site-packages\bitsandbytes\autograd_functions.py", line 528, in matmul_4bit
out = F.gemv_4bit(A, B.t(), out, state=quant_state)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\alienware\Desktop\very important shit\ai (32 gb)\ComfyUI_windows_portable_nvidia\ComfyUI_windows_portable\python_embeded\Lib\site-packages\bitsandbytes\functional.py", line 1989, in gemv_4bit
is_on_gpu([B, A, out, absmax, state.code])
File "C:\Users\alienware\Desktop\very important shit\ai (32 gb)\ComfyUI_windows_portable_nvidia\ComfyUI_windows_portable\python_embeded\Lib\site-packages\bitsandbytes\functional.py", line 464, in is_on_gpu
raise RuntimeError(
RuntimeError: All input tensors need to be on the same GPU, but found some tensors to not be on a GPU:
[(torch.Size([4718592, 1]), device(type='cpu')), (torch.Size([1, 3072]), device(type='cuda', index=0)), (torch.Size([1, 3072]), device(type='cuda', index=0)), (torch.Size([147456]), device(type='cpu')), (torch.Size([16]), device(type='cpu'))]

Prompt executed in 32.31 seconds

I have two gpus:
Iris intel xe graphics (power saving)
nvidia rtx 3050 laptop (high performance)

this is the error I get when trying to generate a normal image using flux_dev_bnb_nf4_v2
any ideas on how to fix this? thanks.

@Dhrhciebcy
Copy link

I have the same problem, I have noticed this reply from here kijai/ComfyUI-HunyuanVideoWrapper#130

Possibly had a connected error see kijai/ComfyUI-HunyuanVideoWrapper#80, In my case the problem was that the --lowvram flag on launch was causing issues with memory allocation and forcing some tensors onto the CPU.

Are you using the --lowvram on the launch of comfyUI? If so you could try to remove it?

I tried running ComfyUI on CPU only and the model loaded perfectly, so there is something wrong with memory management. I'm investigating...

@Dhrhciebcy
Copy link

Ok, maybe I understand what is going on, unfortunately not good news: it's clearly a memory issue, as the error specifies that tensors are not all loaded on the same memory of the same device; "tensor crossover" between multiple devices is not supported. The devices in question are two, CPU and GPU, each of which obviously with its own memory, RAM and VRAM respectively. The problem is exactly this: all tensors must be loaded either all only on RAM or all only on VRAM.

By making a change to the code, I think I managed to force the loading of all tensors on VRAM only, however now my ComfyUI goes into torch.OutOfMemoryError, since my video card unfortunately only has 6 GB of VRAM.
So I fear that the problem can be permanently solved only with enough VRAM or by starting ComfyUI for CPU only, configuring the rest of the workflow to work without CUDA


The file I modified is ComfyUI\custom_nodes\ComfyUI_bnb_nf4_fp4_Loaders\__init__.py

I changed the whole function functional_linear_4bits(x, weight, bias) in line 14 into this:

def functional_linear_4bits(x, weight, bias):
    #Synchronize all tensors
    device = x.device
    dtype = x.dtype

    weight = weight.to(device=device, dtype=dtype)
    if bias is not None:
        bias = bias.to(device=device, dtype=dtype)

    if weight.quant_state is not None:
        weight.quant_state = copy_quant_state(weight.quant_state, device=device)

    #Secure call to bnb.matmul_4bit
    out = bnb.matmul_4bit(x, weight.t(), bias=bias, quant_state=weight.quant_state)
    out = out.to(x)
    return out

For the sake of fairness, I would like to point out that ChatGPT helped me modify the code

I hope this could help

@iamsuper123
Copy link
Author

thank you for your reply
just for anyone else reading this, here's the original code:

def functional_linear_4bits(x, weight, bias):
out = bnb.matmul_4bit(x, weight.t(), bias=bias, quant_state=weight.quant_state)
out = out.to(x)
return out

@itsRevela
Copy link

Ok, maybe I understand what is going on, unfortunately not good news: it's clearly a memory issue, as the error specifies that tensors are not all loaded on the same memory of the same device; "tensor crossover" between multiple devices is not supported. The devices in question are two, CPU and GPU, each of which obviously with its own memory, RAM and VRAM respectively. The problem is exactly this: all tensors must be loaded either all only on RAM or all only on VRAM.

By making a change to the code, I think I managed to force the loading of all tensors on VRAM only, however now my ComfyUI goes into torch.OutOfMemoryError, since my video card unfortunately only has 6 GB of VRAM. So I fear that the problem can be permanently solved only with enough VRAM or by starting ComfyUI for CPU only, configuring the rest of the workflow to work without CUDA

The file I modified is ComfyUI\custom_nodes\ComfyUI_bnb_nf4_fp4_Loaders\__init__.py

I changed the whole function functional_linear_4bits(x, weight, bias) in line 14 into this:

def functional_linear_4bits(x, weight, bias):
    #Synchronize all tensors
    device = x.device
    dtype = x.dtype

    weight = weight.to(device=device, dtype=dtype)
    if bias is not None:
        bias = bias.to(device=device, dtype=dtype)

    if weight.quant_state is not None:
        weight.quant_state = copy_quant_state(weight.quant_state, device=device)

    #Secure call to bnb.matmul_4bit
    out = bnb.matmul_4bit(x, weight.t(), bias=bias, quant_state=weight.quant_state)
    out = out.to(x)
    return out

For the sake of fairness, I would like to point out that ChatGPT helped me modify the code

I hope this could help

This fixed it for me! Thank you so much.

@itsRevela
Copy link

itsRevela commented Dec 31, 2024

This is what ended up working for me:

def functional_linear_4bits(x, weight, bias):
    device = x.device
    dtype = x.dtype

    # Move weight and bias to the correct device and dtype
    weight = weight.to(device=device, dtype=dtype, non_blocking=True)
    if bias is not None:
        bias = bias.to(device=device, dtype=dtype, non_blocking=True)

    # Copy quantization state if it exists
    quant_state = getattr(weight, 'quant_state', None)
    if quant_state is not None:
        quant_state = copy_quant_state(quant_state, device=device)

    # Process everything at once
    return bnb.matmul_4bit(x, weight.t(), bias=bias, quant_state=quant_state).to(dtype=x.dtype)

Differences between original functional_linear_4bits and the new one shown above:

  1. Device and Data Type Handling:

    • New Code: Ensures weight and bias are explicitly moved to the same device and data type as x for compatibility.
      weight = weight.to(device=device, dtype=dtype, non_blocking=True)
      if bias is not None:
          bias = bias.to(device=device, dtype=dtype, non_blocking=True)
    • Original Code: Does not explicitly handle device or dtype mismatches, which could cause runtime errors.
  2. Quantization State Handling:

    • New Code: Checks for the presence of quant_state in weight and ensures it is copied to the correct device.
      quant_state = getattr(weight, 'quant_state', None)
      if quant_state is not None:
          quant_state = copy_quant_state(quant_state, device=device)
    • Original Code: Assumes weight.quant_state exists and is ready to use without validation.
  3. Output Casting:

    • New Code: Explicitly casts the output tensor to the data type of x:
      return bnb.matmul_4bit(x, weight.t(), bias=bias, quant_state=quant_state).to(dtype=x.dtype)
    • Original Code: Uses out.to(x), implicitly setting both dtype and device.

@terrabys
Copy link

terrabys commented Jan 1, 2025

This is what ended up working for me:

def functional_linear_4bits(x, weight, bias):
    device = x.device
    dtype = x.dtype

    # Move weight and bias to the correct device and dtype
    weight = weight.to(device=device, dtype=dtype, non_blocking=True)
    if bias is not None:
        bias = bias.to(device=device, dtype=dtype, non_blocking=True)

    # Copy quantization state if it exists
    quant_state = getattr(weight, 'quant_state', None)
    if quant_state is not None:
        quant_state = copy_quant_state(quant_state, device=device)

    # Process everything at once
    return bnb.matmul_4bit(x, weight.t(), bias=bias, quant_state=quant_state).to(dtype=x.dtype)

Differences between original functional_linear_4bits and the new one shown above:

  1. Device and Data Type Handling:

    • New Code: Ensures weight and bias are explicitly moved to the same device and data type as x for compatibility.
      weight = weight.to(device=device, dtype=dtype, non_blocking=True)
      if bias is not None:
          bias = bias.to(device=device, dtype=dtype, non_blocking=True)
    • Original Code: Does not explicitly handle device or dtype mismatches, which could cause runtime errors.
  2. Quantization State Handling:

    • New Code: Checks for the presence of quant_state in weight and ensures it is copied to the correct device.
      quant_state = getattr(weight, 'quant_state', None)
      if quant_state is not None:
          quant_state = copy_quant_state(quant_state, device=device)
    • Original Code: Assumes weight.quant_state exists and is ready to use without validation.
  3. Output Casting:

    • New Code: Explicitly casts the output tensor to the data type of x:
      return bnb.matmul_4bit(x, weight.t(), bias=bias, quant_state=quant_state).to(dtype=x.dtype)
    • Original Code: Uses out.to(x), implicitly setting both dtype and device.

Did not work for me, the model loads now and starts to generate but runs on CPU, and I get: Allocation on device ,works fine on Forgeui I think there is an issue with the swap part, managed to reproduce the 'Allocation on device" when switching swap method from queue to async, may be same problem here?

This is logs from forge when works correctly:
[Memory Management] Target: KModel, Free GPU: 6998.53 MB, Model Require: 6246.84 MB, Previously Loaded: 0.00 MB, Inference Require: 1024.00 MB, Remaining: -272.31 MB, CPU Swap Loaded (blocked method): 1554.30 MB, GPU Loaded: 4692.54 MB

@Morisgit148
Copy link

Hi the same issue i am having i copied the below code now new error out of memory for bnbnf4v2 I have 6gb GPU and 16 gb vram I am able to run on forge but on comfui failing is there any way to limit GPU weights like forge ?

def functional_linear_4bits(x, weight, bias):
device = x.device
dtype = x.dtype

# Move weight and bias to the correct device and dtype
weight = weight.to(device=device, dtype=dtype, non_blocking=True)
if bias is not None:
    bias = bias.to(device=device, dtype=dtype, non_blocking=True)

# Copy quantization state if it exists
quant_state = getattr(weight, 'quant_state', None)
if quant_state is not None:
    quant_state = copy_quant_state(quant_state, device=device)

# Process everything at once
return bnb.matmul_4bit(x, weight.t(), bias=bias, quant_state=quant_state).to(dtype=x.dtype)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants