CUDA out of mem when changing spaces hardware

#129
by pulb - opened

Hi,

thanks for this awesome app. I cloned your repo to my private space to be able to change the hardware it runs on. No matter what spaces hardware I choose (even Nvidia A100 large), I get a CUDA out of memory error on app start up. Any idea how to fix this?

Hi @pulb ,
Keep it as it is, continue with ZeroGPU [H200].
So the requirement is 60.8 GB, and it is expected to start smoothly.
If you encounter the OOM error again, go to App Settings and hit Factory Rebuild. It will remove unwanted caches and rebuild your app to run smoothly.

Screenshot 2025-11-30 at 11-03-19 Settings · prithivMLmods_Qwen-Image-Edit-2509-LoRAs-Fast

Hi, thanks for the fast response! My problem is that even I though I subscribed to pro, I pretty fast run out of included compute time, so I figured i'll go with one of the paid custom hardware options. But whatever option I chose, even with vram > 100gb, I get this error. Factory rebuild does not seem to solve it.

This is the actual error I get on 4xl40s

Loading pipeline components...: 67%|██████▋ | 4/6 [00:01<00:00, 3.57it/s]
Loading pipeline components...: 100%|██████████| 6/6 [00:01<00:00, 5.36it/s]
Expected types for transformer: (<class 'diffusers.models.transformers.transformer_qwenimage.QwenImageTransformer2DModel'>,), got <class 'qwenimage.transformer_qwenimage.QwenImageTransformer2DModel'>.
Traceback (most recent call last):
File "/home/user/app/app.py", line 110, in
).to(device)
File "/home/user/.pyenv/versions/3.10.19/lib/python3.10/site-packages/diffusers/pipelines/pipeline_utils.py", line 545, in to
module.to(device, dtype)
File "/home/user/.pyenv/versions/3.10.19/lib/python3.10/site-packages/transformers/modeling_utils.py", line 4343, in to
return super().to(*args, **kwargs)
File "/home/user/.pyenv/versions/3.10.19/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1371, in to
return self._apply(convert)
File "/home/user/.pyenv/versions/3.10.19/lib/python3.10/site-packages/torch/nn/modules/module.py", line 930, in _apply
module._apply(fn)
File "/home/user/.pyenv/versions/3.10.19/lib/python3.10/site-packages/torch/nn/modules/module.py", line 930, in _apply
module._apply(fn)
File "/home/user/.pyenv/versions/3.10.19/lib/python3.10/site-packages/torch/nn/modules/module.py", line 930, in _apply
module._apply(fn)
[Previous line repeated 3 more times]
File "/home/user/.pyenv/versions/3.10.19/lib/python3.10/site-packages/torch/nn/modules/module.py", line 957, in _apply
param_applied = fn(param)
File "/home/user/.pyenv/versions/3.10.19/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1357, in convert
return t.to(
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 26.00 MiB. GPU 0 has a total capacity of 44.39 GiB of which 23.38 MiB is free. Including non-PyTorch memory, this process has 44.36 GiB memory in use. Of the allocated memory 43.82 GiB is allocated by PyTorch, and 125.47 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
CUDA_VISIBLE_DEVICES= None
torch.version = 2.9.1+cu128
torch.version.cuda = 12.8
cuda available: True
cuda device count: 4
current device: 0
device name: NVIDIA L40S
Using device: cuda

Ok, I just noticed that the stated RAM of the provided paid options is the combined vRAM of all cards in total... I guess there is no way to account for this in the app, i.e. spread workload across multiple gfx cards?

Hey @pulb
Instead of doing all this custom hardware setup, I already told you to start the app with ZeroGPU (H200), which is the best way too. Why are you going the hard? (Since you have a Pro subscription, you can run up to 10 H200 Spaces with that.)

Remember, the fast 4-step inference can be achieved with ZeroGPU!

My problem is that I only have 25 mins of compute time on zeroGPU. I wanted to go for another GPU option to be able to extend that time. Or is there a way to gain more compute time on zeroGPU?

Okay, you're ready to pay for credits, but you can’t adapt the app for the accelerator. Alright, I’ll come up with a fix for you.
@pulb

Sorry, I'm a developer, but I'm completely new to this CUDA stuff. I'm very interested in improving my knowledge though 😬 Thanks for your support!

I'm not sure it it's sufficient to just install the accelerate lib and set device_map to "auto" instead of "cuda".

@pulb -
Okay, map the device to auto and set allocation segmentation.
VAR_NAME: PYTORCH_CUDA_ALLOC_CONF or PYTORCH_ALLOC_CONF
VALUE: expandable_segments:True

Remove the functions related to ZeroGPU.
I have not tried this on other devices yet, other than ZeroGPU [H200]. I am a little busy with other setups right now. Try these settings and if it works, great. If not, I will bring a solution soon.

I tried changing it to auto, but it said it is not supported. If you have some code to test, I'm happy to try it out :)

@pulb
Yep, refer to the semantics related issues in the environment variables section:
Environment Variables
Try to fix it.

Ok. I got to start up without errors. But if I upload an image and hit edit, it says "processing..." and keeps doing so like forever. . I aborted it after 3 minutes. So if you find time to try a fix, please let me know

Sign up or log in to comment