StupidAE β d8c16 Tiny Patch Autoencoder
StupidAE is a very small, very fast, and intentionally simple model that still works surprisingly well.
It has 13.24M parameters, compresses by 8Γ per spatial dimension, and uses 16 latent channels.
The main goal: make a AE that doesnβt slow everything down and is fast enough to run directly during text-to-image training.
Code
The code is available on GitHub:
π https://github.com/Muinez/StupidAE
Key Numbers
- Total params: 13,243,539
- Compression: d8 (8Γ8 patching)
- Latent channels: 16 (c16)
- Training: 30k steps, batch size 256, ~3 RTX 5090-hours
- Optimizer: Muon + SnooC, LR =
1e-3 - Trained without KL loss (just mse)
Performance (compared to SDXL VAE)
Stats for 1024Γ1024:
| Component | SDXL VAE | StupidAE |
|---|---|---|
| Encoder FLOPs | 4.34 TFLOPs | 124.18 GFLOPs |
| Decoder FLOPs | 9.93 TFLOPs | 318.52 GFLOPs |
| Encoder Params | 34.16M | ~3.8M |
| Decoder Params | 49.49M | ~9.7M |
The model is tens of times faster and lighter, making it usable directly inside training loops.
Architecture Overview
β No Attention
It is simply unnecessary for this design and only slows things down.
π¦ Encoder
- Splits the image into 8Γ8 patches
- Each patch is encoded independently
- Uses only 1Γ1 convolutions
- Extremely fast
The encoder can handle any aspect ratio, but if you want to mix different ARs inside the same batch, the 1Γ1 conv version becomes inconvenient. The Linear encoder version solves this completely β mixed batches work out of the box, although I havenβt released it yet β I can upload it if needed.
There is also a Linear-based encoder version; I can publish it if needed.
π₯ Decoder
- Uses standard 3Γ3 convolutions (but 1Γ1 also works with surprisingly few artifacts)
- Uses a PixNeRF-style head instead of stacked upsampling blocks
Limitations
- Reconstruction is not perfect β small details may appear slightly blurred.
- Current MSE loss: 0.0020.
- This can likely be improved by increasing model size.
Notes on 32Γ Compression
If you want 32Γ spatial compression, do not use naive 32Γ patching β quality drops heavily.
A better approach:
- First stage: patch-8 β 16/32 channels
- Second stage: patch-4 β 256 channels
This trains much better and works well for text-to-image training too.
Iβve tested it, and the results are significantly more stable than naive approaches.
If you want to keep FLOPs low, you could try using patch-16 from the start, but Iβm not sure yet how stable the training would be.
Iβm currently working on a d32c64 model with reconstruction quality better than Hunyuan VAE, but Iβm limited by compute resources.
Support the Project
Iβm renting an RTX 5090 and running all experiments on it.
Iβm currently looking for work and would love to join a team doing text-to-image or video model research.
If you want to support development:
- TRC20: π TPssa5ung2MgqbaVr1aeBQEpHC3xfmm1CL
- BTC: bc1qfv6pyq5dvs0tths682nhfdnmdwnjvm2av80ej4
- Boosty: https://boosty.to/muinez
How to use
Here's a minimal example:
import torch
from huggingface_hub import hf_hub_download
from PIL import Image
from torchvision.transforms import v2
from IPython.display import display
import requests
from stae import StupidAE
vae = StupidAE().cuda().half()
vae.load_state_dict(
torch.load(hf_hub_download(repo_id="Muinez/StupidAE", filename="smol_f8c16.pt"))
)
t = v2.Compose([
v2.Resize((1024, 1024)),
v2.ToTensor(),
v2.Normalize([0.5], [0.5])
])
image = Image.open(requests.get("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG", stream=True).raw).convert("RGB")
with torch.inference_mode():
image = t(image).unsqueeze(0).cuda().half()
latents = vae.encode(image)
image_decoded = vae.decode(latents)
image = v2.ToPILImage()(torch.clamp(image_decoded * 0.5 + 0.5, 0, 1).squeeze(0))
display(image)
Coming Soon
- Linear-encoder variant
- d32c64 model
- Tutorial: training text-to-image without bucketing (supports mixed aspect ratios)