StupidAE — d8c16 Tiny Patch Autoencoder

StupidAE is a very small, very fast, and intentionally simple model that still works surprisingly well.
It has 13.24M parameters, compresses by 8× per spatial dimension, and uses 16 latent channels.

The main goal: make a AE that doesn’t slow everything down and is fast enough to run directly during text-to-image training.

Code

The code is available on GitHub:

👉 https://github.com/Muinez/StupidAE

Key Numbers

Total params: 13,243,539
Compression: d8 (8×8 patching)
Latent channels: 16 (c16)
Training: 30k steps, batch size 256, ~3 RTX 5090-hours
Optimizer: Muon + SnooC, LR = 1e-3
Trained without KL loss (just mse)

Performance (compared to SDXL VAE)

Stats for 1024×1024:

Component	SDXL VAE	StupidAE
Encoder FLOPs	4.34 TFLOPs	124.18 GFLOPs
Decoder FLOPs	9.93 TFLOPs	318.52 GFLOPs
Encoder Params	34.16M	~3.8M
Decoder Params	49.49M	~9.7M

The model is tens of times faster and lighter, making it usable directly inside training loops.

Architecture Overview

❌ No Attention

It is simply unnecessary for this design and only slows things down.

🟦 Encoder

Splits the image into 8×8 patches
Each patch is encoded independently
Uses only 1×1 convolutions
Extremely fast

The encoder can handle any aspect ratio, but if you want to mix different ARs inside the same batch, the 1×1 conv version becomes inconvenient. The Linear encoder version solves this completely — mixed batches work out of the box, although I haven’t released it yet — I can upload it if needed.

There is also a Linear-based encoder version; I can publish it if needed.

🟥 Decoder

Uses standard 3×3 convolutions (but 1×1 also works with surprisingly few artifacts)
Uses a PixNeRF-style head instead of stacked upsampling blocks

Limitations

Reconstruction is not perfect — small details may appear slightly blurred.
Current MSE loss: 0.0020.
This can likely be improved by increasing model size.

Notes on 32× Compression

If you want 32× spatial compression, do not use naive 32× patching — quality drops heavily.

A better approach:

First stage: patch-8 → 16/32 channels
Second stage: patch-4 → 256 channels

This trains much better and works well for text-to-image training too.
I’ve tested it, and the results are significantly more stable than naive approaches.

If you want to keep FLOPs low, you could try using patch-16 from the start, but I’m not sure yet how stable the training would be.

I’m currently working on a d32c64 model with reconstruction quality better than Hunyuan VAE, but I’m limited by compute resources.

Support the Project

I’m renting an RTX 5090 and running all experiments on it.
I’m currently looking for work and would love to join a team doing text-to-image or video model research.

If you want to support development:

TRC20: 👉 TPssa5ung2MgqbaVr1aeBQEpHC3xfmm1CL
BTC: bc1qfv6pyq5dvs0tths682nhfdnmdwnjvm2av80ej4
Boosty: https://boosty.to/muinez

How to use

Here's a minimal example:

import torch
from huggingface_hub import hf_hub_download
from PIL import Image
from torchvision.transforms import v2
from IPython.display import display
import requests
from stae import StupidAE

vae = StupidAE().cuda().half()
vae.load_state_dict(
    torch.load(hf_hub_download(repo_id="Muinez/StupidAE", filename="smol_f8c16.pt"))
)

t = v2.Compose([
    v2.Resize((1024, 1024)),
    v2.ToTensor(),
    v2.Normalize([0.5], [0.5])
])

image = Image.open(requests.get("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG", stream=True).raw).convert("RGB")

with torch.inference_mode():
    image = t(image).unsqueeze(0).cuda().half()
    
    latents = vae.encode(image)
    image_decoded = vae.decode(latents)
    
    image = v2.ToPILImage()(torch.clamp(image_decoded * 0.5 + 0.5, 0, 1).squeeze(0))
    display(image)

Coming Soon

Linear-encoder variant
d32c64 model
Tutorial: training text-to-image without bucketing (supports mixed aspect ratios)

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Muinez/StupidAE

Finetunes

1 model

Muinez
/

StupidAE