Muinez
/

StupidAE

Model card Files Files and versions

xet

Community

Muinez commited on 27 days ago

Commit

8e3d912

verified ·

1 Parent(s): 80e0629

Create README.md

Browse files

Files changed (1) hide show

README.md +150 -0

README.md ADDED Viewed

	@@ -0,0 +1,150 @@

+---
+datasets:
+- Muinez/sankaku-webp-256shortest-edge
+---
+# StupidAE — d8c16 Tiny Patch Autoencoder
+StupidAE is a very small, very fast, and intentionally simple model that still works surprisingly well.
+It has **13.24M parameters**, compresses by **8× per spatial dimension**, and uses **16 latent channels**.
+The main goal: make a AE that doesn’t slow everything down and is fast enough to run directly during text-to-image training.
+---
+## Code
+The code is available on GitHub:
+👉 [https://github.com/Muinez/StupidAE](https://github.com/Muinez/StupidAE)
+---
+## Key Numbers
+- Total params: **13,243,539**
+- Compression: **d8 (8×8 patching)**
+- Latent channels: **16 (c16)**
+- Training: **30k steps**, batch size **256**, **~3** RTX 5090-hours
+- Optimizer: **Muon + SnooC**, LR = `1e-3`
+- Trained **without KL loss** (just mse)
+---
+## Performance (compared to SDXL VAE)
+Stats for 1024×1024:
+| Component | SDXL VAE | StupidAE |
+|----------|----------|-----------|
+| Encoder FLOPs | 4.34 TFLOPs | **124.18 GFLOPs** |
+| Decoder FLOPs | 9.93 TFLOPs | **318.52 GFLOPs** |
+| Encoder Params | 34.16M | **~3.8M** |
+| Decoder Params | 49.49M | **~9.7M** |
+The model is **tens of times faster and lighter**, making it usable directly inside training loops.
+---
+## Architecture Overview
+### ❌ No Attention
+It is simply unnecessary for this design and only slows things down.
+### 🟦 Encoder
+- Splits the image into **8×8 patches**
+- Each patch is encoded **independently**
+- Uses **only 1×1 convolutions**
+- Extremely fast
+The encoder can handle any aspect ratio, but if you want to mix different ARs inside the same batch, the 1×1 conv version becomes inconvenient.
+The Linear encoder version solves this completely — mixed batches work out of the box, although I haven’t released it yet — I can upload it if needed.
+There is also a Linear-based encoder version; I can publish it if needed.
+### 🟥 Decoder
+- Uses standard 3×3 convolutions (but 1×1 also works with surprisingly few artifacts)
+- Uses a **PixNeRF-style head** instead of stacked upsampling blocks
+---
+## Limitations
+- Reconstruction is not perfect — small details may appear slightly blurred.
+- Current MSE loss: 0.0020.
+- This can likely be improved by increasing model size.
+---
+## Notes on 32× Compression
+If you want **32× spatial compression**, do **not** use naive 32× patching — quality drops heavily.
+A better approach:
+1. First stage: patch-8 → 16/32 channels
+2. Second stage: patch-4 → 256 channels
+This trains much better and works well for text-to-image training too.
+I’ve tested it, and the results are significantly more stable than naive approaches.
+If you want to keep FLOPs low, you could try using patch-16 from the start, but I’m not sure yet how stable the training would be.
+I’m currently working on a **d32c64** model with reconstruction quality better than Hunyuan VAE, but I’m limited by compute resources.
+---
+## Support the Project
+I’m renting an **RTX 5090** and running all experiments on it.
+I’m currently looking for work and would love to join a team doing text-to-image or video model research.
+If you want to support development:
+- TRC20: 👉 TPssa5ung2MgqbaVr1aeBQEpHC3xfmm1CL
+- BTC: bc1qfv6pyq5dvs0tths682nhfdnmdwnjvm2av80ej4
+- Boosty: https://boosty.to/muinez
+---
+## How to use
+Here's a minimal example:
+```python
+import torch
+from huggingface_hub import hf_hub_download
+from PIL import Image
+from torchvision.transforms import v2
+from IPython.display import display
+import requests
+from stae import StupidAE
+vae = StupidAE().cuda().half()
+vae.load_state_dict(
+    torch.load(hf_hub_download(repo_id="Muinez/StupidAE", filename="smol_f8c16.pt"))
+)
+t = v2.Compose([
+    v2.Resize((1024, 1024)),
+    v2.ToTensor(),
+    v2.Normalize([0.5], [0.5])
+])
+image = Image.open(requests.get("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG", stream=True).raw).convert("RGB")
+with torch.inference_mode():
+    image = t(image).unsqueeze(0).cuda().half()
+    latents = vae.encode(image)
+    image_decoded = vae.decode(latents)
+    image = v2.ToPILImage()(torch.clamp(image_decoded * 0.5 + 0.5, 0, 1).squeeze(0))
+    display(image)
+```
+---
+## Coming Soon
+- Linear-encoder variant
+- d32c64 model
+- Tutorial: training text-to-image **without bucketing** (supports mixed aspect ratios)