Create README.md
Browse files
README.md
ADDED
|
@@ -0,0 +1,150 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
datasets:
|
| 3 |
+
- Muinez/sankaku-webp-256shortest-edge
|
| 4 |
+
---
|
| 5 |
+
|
| 6 |
+
# StupidAE — d8c16 Tiny Patch Autoencoder
|
| 7 |
+
|
| 8 |
+
StupidAE is a very small, very fast, and intentionally simple model that still works surprisingly well.
|
| 9 |
+
It has **13.24M parameters**, compresses by **8× per spatial dimension**, and uses **16 latent channels**.
|
| 10 |
+
|
| 11 |
+
The main goal: make a AE that doesn’t slow everything down and is fast enough to run directly during text-to-image training.
|
| 12 |
+
|
| 13 |
+
---
|
| 14 |
+
|
| 15 |
+
## Code
|
| 16 |
+
|
| 17 |
+
The code is available on GitHub:
|
| 18 |
+
|
| 19 |
+
👉 [https://github.com/Muinez/StupidAE](https://github.com/Muinez/StupidAE)
|
| 20 |
+
|
| 21 |
+
---
|
| 22 |
+
|
| 23 |
+
## Key Numbers
|
| 24 |
+
|
| 25 |
+
- Total params: **13,243,539**
|
| 26 |
+
- Compression: **d8 (8×8 patching)**
|
| 27 |
+
- Latent channels: **16 (c16)**
|
| 28 |
+
- Training: **30k steps**, batch size **256**, **~3** RTX 5090-hours
|
| 29 |
+
- Optimizer: **Muon + SnooC**, LR = `1e-3`
|
| 30 |
+
- Trained **without KL loss** (just mse)
|
| 31 |
+
|
| 32 |
+
---
|
| 33 |
+
|
| 34 |
+
## Performance (compared to SDXL VAE)
|
| 35 |
+
|
| 36 |
+
Stats for 1024×1024:
|
| 37 |
+
|
| 38 |
+
| Component | SDXL VAE | StupidAE |
|
| 39 |
+
|----------|----------|-----------|
|
| 40 |
+
| Encoder FLOPs | 4.34 TFLOPs | **124.18 GFLOPs** |
|
| 41 |
+
| Decoder FLOPs | 9.93 TFLOPs | **318.52 GFLOPs** |
|
| 42 |
+
| Encoder Params | 34.16M | **~3.8M** |
|
| 43 |
+
| Decoder Params | 49.49M | **~9.7M** |
|
| 44 |
+
|
| 45 |
+
The model is **tens of times faster and lighter**, making it usable directly inside training loops.
|
| 46 |
+
|
| 47 |
+
---
|
| 48 |
+
|
| 49 |
+
## Architecture Overview
|
| 50 |
+
|
| 51 |
+
### ❌ No Attention
|
| 52 |
+
It is simply unnecessary for this design and only slows things down.
|
| 53 |
+
|
| 54 |
+
### 🟦 Encoder
|
| 55 |
+
- Splits the image into **8×8 patches**
|
| 56 |
+
- Each patch is encoded **independently**
|
| 57 |
+
- Uses **only 1×1 convolutions**
|
| 58 |
+
- Extremely fast
|
| 59 |
+
|
| 60 |
+
The encoder can handle any aspect ratio, but if you want to mix different ARs inside the same batch, the 1×1 conv version becomes inconvenient.
|
| 61 |
+
The Linear encoder version solves this completely — mixed batches work out of the box, although I haven’t released it yet — I can upload it if needed.
|
| 62 |
+
|
| 63 |
+
There is also a Linear-based encoder version; I can publish it if needed.
|
| 64 |
+
|
| 65 |
+
### 🟥 Decoder
|
| 66 |
+
- Uses standard 3×3 convolutions (but 1×1 also works with surprisingly few artifacts)
|
| 67 |
+
- Uses a **PixNeRF-style head** instead of stacked upsampling blocks
|
| 68 |
+
|
| 69 |
+
---
|
| 70 |
+
|
| 71 |
+
## Limitations
|
| 72 |
+
|
| 73 |
+
- Reconstruction is not perfect — small details may appear slightly blurred.
|
| 74 |
+
- Current MSE loss: 0.0020.
|
| 75 |
+
- This can likely be improved by increasing model size.
|
| 76 |
+
|
| 77 |
+
---
|
| 78 |
+
|
| 79 |
+
## Notes on 32× Compression
|
| 80 |
+
|
| 81 |
+
If you want **32× spatial compression**, do **not** use naive 32× patching — quality drops heavily.
|
| 82 |
+
|
| 83 |
+
A better approach:
|
| 84 |
+
|
| 85 |
+
1. First stage: patch-8 → 16/32 channels
|
| 86 |
+
2. Second stage: patch-4 → 256 channels
|
| 87 |
+
|
| 88 |
+
This trains much better and works well for text-to-image training too.
|
| 89 |
+
I’ve tested it, and the results are significantly more stable than naive approaches.
|
| 90 |
+
|
| 91 |
+
If you want to keep FLOPs low, you could try using patch-16 from the start, but I’m not sure yet how stable the training would be.
|
| 92 |
+
|
| 93 |
+
I’m currently working on a **d32c64** model with reconstruction quality better than Hunyuan VAE, but I’m limited by compute resources.
|
| 94 |
+
|
| 95 |
+
---
|
| 96 |
+
|
| 97 |
+
## Support the Project
|
| 98 |
+
|
| 99 |
+
I’m renting an **RTX 5090** and running all experiments on it.
|
| 100 |
+
I’m currently looking for work and would love to join a team doing text-to-image or video model research.
|
| 101 |
+
|
| 102 |
+
If you want to support development:
|
| 103 |
+
|
| 104 |
+
- TRC20: 👉 TPssa5ung2MgqbaVr1aeBQEpHC3xfmm1CL
|
| 105 |
+
- BTC: bc1qfv6pyq5dvs0tths682nhfdnmdwnjvm2av80ej4
|
| 106 |
+
- Boosty: https://boosty.to/muinez
|
| 107 |
+
|
| 108 |
+
---
|
| 109 |
+
|
| 110 |
+
## How to use
|
| 111 |
+
|
| 112 |
+
Here's a minimal example:
|
| 113 |
+
|
| 114 |
+
```python
|
| 115 |
+
import torch
|
| 116 |
+
from huggingface_hub import hf_hub_download
|
| 117 |
+
from PIL import Image
|
| 118 |
+
from torchvision.transforms import v2
|
| 119 |
+
from IPython.display import display
|
| 120 |
+
import requests
|
| 121 |
+
from stae import StupidAE
|
| 122 |
+
|
| 123 |
+
vae = StupidAE().cuda().half()
|
| 124 |
+
vae.load_state_dict(
|
| 125 |
+
torch.load(hf_hub_download(repo_id="Muinez/StupidAE", filename="smol_f8c16.pt"))
|
| 126 |
+
)
|
| 127 |
+
|
| 128 |
+
t = v2.Compose([
|
| 129 |
+
v2.Resize((1024, 1024)),
|
| 130 |
+
v2.ToTensor(),
|
| 131 |
+
v2.Normalize([0.5], [0.5])
|
| 132 |
+
])
|
| 133 |
+
|
| 134 |
+
image = Image.open(requests.get("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG", stream=True).raw).convert("RGB")
|
| 135 |
+
|
| 136 |
+
with torch.inference_mode():
|
| 137 |
+
image = t(image).unsqueeze(0).cuda().half()
|
| 138 |
+
|
| 139 |
+
latents = vae.encode(image)
|
| 140 |
+
image_decoded = vae.decode(latents)
|
| 141 |
+
|
| 142 |
+
image = v2.ToPILImage()(torch.clamp(image_decoded * 0.5 + 0.5, 0, 1).squeeze(0))
|
| 143 |
+
display(image)
|
| 144 |
+
```
|
| 145 |
+
---
|
| 146 |
+
## Coming Soon
|
| 147 |
+
|
| 148 |
+
- Linear-encoder variant
|
| 149 |
+
- d32c64 model
|
| 150 |
+
- Tutorial: training text-to-image **without bucketing** (supports mixed aspect ratios)
|