Muinez commited on
Commit
8e3d912
·
verified ·
1 Parent(s): 80e0629

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +150 -0
README.md ADDED
@@ -0,0 +1,150 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ datasets:
3
+ - Muinez/sankaku-webp-256shortest-edge
4
+ ---
5
+
6
+ # StupidAE — d8c16 Tiny Patch Autoencoder
7
+
8
+ StupidAE is a very small, very fast, and intentionally simple model that still works surprisingly well.
9
+ It has **13.24M parameters**, compresses by **8× per spatial dimension**, and uses **16 latent channels**.
10
+
11
+ The main goal: make a AE that doesn’t slow everything down and is fast enough to run directly during text-to-image training.
12
+
13
+ ---
14
+
15
+ ## Code
16
+
17
+ The code is available on GitHub:
18
+
19
+ 👉 [https://github.com/Muinez/StupidAE](https://github.com/Muinez/StupidAE)
20
+
21
+ ---
22
+
23
+ ## Key Numbers
24
+
25
+ - Total params: **13,243,539**
26
+ - Compression: **d8 (8×8 patching)**
27
+ - Latent channels: **16 (c16)**
28
+ - Training: **30k steps**, batch size **256**, **~3** RTX 5090-hours
29
+ - Optimizer: **Muon + SnooC**, LR = `1e-3`
30
+ - Trained **without KL loss** (just mse)
31
+
32
+ ---
33
+
34
+ ## Performance (compared to SDXL VAE)
35
+
36
+ Stats for 1024×1024:
37
+
38
+ | Component | SDXL VAE | StupidAE |
39
+ |----------|----------|-----------|
40
+ | Encoder FLOPs | 4.34 TFLOPs | **124.18 GFLOPs** |
41
+ | Decoder FLOPs | 9.93 TFLOPs | **318.52 GFLOPs** |
42
+ | Encoder Params | 34.16M | **~3.8M** |
43
+ | Decoder Params | 49.49M | **~9.7M** |
44
+
45
+ The model is **tens of times faster and lighter**, making it usable directly inside training loops.
46
+
47
+ ---
48
+
49
+ ## Architecture Overview
50
+
51
+ ### ❌ No Attention
52
+ It is simply unnecessary for this design and only slows things down.
53
+
54
+ ### 🟦 Encoder
55
+ - Splits the image into **8×8 patches**
56
+ - Each patch is encoded **independently**
57
+ - Uses **only 1×1 convolutions**
58
+ - Extremely fast
59
+
60
+ The encoder can handle any aspect ratio, but if you want to mix different ARs inside the same batch, the 1×1 conv version becomes inconvenient.
61
+ The Linear encoder version solves this completely — mixed batches work out of the box, although I haven’t released it yet — I can upload it if needed.
62
+
63
+ There is also a Linear-based encoder version; I can publish it if needed.
64
+
65
+ ### 🟥 Decoder
66
+ - Uses standard 3×3 convolutions (but 1×1 also works with surprisingly few artifacts)
67
+ - Uses a **PixNeRF-style head** instead of stacked upsampling blocks
68
+
69
+ ---
70
+
71
+ ## Limitations
72
+
73
+ - Reconstruction is not perfect — small details may appear slightly blurred.
74
+ - Current MSE loss: 0.0020.
75
+ - This can likely be improved by increasing model size.
76
+
77
+ ---
78
+
79
+ ## Notes on 32× Compression
80
+
81
+ If you want **32× spatial compression**, do **not** use naive 32× patching — quality drops heavily.
82
+
83
+ A better approach:
84
+
85
+ 1. First stage: patch-8 → 16/32 channels
86
+ 2. Second stage: patch-4 → 256 channels
87
+
88
+ This trains much better and works well for text-to-image training too.
89
+ I’ve tested it, and the results are significantly more stable than naive approaches.
90
+
91
+ If you want to keep FLOPs low, you could try using patch-16 from the start, but I’m not sure yet how stable the training would be.
92
+
93
+ I’m currently working on a **d32c64** model with reconstruction quality better than Hunyuan VAE, but I’m limited by compute resources.
94
+
95
+ ---
96
+
97
+ ## Support the Project
98
+
99
+ I’m renting an **RTX 5090** and running all experiments on it.
100
+ I’m currently looking for work and would love to join a team doing text-to-image or video model research.
101
+
102
+ If you want to support development:
103
+
104
+ - TRC20: 👉 TPssa5ung2MgqbaVr1aeBQEpHC3xfmm1CL
105
+ - BTC: bc1qfv6pyq5dvs0tths682nhfdnmdwnjvm2av80ej4
106
+ - Boosty: https://boosty.to/muinez
107
+
108
+ ---
109
+
110
+ ## How to use
111
+
112
+ Here's a minimal example:
113
+
114
+ ```python
115
+ import torch
116
+ from huggingface_hub import hf_hub_download
117
+ from PIL import Image
118
+ from torchvision.transforms import v2
119
+ from IPython.display import display
120
+ import requests
121
+ from stae import StupidAE
122
+
123
+ vae = StupidAE().cuda().half()
124
+ vae.load_state_dict(
125
+ torch.load(hf_hub_download(repo_id="Muinez/StupidAE", filename="smol_f8c16.pt"))
126
+ )
127
+
128
+ t = v2.Compose([
129
+ v2.Resize((1024, 1024)),
130
+ v2.ToTensor(),
131
+ v2.Normalize([0.5], [0.5])
132
+ ])
133
+
134
+ image = Image.open(requests.get("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG", stream=True).raw).convert("RGB")
135
+
136
+ with torch.inference_mode():
137
+ image = t(image).unsqueeze(0).cuda().half()
138
+
139
+ latents = vae.encode(image)
140
+ image_decoded = vae.decode(latents)
141
+
142
+ image = v2.ToPILImage()(torch.clamp(image_decoded * 0.5 + 0.5, 0, 1).squeeze(0))
143
+ display(image)
144
+ ```
145
+ ---
146
+ ## Coming Soon
147
+
148
+ - Linear-encoder variant
149
+ - d32c64 model
150
+ - Tutorial: training text-to-image **without bucketing** (supports mixed aspect ratios)