Spaces:
Sleeping
Sleeping
sethmcknight
commited on
Commit
·
7e43525
1
Parent(s):
4e8e860
refactor: remove PyTorch dependency by implementing L2 normalization with NumPy
Browse files- README.md +1 -0
- constraints.txt +1 -0
- src/embedding/embedding_service.py +4 -6
README.md
CHANGED
|
@@ -6,6 +6,7 @@ This application includes comprehensive memory management and monitoring for sta
|
|
| 6 |
|
| 7 |
- **App Factory Pattern & Lazy Loading:** Services (RAG pipeline, embedding, search) are initialized only when needed, reducing startup memory from ~400MB to ~50MB.
|
| 8 |
-- **Embedding Model Optimization:** Swapped to `paraphrase-MiniLM-L3-v2` (384 dims) for vector embeddings to enable reliable operation within Render's memory limits.
|
|
|
|
| 9 |
- **Gunicorn Configuration:** Single worker, minimal threads, aggressive recycling (`max_requests=50`, `preload_app=False`) to prevent memory leaks and keep usage low.
|
| 10 |
- **Memory Utilities:** Added `MemoryManager` and utility functions for real-time memory tracking, garbage collection, and memory-aware error handling.
|
| 11 |
- **Production Monitoring:** Added Render-specific memory monitoring with `/memory/render-status` endpoint, memory trend analysis, and automated alerts when approaching memory limits. See [Memory Monitoring Documentation](docs/memory_monitoring.md).
|
|
|
|
| 6 |
|
| 7 |
- **App Factory Pattern & Lazy Loading:** Services (RAG pipeline, embedding, search) are initialized only when needed, reducing startup memory from ~400MB to ~50MB.
|
| 8 |
-- **Embedding Model Optimization:** Swapped to `paraphrase-MiniLM-L3-v2` (384 dims) for vector embeddings to enable reliable operation within Render's memory limits.
|
| 9 |
+
-- **Torch Dependency Removal (Oct 2025):** Replaced `torch.nn.functional.normalize` with pure NumPy L2 normalization to eliminate PyTorch from production runtime, shrinking image size, speeding builds, and lowering memory.
|
| 10 |
- **Gunicorn Configuration:** Single worker, minimal threads, aggressive recycling (`max_requests=50`, `preload_app=False`) to prevent memory leaks and keep usage low.
|
| 11 |
- **Memory Utilities:** Added `MemoryManager` and utility functions for real-time memory tracking, garbage collection, and memory-aware error handling.
|
| 12 |
- **Production Monitoring:** Added Render-specific memory monitoring with `/memory/render-status` endpoint, memory trend analysis, and automated alerts when approaching memory limits. See [Memory Monitoring Documentation](docs/memory_monitoring.md).
|
constraints.txt
CHANGED
|
@@ -15,3 +15,4 @@ psycopg2-binary==2.9.7
|
|
| 15 |
optimum==1.22.0
|
| 16 |
onnxruntime==1.18.1
|
| 17 |
psutil==5.9.0
|
|
|
|
|
|
| 15 |
optimum==1.22.0
|
| 16 |
onnxruntime==1.18.1
|
| 17 |
psutil==5.9.0
|
| 18 |
+
# torch removed: switched embedding normalization to pure NumPy
|
src/embedding/embedding_service.py
CHANGED
|
@@ -4,7 +4,6 @@ import logging
|
|
| 4 |
from typing import Dict, List, Optional, Tuple
|
| 5 |
|
| 6 |
import numpy as np
|
| 7 |
-
import torch
|
| 8 |
from optimum.onnxruntime import ORTModelForFeatureExtraction
|
| 9 |
from transformers import AutoTokenizer, PreTrainedTokenizer
|
| 10 |
|
|
@@ -152,11 +151,10 @@ class EmbeddingService:
|
|
| 152 |
model_output, encoded_input["attention_mask"]
|
| 153 |
)
|
| 154 |
|
| 155 |
-
# Normalize embeddings
|
| 156 |
-
|
| 157 |
-
|
| 158 |
-
|
| 159 |
-
batch_embeddings = normalized_embeddings.numpy()
|
| 160 |
|
| 161 |
log_memory_checkpoint(f"batch_end_{i}//{self.batch_size}")
|
| 162 |
|
|
|
|
| 4 |
from typing import Dict, List, Optional, Tuple
|
| 5 |
|
| 6 |
import numpy as np
|
|
|
|
| 7 |
from optimum.onnxruntime import ORTModelForFeatureExtraction
|
| 8 |
from transformers import AutoTokenizer, PreTrainedTokenizer
|
| 9 |
|
|
|
|
| 151 |
model_output, encoded_input["attention_mask"]
|
| 152 |
)
|
| 153 |
|
| 154 |
+
# Normalize embeddings (L2) using pure NumPy to avoid torch dependency
|
| 155 |
+
norms = np.linalg.norm(sentence_embeddings, axis=1, keepdims=True)
|
| 156 |
+
norms = np.clip(norms, 1e-12, None)
|
| 157 |
+
batch_embeddings = sentence_embeddings / norms
|
|
|
|
| 158 |
|
| 159 |
log_memory_checkpoint(f"batch_end_{i}//{self.batch_size}")
|
| 160 |
|