Spaces:

sethmcknight
/

msse-ai-engineering

Sleeping

sethmcknight commited on Oct 24

Commit

7e43525

1 Parent(s): 4e8e860

refactor: remove PyTorch dependency by implementing L2 normalization with NumPy

Files changed (3) hide show

README.md CHANGED Viewed

@@ -6,6 +6,7 @@ This application includes comprehensive memory management and monitoring for sta
 - **App Factory Pattern & Lazy Loading:** Services (RAG pipeline, embedding, search) are initialized only when needed, reducing startup memory from ~400MB to ~50MB.
   -- **Embedding Model Optimization:** Swapped to `paraphrase-MiniLM-L3-v2` (384 dims) for vector embeddings to enable reliable operation within Render's memory limits.
 - **Gunicorn Configuration:** Single worker, minimal threads, aggressive recycling (`max_requests=50`, `preload_app=False`) to prevent memory leaks and keep usage low.
 - **Memory Utilities:** Added `MemoryManager` and utility functions for real-time memory tracking, garbage collection, and memory-aware error handling.
 - **Production Monitoring:** Added Render-specific memory monitoring with `/memory/render-status` endpoint, memory trend analysis, and automated alerts when approaching memory limits. See [Memory Monitoring Documentation](docs/memory_monitoring.md).

 - **App Factory Pattern & Lazy Loading:** Services (RAG pipeline, embedding, search) are initialized only when needed, reducing startup memory from ~400MB to ~50MB.
   -- **Embedding Model Optimization:** Swapped to `paraphrase-MiniLM-L3-v2` (384 dims) for vector embeddings to enable reliable operation within Render's memory limits.
+  -- **Torch Dependency Removal (Oct 2025):** Replaced `torch.nn.functional.normalize` with pure NumPy L2 normalization to eliminate PyTorch from production runtime, shrinking image size, speeding builds, and lowering memory.
 - **Gunicorn Configuration:** Single worker, minimal threads, aggressive recycling (`max_requests=50`, `preload_app=False`) to prevent memory leaks and keep usage low.
 - **Memory Utilities:** Added `MemoryManager` and utility functions for real-time memory tracking, garbage collection, and memory-aware error handling.
 - **Production Monitoring:** Added Render-specific memory monitoring with `/memory/render-status` endpoint, memory trend analysis, and automated alerts when approaching memory limits. See [Memory Monitoring Documentation](docs/memory_monitoring.md).

constraints.txt CHANGED Viewed

@@ -15,3 +15,4 @@ psycopg2-binary==2.9.7
 optimum==1.22.0
 onnxruntime==1.18.1
 psutil==5.9.0

 optimum==1.22.0
 onnxruntime==1.18.1
 psutil==5.9.0
+# torch removed: switched embedding normalization to pure NumPy

src/embedding/embedding_service.py CHANGED Viewed

@@ -4,7 +4,6 @@ import logging
 from typing import Dict, List, Optional, Tuple
 import numpy as np
-import torch
 from optimum.onnxruntime import ORTModelForFeatureExtraction
 from transformers import AutoTokenizer, PreTrainedTokenizer
@@ -152,11 +151,10 @@ class EmbeddingService:
                     model_output, encoded_input["attention_mask"]
                 )
-                # Normalize embeddings
-                normalized_embeddings = torch.nn.functional.normalize(
-                    torch.from_numpy(sentence_embeddings), p=2, dim=1
-                )
-                batch_embeddings = normalized_embeddings.numpy()
                 log_memory_checkpoint(f"batch_end_{i}//{self.batch_size}")

 from typing import Dict, List, Optional, Tuple
 import numpy as np
 from optimum.onnxruntime import ORTModelForFeatureExtraction
 from transformers import AutoTokenizer, PreTrainedTokenizer
                     model_output, encoded_input["attention_mask"]
                 )
+                # Normalize embeddings (L2) using pure NumPy to avoid torch dependency
+                norms = np.linalg.norm(sentence_embeddings, axis=1, keepdims=True)
+                norms = np.clip(norms, 1e-12, None)
+                batch_embeddings = sentence_embeddings / norms
                 log_memory_checkpoint(f"batch_end_{i}//{self.batch_size}")