sethmcknight commited on
Commit
7e43525
·
1 Parent(s): 4e8e860

refactor: remove PyTorch dependency by implementing L2 normalization with NumPy

Browse files
README.md CHANGED
@@ -6,6 +6,7 @@ This application includes comprehensive memory management and monitoring for sta
6
 
7
  - **App Factory Pattern & Lazy Loading:** Services (RAG pipeline, embedding, search) are initialized only when needed, reducing startup memory from ~400MB to ~50MB.
8
  -- **Embedding Model Optimization:** Swapped to `paraphrase-MiniLM-L3-v2` (384 dims) for vector embeddings to enable reliable operation within Render's memory limits.
 
9
  - **Gunicorn Configuration:** Single worker, minimal threads, aggressive recycling (`max_requests=50`, `preload_app=False`) to prevent memory leaks and keep usage low.
10
  - **Memory Utilities:** Added `MemoryManager` and utility functions for real-time memory tracking, garbage collection, and memory-aware error handling.
11
  - **Production Monitoring:** Added Render-specific memory monitoring with `/memory/render-status` endpoint, memory trend analysis, and automated alerts when approaching memory limits. See [Memory Monitoring Documentation](docs/memory_monitoring.md).
 
6
 
7
  - **App Factory Pattern & Lazy Loading:** Services (RAG pipeline, embedding, search) are initialized only when needed, reducing startup memory from ~400MB to ~50MB.
8
  -- **Embedding Model Optimization:** Swapped to `paraphrase-MiniLM-L3-v2` (384 dims) for vector embeddings to enable reliable operation within Render's memory limits.
9
+ -- **Torch Dependency Removal (Oct 2025):** Replaced `torch.nn.functional.normalize` with pure NumPy L2 normalization to eliminate PyTorch from production runtime, shrinking image size, speeding builds, and lowering memory.
10
  - **Gunicorn Configuration:** Single worker, minimal threads, aggressive recycling (`max_requests=50`, `preload_app=False`) to prevent memory leaks and keep usage low.
11
  - **Memory Utilities:** Added `MemoryManager` and utility functions for real-time memory tracking, garbage collection, and memory-aware error handling.
12
  - **Production Monitoring:** Added Render-specific memory monitoring with `/memory/render-status` endpoint, memory trend analysis, and automated alerts when approaching memory limits. See [Memory Monitoring Documentation](docs/memory_monitoring.md).
constraints.txt CHANGED
@@ -15,3 +15,4 @@ psycopg2-binary==2.9.7
15
  optimum==1.22.0
16
  onnxruntime==1.18.1
17
  psutil==5.9.0
 
 
15
  optimum==1.22.0
16
  onnxruntime==1.18.1
17
  psutil==5.9.0
18
+ # torch removed: switched embedding normalization to pure NumPy
src/embedding/embedding_service.py CHANGED
@@ -4,7 +4,6 @@ import logging
4
  from typing import Dict, List, Optional, Tuple
5
 
6
  import numpy as np
7
- import torch
8
  from optimum.onnxruntime import ORTModelForFeatureExtraction
9
  from transformers import AutoTokenizer, PreTrainedTokenizer
10
 
@@ -152,11 +151,10 @@ class EmbeddingService:
152
  model_output, encoded_input["attention_mask"]
153
  )
154
 
155
- # Normalize embeddings
156
- normalized_embeddings = torch.nn.functional.normalize(
157
- torch.from_numpy(sentence_embeddings), p=2, dim=1
158
- )
159
- batch_embeddings = normalized_embeddings.numpy()
160
 
161
  log_memory_checkpoint(f"batch_end_{i}//{self.batch_size}")
162
 
 
4
  from typing import Dict, List, Optional, Tuple
5
 
6
  import numpy as np
 
7
  from optimum.onnxruntime import ORTModelForFeatureExtraction
8
  from transformers import AutoTokenizer, PreTrainedTokenizer
9
 
 
151
  model_output, encoded_input["attention_mask"]
152
  )
153
 
154
+ # Normalize embeddings (L2) using pure NumPy to avoid torch dependency
155
+ norms = np.linalg.norm(sentence_embeddings, axis=1, keepdims=True)
156
+ norms = np.clip(norms, 1e-12, None)
157
+ batch_embeddings = sentence_embeddings / norms
 
158
 
159
  log_memory_checkpoint(f"batch_end_{i}//{self.batch_size}")
160