Spaces:
Sleeping
PostgreSQL Migration Guide
Overview
This branch implements PostgreSQL with pgvector as an alternative to ChromaDB for vector storage. This reduces memory usage from 400MB+ to ~50-100MB by storing vectors on disk instead of in RAM.
What's Been Implemented
1. PostgresVectorService (src/vector_db/postgres_vector_service.py)
- Full PostgreSQL integration with pgvector extension
- Automatic table creation and indexing
- Similarity search using cosine distance
- Document CRUD operations
- Health monitoring and collection info
2. PostgresVectorAdapter (src/vector_db/postgres_adapter.py)
- Compatibility layer for existing ChromaDB interface
- Ensures seamless migration without code changes
- Converts between PostgreSQL and ChromaDB result formats
3. Updated Configuration (src/config.py)
- Added
VECTOR_STORAGE_TYPEenvironment variable - PostgreSQL connection settings
- Memory optimization parameters
4. Factory Pattern (src/vector_store/vector_db.py)
create_vector_database()function selects backend automatically- Supports both ChromaDB and PostgreSQL based on configuration
5. Migration Script (scripts/migrate_to_postgres.py)
- Data optimization (text summarization, metadata cleaning)
- Batch processing with memory management
- Handles 4GB → 1GB data reduction for free tier
6. Tests (tests/test_vector_store/test_postgres_vector.py)
- Unit tests with mocked dependencies
- Integration tests for real database
- Compatibility tests for ChromaDB interface
Setup Instructions
Step 1: Create Render PostgreSQL Database
- Go to Render Dashboard
- Create → PostgreSQL
- Choose "Free" plan (1GB storage, 30 days)
- Save the connection details
Step 2: Enable pgvector Extension
You have several options to enable pgvector:
Option A: Use the initialization script (Recommended)
# Set your database URL
export DATABASE_URL="postgresql://user:password@host:port/database"
# Run the initialization script
python scripts/init_pgvector.py
Option B: Manual SQL Connect to your database and run:
CREATE EXTENSION IF NOT EXISTS vector;
Option C: From Render Dashboard
- Go to your PostgreSQL service → Info tab
- Use the "PSQL Command" to connect
- Run:
CREATE EXTENSION IF NOT EXISTS vector;
The initialization script (scripts/init_pgvector.py) will:
- Test database connection
- Check PostgreSQL version compatibility (13+)
- Install pgvector extension safely
- Verify vector operations work correctly
- Provide detailed logging and error messages
Step 3: Update Environment Variables
Add to your Render environment variables:
DATABASE_URL=postgresql://username:password@host:port/database
VECTOR_STORAGE_TYPE=postgres
MEMORY_LIMIT_MB=400
Step 4: Install Dependencies
pip install psycopg2-binary==2.9.7
Step 5: Run Migration (Optional)
If you have existing ChromaDB data:
python scripts/migrate_to_postgres.py --database-url="your-connection-string"
Usage
Switch to PostgreSQL
Set environment variable:
export VECTOR_STORAGE_TYPE=postgres
Use in Code (No Changes Required!)
from src.vector_store.vector_db import create_vector_database
# Automatically uses PostgreSQL if VECTOR_STORAGE_TYPE=postgres
vector_db = create_vector_database()
vector_db.add_embeddings(embeddings, ids, documents, metadatas)
results = vector_db.search(query_embedding, top_k=5)
Expected Memory Reduction
| Component | Before (ChromaDB) | After (PostgreSQL) | Savings |
|---|---|---|---|
| Vector Storage | 200-300MB | 0MB (disk) | 200-300MB |
| Embedding Model | 100MB | 50MB (smaller model) | 50MB |
| Application Code | 50-100MB | 50-100MB | 0MB |
| Total | 350-500MB | 50-150MB | 300-350MB |
Migration Optimizations
Data Size Reduction
- Text Summarization: Documents truncated to 1000 characters
- Metadata Cleaning: Only essential fields kept
- Dimension Reduction: Can use smaller embedding models
- Quality Filtering: Skip very short or low-quality documents
Memory Management
- Batch Processing: Process documents in small batches
- Garbage Collection: Aggressive cleanup between operations
- Streaming: Process data without loading everything into memory
Testing
Unit Tests
pytest tests/test_vector_store/test_postgres_vector.py -v
Integration Tests (Requires Database)
export TEST_DATABASE_URL="postgresql://test:test@localhost:5432/test_db"
pytest tests/test_vector_store/test_postgres_vector.py -m integration -v
Migration Test
python scripts/migrate_to_postgres.py --test-only
Deployment
Local Development
Keep using ChromaDB:
export VECTOR_STORAGE_TYPE=chroma
Production (Render)
Switch to PostgreSQL:
export VECTOR_STORAGE_TYPE=postgres
export DATABASE_URL="your-render-postgres-url"
Troubleshooting
Common Issues
"pgvector extension not found"
- Run
CREATE EXTENSION vector;in your database
- Run
Connection errors
- Verify DATABASE_URL format:
postgresql://user:pass@host:port/db - Check firewall/network connectivity
- Verify DATABASE_URL format:
Memory still high
- Verify
VECTOR_STORAGE_TYPE=postgres - Check that old ChromaDB files aren't being loaded
- Verify
Monitoring
from src.vector_db.postgres_vector_service import PostgresVectorService
service = PostgresVectorService()
health = service.health_check()
print(health) # Shows connection status, document count, etc.
Rollback Plan
If issues occur, simply change back to ChromaDB:
export VECTOR_STORAGE_TYPE=chroma
The factory pattern ensures seamless switching between backends.
Performance Comparison
| Operation | ChromaDB | PostgreSQL | Notes |
|---|---|---|---|
| Insert | Fast | Medium | Network overhead |
| Search | Very Fast | Fast | pgvector is optimized |
| Memory | High | Low | Vectors stored on disk |
| Persistence | File-based | Database | More reliable |
| Scaling | Limited | Excellent | Can upgrade storage |
Next Steps
- Test locally with PostgreSQL
- Create Render PostgreSQL database
- Run migration script
- Deploy with
VECTOR_STORAGE_TYPE=postgres - Monitor memory usage in production