msse-ai-engineering / POSTGRES_MIGRATION.md
sethmcknight
refactor: enhance gunicorn startup script with health checks and config handling
45cf08e

PostgreSQL Migration Guide

Overview

This branch implements PostgreSQL with pgvector as an alternative to ChromaDB for vector storage. This reduces memory usage from 400MB+ to ~50-100MB by storing vectors on disk instead of in RAM.

What's Been Implemented

1. PostgresVectorService (src/vector_db/postgres_vector_service.py)

  • Full PostgreSQL integration with pgvector extension
  • Automatic table creation and indexing
  • Similarity search using cosine distance
  • Document CRUD operations
  • Health monitoring and collection info

2. PostgresVectorAdapter (src/vector_db/postgres_adapter.py)

  • Compatibility layer for existing ChromaDB interface
  • Ensures seamless migration without code changes
  • Converts between PostgreSQL and ChromaDB result formats

3. Updated Configuration (src/config.py)

  • Added VECTOR_STORAGE_TYPE environment variable
  • PostgreSQL connection settings
  • Memory optimization parameters

4. Factory Pattern (src/vector_store/vector_db.py)

  • create_vector_database() function selects backend automatically
  • Supports both ChromaDB and PostgreSQL based on configuration

5. Migration Script (scripts/migrate_to_postgres.py)

  • Data optimization (text summarization, metadata cleaning)
  • Batch processing with memory management
  • Handles 4GB → 1GB data reduction for free tier

6. Tests (tests/test_vector_store/test_postgres_vector.py)

  • Unit tests with mocked dependencies
  • Integration tests for real database
  • Compatibility tests for ChromaDB interface

Setup Instructions

Step 1: Create Render PostgreSQL Database

  1. Go to Render Dashboard
  2. Create → PostgreSQL
  3. Choose "Free" plan (1GB storage, 30 days)
  4. Save the connection details

Step 2: Enable pgvector Extension

You have several options to enable pgvector:

Option A: Use the initialization script (Recommended)

# Set your database URL
export DATABASE_URL="postgresql://user:password@host:port/database"

# Run the initialization script
python scripts/init_pgvector.py

Option B: Manual SQL Connect to your database and run:

CREATE EXTENSION IF NOT EXISTS vector;

Option C: From Render Dashboard

  1. Go to your PostgreSQL service → Info tab
  2. Use the "PSQL Command" to connect
  3. Run: CREATE EXTENSION IF NOT EXISTS vector;

The initialization script (scripts/init_pgvector.py) will:

  • Test database connection
  • Check PostgreSQL version compatibility (13+)
  • Install pgvector extension safely
  • Verify vector operations work correctly
  • Provide detailed logging and error messages

Step 3: Update Environment Variables

Add to your Render environment variables:

DATABASE_URL=postgresql://username:password@host:port/database
VECTOR_STORAGE_TYPE=postgres
MEMORY_LIMIT_MB=400

Step 4: Install Dependencies

pip install psycopg2-binary==2.9.7

Step 5: Run Migration (Optional)

If you have existing ChromaDB data:

python scripts/migrate_to_postgres.py --database-url="your-connection-string"

Usage

Switch to PostgreSQL

Set environment variable:

export VECTOR_STORAGE_TYPE=postgres

Use in Code (No Changes Required!)

from src.vector_store.vector_db import create_vector_database

# Automatically uses PostgreSQL if VECTOR_STORAGE_TYPE=postgres
vector_db = create_vector_database()
vector_db.add_embeddings(embeddings, ids, documents, metadatas)
results = vector_db.search(query_embedding, top_k=5)

Expected Memory Reduction

Component Before (ChromaDB) After (PostgreSQL) Savings
Vector Storage 200-300MB 0MB (disk) 200-300MB
Embedding Model 100MB 50MB (smaller model) 50MB
Application Code 50-100MB 50-100MB 0MB
Total 350-500MB 50-150MB 300-350MB

Migration Optimizations

Data Size Reduction

  • Text Summarization: Documents truncated to 1000 characters
  • Metadata Cleaning: Only essential fields kept
  • Dimension Reduction: Can use smaller embedding models
  • Quality Filtering: Skip very short or low-quality documents

Memory Management

  • Batch Processing: Process documents in small batches
  • Garbage Collection: Aggressive cleanup between operations
  • Streaming: Process data without loading everything into memory

Testing

Unit Tests

pytest tests/test_vector_store/test_postgres_vector.py -v

Integration Tests (Requires Database)

export TEST_DATABASE_URL="postgresql://test:test@localhost:5432/test_db"
pytest tests/test_vector_store/test_postgres_vector.py -m integration -v

Migration Test

python scripts/migrate_to_postgres.py --test-only

Deployment

Local Development

Keep using ChromaDB:

export VECTOR_STORAGE_TYPE=chroma

Production (Render)

Switch to PostgreSQL:

export VECTOR_STORAGE_TYPE=postgres
export DATABASE_URL="your-render-postgres-url"

Troubleshooting

Common Issues

  1. "pgvector extension not found"

    • Run CREATE EXTENSION vector; in your database
  2. Connection errors

    • Verify DATABASE_URL format: postgresql://user:pass@host:port/db
    • Check firewall/network connectivity
  3. Memory still high

    • Verify VECTOR_STORAGE_TYPE=postgres
    • Check that old ChromaDB files aren't being loaded

Monitoring

from src.vector_db.postgres_vector_service import PostgresVectorService

service = PostgresVectorService()
health = service.health_check()
print(health)  # Shows connection status, document count, etc.

Rollback Plan

If issues occur, simply change back to ChromaDB:

export VECTOR_STORAGE_TYPE=chroma

The factory pattern ensures seamless switching between backends.

Performance Comparison

Operation ChromaDB PostgreSQL Notes
Insert Fast Medium Network overhead
Search Very Fast Fast pgvector is optimized
Memory High Low Vectors stored on disk
Persistence File-based Database More reliable
Scaling Limited Excellent Can upgrade storage

Next Steps

  1. Test locally with PostgreSQL
  2. Create Render PostgreSQL database
  3. Run migration script
  4. Deploy with VECTOR_STORAGE_TYPE=postgres
  5. Monitor memory usage in production