Spaces:

sethmcknight
/

msse-ai-engineering

Sleeping

App Files Files Community

msse-ai-engineering / POSTGRES_MIGRATION.md

sethmcknight

refactor: enhance gunicorn startup script with health checks and config handling

45cf08e about 2 months ago

preview code

raw

history blame contribute delete

6.65 kB

PostgreSQL Migration Guide

Overview

This branch implements PostgreSQL with pgvector as an alternative to ChromaDB for vector storage. This reduces memory usage from 400MB+ to ~50-100MB by storing vectors on disk instead of in RAM.

What's Been Implemented

1. PostgresVectorService (`src/vector_db/postgres_vector_service.py`)

Full PostgreSQL integration with pgvector extension
Automatic table creation and indexing
Similarity search using cosine distance
Document CRUD operations
Health monitoring and collection info

2. PostgresVectorAdapter (`src/vector_db/postgres_adapter.py`)

Compatibility layer for existing ChromaDB interface
Ensures seamless migration without code changes
Converts between PostgreSQL and ChromaDB result formats

3. Updated Configuration (`src/config.py`)

Added VECTOR_STORAGE_TYPE environment variable
PostgreSQL connection settings
Memory optimization parameters

4. Factory Pattern (`src/vector_store/vector_db.py`)

create_vector_database() function selects backend automatically
Supports both ChromaDB and PostgreSQL based on configuration

5. Migration Script (`scripts/migrate_to_postgres.py`)

Data optimization (text summarization, metadata cleaning)
Batch processing with memory management
Handles 4GB → 1GB data reduction for free tier

6. Tests (`tests/test_vector_store/test_postgres_vector.py`)

Unit tests with mocked dependencies
Integration tests for real database
Compatibility tests for ChromaDB interface

Setup Instructions

Step 1: Create Render PostgreSQL Database

Go to Render Dashboard
Create → PostgreSQL
Choose "Free" plan (1GB storage, 30 days)
Save the connection details

Step 2: Enable pgvector Extension

You have several options to enable pgvector:

Option A: Use the initialization script (Recommended)

# Set your database URL
export DATABASE_URL="postgresql://user:password@host:port/database"

# Run the initialization script
python scripts/init_pgvector.py

Option B: Manual SQL Connect to your database and run:

CREATE EXTENSION IF NOT EXISTS vector;

Option C: From Render Dashboard

Go to your PostgreSQL service → Info tab
Use the "PSQL Command" to connect
Run: CREATE EXTENSION IF NOT EXISTS vector;

The initialization script (scripts/init_pgvector.py) will:

Test database connection
Check PostgreSQL version compatibility (13+)
Install pgvector extension safely
Verify vector operations work correctly
Provide detailed logging and error messages

Step 3: Update Environment Variables

Add to your Render environment variables:

DATABASE_URL=postgresql://username:password@host:port/database
VECTOR_STORAGE_TYPE=postgres
MEMORY_LIMIT_MB=400

Step 4: Install Dependencies

pip install psycopg2-binary==2.9.7

Step 5: Run Migration (Optional)

If you have existing ChromaDB data:

python scripts/migrate_to_postgres.py --database-url="your-connection-string"

Usage

Switch to PostgreSQL

Set environment variable:

export VECTOR_STORAGE_TYPE=postgres

Use in Code (No Changes Required!)

from src.vector_store.vector_db import create_vector_database

# Automatically uses PostgreSQL if VECTOR_STORAGE_TYPE=postgres
vector_db = create_vector_database()
vector_db.add_embeddings(embeddings, ids, documents, metadatas)
results = vector_db.search(query_embedding, top_k=5)

Expected Memory Reduction

Component	Before (ChromaDB)	After (PostgreSQL)	Savings
Vector Storage	200-300MB	0MB (disk)	200-300MB
Embedding Model	100MB	50MB (smaller model)	50MB
Application Code	50-100MB	50-100MB	0MB
Total	350-500MB	50-150MB	300-350MB

Migration Optimizations

Data Size Reduction

Text Summarization: Documents truncated to 1000 characters
Metadata Cleaning: Only essential fields kept
Dimension Reduction: Can use smaller embedding models
Quality Filtering: Skip very short or low-quality documents

Memory Management

Batch Processing: Process documents in small batches
Garbage Collection: Aggressive cleanup between operations
Streaming: Process data without loading everything into memory

Testing

Unit Tests

pytest tests/test_vector_store/test_postgres_vector.py -v

Integration Tests (Requires Database)

export TEST_DATABASE_URL="postgresql://test:test@localhost:5432/test_db"
pytest tests/test_vector_store/test_postgres_vector.py -m integration -v

Migration Test

python scripts/migrate_to_postgres.py --test-only

Deployment

Local Development

Keep using ChromaDB:

export VECTOR_STORAGE_TYPE=chroma

Production (Render)

Switch to PostgreSQL:

export VECTOR_STORAGE_TYPE=postgres
export DATABASE_URL="your-render-postgres-url"

Troubleshooting

Common Issues

"pgvector extension not found"
- Run CREATE EXTENSION vector; in your database
Connection errors
- Verify DATABASE_URL format: postgresql://user:pass@host:port/db
- Check firewall/network connectivity
Memory still high
- Verify VECTOR_STORAGE_TYPE=postgres
- Check that old ChromaDB files aren't being loaded

Monitoring

from src.vector_db.postgres_vector_service import PostgresVectorService

service = PostgresVectorService()
health = service.health_check()
print(health)  # Shows connection status, document count, etc.

Rollback Plan

If issues occur, simply change back to ChromaDB:

export VECTOR_STORAGE_TYPE=chroma

The factory pattern ensures seamless switching between backends.

Performance Comparison

Operation	ChromaDB	PostgreSQL	Notes
Insert	Fast	Medium	Network overhead
Search	Very Fast	Fast	pgvector is optimized
Memory	High	Low	Vectors stored on disk
Persistence	File-based	Database	More reliable
Scaling	Limited	Excellent	Can upgrade storage

Next Steps

Test locally with PostgreSQL
Create Render PostgreSQL database
Run migration script
Deploy with VECTOR_STORAGE_TYPE=postgres
Monitor memory usage in production