msse-ai-engineering / project-plan.md
Seth McKnight
Add memory diagnostics endpoints and logging enhancements (#80)
0a7f9b4

RAG Application Project Plan

This plan outlines the steps to design, build, and deploy a Retrieval-Augmented Generation (RAG) application as per the project requirements, with a focus on achieving a grade of 5. The approach prioritizes early deployment and continuous integration, following Test-Driven Development (TDD) principles.

1. Foundational Setup

  • Repository: Create a new GitHub repository.
  • Virtual Environment: Set up a local Python virtual environment (venv).
  • Initial Files:
    • Create requirements.txt with initial dependencies (Flask, pytest).
    • Create a .gitignore file for Python.
    • Create a README.md with initial setup instructions.
    • Create placeholder files: deployed.md and design-and-evaluation.md.
  • Testing Framework: Establish a tests/ directory and configure pytest.

2. "Hello World" Deployment

  • Minimal App: Develop a minimal Flask application (app.py) with a /health endpoint that returns a JSON status object.
  • Unit Test: Write a test for the /health endpoint to ensure it returns a 200 OK status and the correct JSON payload.
  • Local Validation: Run the app and tests locally to confirm everything works.

3. CI/CD and Initial Deployment

  • Render Setup: Create a new Web Service on Render and link it to the GitHub repository.
  • Environment Configuration: Configure necessary environment variables on Render (e.g., PYTHON_VERSION).
  • GitHub Actions: Create a CI/CD workflow (.github/workflows/main.yml) that:
    • Triggers on push/PR to the main branch.
    • Installs dependencies from requirements.txt.
    • Runs the pytest test suite.
    • On success, triggers a deployment to Render.
  • Deployment Validation: Push a change and verify that the workflow runs successfully and the application is deployed.
  • Documentation: Update deployed.md with the live URL of the deployed application.

CI/CD optimizations added

  • Add pip cache to CI to speed up dependency installation.
  • Optimize pre-commit in PRs to run only changed-file hooks (use pre-commit run --from-ref ... --to-ref ...).

4. Data Ingestion and Processing

  • Corpus Assembly: Collect or generate 5-20 policy documents (PDF, TXT, MD) and place them in a synthetic_policies/ directory.
  • Parsing Logic: Implement and test functions to parse different document formats.
  • Chunking Strategy: Implement and test a document chunking strategy (e.g., recursive character splitting with overlap).
  • Reproducibility: Set fixed seeds for any processes involving randomness (e.g., chunking, sampling) to ensure deterministic outcomes.

5. Embedding and Vector Storage βœ… PHASE 2B COMPLETED

  • Vector DB Setup: Integrate a vector database (ChromaDB) into the project.
  • Embedding Model: Select and integrate a free embedding model (paraphrase-MiniLM-L3-v2 chosen for memory efficiency).
  • Ingestion Pipeline: Create enhanced ingestion pipeline that:
    • Loads documents from the corpus.
    • Chunks the documents with metadata.
    • Embeds the chunks using sentence-transformers.
    • Stores the embeddings in ChromaDB vector database.
    • Provides detailed processing statistics.
  • Testing: Write comprehensive tests (60+ tests) verifying each step of the ingestion pipeline.
  • Search API: Implement POST /search endpoint for semantic search with:
    • JSON request/response format
    • Configurable parameters (top_k, threshold)
    • Comprehensive input validation
    • Detailed error handling
  • End-to-End Testing: Complete pipeline testing from ingestion through search.
  • Documentation: Full API documentation with examples and performance metrics.

6. RAG Core Implementation βœ… PHASE 3 COMPLETED

  • Retrieval Logic: Implement a function to retrieve the top-k relevant document chunks from the vector store based on a user query.
  • Prompt Engineering: Design a prompt template that injects the retrieved context into the query for the LLM.
  • LLM Integration: Connect to a free-tier LLM (e.g., via OpenRouter or Groq) to generate answers.
  • Basic Guardrails: Implement and test basic guardrails for context validation and response length limits.
  • Enhanced Guardrails (Issue #24): βœ… COMPLETED - Comprehensive guardrails and response quality system:
    • Content Safety Filtering: PII detection, bias mitigation, inappropriate content filtering
    • Response Quality Scoring: Multi-dimensional quality assessment (relevance, completeness, coherence, source fidelity)
    • Source Attribution: Automated citation generation and validation
    • Error Handling: Circuit breaker patterns and graceful degradation
    • Configuration System: Flexible thresholds and feature toggles
    • Testing: 13 comprehensive tests with 100% pass rate
    • Integration: Enhanced RAG pipeline with backward compatibility

7. Web Application Completion

  • Chat Interface: βœ… COMPLETED - Implement a simple web chat interface for the / endpoint.
    • Modern Chat UI: Interactive chat interface with real-time messaging
    • Message History: Conversation display with user and assistant messages
    • Source Citations: Visual display of source documents and confidence scores
    • Responsive Design: Mobile-friendly interface with modern styling
    • Error Handling: Graceful error display and loading states
    • System Health: Status indicators and health monitoring
  • API Endpoint: Create the /chat API endpoint that receives user questions (POST) and returns model-generated answers with citations and snippets.
  • UI/UX: βœ… COMPLETED - Ensure the web interface is clean, user-friendly, and handles loading/error states gracefully.
  • Testing: Write end-to-end tests for the chat functionality.

7.5. Memory Management & Production Optimization βœ… COMPLETED

  • Memory Architecture Redesign: βœ… COMPLETED - Comprehensive memory optimization for cloud deployment:

    • App Factory Pattern: Migrated from monolithic to factory pattern with lazy loading
      • Impact: 87% reduction in startup memory (400MB β†’ 50MB)
      • Benefit: Services initialize only when needed, improving resource efficiency
    • Embedding Model Optimization: Changed from all-MiniLM-L6-v2 to paraphrase-MiniLM-L3-v2
      • Memory Savings: 75-85% reduction (550-1000MB β†’ 132MB)
      • Quality Impact: <5% reduction in similarity scoring (acceptable trade-off)
      • Deployment Viability: Enables deployment on Render free tier (512MB limit)
    • Gunicorn Production Configuration: Optimized for memory-constrained environments
      • Configuration: Single worker, 2 threads, max_requests=50
      • Memory Control: Prevent memory leaks with automatic worker restart
      • Performance: Balanced for I/O-bound LLM operations
  • Memory Management Utilities: βœ… COMPLETED - Comprehensive memory monitoring and optimization:

    • MemoryManager Class: Context manager for memory tracking and cleanup
    • Real-time Monitoring: Memory usage tracking with automatic garbage collection
    • Memory Statistics: Detailed memory reporting for production monitoring
    • Error Recovery: Memory-aware error handling with graceful degradation
    • Health Integration: Memory metrics exposed via /health endpoint
  • Database Pre-building Strategy: βœ… COMPLETED - Eliminate deployment memory spikes:

    • Local Database Building: build_embeddings.py script for development
    • Repository Commitment: Pre-built vector database (25MB) committed to git
    • Deployment Optimization: Zero embedding generation on production startup
    • Memory Impact: Avoid 150MB+ memory spikes during embedding generation
  • Production Deployment Optimization: βœ… COMPLETED - Full production readiness:

    • Memory Profiling: Comprehensive memory usage analysis and optimization
    • Performance Testing: Load testing with memory constraints validation
    • Error Handling: Production-grade error recovery for memory pressure
    • Monitoring Integration: Real-time memory tracking and alerting
    • Documentation: Complete memory management documentation across all files
  • Testing & Validation: βœ… COMPLETED - Memory-aware testing infrastructure:

    • Memory Constraint Testing: All 138 tests pass with memory optimizations
    • Performance Regression Testing: Response time validation maintained
    • Memory Leak Detection: Long-running tests validate memory stability
    • Production Simulation: Testing in memory-constrained environments

8. Evaluation

  • Evaluation Set: Create an evaluation set of 15-30 questions and corresponding "gold" answers covering various policy topics.
  • Metric Implementation: Develop scripts to calculate:
    • Answer Quality: Groundedness and Citation Accuracy.
    • System Metrics: Latency (p50/p95).
  • Execution: Run the evaluation and record the results.
  • Documentation: Summarize the evaluation results in design-and-evaluation.md.

9. Final Documentation and Submission

  • Design Document: βœ… COMPLETED - Complete design-and-evaluation.md with comprehensive technical analysis:
    • Memory Architecture Design: Detailed analysis of memory-constrained architecture decisions
    • Performance Evaluation: Comprehensive memory usage, response time, and quality metrics
    • Model Selection Analysis: Embedding model comparison with memory vs quality trade-offs
    • Production Deployment Evaluation: Platform compatibility and scalability analysis
    • Design Trade-offs Documentation: Lessons learned and future considerations
  • README: βœ… COMPLETED - Comprehensive documentation with memory management focus:
    • Memory Management Section: Detailed memory optimization architecture and utilities
    • Production Configuration: Gunicorn, database pre-building, and deployment strategies
    • Performance Metrics: Memory usage breakdown and production performance data
    • Setup Instructions: Memory-aware development and deployment guidelines
  • Deployment Documentation: βœ… COMPLETED - Updated deployed.md with production details:
    • Memory-Optimized Configuration: Production memory profile and optimization results
    • Performance Metrics: Real-time memory monitoring and capacity analysis
    • Production Features: Memory management system and error handling documentation
    • Deployment Pipeline: CI/CD integration with memory validation
  • Contributing Guidelines: βœ… COMPLETED - Updated CONTRIBUTING.md with memory-conscious development:
    • Memory Development Principles: Guidelines for memory-efficient code patterns
    • Memory Testing Procedures: Development workflow for memory constraint validation
    • Code Review Guidelines: Memory-focused review checklist and best practices
    • Production Testing: Memory leak detection and performance validation procedures
  • Demonstration Video: Record a 5-10 minute screen-share video demonstrating the deployed application, walking through the code architecture, explaining the evaluation results, and showing a successful CI/CD run.
  • Submission: Share the GitHub repository with the grader and submit the repository and video links.