Spaces:

sethmcknight
/

msse-ai-engineering

Sleeping

App Files Files Community

msse-ai-engineering / project-plan.md

Seth McKnight

Add memory diagnostics endpoints and logging enhancements (#80)

0a7f9b4 about 2 months ago

preview code

raw

history blame contribute delete

11.8 kB

RAG Application Project Plan

This plan outlines the steps to design, build, and deploy a Retrieval-Augmented Generation (RAG) application as per the project requirements, with a focus on achieving a grade of 5. The approach prioritizes early deployment and continuous integration, following Test-Driven Development (TDD) principles.

1. Foundational Setup

Repository: Create a new GitHub repository.
Virtual Environment: Set up a local Python virtual environment (venv).
Initial Files:
- Create requirements.txt with initial dependencies (Flask, pytest).
- Create a .gitignore file for Python.
- Create a README.md with initial setup instructions.
- Create placeholder files: deployed.md and design-and-evaluation.md.
Testing Framework: Establish a tests/ directory and configure pytest.

2. "Hello World" Deployment

Minimal App: Develop a minimal Flask application (app.py) with a /health endpoint that returns a JSON status object.
Unit Test: Write a test for the /health endpoint to ensure it returns a 200 OK status and the correct JSON payload.
Local Validation: Run the app and tests locally to confirm everything works.

3. CI/CD and Initial Deployment

Render Setup: Create a new Web Service on Render and link it to the GitHub repository.
Environment Configuration: Configure necessary environment variables on Render (e.g., PYTHON_VERSION).
GitHub Actions: Create a CI/CD workflow (.github/workflows/main.yml) that:
- Triggers on push/PR to the main branch.
- Installs dependencies from requirements.txt.
- Runs the pytest test suite.
- On success, triggers a deployment to Render.
Deployment Validation: Push a change and verify that the workflow runs successfully and the application is deployed.
Documentation: Update deployed.md with the live URL of the deployed application.

CI/CD optimizations added

Add pip cache to CI to speed up dependency installation.
Optimize pre-commit in PRs to run only changed-file hooks (use pre-commit run --from-ref ... --to-ref ...).

4. Data Ingestion and Processing

Corpus Assembly: Collect or generate 5-20 policy documents (PDF, TXT, MD) and place them in a synthetic_policies/ directory.
Parsing Logic: Implement and test functions to parse different document formats.
Chunking Strategy: Implement and test a document chunking strategy (e.g., recursive character splitting with overlap).
Reproducibility: Set fixed seeds for any processes involving randomness (e.g., chunking, sampling) to ensure deterministic outcomes.

5. Embedding and Vector Storage ✅ PHASE 2B COMPLETED

Vector DB Setup: Integrate a vector database (ChromaDB) into the project.
Embedding Model: Select and integrate a free embedding model (paraphrase-MiniLM-L3-v2 chosen for memory efficiency).
Ingestion Pipeline: Create enhanced ingestion pipeline that:
- Loads documents from the corpus.
- Chunks the documents with metadata.
- Embeds the chunks using sentence-transformers.
- Stores the embeddings in ChromaDB vector database.
- Provides detailed processing statistics.
Testing: Write comprehensive tests (60+ tests) verifying each step of the ingestion pipeline.
Search API: Implement POST /search endpoint for semantic search with:
- JSON request/response format
- Configurable parameters (top_k, threshold)
- Comprehensive input validation
- Detailed error handling
End-to-End Testing: Complete pipeline testing from ingestion through search.
Documentation: Full API documentation with examples and performance metrics.

6. RAG Core Implementation ✅ PHASE 3 COMPLETED

Retrieval Logic: Implement a function to retrieve the top-k relevant document chunks from the vector store based on a user query.
Prompt Engineering: Design a prompt template that injects the retrieved context into the query for the LLM.
LLM Integration: Connect to a free-tier LLM (e.g., via OpenRouter or Groq) to generate answers.
Basic Guardrails: Implement and test basic guardrails for context validation and response length limits.
Enhanced Guardrails (Issue #24): ✅ COMPLETED - Comprehensive guardrails and response quality system:
- Content Safety Filtering: PII detection, bias mitigation, inappropriate content filtering
- Response Quality Scoring: Multi-dimensional quality assessment (relevance, completeness, coherence, source fidelity)
- Source Attribution: Automated citation generation and validation
- Error Handling: Circuit breaker patterns and graceful degradation
- Configuration System: Flexible thresholds and feature toggles
- Testing: 13 comprehensive tests with 100% pass rate
- Integration: Enhanced RAG pipeline with backward compatibility

7. Web Application Completion

Chat Interface: ✅ COMPLETED - Implement a simple web chat interface for the / endpoint.
- Modern Chat UI: Interactive chat interface with real-time messaging
- Message History: Conversation display with user and assistant messages
- Source Citations: Visual display of source documents and confidence scores
- Responsive Design: Mobile-friendly interface with modern styling
- Error Handling: Graceful error display and loading states
- System Health: Status indicators and health monitoring
API Endpoint: Create the /chat API endpoint that receives user questions (POST) and returns model-generated answers with citations and snippets.
UI/UX: ✅ COMPLETED - Ensure the web interface is clean, user-friendly, and handles loading/error states gracefully.
Testing: Write end-to-end tests for the chat functionality.

7.5. Memory Management & Production Optimization ✅ COMPLETED

Memory Architecture Redesign: ✅ COMPLETED - Comprehensive memory optimization for cloud deployment:
- App Factory Pattern: Migrated from monolithic to factory pattern with lazy loading
  - Impact: 87% reduction in startup memory (400MB → 50MB)
  - Benefit: Services initialize only when needed, improving resource efficiency
- Embedding Model Optimization: Changed from all-MiniLM-L6-v2 to paraphrase-MiniLM-L3-v2
  - Memory Savings: 75-85% reduction (550-1000MB → 132MB)
  - Quality Impact: <5% reduction in similarity scoring (acceptable trade-off)
  - Deployment Viability: Enables deployment on Render free tier (512MB limit)
- Gunicorn Production Configuration: Optimized for memory-constrained environments
  - Configuration: Single worker, 2 threads, max_requests=50
  - Memory Control: Prevent memory leaks with automatic worker restart
  - Performance: Balanced for I/O-bound LLM operations
Memory Management Utilities: ✅ COMPLETED - Comprehensive memory monitoring and optimization:
- MemoryManager Class: Context manager for memory tracking and cleanup
- Real-time Monitoring: Memory usage tracking with automatic garbage collection
- Memory Statistics: Detailed memory reporting for production monitoring
- Error Recovery: Memory-aware error handling with graceful degradation
- Health Integration: Memory metrics exposed via /health endpoint
Database Pre-building Strategy: ✅ COMPLETED - Eliminate deployment memory spikes:
- Local Database Building: build_embeddings.py script for development
- Repository Commitment: Pre-built vector database (25MB) committed to git
- Deployment Optimization: Zero embedding generation on production startup
- Memory Impact: Avoid 150MB+ memory spikes during embedding generation
Production Deployment Optimization: ✅ COMPLETED - Full production readiness:
- Memory Profiling: Comprehensive memory usage analysis and optimization
- Performance Testing: Load testing with memory constraints validation
- Error Handling: Production-grade error recovery for memory pressure
- Monitoring Integration: Real-time memory tracking and alerting
- Documentation: Complete memory management documentation across all files
Testing & Validation: ✅ COMPLETED - Memory-aware testing infrastructure:
- Memory Constraint Testing: All 138 tests pass with memory optimizations
- Performance Regression Testing: Response time validation maintained
- Memory Leak Detection: Long-running tests validate memory stability
- Production Simulation: Testing in memory-constrained environments

8. Evaluation

Evaluation Set: Create an evaluation set of 15-30 questions and corresponding "gold" answers covering various policy topics.
Metric Implementation: Develop scripts to calculate:
- Answer Quality: Groundedness and Citation Accuracy.
- System Metrics: Latency (p50/p95).
Execution: Run the evaluation and record the results.
Documentation: Summarize the evaluation results in design-and-evaluation.md.

9. Final Documentation and Submission

Design Document: ✅ COMPLETED - Complete design-and-evaluation.md with comprehensive technical analysis:
- Memory Architecture Design: Detailed analysis of memory-constrained architecture decisions
- Performance Evaluation: Comprehensive memory usage, response time, and quality metrics
- Model Selection Analysis: Embedding model comparison with memory vs quality trade-offs
- Production Deployment Evaluation: Platform compatibility and scalability analysis
- Design Trade-offs Documentation: Lessons learned and future considerations
README: ✅ COMPLETED - Comprehensive documentation with memory management focus:
- Memory Management Section: Detailed memory optimization architecture and utilities
- Production Configuration: Gunicorn, database pre-building, and deployment strategies
- Performance Metrics: Memory usage breakdown and production performance data
- Setup Instructions: Memory-aware development and deployment guidelines
Deployment Documentation: ✅ COMPLETED - Updated deployed.md with production details:
- Memory-Optimized Configuration: Production memory profile and optimization results
- Performance Metrics: Real-time memory monitoring and capacity analysis
- Production Features: Memory management system and error handling documentation
- Deployment Pipeline: CI/CD integration with memory validation
Contributing Guidelines: ✅ COMPLETED - Updated CONTRIBUTING.md with memory-conscious development:
- Memory Development Principles: Guidelines for memory-efficient code patterns
- Memory Testing Procedures: Development workflow for memory constraint validation
- Code Review Guidelines: Memory-focused review checklist and best practices
- Production Testing: Memory leak detection and performance validation procedures
Demonstration Video: Record a 5-10 minute screen-share video demonstrating the deployed application, walking through the code architecture, explaining the evaluation results, and showing a successful CI/CD run.
Submission: Share the GitHub repository with the grader and submit the repository and video links.