Spaces:
Sleeping
Sleeping
RAG Application Project Plan
This plan outlines the steps to design, build, and deploy a Retrieval-Augmented Generation (RAG) application as per the project requirements, with a focus on achieving a grade of 5. The approach prioritizes early deployment and continuous integration, following Test-Driven Development (TDD) principles.
1. Foundational Setup
- Repository: Create a new GitHub repository.
- Virtual Environment: Set up a local Python virtual environment (
venv). - Initial Files:
- Create
requirements.txtwith initial dependencies (Flask,pytest). - Create a
.gitignorefile for Python. - Create a
README.mdwith initial setup instructions. - Create placeholder files:
deployed.mdanddesign-and-evaluation.md.
- Create
- Testing Framework: Establish a
tests/directory and configurepytest.
2. "Hello World" Deployment
- Minimal App: Develop a minimal Flask application (
app.py) with a/healthendpoint that returns a JSON status object. - Unit Test: Write a test for the
/healthendpoint to ensure it returns a200 OKstatus and the correct JSON payload. - Local Validation: Run the app and tests locally to confirm everything works.
3. CI/CD and Initial Deployment
- Render Setup: Create a new Web Service on Render and link it to the GitHub repository.
- Environment Configuration: Configure necessary environment variables on Render (e.g.,
PYTHON_VERSION). - GitHub Actions: Create a CI/CD workflow (
.github/workflows/main.yml) that:- Triggers on push/PR to the
mainbranch. - Installs dependencies from
requirements.txt. - Runs the
pytesttest suite. - On success, triggers a deployment to Render.
- Triggers on push/PR to the
- Deployment Validation: Push a change and verify that the workflow runs successfully and the application is deployed.
- Documentation: Update
deployed.mdwith the live URL of the deployed application.
CI/CD optimizations added
- Add pip cache to CI to speed up dependency installation.
- Optimize pre-commit in PRs to run only changed-file hooks (use
pre-commit run --from-ref ... --to-ref ...).
4. Data Ingestion and Processing
- Corpus Assembly: Collect or generate 5-20 policy documents (PDF, TXT, MD) and place them in a
synthetic_policies/directory. - Parsing Logic: Implement and test functions to parse different document formats.
- Chunking Strategy: Implement and test a document chunking strategy (e.g., recursive character splitting with overlap).
- Reproducibility: Set fixed seeds for any processes involving randomness (e.g., chunking, sampling) to ensure deterministic outcomes.
5. Embedding and Vector Storage β PHASE 2B COMPLETED
- Vector DB Setup: Integrate a vector database (ChromaDB) into the project.
- Embedding Model: Select and integrate a free embedding model (
paraphrase-MiniLM-L3-v2chosen for memory efficiency). - Ingestion Pipeline: Create enhanced ingestion pipeline that:
- Loads documents from the corpus.
- Chunks the documents with metadata.
- Embeds the chunks using sentence-transformers.
- Stores the embeddings in ChromaDB vector database.
- Provides detailed processing statistics.
- Testing: Write comprehensive tests (60+ tests) verifying each step of the ingestion pipeline.
- Search API: Implement POST
/searchendpoint for semantic search with:- JSON request/response format
- Configurable parameters (top_k, threshold)
- Comprehensive input validation
- Detailed error handling
- End-to-End Testing: Complete pipeline testing from ingestion through search.
- Documentation: Full API documentation with examples and performance metrics.
6. RAG Core Implementation β PHASE 3 COMPLETED
- Retrieval Logic: Implement a function to retrieve the top-k relevant document chunks from the vector store based on a user query.
- Prompt Engineering: Design a prompt template that injects the retrieved context into the query for the LLM.
- LLM Integration: Connect to a free-tier LLM (e.g., via OpenRouter or Groq) to generate answers.
- Basic Guardrails: Implement and test basic guardrails for context validation and response length limits.
- Enhanced Guardrails (Issue #24): β
COMPLETED - Comprehensive guardrails and response quality system:
- Content Safety Filtering: PII detection, bias mitigation, inappropriate content filtering
- Response Quality Scoring: Multi-dimensional quality assessment (relevance, completeness, coherence, source fidelity)
- Source Attribution: Automated citation generation and validation
- Error Handling: Circuit breaker patterns and graceful degradation
- Configuration System: Flexible thresholds and feature toggles
- Testing: 13 comprehensive tests with 100% pass rate
- Integration: Enhanced RAG pipeline with backward compatibility
7. Web Application Completion
- Chat Interface: β
COMPLETED - Implement a simple web chat interface for the
/endpoint.- Modern Chat UI: Interactive chat interface with real-time messaging
- Message History: Conversation display with user and assistant messages
- Source Citations: Visual display of source documents and confidence scores
- Responsive Design: Mobile-friendly interface with modern styling
- Error Handling: Graceful error display and loading states
- System Health: Status indicators and health monitoring
- API Endpoint: Create the
/chatAPI endpoint that receives user questions (POST) and returns model-generated answers with citations and snippets. - UI/UX: β COMPLETED - Ensure the web interface is clean, user-friendly, and handles loading/error states gracefully.
- Testing: Write end-to-end tests for the chat functionality.
7.5. Memory Management & Production Optimization β COMPLETED
Memory Architecture Redesign: β COMPLETED - Comprehensive memory optimization for cloud deployment:
- App Factory Pattern: Migrated from monolithic to factory pattern with lazy loading
- Impact: 87% reduction in startup memory (400MB β 50MB)
- Benefit: Services initialize only when needed, improving resource efficiency
- Embedding Model Optimization: Changed from
all-MiniLM-L6-v2toparaphrase-MiniLM-L3-v2- Memory Savings: 75-85% reduction (550-1000MB β 132MB)
- Quality Impact: <5% reduction in similarity scoring (acceptable trade-off)
- Deployment Viability: Enables deployment on Render free tier (512MB limit)
- Gunicorn Production Configuration: Optimized for memory-constrained environments
- Configuration: Single worker, 2 threads, max_requests=50
- Memory Control: Prevent memory leaks with automatic worker restart
- Performance: Balanced for I/O-bound LLM operations
- App Factory Pattern: Migrated from monolithic to factory pattern with lazy loading
Memory Management Utilities: β COMPLETED - Comprehensive memory monitoring and optimization:
- MemoryManager Class: Context manager for memory tracking and cleanup
- Real-time Monitoring: Memory usage tracking with automatic garbage collection
- Memory Statistics: Detailed memory reporting for production monitoring
- Error Recovery: Memory-aware error handling with graceful degradation
- Health Integration: Memory metrics exposed via
/healthendpoint
Database Pre-building Strategy: β COMPLETED - Eliminate deployment memory spikes:
- Local Database Building:
build_embeddings.pyscript for development - Repository Commitment: Pre-built vector database (25MB) committed to git
- Deployment Optimization: Zero embedding generation on production startup
- Memory Impact: Avoid 150MB+ memory spikes during embedding generation
- Local Database Building:
Production Deployment Optimization: β COMPLETED - Full production readiness:
- Memory Profiling: Comprehensive memory usage analysis and optimization
- Performance Testing: Load testing with memory constraints validation
- Error Handling: Production-grade error recovery for memory pressure
- Monitoring Integration: Real-time memory tracking and alerting
- Documentation: Complete memory management documentation across all files
Testing & Validation: β COMPLETED - Memory-aware testing infrastructure:
- Memory Constraint Testing: All 138 tests pass with memory optimizations
- Performance Regression Testing: Response time validation maintained
- Memory Leak Detection: Long-running tests validate memory stability
- Production Simulation: Testing in memory-constrained environments
8. Evaluation
- Evaluation Set: Create an evaluation set of 15-30 questions and corresponding "gold" answers covering various policy topics.
- Metric Implementation: Develop scripts to calculate:
- Answer Quality: Groundedness and Citation Accuracy.
- System Metrics: Latency (p50/p95).
- Execution: Run the evaluation and record the results.
- Documentation: Summarize the evaluation results in
design-and-evaluation.md.
9. Final Documentation and Submission
- Design Document: β
COMPLETED - Complete
design-and-evaluation.mdwith comprehensive technical analysis:- Memory Architecture Design: Detailed analysis of memory-constrained architecture decisions
- Performance Evaluation: Comprehensive memory usage, response time, and quality metrics
- Model Selection Analysis: Embedding model comparison with memory vs quality trade-offs
- Production Deployment Evaluation: Platform compatibility and scalability analysis
- Design Trade-offs Documentation: Lessons learned and future considerations
- README: β
COMPLETED - Comprehensive documentation with memory management focus:
- Memory Management Section: Detailed memory optimization architecture and utilities
- Production Configuration: Gunicorn, database pre-building, and deployment strategies
- Performance Metrics: Memory usage breakdown and production performance data
- Setup Instructions: Memory-aware development and deployment guidelines
- Deployment Documentation: β
COMPLETED - Updated
deployed.mdwith production details:- Memory-Optimized Configuration: Production memory profile and optimization results
- Performance Metrics: Real-time memory monitoring and capacity analysis
- Production Features: Memory management system and error handling documentation
- Deployment Pipeline: CI/CD integration with memory validation
- Contributing Guidelines: β
COMPLETED - Updated
CONTRIBUTING.mdwith memory-conscious development:- Memory Development Principles: Guidelines for memory-efficient code patterns
- Memory Testing Procedures: Development workflow for memory constraint validation
- Code Review Guidelines: Memory-focused review checklist and best practices
- Production Testing: Memory leak detection and performance validation procedures
- Demonstration Video: Record a 5-10 minute screen-share video demonstrating the deployed application, walking through the code architecture, explaining the evaluation results, and showing a successful CI/CD run.
- Submission: Share the GitHub repository with the grader and submit the repository and video links.