Spaces:
Runtime error
Runtime error
| # Scripts Documentation π | |
| Automated scripts for HeoCare Chatbot setup and maintenance. | |
| ## π Quick Start | |
| ### One-Command Setup (Recommended) | |
| ```bash | |
| # Run everything in one command | |
| bash scripts/setup_rag.sh | |
| ``` | |
| **What it does:** | |
| 1. β Check Python & dependencies | |
| 2. β Install required packages | |
| 3. β Download 6 medical datasets from HuggingFace | |
| 4. β Build ChromaDB vector stores (~160 MB) | |
| 5. β Generate training data (200 conversations) | |
| 6. β Optional: Fine-tune agents | |
| **Time:** ~15-20 minutes (depends on internet speed) | |
| --- | |
| ## π Available Scripts | |
| ### 1. `setup_rag.sh` β Main Setup | |
| ```bash | |
| bash scripts/setup_rag.sh | |
| ``` | |
| **Features:** | |
| - Downloads 6 datasets from HuggingFace: | |
| - ViMedical (603 diseases) | |
| - MentalChat16K (16K conversations) | |
| - Nutrition recommendations | |
| - Vietnamese food nutrition | |
| - Fitness exercises (1.66K) | |
| - Medical Q&A (9.3K pairs) | |
| - Builds ChromaDB vector stores | |
| - Generates training data | |
| - Optional fine-tuning | |
| **Skip existing databases automatically!** | |
| --- | |
| ### 2. `generate_training_data.py` - Training Data | |
| ```bash | |
| python scripts/generate_training_data.py | |
| ``` | |
| **What it does:** | |
| - Generates 200 synthetic conversations | |
| - 50 scenarios per agent (nutrition, symptom, exercise, mental_health) | |
| - Uses GPT-4o-mini | |
| - Output: `fine_tuning/training_data/*.jsonl` | |
| **Cost:** ~$0.50 (OpenAI API) | |
| --- | |
| ### 3. `auto_finetune.py` - Batch Fine-tuning | |
| ```bash | |
| python scripts/auto_finetune.py | |
| ``` | |
| **What it does:** | |
| - Fine-tunes all 4 agents automatically | |
| - Uploads training files | |
| - Creates fine-tuning jobs | |
| - Tracks progress | |
| - Updates model config | |
| **Requirements:** OpenAI official API (custom APIs not supported) | |
| --- | |
| ### 4. `fine_tune_agent.py` - Single Agent Fine-tuning | |
| ```bash | |
| python scripts/fine_tune_agent.py nutrition_agent | |
| ``` | |
| **What it does:** | |
| - Fine-tune one specific agent | |
| - Manual control over the process | |
| - Alternative to auto_finetune.py | |
| **Agents:** `nutrition_agent`, `symptom_agent`, `exercise_agent`, `mental_health_agent` | |
| --- | |
| ### 5. `check_rag_status.py` - Diagnostic Tool | |
| ```bash | |
| python scripts/check_rag_status.py | |
| ``` | |
| **What it checks:** | |
| - β ChromaDB folders exist | |
| - π Database sizes | |
| - π Document counts | |
| - π§ͺ Test queries | |
| **Note:** May need updates for new vector store paths | |
| --- | |
| ## π Directory Structure | |
| ``` | |
| scripts/ | |
| βββ setup_rag.sh # β Main setup script | |
| βββ generate_training_data.py # Generate synthetic data | |
| βββ auto_finetune.py # Batch fine-tuning | |
| βββ fine_tune_agent.py # Single agent fine-tuning | |
| βββ check_rag_status.py # Diagnostic tool | |
| βββ README.md # This file | |
| data_mining/ # Dataset downloaders | |
| βββ mining_vimedical.py # ViMedical diseases | |
| βββ mining_mentalchat.py # Mental health conversations | |
| βββ mining_nutrition.py # Nutrition recommendations | |
| βββ mining_vietnamese_food.py # Vietnamese food data | |
| βββ mining_fitness.py # Fitness exercises | |
| βββ mining_medical_qa.py # Medical Q&A pairs | |
| rag/vector_store/ # ChromaDB (NOT committed) | |
| βββ medical_diseases/ # ViMedical (603 diseases) | |
| βββ mental_health/ # MentalChat (16K conversations) | |
| βββ nutrition/ # Nutrition plans | |
| βββ vietnamese_nutrition/ # Vietnamese foods (73) | |
| βββ fitness/ # Exercises (1.66K) | |
| βββ symptom_qa/ # Medical Q&A | |
| βββ general_health_qa/ # General health Q&A | |
| fine_tuning/training_data/ # Generated data (NOT committed) | |
| βββ nutrition_training.jsonl | |
| βββ symptom_training.jsonl | |
| βββ exercise_training.jsonl | |
| βββ mental_health_training.jsonl | |
| ``` | |
| --- | |
| ## π Team Workflow | |
| ### First Time Setup (New Team Member) | |
| ```bash | |
| # 1. Clone repo | |
| git clone <repo-url> | |
| cd heocare-chatbot | |
| # 2. Create .env file | |
| cp .env.example .env | |
| # Add your OPENAI_API_KEY | |
| # 3. Setup everything (one command) | |
| bash scripts/setup_rag.sh | |
| # 4. Run app | |
| python app.py | |
| ``` | |
| **Time:** ~15-20 minutes | |
| --- | |
| ### Daily Development | |
| ```bash | |
| # Pull latest code | |
| git pull | |
| # If setup_rag.sh was updated, run it again | |
| # (It will skip existing databases automatically) | |
| bash scripts/setup_rag.sh | |
| # Run app | |
| python app.py | |
| ``` | |
| --- | |
| ### Regenerate Training Data | |
| ```bash | |
| # If you updated agent prompts or scenarios | |
| python scripts/generate_training_data.py | |
| # Optional: Fine-tune with new data | |
| python scripts/auto_finetune.py | |
| ``` | |
| --- | |
| ### Reset Everything | |
| ```bash | |
| # Delete all generated data | |
| rm -rf rag/vector_store/* | |
| rm -rf fine_tuning/training_data/* | |
| rm -rf data_mining/datasets/* | |
| rm -rf data_mining/output/* | |
| # Setup from scratch | |
| bash scripts/setup_rag.sh | |
| ``` | |
| --- | |
| ## π Troubleshooting | |
| ### Setup Failed | |
| ```bash | |
| # Check Python version (need 3.8+) | |
| python --version | |
| # Check dependencies | |
| pip install -r requirements.txt | |
| # Check API key | |
| echo $OPENAI_API_KEY | |
| ``` | |
| --- | |
| ### Dataset Download Failed | |
| ```bash | |
| # Check internet connection | |
| ping huggingface.co | |
| # Try manual download for specific dataset | |
| python data_mining/mining_vimedical.py | |
| python data_mining/mining_mentalchat.py | |
| ``` | |
| --- | |
| ### ChromaDB Issues | |
| ```bash | |
| # Check status | |
| python scripts/check_rag_status.py | |
| # Delete and rebuild specific database | |
| rm -rf rag/vector_store/medical_diseases | |
| python data_mining/mining_vimedical.py | |
| # Move to correct location | |
| mkdir -p rag/vector_store | |
| mv data_mining/output/medical_chroma rag/vector_store/medical_diseases | |
| ``` | |
| --- | |
| ### Fine-tuning 404 Error | |
| ``` | |
| Error: 404 - {'detail': 'Not Found'} | |
| ``` | |
| **Cause:** Custom API endpoint doesn't support fine-tuning | |
| **Solution:** | |
| 1. Use OpenAI official API for fine-tuning | |
| 2. Or skip fine-tuning (app works fine with base model + RAG) | |
| ```bash | |
| # Option 1: Update .env to use official API | |
| OPENAI_BASE_URL=https://api.openai.com/v1 | |
| OPENAI_API_KEY=sk-proj-your-official-key | |
| # Option 2: Skip fine-tuning | |
| # Just run the app without fine-tuning | |
| python app.py | |
| ``` | |
| --- | |
| ## π Performance | |
| | Task | Time | Size | | |
| |------|------|------| | |
| | Download datasets | ~5-8 min | ~50 MB | | |
| | Build ChromaDB | ~5-7 min | ~160 MB | | |
| | Generate training data | ~2-3 min | ~500 KB | | |
| | Fine-tuning (optional) | ~30-60 min | - | | |
| | **Total Setup** | **~15-20 min** | **~160 MB** | | |
| --- | |
| ## π Support | |
| If you encounter issues: | |
| 1. Run `python scripts/check_rag_status.py` for diagnostics | |
| 2. Check console logs for errors | |
| 3. Verify `.gitignore` is correct | |
| 4. Try deleting and rebuilding specific databases | |
| 5. Check that `.env` has valid API key | |
| --- | |
| **Happy Coding! π** | |