Spaces:
Runtime error
Runtime error
A newer version of the Gradio SDK is available:
6.1.0
Scripts Documentation π
Automated scripts for HeoCare Chatbot setup and maintenance.
π Quick Start
One-Command Setup (Recommended)
# Run everything in one command
bash scripts/setup_rag.sh
What it does:
- β Check Python & dependencies
- β Install required packages
- β Download 6 medical datasets from HuggingFace
- β Build ChromaDB vector stores (~160 MB)
- β Generate training data (200 conversations)
- β Optional: Fine-tune agents
Time: ~15-20 minutes (depends on internet speed)
π Available Scripts
1. setup_rag.sh β Main Setup
bash scripts/setup_rag.sh
Features:
- Downloads 6 datasets from HuggingFace:
- ViMedical (603 diseases)
- MentalChat16K (16K conversations)
- Nutrition recommendations
- Vietnamese food nutrition
- Fitness exercises (1.66K)
- Medical Q&A (9.3K pairs)
- Builds ChromaDB vector stores
- Generates training data
- Optional fine-tuning
Skip existing databases automatically!
2. generate_training_data.py - Training Data
python scripts/generate_training_data.py
What it does:
- Generates 200 synthetic conversations
- 50 scenarios per agent (nutrition, symptom, exercise, mental_health)
- Uses GPT-4o-mini
- Output:
fine_tuning/training_data/*.jsonl
Cost: ~$0.50 (OpenAI API)
3. auto_finetune.py - Batch Fine-tuning
python scripts/auto_finetune.py
What it does:
- Fine-tunes all 4 agents automatically
- Uploads training files
- Creates fine-tuning jobs
- Tracks progress
- Updates model config
Requirements: OpenAI official API (custom APIs not supported)
4. fine_tune_agent.py - Single Agent Fine-tuning
python scripts/fine_tune_agent.py nutrition_agent
What it does:
- Fine-tune one specific agent
- Manual control over the process
- Alternative to auto_finetune.py
Agents: nutrition_agent, symptom_agent, exercise_agent, mental_health_agent
5. check_rag_status.py - Diagnostic Tool
python scripts/check_rag_status.py
What it checks:
- β ChromaDB folders exist
- π Database sizes
- π Document counts
- π§ͺ Test queries
Note: May need updates for new vector store paths
π Directory Structure
scripts/
βββ setup_rag.sh # β Main setup script
βββ generate_training_data.py # Generate synthetic data
βββ auto_finetune.py # Batch fine-tuning
βββ fine_tune_agent.py # Single agent fine-tuning
βββ check_rag_status.py # Diagnostic tool
βββ README.md # This file
data_mining/ # Dataset downloaders
βββ mining_vimedical.py # ViMedical diseases
βββ mining_mentalchat.py # Mental health conversations
βββ mining_nutrition.py # Nutrition recommendations
βββ mining_vietnamese_food.py # Vietnamese food data
βββ mining_fitness.py # Fitness exercises
βββ mining_medical_qa.py # Medical Q&A pairs
rag/vector_store/ # ChromaDB (NOT committed)
βββ medical_diseases/ # ViMedical (603 diseases)
βββ mental_health/ # MentalChat (16K conversations)
βββ nutrition/ # Nutrition plans
βββ vietnamese_nutrition/ # Vietnamese foods (73)
βββ fitness/ # Exercises (1.66K)
βββ symptom_qa/ # Medical Q&A
βββ general_health_qa/ # General health Q&A
fine_tuning/training_data/ # Generated data (NOT committed)
βββ nutrition_training.jsonl
βββ symptom_training.jsonl
βββ exercise_training.jsonl
βββ mental_health_training.jsonl
π Team Workflow
First Time Setup (New Team Member)
# 1. Clone repo
git clone <repo-url>
cd heocare-chatbot
# 2. Create .env file
cp .env.example .env
# Add your OPENAI_API_KEY
# 3. Setup everything (one command)
bash scripts/setup_rag.sh
# 4. Run app
python app.py
Time: ~15-20 minutes
Daily Development
# Pull latest code
git pull
# If setup_rag.sh was updated, run it again
# (It will skip existing databases automatically)
bash scripts/setup_rag.sh
# Run app
python app.py
Regenerate Training Data
# If you updated agent prompts or scenarios
python scripts/generate_training_data.py
# Optional: Fine-tune with new data
python scripts/auto_finetune.py
Reset Everything
# Delete all generated data
rm -rf rag/vector_store/*
rm -rf fine_tuning/training_data/*
rm -rf data_mining/datasets/*
rm -rf data_mining/output/*
# Setup from scratch
bash scripts/setup_rag.sh
π Troubleshooting
Setup Failed
# Check Python version (need 3.8+)
python --version
# Check dependencies
pip install -r requirements.txt
# Check API key
echo $OPENAI_API_KEY
Dataset Download Failed
# Check internet connection
ping huggingface.co
# Try manual download for specific dataset
python data_mining/mining_vimedical.py
python data_mining/mining_mentalchat.py
ChromaDB Issues
# Check status
python scripts/check_rag_status.py
# Delete and rebuild specific database
rm -rf rag/vector_store/medical_diseases
python data_mining/mining_vimedical.py
# Move to correct location
mkdir -p rag/vector_store
mv data_mining/output/medical_chroma rag/vector_store/medical_diseases
Fine-tuning 404 Error
Error: 404 - {'detail': 'Not Found'}
Cause: Custom API endpoint doesn't support fine-tuning
Solution:
- Use OpenAI official API for fine-tuning
- Or skip fine-tuning (app works fine with base model + RAG)
# Option 1: Update .env to use official API
OPENAI_BASE_URL=https://api.openai.com/v1
OPENAI_API_KEY=sk-proj-your-official-key
# Option 2: Skip fine-tuning
# Just run the app without fine-tuning
python app.py
π Performance
| Task | Time | Size |
|---|---|---|
| Download datasets | ~5-8 min | ~50 MB |
| Build ChromaDB | ~5-7 min | ~160 MB |
| Generate training data | ~2-3 min | ~500 KB |
| Fine-tuning (optional) | ~30-60 min | - |
| Total Setup | ~15-20 min | ~160 MB |
π Support
If you encounter issues:
- Run
python scripts/check_rag_status.pyfor diagnostics - Check console logs for errors
- Verify
.gitignoreis correct - Try deleting and rebuilding specific databases
- Check that
.envhas valid API key
Happy Coding! π