my-gradio-app / scripts /README.md
Nguyen Trong Lap
Recreate history without binary blobs
eeb0f9c

A newer version of the Gradio SDK is available: 6.1.0

Upgrade

Scripts Documentation πŸš€

Automated scripts for HeoCare Chatbot setup and maintenance.

πŸ“‹ Quick Start

One-Command Setup (Recommended)

# Run everything in one command
bash scripts/setup_rag.sh

What it does:

  1. βœ… Check Python & dependencies
  2. βœ… Install required packages
  3. βœ… Download 6 medical datasets from HuggingFace
  4. βœ… Build ChromaDB vector stores (~160 MB)
  5. βœ… Generate training data (200 conversations)
  6. βœ… Optional: Fine-tune agents

Time: ~15-20 minutes (depends on internet speed)


πŸ“œ Available Scripts

1. setup_rag.sh ⭐ Main Setup

bash scripts/setup_rag.sh

Features:

  • Downloads 6 datasets from HuggingFace:
    • ViMedical (603 diseases)
    • MentalChat16K (16K conversations)
    • Nutrition recommendations
    • Vietnamese food nutrition
    • Fitness exercises (1.66K)
    • Medical Q&A (9.3K pairs)
  • Builds ChromaDB vector stores
  • Generates training data
  • Optional fine-tuning

Skip existing databases automatically!


2. generate_training_data.py - Training Data

python scripts/generate_training_data.py

What it does:

  • Generates 200 synthetic conversations
  • 50 scenarios per agent (nutrition, symptom, exercise, mental_health)
  • Uses GPT-4o-mini
  • Output: fine_tuning/training_data/*.jsonl

Cost: ~$0.50 (OpenAI API)


3. auto_finetune.py - Batch Fine-tuning

python scripts/auto_finetune.py

What it does:

  • Fine-tunes all 4 agents automatically
  • Uploads training files
  • Creates fine-tuning jobs
  • Tracks progress
  • Updates model config

Requirements: OpenAI official API (custom APIs not supported)


4. fine_tune_agent.py - Single Agent Fine-tuning

python scripts/fine_tune_agent.py nutrition_agent

What it does:

  • Fine-tune one specific agent
  • Manual control over the process
  • Alternative to auto_finetune.py

Agents: nutrition_agent, symptom_agent, exercise_agent, mental_health_agent


5. check_rag_status.py - Diagnostic Tool

python scripts/check_rag_status.py

What it checks:

  • βœ… ChromaDB folders exist
  • πŸ“Š Database sizes
  • πŸ“š Document counts
  • πŸ§ͺ Test queries

Note: May need updates for new vector store paths


πŸ“ Directory Structure

scripts/
β”œβ”€β”€ setup_rag.sh                   # ⭐ Main setup script
β”œβ”€β”€ generate_training_data.py      # Generate synthetic data
β”œβ”€β”€ auto_finetune.py               # Batch fine-tuning
β”œβ”€β”€ fine_tune_agent.py             # Single agent fine-tuning
β”œβ”€β”€ check_rag_status.py            # Diagnostic tool
└── README.md                      # This file

data_mining/                       # Dataset downloaders
β”œβ”€β”€ mining_vimedical.py            # ViMedical diseases
β”œβ”€β”€ mining_mentalchat.py           # Mental health conversations
β”œβ”€β”€ mining_nutrition.py            # Nutrition recommendations
β”œβ”€β”€ mining_vietnamese_food.py      # Vietnamese food data
β”œβ”€β”€ mining_fitness.py              # Fitness exercises
└── mining_medical_qa.py           # Medical Q&A pairs

rag/vector_store/                  # ChromaDB (NOT committed)
β”œβ”€β”€ medical_diseases/              # ViMedical (603 diseases)
β”œβ”€β”€ mental_health/                 # MentalChat (16K conversations)
β”œβ”€β”€ nutrition/                     # Nutrition plans
β”œβ”€β”€ vietnamese_nutrition/          # Vietnamese foods (73)
β”œβ”€β”€ fitness/                       # Exercises (1.66K)
β”œβ”€β”€ symptom_qa/                    # Medical Q&A
└── general_health_qa/             # General health Q&A

fine_tuning/training_data/         # Generated data (NOT committed)
β”œβ”€β”€ nutrition_training.jsonl
β”œβ”€β”€ symptom_training.jsonl
β”œβ”€β”€ exercise_training.jsonl
└── mental_health_training.jsonl

πŸ”„ Team Workflow

First Time Setup (New Team Member)

# 1. Clone repo
git clone <repo-url>
cd heocare-chatbot

# 2. Create .env file
cp .env.example .env
# Add your OPENAI_API_KEY

# 3. Setup everything (one command)
bash scripts/setup_rag.sh

# 4. Run app
python app.py

Time: ~15-20 minutes


Daily Development

# Pull latest code
git pull

# If setup_rag.sh was updated, run it again
# (It will skip existing databases automatically)
bash scripts/setup_rag.sh

# Run app
python app.py

Regenerate Training Data

# If you updated agent prompts or scenarios
python scripts/generate_training_data.py

# Optional: Fine-tune with new data
python scripts/auto_finetune.py

Reset Everything

# Delete all generated data
rm -rf rag/vector_store/*
rm -rf fine_tuning/training_data/*
rm -rf data_mining/datasets/*
rm -rf data_mining/output/*

# Setup from scratch
bash scripts/setup_rag.sh

πŸ› Troubleshooting

Setup Failed

# Check Python version (need 3.8+)
python --version

# Check dependencies
pip install -r requirements.txt

# Check API key
echo $OPENAI_API_KEY

Dataset Download Failed

# Check internet connection
ping huggingface.co

# Try manual download for specific dataset
python data_mining/mining_vimedical.py
python data_mining/mining_mentalchat.py

ChromaDB Issues

# Check status
python scripts/check_rag_status.py

# Delete and rebuild specific database
rm -rf rag/vector_store/medical_diseases
python data_mining/mining_vimedical.py

# Move to correct location
mkdir -p rag/vector_store
mv data_mining/output/medical_chroma rag/vector_store/medical_diseases

Fine-tuning 404 Error

Error: 404 - {'detail': 'Not Found'}

Cause: Custom API endpoint doesn't support fine-tuning

Solution:

  1. Use OpenAI official API for fine-tuning
  2. Or skip fine-tuning (app works fine with base model + RAG)
# Option 1: Update .env to use official API
OPENAI_BASE_URL=https://api.openai.com/v1
OPENAI_API_KEY=sk-proj-your-official-key

# Option 2: Skip fine-tuning
# Just run the app without fine-tuning
python app.py

πŸ“Š Performance

Task Time Size
Download datasets ~5-8 min ~50 MB
Build ChromaDB ~5-7 min ~160 MB
Generate training data ~2-3 min ~500 KB
Fine-tuning (optional) ~30-60 min -
Total Setup ~15-20 min ~160 MB

πŸ†˜ Support

If you encounter issues:

  1. Run python scripts/check_rag_status.py for diagnostics
  2. Check console logs for errors
  3. Verify .gitignore is correct
  4. Try deleting and rebuilding specific databases
  5. Check that .env has valid API key

Happy Coding! πŸš€