my-gradio-app / scripts /README.md
Nguyen Trong Lap
Recreate history without binary blobs
eeb0f9c
# Scripts Documentation πŸš€
Automated scripts for HeoCare Chatbot setup and maintenance.
## πŸ“‹ Quick Start
### One-Command Setup (Recommended)
```bash
# Run everything in one command
bash scripts/setup_rag.sh
```
**What it does:**
1. βœ… Check Python & dependencies
2. βœ… Install required packages
3. βœ… Download 6 medical datasets from HuggingFace
4. βœ… Build ChromaDB vector stores (~160 MB)
5. βœ… Generate training data (200 conversations)
6. βœ… Optional: Fine-tune agents
**Time:** ~15-20 minutes (depends on internet speed)
---
## πŸ“œ Available Scripts
### 1. `setup_rag.sh` ⭐ Main Setup
```bash
bash scripts/setup_rag.sh
```
**Features:**
- Downloads 6 datasets from HuggingFace:
- ViMedical (603 diseases)
- MentalChat16K (16K conversations)
- Nutrition recommendations
- Vietnamese food nutrition
- Fitness exercises (1.66K)
- Medical Q&A (9.3K pairs)
- Builds ChromaDB vector stores
- Generates training data
- Optional fine-tuning
**Skip existing databases automatically!**
---
### 2. `generate_training_data.py` - Training Data
```bash
python scripts/generate_training_data.py
```
**What it does:**
- Generates 200 synthetic conversations
- 50 scenarios per agent (nutrition, symptom, exercise, mental_health)
- Uses GPT-4o-mini
- Output: `fine_tuning/training_data/*.jsonl`
**Cost:** ~$0.50 (OpenAI API)
---
### 3. `auto_finetune.py` - Batch Fine-tuning
```bash
python scripts/auto_finetune.py
```
**What it does:**
- Fine-tunes all 4 agents automatically
- Uploads training files
- Creates fine-tuning jobs
- Tracks progress
- Updates model config
**Requirements:** OpenAI official API (custom APIs not supported)
---
### 4. `fine_tune_agent.py` - Single Agent Fine-tuning
```bash
python scripts/fine_tune_agent.py nutrition_agent
```
**What it does:**
- Fine-tune one specific agent
- Manual control over the process
- Alternative to auto_finetune.py
**Agents:** `nutrition_agent`, `symptom_agent`, `exercise_agent`, `mental_health_agent`
---
### 5. `check_rag_status.py` - Diagnostic Tool
```bash
python scripts/check_rag_status.py
```
**What it checks:**
- βœ… ChromaDB folders exist
- πŸ“Š Database sizes
- πŸ“š Document counts
- πŸ§ͺ Test queries
**Note:** May need updates for new vector store paths
---
## πŸ“ Directory Structure
```
scripts/
β”œβ”€β”€ setup_rag.sh # ⭐ Main setup script
β”œβ”€β”€ generate_training_data.py # Generate synthetic data
β”œβ”€β”€ auto_finetune.py # Batch fine-tuning
β”œβ”€β”€ fine_tune_agent.py # Single agent fine-tuning
β”œβ”€β”€ check_rag_status.py # Diagnostic tool
└── README.md # This file
data_mining/ # Dataset downloaders
β”œβ”€β”€ mining_vimedical.py # ViMedical diseases
β”œβ”€β”€ mining_mentalchat.py # Mental health conversations
β”œβ”€β”€ mining_nutrition.py # Nutrition recommendations
β”œβ”€β”€ mining_vietnamese_food.py # Vietnamese food data
β”œβ”€β”€ mining_fitness.py # Fitness exercises
└── mining_medical_qa.py # Medical Q&A pairs
rag/vector_store/ # ChromaDB (NOT committed)
β”œβ”€β”€ medical_diseases/ # ViMedical (603 diseases)
β”œβ”€β”€ mental_health/ # MentalChat (16K conversations)
β”œβ”€β”€ nutrition/ # Nutrition plans
β”œβ”€β”€ vietnamese_nutrition/ # Vietnamese foods (73)
β”œβ”€β”€ fitness/ # Exercises (1.66K)
β”œβ”€β”€ symptom_qa/ # Medical Q&A
└── general_health_qa/ # General health Q&A
fine_tuning/training_data/ # Generated data (NOT committed)
β”œβ”€β”€ nutrition_training.jsonl
β”œβ”€β”€ symptom_training.jsonl
β”œβ”€β”€ exercise_training.jsonl
└── mental_health_training.jsonl
```
---
## πŸ”„ Team Workflow
### First Time Setup (New Team Member)
```bash
# 1. Clone repo
git clone <repo-url>
cd heocare-chatbot
# 2. Create .env file
cp .env.example .env
# Add your OPENAI_API_KEY
# 3. Setup everything (one command)
bash scripts/setup_rag.sh
# 4. Run app
python app.py
```
**Time:** ~15-20 minutes
---
### Daily Development
```bash
# Pull latest code
git pull
# If setup_rag.sh was updated, run it again
# (It will skip existing databases automatically)
bash scripts/setup_rag.sh
# Run app
python app.py
```
---
### Regenerate Training Data
```bash
# If you updated agent prompts or scenarios
python scripts/generate_training_data.py
# Optional: Fine-tune with new data
python scripts/auto_finetune.py
```
---
### Reset Everything
```bash
# Delete all generated data
rm -rf rag/vector_store/*
rm -rf fine_tuning/training_data/*
rm -rf data_mining/datasets/*
rm -rf data_mining/output/*
# Setup from scratch
bash scripts/setup_rag.sh
```
---
## πŸ› Troubleshooting
### Setup Failed
```bash
# Check Python version (need 3.8+)
python --version
# Check dependencies
pip install -r requirements.txt
# Check API key
echo $OPENAI_API_KEY
```
---
### Dataset Download Failed
```bash
# Check internet connection
ping huggingface.co
# Try manual download for specific dataset
python data_mining/mining_vimedical.py
python data_mining/mining_mentalchat.py
```
---
### ChromaDB Issues
```bash
# Check status
python scripts/check_rag_status.py
# Delete and rebuild specific database
rm -rf rag/vector_store/medical_diseases
python data_mining/mining_vimedical.py
# Move to correct location
mkdir -p rag/vector_store
mv data_mining/output/medical_chroma rag/vector_store/medical_diseases
```
---
### Fine-tuning 404 Error
```
Error: 404 - {'detail': 'Not Found'}
```
**Cause:** Custom API endpoint doesn't support fine-tuning
**Solution:**
1. Use OpenAI official API for fine-tuning
2. Or skip fine-tuning (app works fine with base model + RAG)
```bash
# Option 1: Update .env to use official API
OPENAI_BASE_URL=https://api.openai.com/v1
OPENAI_API_KEY=sk-proj-your-official-key
# Option 2: Skip fine-tuning
# Just run the app without fine-tuning
python app.py
```
---
## πŸ“Š Performance
| Task | Time | Size |
|------|------|------|
| Download datasets | ~5-8 min | ~50 MB |
| Build ChromaDB | ~5-7 min | ~160 MB |
| Generate training data | ~2-3 min | ~500 KB |
| Fine-tuning (optional) | ~30-60 min | - |
| **Total Setup** | **~15-20 min** | **~160 MB** |
---
## πŸ†˜ Support
If you encounter issues:
1. Run `python scripts/check_rag_status.py` for diagnostics
2. Check console logs for errors
3. Verify `.gitignore` is correct
4. Try deleting and rebuilding specific databases
5. Check that `.env` has valid API key
---
**Happy Coding! πŸš€**