Spaces:

lapnt3
/

my-gradio-app

Runtime error

App Files Files Community

my-gradio-app / scripts /README.md

Nguyen Trong Lap

Recreate history without binary blobs

eeb0f9c about 2 months ago

preview code

raw

history blame contribute delete

6.67 kB

	# Scripts Documentation 🚀

	Automated scripts for HeoCare Chatbot setup and maintenance.

	## 📋 Quick Start

	### One-Command Setup (Recommended)

	```bash
	# Run everything in one command
	bash scripts/setup_rag.sh
	```

	What it does:
	1. ✅ Check Python & dependencies
	2. ✅ Install required packages
	3. ✅ Download 6 medical datasets from HuggingFace
	4. ✅ Build ChromaDB vector stores (~160 MB)
	5. ✅ Generate training data (200 conversations)
	6. ✅ Optional: Fine-tune agents

	Time: ~15-20 minutes (depends on internet speed)

	---

	## 📜 Available Scripts

	### 1. `setup_rag.sh` ⭐ Main Setup

	```bash
	bash scripts/setup_rag.sh
	```

	Features:
	- Downloads 6 datasets from HuggingFace:
	- ViMedical (603 diseases)
	- MentalChat16K (16K conversations)
	- Nutrition recommendations
	- Vietnamese food nutrition
	- Fitness exercises (1.66K)
	- Medical Q&A (9.3K pairs)
	- Builds ChromaDB vector stores
	- Generates training data
	- Optional fine-tuning

	Skip existing databases automatically!

	---

	### 2. `generate_training_data.py` - Training Data

	```bash
	python scripts/generate_training_data.py
	```

	What it does:
	- Generates 200 synthetic conversations
	- 50 scenarios per agent (nutrition, symptom, exercise, mental_health)
	- Uses GPT-4o-mini
	- Output: `fine_tuning/training_data/*.jsonl`

	Cost: ~$0.50 (OpenAI API)

	---

	### 3. `auto_finetune.py` - Batch Fine-tuning

	```bash
	python scripts/auto_finetune.py
	```

	What it does:
	- Fine-tunes all 4 agents automatically
	- Uploads training files
	- Creates fine-tuning jobs
	- Tracks progress
	- Updates model config

	Requirements: OpenAI official API (custom APIs not supported)

	---

	### 4. `fine_tune_agent.py` - Single Agent Fine-tuning

	```bash
	python scripts/fine_tune_agent.py nutrition_agent
	```

	What it does:
	- Fine-tune one specific agent
	- Manual control over the process
	- Alternative to auto_finetune.py

	Agents: `nutrition_agent`, `symptom_agent`, `exercise_agent`, `mental_health_agent`

	---

	### 5. `check_rag_status.py` - Diagnostic Tool

	```bash
	python scripts/check_rag_status.py
	```

	What it checks:
	- ✅ ChromaDB folders exist
	- 📊 Database sizes
	- 📚 Document counts
	- 🧪 Test queries

	Note: May need updates for new vector store paths

	---

	## 📁 Directory Structure

	```
	scripts/
	├── setup_rag.sh # ⭐ Main setup script
	├── generate_training_data.py # Generate synthetic data
	├── auto_finetune.py # Batch fine-tuning
	├── fine_tune_agent.py # Single agent fine-tuning
	├── check_rag_status.py # Diagnostic tool
	└── README.md # This file

	data_mining/ # Dataset downloaders
	├── mining_vimedical.py # ViMedical diseases
	├── mining_mentalchat.py # Mental health conversations
	├── mining_nutrition.py # Nutrition recommendations
	├── mining_vietnamese_food.py # Vietnamese food data
	├── mining_fitness.py # Fitness exercises
	└── mining_medical_qa.py # Medical Q&A pairs

	rag/vector_store/ # ChromaDB (NOT committed)
	├── medical_diseases/ # ViMedical (603 diseases)
	├── mental_health/ # MentalChat (16K conversations)
	├── nutrition/ # Nutrition plans
	├── vietnamese_nutrition/ # Vietnamese foods (73)
	├── fitness/ # Exercises (1.66K)
	├── symptom_qa/ # Medical Q&A
	└── general_health_qa/ # General health Q&A

	fine_tuning/training_data/ # Generated data (NOT committed)
	├── nutrition_training.jsonl
	├── symptom_training.jsonl
	├── exercise_training.jsonl
	└── mental_health_training.jsonl
	```

	---

	## 🔄 Team Workflow

	### First Time Setup (New Team Member)

	```bash
	# 1. Clone repo
	git clone <repo-url>
	cd heocare-chatbot

	# 2. Create .env file
	cp .env.example .env
	# Add your OPENAI_API_KEY

	# 3. Setup everything (one command)
	bash scripts/setup_rag.sh

	# 4. Run app
	python app.py
	```

	Time: ~15-20 minutes

	---

	### Daily Development

	```bash
	# Pull latest code
	git pull

	# If setup_rag.sh was updated, run it again
	# (It will skip existing databases automatically)
	bash scripts/setup_rag.sh

	# Run app
	python app.py
	```

	---

	### Regenerate Training Data

	```bash
	# If you updated agent prompts or scenarios
	python scripts/generate_training_data.py

	# Optional: Fine-tune with new data
	python scripts/auto_finetune.py
	```

	---

	### Reset Everything

	```bash
	# Delete all generated data
	rm -rf rag/vector_store/*
	rm -rf fine_tuning/training_data/*
	rm -rf data_mining/datasets/*
	rm -rf data_mining/output/*

	# Setup from scratch
	bash scripts/setup_rag.sh
	```

	---

	## 🐛 Troubleshooting

	### Setup Failed

	```bash
	# Check Python version (need 3.8+)
	python --version

	# Check dependencies
	pip install -r requirements.txt

	# Check API key
	echo $OPENAI_API_KEY
	```

	---

	### Dataset Download Failed

	```bash
	# Check internet connection
	ping huggingface.co

	# Try manual download for specific dataset
	python data_mining/mining_vimedical.py
	python data_mining/mining_mentalchat.py
	```

	---

	### ChromaDB Issues

	```bash
	# Check status
	python scripts/check_rag_status.py

	# Delete and rebuild specific database
	rm -rf rag/vector_store/medical_diseases
	python data_mining/mining_vimedical.py

	# Move to correct location
	mkdir -p rag/vector_store
	mv data_mining/output/medical_chroma rag/vector_store/medical_diseases
	```

	---

	### Fine-tuning 404 Error

	```
	Error: 404 - {'detail': 'Not Found'}
	```

	Cause: Custom API endpoint doesn't support fine-tuning

	Solution:
	1. Use OpenAI official API for fine-tuning
	2. Or skip fine-tuning (app works fine with base model + RAG)

	```bash
	# Option 1: Update .env to use official API
	OPENAI_BASE_URL=https://api.openai.com/v1
	OPENAI_API_KEY=sk-proj-your-official-key

	# Option 2: Skip fine-tuning
	# Just run the app without fine-tuning
	python app.py
	```

	---

	## 📊 Performance

	\| Task \| Time \| Size \|
	\|------\|------\|------\|
	\| Download datasets \| ~5-8 min \| ~50 MB \|
	\| Build ChromaDB \| ~5-7 min \| ~160 MB \|
	\| Generate training data \| ~2-3 min \| ~500 KB \|
	\| Fine-tuning (optional) \| ~30-60 min \| - \|
	\| Total Setup \| ~15-20 min \| ~160 MB \|

	---

	## 🆘 Support

	If you encounter issues:

	1. Run `python scripts/check_rag_status.py` for diagnostics
	2. Check console logs for errors
	3. Verify `.gitignore` is correct
	4. Try deleting and rebuilding specific databases
	5. Check that `.env` has valid API key

	---

	Happy Coding! 🚀