Spaces:

MCP-1st-Birthday
/

voicekit

Running

File size: 9,401 Bytes

3a9fd79
9deac74
 
 
 
3a9fd79
9deac74
3a9fd79
 
 
 
9deac74
3a9fd79
 
9deac74
3a9fd79
9deac74
3a9fd79
9deac74
3a9fd79
7ae2c28
6b9974f
9deac74
3a9fd79
9deac74
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3a9fd79
9deac74
 
 
3a9fd79
 
9deac74
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3a9fd79
 
9deac74
 
 
3a9fd79
9deac74
 
 
3a9fd79
 
 
9deac74
 
 
3a9fd79
9deac74
3a9fd79
 
 
9deac74
3a9fd79
9deac74
 
 
 
 
3a9fd79
9deac74
 
3a9fd79
9deac74
 
 
3a9fd79
9deac74
3a9fd79
9deac74
3a9fd79
9deac74
 
 
 
 
 
 
 
3a9fd79
9deac74
3a9fd79
9deac74
3a9fd79
9deac74
 
 
 
 
 
 
3a9fd79
9deac74
 
 
 
 
 
 
 
 
 
 
 
 
 
3a9fd79
9deac74
 
 
 
 
 
 
 
 
 
 
 
 
 
3a9fd79
9deac74
3a9fd79
9deac74

---
title: VoiceKit MCP
emoji: 🎤
colorFrom: purple
colorTo: indigo
sdk: gradio
sdk_version: "6.0.0"
app_file: app.py
pinned: false
tags:
  - building-mcp-track-creative
  - mcp-server
---

# 🎤 VoiceKit MCP

> **Professional voice analysis as MCP tools — extract embeddings, compare voices, transcribe speech, and more.**

6 powerful MCP tools for voice processing, all accepting base64-encoded audio.

📢 **Social Post:** [View on X](https://x.com/dahee_pk/status/1994389505898582442)<br>
🎬 **Demo Video:** [Watch on YouTube](https://www.youtube.com/watch?v=1VIqvpwfyWU)<br>
👥 **Team:** [@EricYoun](https://huggingface.co/EricYoun), [@NickEo](https://huggingface.co/NickEo), [@HYENA-WON](https://huggingface.co/HYENA-WON), [@jjin6573](https://huggingface.co/jjin6573), [@cocoajoa](https://huggingface.co/cocoajoa)

---

## 📋 Submission Info

| | |
|---|---|
| **Track** | Building MCP — Creative |
| **MCP Endpoint** | `https://mcp-1st-birthday-voicekit.hf.space/gradio_api/mcp/sse` |
| **Framework** | Gradio 6.0 |

---

## ✅ Track 1 Requirements

| Requirement | How We Fulfill It |
|-------------|-------------------|
| **Functioning MCP Server** | 6 MCP tools exposed via Gradio's `mcp_server=True` |
| **MCP Client Demo** | Video shows integration with Claude Desktop / MCP client |
| **Documented Tools** | Full API documentation with inputs/outputs below |
| **Gradio App** | Interactive demo UI + hidden MCP tool interfaces |

---

## 🛠️ MCP Tools (6 Tools)

All tools accept **base64-encoded audio** as input.

### 1. `extract_embedding` <img src="icons/extract_embedding.svg" width="20" height="20">
Extract voice embeddings using Wav2Vec2 model.

| | |
|---|---|
| **Input** | `audio_base64` (base64-encoded audio) |
| **Output** | `embedding_preview` (first 5 values), `embedding_length` (768) |
| **Use Case** | Speaker identification, voice fingerprinting |

<img src="imgs/extract_embedding.jpg" height="300">

### 2. `match_voice` <img src="icons/match_voice.svg" width="20" height="20">
Compare similarity between two voices.

| | |
|---|---|
| **Inputs** | `audio1_base64`, `audio2_base64` |
| **Output** | `similarity` (0-1), `tone_score` (0-100) |
| **Use Case** | Voice cloning verification, speaker matching |

<img src="imgs/match_voice.jpg" height="300">

### 3. `analyze_acoustics` <img src="icons/analyze_acoustics.svg" width="20" height="20">
Extract detailed acoustic characteristics.

| | |
|---|---|
| **Input** | `audio_base64` |
| **Output** | Pitch, energy, rhythm, tempo, spectral info |
| **Use Case** | Emotional tone detection, voice profiling |

<img src="imgs/analyze_acoustics.jpg" height="300">

### 4. `transcribe_audio` <img src="icons/transcribe_audio.svg" width="20" height="20">
Convert speech to text (multilingual).

| | |
|---|---|
| **Inputs** | `audio_base64`, `language` (default: "en") |
| **Output** | Transcribed text, detected language |
| **Model** | ElevenLabs Scribe v1 |
| **Languages** | English, Korean, Japanese, and 15+ more |

<img src="imgs/transcribe_audio.jpg" height="300">

### 5. `isolate_voice` <img src="icons/isolate_voice.svg" width="20" height="20">
Remove background music/noise and extract clean voice.

| | |
|---|---|
| **Input** | `audio_base64` (audio with background sounds) |
| **Output** | Isolated audio (base64), BGM detection status |
| **Use Case** | Audio cleanup for memes, songs, movies |

<img src="imgs/isolate_voice.jpg" height="300">

### 6. `grade_voice` <img src="icons/grade_voice.svg" width="20" height="20">
Comprehensive voice comparison with multi-metric scoring.

| | |
|---|---|
| **Inputs** | `user_audio_base64`, `reference_audio_base64`, `reference_text` (optional), `category` (meme\|song\|movie) |
| **Output** | Pitch, rhythm, energy, pronunciation scores (0-100), overall score, user transcription |
| **Use Case** | Voice mimicry evaluation, pronunciation games |

<img src="imgs/grade_voice.jpg" height="300">

---

## 🏗️ Architecture

```
┌─────────────────────────────────────────────────────────────────┐
│                        VoiceKit MCP                             │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  ┌────────────────────────────────────────────────────────────┐ │
│  │                    MCP Client (Claude)                     │ │
│  │               base64 audio → SSE endpoint                  │ │
│  └──────────────────────────┬─────────────────────────────────┘ │
│                             ↓                                   │
│  ┌────────────────────────────────────────────────────────────┐ │
│  │                Gradio MCP Server (app.py)                  │ │
│  │           mcp_server=True • 6 tool interfaces              │ │
│  └──────────────────────────┬─────────────────────────────────┘ │
│                             ↓                                   │
│  ┌────────────────────────────────────────────────────────────┐ │
│  │              Modal GPU Container (T4)                      │ │
│  │    Wav2Vec2 • librosa • ElevenLabs APIs • DTW              │ │
│  └──────────────────────────┬─────────────────────────────────┘ │
│                             ↓                                   │
│  ┌────────────────────────────────────────────────────────────┐ │
│  │                    JSON Response                           │ │
│  │         embeddings • scores • transcripts • audio          │ │
│  └────────────────────────────────────────────────────────────┘ │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘
```

---

## 🔌 How to Connect

### Claude Desktop / MCP Client

Add to your MCP configuration:

```json
{
  "mcpServers": {
    "voicekit": {
      "url": "https://mcp-1st-birthday-voicekit.hf.space/gradio_api/mcp/sse"
    }
  }
}
```

### Example Usage

```python
# 1. Encode audio to base64
import base64
with open("audio.wav", "rb") as f:
    audio_base64 = base64.b64encode(f.read()).decode()

# 2. Call MCP tool
result = mcp_client.call("extract_embedding", {"audio_base64": audio_base64})

# 3. Use the 768-dim embedding
embedding = result["embedding"]
```

---

## 🛠️ Tech Stack

| Component | Technology |
|-----------|------------|
| MCP Server | Gradio 6.0 (`mcp_server=True`) |
| GPU Compute | Modal (T4 GPU) |
| Embeddings | Wav2Vec2 (facebook/wav2vec2-base-960h) |
| Speech-to-Text | ElevenLabs Scribe v1 |
| Voice Isolation | ElevenLabs Voice Isolator |
| Acoustic Analysis | librosa + scipy |

---

## ⚡ Performance

| Metric | Value |
|--------|-------|
| Response Time (warm) | <200ms |
| Cold Start | 1-3s (memory snapshot optimized) |
| Embedding Dimensions | 768 |
| Supported Audio | Any format (auto-converts to WAV) |
| Max Duration | Tested up to 10 minutes |

---

## 🎯 Why VoiceKit MCP?

| Criteria | Our Approach |
|----------|--------------|
| **Functionality** | 6 production-ready tools covering full voice analysis pipeline |
| **Innovation** | First MCP server for comprehensive voice analysis |
| **Documentation** | Complete API docs with inputs/outputs/use cases |
| **Real-world Impact** | Powers Voice Sementle game; applicable to voice cloning, accessibility, language learning |

---

## 🎮 Interactive Demo

👆 **Click the interface above to try each tool!**

1. Upload or record audio
2. Select a tool to test
3. View JSON results with scores and analysis
4. Copy embeddings or transcripts for your app

---

## 🔗 Related Projects

- **[Voice Sementle](https://huggingface.co/spaces/MCP-1st-Birthday/Voice-Sementle)** — Daily voice puzzle game powered by VoiceKit MCP

---

**Built for [MCP's 1st Birthday Hackathon](https://huggingface.co/MCP-1st-Birthday)** 🎂

*Celebrating one year of Model Context Protocol!*