File size: 9,401 Bytes
3a9fd79
9deac74
 
 
 
3a9fd79
9deac74
3a9fd79
 
 
 
9deac74
3a9fd79
 
9deac74
3a9fd79
9deac74
3a9fd79
9deac74
3a9fd79
7ae2c28
6b9974f
9deac74
3a9fd79
9deac74
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3a9fd79
9deac74
 
 
3a9fd79
 
9deac74
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3a9fd79
 
9deac74
 
 
3a9fd79
9deac74
 
 
3a9fd79
 
 
9deac74
 
 
3a9fd79
9deac74
3a9fd79
 
 
9deac74
3a9fd79
9deac74
 
 
 
 
3a9fd79
9deac74
 
3a9fd79
9deac74
 
 
3a9fd79
9deac74
3a9fd79
9deac74
3a9fd79
9deac74
 
 
 
 
 
 
 
3a9fd79
9deac74
3a9fd79
9deac74
3a9fd79
9deac74
 
 
 
 
 
 
3a9fd79
9deac74
 
 
 
 
 
 
 
 
 
 
 
 
 
3a9fd79
9deac74
 
 
 
 
 
 
 
 
 
 
 
 
 
3a9fd79
9deac74
3a9fd79
9deac74
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
---
title: VoiceKit MCP
emoji: ๐ŸŽค
colorFrom: purple
colorTo: indigo
sdk: gradio
sdk_version: "6.0.0"
app_file: app.py
pinned: false
tags:
  - building-mcp-track-creative
  - mcp-server
---

# ๐ŸŽค VoiceKit MCP

> **Professional voice analysis as MCP tools โ€” extract embeddings, compare voices, transcribe speech, and more.**

6 powerful MCP tools for voice processing, all accepting base64-encoded audio.

๐Ÿ“ข **Social Post:** [View on X](https://x.com/dahee_pk/status/1994389505898582442)<br>
๐ŸŽฌ **Demo Video:** [Watch on YouTube](https://www.youtube.com/watch?v=1VIqvpwfyWU)<br>
๐Ÿ‘ฅ **Team:** [@EricYoun](https://huggingface.co/EricYoun), [@NickEo](https://huggingface.co/NickEo), [@HYENA-WON](https://huggingface.co/HYENA-WON), [@jjin6573](https://huggingface.co/jjin6573), [@cocoajoa](https://huggingface.co/cocoajoa)

---

## ๐Ÿ“‹ Submission Info

| | |
|---|---|
| **Track** | Building MCP โ€” Creative |
| **MCP Endpoint** | `https://mcp-1st-birthday-voicekit.hf.space/gradio_api/mcp/sse` |
| **Framework** | Gradio 6.0 |

---

## โœ… Track 1 Requirements

| Requirement | How We Fulfill It |
|-------------|-------------------|
| **Functioning MCP Server** | 6 MCP tools exposed via Gradio's `mcp_server=True` |
| **MCP Client Demo** | Video shows integration with Claude Desktop / MCP client |
| **Documented Tools** | Full API documentation with inputs/outputs below |
| **Gradio App** | Interactive demo UI + hidden MCP tool interfaces |

---

## ๐Ÿ› ๏ธ MCP Tools (6 Tools)

All tools accept **base64-encoded audio** as input.

### 1. `extract_embedding` <img src="icons/extract_embedding.svg" width="20" height="20">
Extract voice embeddings using Wav2Vec2 model.

| | |
|---|---|
| **Input** | `audio_base64` (base64-encoded audio) |
| **Output** | `embedding_preview` (first 5 values), `embedding_length` (768) |
| **Use Case** | Speaker identification, voice fingerprinting |

<img src="imgs/extract_embedding.jpg" height="300">

### 2. `match_voice` <img src="icons/match_voice.svg" width="20" height="20">
Compare similarity between two voices.

| | |
|---|---|
| **Inputs** | `audio1_base64`, `audio2_base64` |
| **Output** | `similarity` (0-1), `tone_score` (0-100) |
| **Use Case** | Voice cloning verification, speaker matching |

<img src="imgs/match_voice.jpg" height="300">

### 3. `analyze_acoustics` <img src="icons/analyze_acoustics.svg" width="20" height="20">
Extract detailed acoustic characteristics.

| | |
|---|---|
| **Input** | `audio_base64` |
| **Output** | Pitch, energy, rhythm, tempo, spectral info |
| **Use Case** | Emotional tone detection, voice profiling |

<img src="imgs/analyze_acoustics.jpg" height="300">

### 4. `transcribe_audio` <img src="icons/transcribe_audio.svg" width="20" height="20">
Convert speech to text (multilingual).

| | |
|---|---|
| **Inputs** | `audio_base64`, `language` (default: "en") |
| **Output** | Transcribed text, detected language |
| **Model** | ElevenLabs Scribe v1 |
| **Languages** | English, Korean, Japanese, and 15+ more |

<img src="imgs/transcribe_audio.jpg" height="300">

### 5. `isolate_voice` <img src="icons/isolate_voice.svg" width="20" height="20">
Remove background music/noise and extract clean voice.

| | |
|---|---|
| **Input** | `audio_base64` (audio with background sounds) |
| **Output** | Isolated audio (base64), BGM detection status |
| **Use Case** | Audio cleanup for memes, songs, movies |

<img src="imgs/isolate_voice.jpg" height="300">

### 6. `grade_voice` <img src="icons/grade_voice.svg" width="20" height="20">
Comprehensive voice comparison with multi-metric scoring.

| | |
|---|---|
| **Inputs** | `user_audio_base64`, `reference_audio_base64`, `reference_text` (optional), `category` (meme\|song\|movie) |
| **Output** | Pitch, rhythm, energy, pronunciation scores (0-100), overall score, user transcription |
| **Use Case** | Voice mimicry evaluation, pronunciation games |

<img src="imgs/grade_voice.jpg" height="300">

---

## ๐Ÿ—๏ธ Architecture

```
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                        VoiceKit MCP                             โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚                                                                 โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚
โ”‚  โ”‚                    MCP Client (Claude)                     โ”‚ โ”‚
โ”‚  โ”‚               base64 audio โ†’ SSE endpoint                  โ”‚ โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚
โ”‚                             โ†“                                   โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚
โ”‚  โ”‚                Gradio MCP Server (app.py)                  โ”‚ โ”‚
โ”‚  โ”‚           mcp_server=True โ€ข 6 tool interfaces              โ”‚ โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚
โ”‚                             โ†“                                   โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚
โ”‚  โ”‚              Modal GPU Container (T4)                      โ”‚ โ”‚
โ”‚  โ”‚    Wav2Vec2 โ€ข librosa โ€ข ElevenLabs APIs โ€ข DTW              โ”‚ โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚
โ”‚                             โ†“                                   โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚
โ”‚  โ”‚                    JSON Response                           โ”‚ โ”‚
โ”‚  โ”‚         embeddings โ€ข scores โ€ข transcripts โ€ข audio          โ”‚ โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚
โ”‚                                                                 โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
```

---

## ๐Ÿ”Œ How to Connect

### Claude Desktop / MCP Client

Add to your MCP configuration:

```json
{
  "mcpServers": {
    "voicekit": {
      "url": "https://mcp-1st-birthday-voicekit.hf.space/gradio_api/mcp/sse"
    }
  }
}
```

### Example Usage

```python
# 1. Encode audio to base64
import base64
with open("audio.wav", "rb") as f:
    audio_base64 = base64.b64encode(f.read()).decode()

# 2. Call MCP tool
result = mcp_client.call("extract_embedding", {"audio_base64": audio_base64})

# 3. Use the 768-dim embedding
embedding = result["embedding"]
```

---

## ๐Ÿ› ๏ธ Tech Stack

| Component | Technology |
|-----------|------------|
| MCP Server | Gradio 6.0 (`mcp_server=True`) |
| GPU Compute | Modal (T4 GPU) |
| Embeddings | Wav2Vec2 (facebook/wav2vec2-base-960h) |
| Speech-to-Text | ElevenLabs Scribe v1 |
| Voice Isolation | ElevenLabs Voice Isolator |
| Acoustic Analysis | librosa + scipy |

---

## โšก Performance

| Metric | Value |
|--------|-------|
| Response Time (warm) | <200ms |
| Cold Start | 1-3s (memory snapshot optimized) |
| Embedding Dimensions | 768 |
| Supported Audio | Any format (auto-converts to WAV) |
| Max Duration | Tested up to 10 minutes |

---

## ๐ŸŽฏ Why VoiceKit MCP?

| Criteria | Our Approach |
|----------|--------------|
| **Functionality** | 6 production-ready tools covering full voice analysis pipeline |
| **Innovation** | First MCP server for comprehensive voice analysis |
| **Documentation** | Complete API docs with inputs/outputs/use cases |
| **Real-world Impact** | Powers Voice Sementle game; applicable to voice cloning, accessibility, language learning |

---

## ๐ŸŽฎ Interactive Demo

๐Ÿ‘† **Click the interface above to try each tool!**

1. Upload or record audio
2. Select a tool to test
3. View JSON results with scores and analysis
4. Copy embeddings or transcripts for your app

---

## ๐Ÿ”— Related Projects

- **[Voice Sementle](https://huggingface.co/spaces/MCP-1st-Birthday/Voice-Sementle)** โ€” Daily voice puzzle game powered by VoiceKit MCP

---

**Built for [MCP's 1st Birthday Hackathon](https://huggingface.co/MCP-1st-Birthday)** ๐ŸŽ‚

*Celebrating one year of Model Context Protocol!*