Spaces:
Running
A newer version of the Gradio SDK is available:
6.0.2
title: VoiceKit MCP
emoji: ๐ค
colorFrom: purple
colorTo: indigo
sdk: gradio
sdk_version: 6.0.0
app_file: app.py
pinned: false
tags:
- building-mcp-track-creative
- mcp-server
๐ค VoiceKit MCP
Professional voice analysis as MCP tools โ extract embeddings, compare voices, transcribe speech, and more.
6 powerful MCP tools for voice processing, all accepting base64-encoded audio.
๐ข Social Post: View on X
๐ฌ Demo Video: Watch on YouTube
๐ฅ Team: @EricYoun, @NickEo, @HYENA-WON, @jjin6573, @cocoajoa
๐ Submission Info
| Track | Building MCP โ Creative |
| MCP Endpoint | https://mcp-1st-birthday-voicekit.hf.space/gradio_api/mcp/sse |
| Framework | Gradio 6.0 |
โ Track 1 Requirements
| Requirement | How We Fulfill It |
|---|---|
| Functioning MCP Server | 6 MCP tools exposed via Gradio's mcp_server=True |
| MCP Client Demo | Video shows integration with Claude Desktop / MCP client |
| Documented Tools | Full API documentation with inputs/outputs below |
| Gradio App | Interactive demo UI + hidden MCP tool interfaces |
๐ ๏ธ MCP Tools (6 Tools)
All tools accept base64-encoded audio as input.
1. extract_embedding
Extract voice embeddings using Wav2Vec2 model.
| Input | audio_base64 (base64-encoded audio) |
| Output | embedding_preview (first 5 values), embedding_length (768) |
| Use Case | Speaker identification, voice fingerprinting |
2. match_voice
Compare similarity between two voices.
| Inputs | audio1_base64, audio2_base64 |
| Output | similarity (0-1), tone_score (0-100) |
| Use Case | Voice cloning verification, speaker matching |
3. analyze_acoustics
Extract detailed acoustic characteristics.
| Input | audio_base64 |
| Output | Pitch, energy, rhythm, tempo, spectral info |
| Use Case | Emotional tone detection, voice profiling |
4. transcribe_audio
Convert speech to text (multilingual).
| Inputs | audio_base64, language (default: "en") |
| Output | Transcribed text, detected language |
| Model | ElevenLabs Scribe v1 |
| Languages | English, Korean, Japanese, and 15+ more |
5. isolate_voice
Remove background music/noise and extract clean voice.
| Input | audio_base64 (audio with background sounds) |
| Output | Isolated audio (base64), BGM detection status |
| Use Case | Audio cleanup for memes, songs, movies |
6. grade_voice
Comprehensive voice comparison with multi-metric scoring.
| Inputs | user_audio_base64, reference_audio_base64, reference_text (optional), category (meme|song|movie) |
| Output | Pitch, rhythm, energy, pronunciation scores (0-100), overall score, user transcription |
| Use Case | Voice mimicry evaluation, pronunciation games |
๐๏ธ Architecture
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ VoiceKit MCP โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ MCP Client (Claude) โ โ
โ โ base64 audio โ SSE endpoint โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ Gradio MCP Server (app.py) โ โ
โ โ mcp_server=True โข 6 tool interfaces โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ Modal GPU Container (T4) โ โ
โ โ Wav2Vec2 โข librosa โข ElevenLabs APIs โข DTW โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ JSON Response โ โ
โ โ embeddings โข scores โข transcripts โข audio โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
๐ How to Connect
Claude Desktop / MCP Client
Add to your MCP configuration:
{
"mcpServers": {
"voicekit": {
"url": "https://mcp-1st-birthday-voicekit.hf.space/gradio_api/mcp/sse"
}
}
}
Example Usage
# 1. Encode audio to base64
import base64
with open("audio.wav", "rb") as f:
audio_base64 = base64.b64encode(f.read()).decode()
# 2. Call MCP tool
result = mcp_client.call("extract_embedding", {"audio_base64": audio_base64})
# 3. Use the 768-dim embedding
embedding = result["embedding"]
๐ ๏ธ Tech Stack
| Component | Technology |
|---|---|
| MCP Server | Gradio 6.0 (mcp_server=True) |
| GPU Compute | Modal (T4 GPU) |
| Embeddings | Wav2Vec2 (facebook/wav2vec2-base-960h) |
| Speech-to-Text | ElevenLabs Scribe v1 |
| Voice Isolation | ElevenLabs Voice Isolator |
| Acoustic Analysis | librosa + scipy |
โก Performance
| Metric | Value |
|---|---|
| Response Time (warm) | <200ms |
| Cold Start | 1-3s (memory snapshot optimized) |
| Embedding Dimensions | 768 |
| Supported Audio | Any format (auto-converts to WAV) |
| Max Duration | Tested up to 10 minutes |
๐ฏ Why VoiceKit MCP?
| Criteria | Our Approach |
|---|---|
| Functionality | 6 production-ready tools covering full voice analysis pipeline |
| Innovation | First MCP server for comprehensive voice analysis |
| Documentation | Complete API docs with inputs/outputs/use cases |
| Real-world Impact | Powers Voice Sementle game; applicable to voice cloning, accessibility, language learning |
๐ฎ Interactive Demo
๐ Click the interface above to try each tool!
- Upload or record audio
- Select a tool to test
- View JSON results with scores and analysis
- Copy embeddings or transcripts for your app
๐ Related Projects
- Voice Sementle โ Daily voice puzzle game powered by VoiceKit MCP
Built for MCP's 1st Birthday Hackathon ๐
Celebrating one year of Model Context Protocol!