Spaces:

MCP-1st-Birthday
/

voicekit

Running

App Files Files Community

voicekit / README.md

jjin6573

Upload folder using huggingface_hub

7ae2c28 verified 9 days ago

preview code

raw

history blame contribute delete

9.4 kB

A newer version of the Gradio SDK is available: 6.0.2

Upgrade

metadata

title: VoiceKit MCP
emoji: 🎤
colorFrom: purple
colorTo: indigo
sdk: gradio
sdk_version: 6.0.0
app_file: app.py
pinned: false
tags:
  - building-mcp-track-creative
  - mcp-server

🎤 VoiceKit MCP

Professional voice analysis as MCP tools — extract embeddings, compare voices, transcribe speech, and more.

6 powerful MCP tools for voice processing, all accepting base64-encoded audio.

📢 Social Post: View on X
🎬 Demo Video: Watch on YouTube
👥 Team: @EricYoun, @NickEo, @HYENA-WON, @jjin6573, @cocoajoa

📋 Submission Info


Track	Building MCP — Creative
MCP Endpoint	`https://mcp-1st-birthday-voicekit.hf.space/gradio_api/mcp/sse`
Framework	Gradio 6.0

✅ Track 1 Requirements

Requirement	How We Fulfill It
Functioning MCP Server	6 MCP tools exposed via Gradio's `mcp_server=True`
MCP Client Demo	Video shows integration with Claude Desktop / MCP client
Documented Tools	Full API documentation with inputs/outputs below
Gradio App	Interactive demo UI + hidden MCP tool interfaces

🛠️ MCP Tools (6 Tools)

All tools accept base64-encoded audio as input.

1. `extract_embedding`

Extract voice embeddings using Wav2Vec2 model.


Input	`audio_base64` (base64-encoded audio)
Output	`embedding_preview` (first 5 values), `embedding_length` (768)
Use Case	Speaker identification, voice fingerprinting

2. `match_voice`

Compare similarity between two voices.


Inputs	`audio1_base64`, `audio2_base64`
Output	`similarity` (0-1), `tone_score` (0-100)
Use Case	Voice cloning verification, speaker matching

3. `analyze_acoustics`

Extract detailed acoustic characteristics.


Input	`audio_base64`
Output	Pitch, energy, rhythm, tempo, spectral info
Use Case	Emotional tone detection, voice profiling

4. `transcribe_audio`

Convert speech to text (multilingual).


Inputs	`audio_base64`, `language` (default: "en")
Output	Transcribed text, detected language
Model	ElevenLabs Scribe v1
Languages	English, Korean, Japanese, and 15+ more

5. `isolate_voice`

Remove background music/noise and extract clean voice.


Input	`audio_base64` (audio with background sounds)
Output	Isolated audio (base64), BGM detection status
Use Case	Audio cleanup for memes, songs, movies

6. `grade_voice`

Comprehensive voice comparison with multi-metric scoring.


Inputs	`user_audio_base64`, `reference_audio_base64`, `reference_text` (optional), `category` (meme\|song\|movie)
Output	Pitch, rhythm, energy, pronunciation scores (0-100), overall score, user transcription
Use Case	Voice mimicry evaluation, pronunciation games

🏗️ Architecture

┌─────────────────────────────────────────────────────────────────┐
│                        VoiceKit MCP                             │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  ┌────────────────────────────────────────────────────────────┐ │
│  │                    MCP Client (Claude)                     │ │
│  │               base64 audio → SSE endpoint                  │ │
│  └──────────────────────────┬─────────────────────────────────┘ │
│                             ↓                                   │
│  ┌────────────────────────────────────────────────────────────┐ │
│  │                Gradio MCP Server (app.py)                  │ │
│  │           mcp_server=True • 6 tool interfaces              │ │
│  └──────────────────────────┬─────────────────────────────────┘ │
│                             ↓                                   │
│  ┌────────────────────────────────────────────────────────────┐ │
│  │              Modal GPU Container (T4)                      │ │
│  │    Wav2Vec2 • librosa • ElevenLabs APIs • DTW              │ │
│  └──────────────────────────┬─────────────────────────────────┘ │
│                             ↓                                   │
│  ┌────────────────────────────────────────────────────────────┐ │
│  │                    JSON Response                           │ │
│  │         embeddings • scores • transcripts • audio          │ │
│  └────────────────────────────────────────────────────────────┘ │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

🔌 How to Connect

Claude Desktop / MCP Client

Add to your MCP configuration:

{
  "mcpServers": {
    "voicekit": {
      "url": "https://mcp-1st-birthday-voicekit.hf.space/gradio_api/mcp/sse"
    }
  }
}

Example Usage

# 1. Encode audio to base64
import base64
with open("audio.wav", "rb") as f:
    audio_base64 = base64.b64encode(f.read()).decode()

# 2. Call MCP tool
result = mcp_client.call("extract_embedding", {"audio_base64": audio_base64})

# 3. Use the 768-dim embedding
embedding = result["embedding"]

🛠️ Tech Stack

Component	Technology
MCP Server	Gradio 6.0 (`mcp_server=True`)
GPU Compute	Modal (T4 GPU)
Embeddings	Wav2Vec2 (facebook/wav2vec2-base-960h)
Speech-to-Text	ElevenLabs Scribe v1
Voice Isolation	ElevenLabs Voice Isolator
Acoustic Analysis	librosa + scipy

⚡ Performance

Metric	Value
Response Time (warm)	<200ms
Cold Start	1-3s (memory snapshot optimized)
Embedding Dimensions	768
Supported Audio	Any format (auto-converts to WAV)
Max Duration	Tested up to 10 minutes

🎯 Why VoiceKit MCP?

Criteria	Our Approach
Functionality	6 production-ready tools covering full voice analysis pipeline
Innovation	First MCP server for comprehensive voice analysis
Documentation	Complete API docs with inputs/outputs/use cases
Real-world Impact	Powers Voice Sementle game; applicable to voice cloning, accessibility, language learning

🎮 Interactive Demo

👆 Click the interface above to try each tool!

Upload or record audio
Select a tool to test
View JSON results with scores and analysis
Copy embeddings or transcripts for your app

🔗 Related Projects

Voice Sementle — Daily voice puzzle game powered by VoiceKit MCP

Built for MCP's 1st Birthday Hackathon 🎂

Celebrating one year of Model Context Protocol!