voicekit / README.md
jjin6573's picture
Upload folder using huggingface_hub
7ae2c28 verified

A newer version of the Gradio SDK is available: 6.0.2

Upgrade
metadata
title: VoiceKit MCP
emoji: ๐ŸŽค
colorFrom: purple
colorTo: indigo
sdk: gradio
sdk_version: 6.0.0
app_file: app.py
pinned: false
tags:
  - building-mcp-track-creative
  - mcp-server

๐ŸŽค VoiceKit MCP

Professional voice analysis as MCP tools โ€” extract embeddings, compare voices, transcribe speech, and more.

6 powerful MCP tools for voice processing, all accepting base64-encoded audio.

๐Ÿ“ข Social Post: View on X
๐ŸŽฌ Demo Video: Watch on YouTube
๐Ÿ‘ฅ Team: @EricYoun, @NickEo, @HYENA-WON, @jjin6573, @cocoajoa


๐Ÿ“‹ Submission Info

Track Building MCP โ€” Creative
MCP Endpoint https://mcp-1st-birthday-voicekit.hf.space/gradio_api/mcp/sse
Framework Gradio 6.0

โœ… Track 1 Requirements

Requirement How We Fulfill It
Functioning MCP Server 6 MCP tools exposed via Gradio's mcp_server=True
MCP Client Demo Video shows integration with Claude Desktop / MCP client
Documented Tools Full API documentation with inputs/outputs below
Gradio App Interactive demo UI + hidden MCP tool interfaces

๐Ÿ› ๏ธ MCP Tools (6 Tools)

All tools accept base64-encoded audio as input.

1. extract_embedding

Extract voice embeddings using Wav2Vec2 model.

Input audio_base64 (base64-encoded audio)
Output embedding_preview (first 5 values), embedding_length (768)
Use Case Speaker identification, voice fingerprinting

2. match_voice

Compare similarity between two voices.

Inputs audio1_base64, audio2_base64
Output similarity (0-1), tone_score (0-100)
Use Case Voice cloning verification, speaker matching

3. analyze_acoustics

Extract detailed acoustic characteristics.

Input audio_base64
Output Pitch, energy, rhythm, tempo, spectral info
Use Case Emotional tone detection, voice profiling

4. transcribe_audio

Convert speech to text (multilingual).

Inputs audio_base64, language (default: "en")
Output Transcribed text, detected language
Model ElevenLabs Scribe v1
Languages English, Korean, Japanese, and 15+ more

5. isolate_voice

Remove background music/noise and extract clean voice.

Input audio_base64 (audio with background sounds)
Output Isolated audio (base64), BGM detection status
Use Case Audio cleanup for memes, songs, movies

6. grade_voice

Comprehensive voice comparison with multi-metric scoring.

Inputs user_audio_base64, reference_audio_base64, reference_text (optional), category (meme|song|movie)
Output Pitch, rhythm, energy, pronunciation scores (0-100), overall score, user transcription
Use Case Voice mimicry evaluation, pronunciation games

๐Ÿ—๏ธ Architecture

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                        VoiceKit MCP                             โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚                                                                 โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚
โ”‚  โ”‚                    MCP Client (Claude)                     โ”‚ โ”‚
โ”‚  โ”‚               base64 audio โ†’ SSE endpoint                  โ”‚ โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚
โ”‚                             โ†“                                   โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚
โ”‚  โ”‚                Gradio MCP Server (app.py)                  โ”‚ โ”‚
โ”‚  โ”‚           mcp_server=True โ€ข 6 tool interfaces              โ”‚ โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚
โ”‚                             โ†“                                   โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚
โ”‚  โ”‚              Modal GPU Container (T4)                      โ”‚ โ”‚
โ”‚  โ”‚    Wav2Vec2 โ€ข librosa โ€ข ElevenLabs APIs โ€ข DTW              โ”‚ โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚
โ”‚                             โ†“                                   โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚
โ”‚  โ”‚                    JSON Response                           โ”‚ โ”‚
โ”‚  โ”‚         embeddings โ€ข scores โ€ข transcripts โ€ข audio          โ”‚ โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚
โ”‚                                                                 โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

๐Ÿ”Œ How to Connect

Claude Desktop / MCP Client

Add to your MCP configuration:

{
  "mcpServers": {
    "voicekit": {
      "url": "https://mcp-1st-birthday-voicekit.hf.space/gradio_api/mcp/sse"
    }
  }
}

Example Usage

# 1. Encode audio to base64
import base64
with open("audio.wav", "rb") as f:
    audio_base64 = base64.b64encode(f.read()).decode()

# 2. Call MCP tool
result = mcp_client.call("extract_embedding", {"audio_base64": audio_base64})

# 3. Use the 768-dim embedding
embedding = result["embedding"]

๐Ÿ› ๏ธ Tech Stack

Component Technology
MCP Server Gradio 6.0 (mcp_server=True)
GPU Compute Modal (T4 GPU)
Embeddings Wav2Vec2 (facebook/wav2vec2-base-960h)
Speech-to-Text ElevenLabs Scribe v1
Voice Isolation ElevenLabs Voice Isolator
Acoustic Analysis librosa + scipy

โšก Performance

Metric Value
Response Time (warm) <200ms
Cold Start 1-3s (memory snapshot optimized)
Embedding Dimensions 768
Supported Audio Any format (auto-converts to WAV)
Max Duration Tested up to 10 minutes

๐ŸŽฏ Why VoiceKit MCP?

Criteria Our Approach
Functionality 6 production-ready tools covering full voice analysis pipeline
Innovation First MCP server for comprehensive voice analysis
Documentation Complete API docs with inputs/outputs/use cases
Real-world Impact Powers Voice Sementle game; applicable to voice cloning, accessibility, language learning

๐ŸŽฎ Interactive Demo

๐Ÿ‘† Click the interface above to try each tool!

  1. Upload or record audio
  2. Select a tool to test
  3. View JSON results with scores and analysis
  4. Copy embeddings or transcripts for your app

๐Ÿ”— Related Projects

  • Voice Sementle โ€” Daily voice puzzle game powered by VoiceKit MCP

Built for MCP's 1st Birthday Hackathon ๐ŸŽ‚

Celebrating one year of Model Context Protocol!