Spaces:

MCP-1st-Birthday
/

voicekit

Running

App Files Files Community

Gideon commited on 9 days ago

Commit

9deac74

1 Parent(s): 85f62fa

Add DOCS modal, tool icons

Browse files

Files changed (8) hide show

README.md +203 -64
app.py +253 -1
icons/analyze_acoustics.svg +7 -0
icons/extract_embedding.svg +8 -0
icons/grade_voice.svg +12 -0
icons/isolate_voice.svg +7 -0
icons/match_voice.svg +13 -0
icons/transcribe_audio.svg +6 -0

README.md CHANGED Viewed

@@ -1,102 +1,241 @@
 ---
-title: VoiceKit MCP Server
-emoji: 🎙️
-colorFrom: indigo
-colorTo: purple
 sdk: gradio
-sdk_version: 6.0.0
 app_file: app.py
 pinned: false
-license: mit
 tags:
   - building-mcp-track-creative
 ---
-# VoiceKit MCP Server
-**Voice analysis toolkit exposing 6 MCP tools for AI assistants.**
-VoiceKit provides comprehensive voice processing capabilities through the Model Context Protocol (MCP), enabling Claude and other AI assistants to analyze, compare, transcribe, and process audio.
-## Purpose
-VoiceKit bridges the gap between AI assistants and advanced voice analysis. It allows:
-- **Voice comparison** for mimicry games, pronunciation practice
-- **Audio transcription** in multiple languages
-- **Acoustic analysis** for voice coaching, music production
-- **Background removal** for clean audio extraction
-## MCP Endpoint
 ```
-https://MCP-1st-Birthday-voicekit.hf.space/gradio_api/mcp/sse
 ```
-## Quick Start
-Add to your `claude_desktop_config.json`:
 ```json
 {
-    "mcpServers": {
-        "voicekit": {
-            "url": "https://MCP-1st-Birthday-voicekit.hf.space/gradio_api/mcp/sse"
-        }
     }
 }
 ```
-## Available Tools (6)
-### Primitive Tools
-| Tool | Purpose | Input | Output |
-|------|---------|-------|--------|
-| `extract_embedding` | Get voice fingerprint | Audio file | 768-dim Wav2Vec2 vector |
-| `compare_voices` | Measure voice similarity | 2 audio files | Similarity score (0-1) |
-| `analyze_acoustic_features` | Analyze voice characteristics | Audio file | Pitch, energy, rhythm, tempo |
-| `transcribe_audio` | Speech-to-text | Audio + language | Transcribed text |
-| `isolate_voice` | Remove background noise/music | Audio file | Clean voice audio |
-### Composite Tool
-| Tool | Purpose | Input | Output |
-|------|---------|-------|--------|
-| `analyze_voice_similarity` | Full voice analysis | 2 audios + text | 5 metrics + overall score |
-## Use Cases
-### Voice Mimicry Game
-```
-User: "Compare my voice to this movie clip"
-Claude: [uses analyze_voice_similarity] → Returns pronunciation, tone, pitch, rhythm, energy scores
-```
-### Audio Transcription
-```
-User: "What does this Korean audio say?"
-Claude: [uses transcribe_audio with language="ko"] → Returns Korean text
-```
-### Clean Audio Extraction
-```
-User: "Remove the background music from this meme"
-Claude: [uses isolate_voice] → Returns isolated voice track
-```
-## Architecture
-```
-┌─────────────────┐     MCP/SSE      ┌─────────────────┐     API      ┌─────────────────┐
-│  Claude Desktop │ ◄──────────────► │   HF Space      │ ◄──────────► │   Modal GPU     │
-│  (MCP Client)   │                  │   (Gradio)      │              │   (Inference)   │
-└─────────────────┘                  └─────────────────┘              └─────────────────┘
-```
-- **Frontend**: Gradio 6 MCP Server on Hugging Face Spaces
-- **Backend**: Modal serverless GPU for ML inference
-- **Models**: Wav2Vec2, ElevenLabs Scribe STT, Voice Isolator
-## Demo
-Try each tool directly in the tabs on the Space UI!

 ---
+title: VoiceKit MCP
+emoji: 🎤
+colorFrom: purple
+colorTo: indigo
 sdk: gradio
+sdk_version: "6.0.0"
 app_file: app.py
 pinned: false
 tags:
   - building-mcp-track-creative
+  - mcp-server
 ---
+# 🎤 VoiceKit MCP
+> **Professional voice analysis as MCP tools — extract embeddings, compare voices, transcribe speech, and more.**
+6 powerful MCP tools for voice processing, all accepting base64-encoded audio.
+📢 **Social Post:** [View on X/Twitter](#) <!-- TODO: Add link to your social media post --><br>
+🎬 **Demo Video:** [Watch (1-5 min)](#) <!-- TODO: Add link to your demo video --><br>
+👥 **Team:** [@EricYoun](https://huggingface.co/EricYoun), [@NickEo](https://huggingface.co/NickEo), [@HYENA-WON](https://huggingface.co/HYENA-WON), [@jjin6573](https://huggingface.co/jjin6573), [@cocoajoa](https://huggingface.co/cocoajoa)
+---
+## 📋 Submission Info
+| | |
+|---|---|
+| **Track** | Building MCP — Creative |
+| **MCP Endpoint** | `https://mcp-1st-birthday-voicekit.hf.space/gradio_api/mcp/sse` |
+| **Framework** | Gradio 6.0 |
+---
+## ✅ Track 1 Requirements
+| Requirement | How We Fulfill It |
+|-------------|-------------------|
+| **Functioning MCP Server** | 6 MCP tools exposed via Gradio's `mcp_server=True` |
+| **MCP Client Demo** | Video shows integration with Claude Desktop / MCP client |
+| **Documented Tools** | Full API documentation with inputs/outputs below |
+| **Gradio App** | Interactive demo UI + hidden MCP tool interfaces |
+---
+## 🛠️ MCP Tools (6 Tools)
+All tools accept **base64-encoded audio** as input.
+### 1. `extract_embedding` <img src="icons/extract_embedding.svg" width="20" height="20">
+Extract voice embeddings using Wav2Vec2 model.
+| | |
+|---|---|
+| **Input** | `audio_base64` (base64-encoded audio) |
+| **Output** | `embedding_preview` (first 5 values), `embedding_length` (768) |
+| **Use Case** | Speaker identification, voice fingerprinting |
+<img src="imgs/extract_embedding.jpg" height="300">
+### 2. `match_voice` <img src="icons/match_voice.svg" width="20" height="20">
+Compare similarity between two voices.
+| | |
+|---|---|
+| **Inputs** | `audio1_base64`, `audio2_base64` |
+| **Output** | `similarity` (0-1), `tone_score` (0-100) |
+| **Use Case** | Voice cloning verification, speaker matching |
+<img src="imgs/match_voice.jpg" height="300">
+### 3. `analyze_acoustics` <img src="icons/analyze_acoustics.svg" width="20" height="20">
+Extract detailed acoustic characteristics.
+| | |
+|---|---|
+| **Input** | `audio_base64` |
+| **Output** | Pitch, energy, rhythm, tempo, spectral info |
+| **Use Case** | Emotional tone detection, voice profiling |
+<img src="imgs/analyze_acoustics.jpg" height="300">
+### 4. `transcribe_audio` <img src="icons/transcribe_audio.svg" width="20" height="20">
+Convert speech to text (multilingual).
+| | |
+|---|---|
+| **Inputs** | `audio_base64`, `language` (default: "en") |
+| **Output** | Transcribed text, detected language |
+| **Model** | ElevenLabs Scribe v1 |
+| **Languages** | English, Korean, Japanese, and 15+ more |
+<img src="imgs/transcribe_audio.jpg" height="300">
+### 5. `isolate_voice` <img src="icons/isolate_voice.svg" width="20" height="20">
+Remove background music/noise and extract clean voice.
+| | |
+|---|---|
+| **Input** | `audio_base64` (audio with background sounds) |
+| **Output** | Isolated audio (base64), BGM detection status |
+| **Use Case** | Audio cleanup for memes, songs, movies |
+<img src="imgs/isolate_voice.jpg" height="300">
+### 6. `grade_voice` <img src="icons/grade_voice.svg" width="20" height="20">
+Comprehensive voice comparison with multi-metric scoring.
+| | |
+|---|---|
+| **Inputs** | `user_audio_base64`, `reference_audio_base64`, `reference_text` (optional), `category` (meme\|song\|movie) |
+| **Output** | Pitch, rhythm, energy, pronunciation scores (0-100), overall score, user transcription |
+| **Use Case** | Voice mimicry evaluation, pronunciation games |
+<img src="imgs/grade_voice.jpg" height="300">
+---
+## 🏗️ Architecture
 ```
+┌──────────────────────────────────────────────────────��──────────┐
+│                        VoiceKit MCP                             │
+├─────────────────────────────────────────────────────────────────┤
+│                                                                 │
+│  ┌────────────────────────────────────────────────────────────┐ │
+│  │                    MCP Client (Claude)                     │ │
+│  │               base64 audio → SSE endpoint                  │ │
+│  └──────────────────────────┬─────────────────────────────────┘ │
+│                             ↓                                   │
+│  ┌────────────────────────────────────────────────────────────┐ │
+│  │                Gradio MCP Server (app.py)                  │ │
+│  │           mcp_server=True • 6 tool interfaces              │ │
+│  └──────────────────────────┬─────────────────────────────────┘ │
+│                             ↓                                   │
+│  ┌────────────────────────────────────────────────────────────┐ │
+│  │              Modal GPU Container (T4)                      │ │
+│  │    Wav2Vec2 • librosa • ElevenLabs APIs • DTW              │ │
+│  └──────────────────────────┬─────────────────────────────────┘ │
+│                             ↓                                   │
+│  ┌────────────────────────────────────────────────────────────┐ │
+│  │                    JSON Response                           │ │
+│  │         embeddings • scores • transcripts • audio          │ │
+│  └────────────────────────────────────────────────────────────┘ │
+│                                                                 │
+└─────────────────────────────────────────────────────────────────┘
 ```
+---
+## 🔌 How to Connect
+### Claude Desktop / MCP Client
+Add to your MCP configuration:
 ```json
 {
+  "mcpServers": {
+    "voicekit": {
+      "url": "https://mcp-1st-birthday-voicekit.hf.space/gradio_api/mcp/sse"
     }
+  }
 }
 ```
+### Example Usage
+```python
+# 1. Encode audio to base64
+import base64
+with open("audio.wav", "rb") as f:
+    audio_base64 = base64.b64encode(f.read()).decode()
+# 2. Call MCP tool
+result = mcp_client.call("extract_embedding", {"audio_base64": audio_base64})
+# 3. Use the 768-dim embedding
+embedding = result["embedding"]
+```
+---
+## 🛠️ Tech Stack
+| Component | Technology |
+|-----------|------------|
+| MCP Server | Gradio 6.0 (`mcp_server=True`) |
+| GPU Compute | Modal (T4 GPU) |
+| Embeddings | Wav2Vec2 (facebook/wav2vec2-base-960h) |
+| Speech-to-Text | ElevenLabs Scribe v1 |
+| Voice Isolation | ElevenLabs Voice Isolator |
+| Acoustic Analysis | librosa + scipy |
+---
+## ⚡ Performance
+| Metric | Value |
+|--------|-------|
+| Response Time (warm) | <200ms |
+| Cold Start | 1-3s (memory snapshot optimized) |
+| Embedding Dimensions | 768 |
+| Supported Audio | Any format (auto-converts to WAV) |
+| Max Duration | Tested up to 10 minutes |
+---
+## 🎯 Why VoiceKit MCP?
+| Criteria | Our Approach |
+|----------|--------------|
+| **Functionality** | 6 production-ready tools covering full voice analysis pipeline |
+| **Innovation** | First MCP server for comprehensive voice analysis |
+| **Documentation** | Complete API docs with inputs/outputs/use cases |
+| **Real-world Impact** | Powers Voice Sementle game; applicable to voice cloning, accessibility, language learning |
+---
+## 🎮 Interactive Demo
+👆 **Click the interface above to try each tool!**
+1. Upload or record audio
+2. Select a tool to test
+3. View JSON results with scores and analysis
+4. Copy embeddings or transcripts for your app
+---
+## 🔗 Related Projects
+- **[Voice Sementle](https://huggingface.co/spaces/MCP-1st-Birthday/Voice-Sementle)** — Daily voice puzzle game powered by VoiceKit MCP
+---
+**Built for [MCP's 1st Birthday Hackathon](https://huggingface.co/MCP-1st-Birthday)** 🎂
+*Celebrating one year of Model Context Protocol!*

app.py CHANGED Viewed

@@ -14,6 +14,7 @@ import os
 import json
 import tempfile
 import math
 # Set Gradio temp directory to current directory
 GRADIO_TEMP_DIR = os.path.join(os.getcwd(), "gradio_temp")
@@ -34,6 +35,96 @@ except Exception as e:
     print(f"Modal not available: {e}")
 def file_to_base64(file_path: str) -> str:
     """Convert file path to base64 string"""
     if not file_path:
@@ -1104,6 +1195,138 @@ footer,
     margin-left: 6px;
 }
 /* ===== CARD STYLES ===== */
 .card {
     background: rgba(15, 15, 35, 0.8);
@@ -1934,7 +2157,7 @@ with gr.Blocks() as demo:
     """)
     # ==================== HEADER (FLOATING) ====================
-    gr.HTML("""
     <div class="header-main">
         <div class="header-left">
             <span class="header-icon">
@@ -1971,6 +2194,35 @@ with gr.Blocks() as demo:
                 <span class="header-subtitle">MCP Server</span>
             </div>
         </div>
     </div>
     """)

 import json
 import tempfile
 import math
+import re
 # Set Gradio temp directory to current directory
 GRADIO_TEMP_DIR = os.path.join(os.getcwd(), "gradio_temp")
     print(f"Modal not available: {e}")
+# Load README.md and convert to HTML
+def load_readme_as_html():
+    """Load README.md and convert markdown to HTML"""
+    try:
+        with open("README.md", "r", encoding="utf-8") as f:
+            content = f.read()
+        # Remove YAML front matter
+        content = re.sub(r'^---\n.*?\n---\n', '', content, flags=re.DOTALL)
+        html = content
+        # Headers
+        html = re.sub(r'^### (.+)$', r'<h3>\1</h3>', html, flags=re.MULTILINE)
+        html = re.sub(r'^## (.+)$', r'<h2>\1</h2>', html, flags=re.MULTILINE)
+        html = re.sub(r'^# (.+)$', r'<h1>\1</h1>', html, flags=re.MULTILINE)
+        # Code blocks
+        html = re.sub(r'```(\w*)\n(.*?)```', r'<pre><code>\2</code></pre>', html, flags=re.DOTALL)
+        # Inline code
+        html = re.sub(r'`([^`]+)`', r'<code>\1</code>', html)
+        # Bold
+        html = re.sub(r'\*\*(.+?)\*\*', r'<strong>\1</strong>', html)
+        # Links
+        html = re.sub(r'\[([^\]]+)\]\(([^)]+)\)', r'<a href="\2" target="_blank">\1</a>', html)
+        # Tables
+        lines = html.split('\n')
+        in_table = False
+        table_html = []
+        new_lines = []
+        for line in lines:
+            if '|' in line and line.strip().startswith('|'):
+                if not in_table:
+                    in_table = True
+                    table_html = ['<table>']
+                if re.match(r'^\|[\s\-:|]+\|$', line.strip()):
+                    continue
+                cells = [c.strip() for c in line.strip().split('|')[1:-1]]
+                if len(table_html) == 1:
+                    table_html.append('<thead><tr>')
+                    for cell in cells:
+                        table_html.append(f'<th>{cell}</th>')
+                    table_html.append('</tr></thead><tbody>')
+                else:
+                    table_html.append('<tr>')
+                    for cell in cells:
+                        table_html.append(f'<td>{cell}</td>')
+                    table_html.append('</tr>')
+            else:
+                if in_table:
+                    table_html.append('</tbody></table>')
+                    new_lines.append('\n'.join(table_html))
+                    table_html = []
+                    in_table = False
+                new_lines.append(line)
+        if in_table:
+            table_html.append('</tbody></table>')
+            new_lines.append('\n'.join(table_html))
+        html = '\n'.join(new_lines)
+        # Lists
+        html = re.sub(r'^- (.+)$', r'<li>\1</li>', html, flags=re.MULTILINE)
+        html = re.sub(r'(<li>.*</li>\n?)+', r'<ul>\g<0></ul>', html)
+        # Paragraphs
+        lines = html.split('\n')
+        result = []
+        for line in lines:
+            stripped = line.strip()
+            if stripped and not stripped.startswith('<') and not stripped.startswith('```'):
+                result.append(f'<p>{stripped}</p>')
+            else:
+                result.append(line)
+        return '\n'.join(result)
+    except Exception as e:
+        return f"<p>Error loading README: {e}</p>"
+readme_html = load_readme_as_html()
 def file_to_base64(file_path: str) -> str:
     """Convert file path to base64 string"""
     if not file_path:
     margin-left: 6px;
 }
+/* ===== DOCS BUTTON ===== */
+.docs-button {
+    display: flex;
+    align-items: center;
+    gap: 8px;
+    padding: 10px 20px;
+    background: linear-gradient(135deg, rgba(124, 58, 237, 0.3), rgba(99, 102, 241, 0.3));
+    border: 1px solid rgba(124, 58, 237, 0.5);
+    border-radius: 12px;
+    color: #e0e7ff;
+    font-size: 14px;
+    font-weight: 600;
+    cursor: pointer;
+    transition: all 0.3s ease;
+    text-transform: uppercase;
+    letter-spacing: 0.5px;
+}
+.docs-button:hover {
+    background: linear-gradient(135deg, rgba(124, 58, 237, 0.5), rgba(99, 102, 241, 0.5));
+    border-color: rgba(124, 58, 237, 0.8);
+    transform: translateY(-2px);
+    box-shadow: 0 4px 20px rgba(124, 58, 237, 0.4);
+}
+.docs-button svg {
+    width: 18px;
+    height: 18px;
+}
+/* ===== DOCS MODAL ===== */
+.docs-modal-overlay {
+    display: none;
+    position: fixed !important;
+    top: 0 !important;
+    left: 0 !important;
+    right: 0 !important;
+    bottom: 0 !important;
+    width: 100vw !important;
+    height: 100vh !important;
+    background: rgba(0, 0, 0, 0.85) !important;
+    backdrop-filter: blur(10px) !important;
+    z-index: 99999 !important;
+    justify-content: center !important;
+    align-items: flex-start !important;
+    padding: 10px 20px !important;
+    box-sizing: border-box !important;
+}
+.docs-modal-overlay.active {
+    display: flex !important;
+}
+.docs-modal {
+    background: #0d0d1a !important;
+    border: 2px solid #7c3aed !important;
+    border-radius: 20px !important;
+    width: calc(100vw - 40px) !important;
+    max-width: 1800px !important;
+    height: auto !important;
+    max-height: 80vh !important;
+    overflow: hidden !important;
+    box-shadow: 0 25px 80px rgba(0, 0, 0, 0.9) !important;
+    margin: 0 auto !important;
+    position: relative !important;
+    top: 20px !important;
+}
+.docs-modal-header {
+    display: flex !important;
+    justify-content: space-between !important;
+    align-items: center !important;
+    padding: 20px 24px !important;
+    border-bottom: 2px solid #7c3aed !important;
+    background: #1a1a2e !important;
+}
+.docs-modal-title {
+    font-size: 20px;
+    font-weight: 700;
+    color: #e0e7ff;
+    display: flex;
+    align-items: center;
+    gap: 10px;
+}
+.docs-modal-close {
+    background: rgba(124, 58, 237, 0.3);
+    border: 2px solid rgba(124, 58, 237, 0.5);
+    border-radius: 12px;
+    color: #e0e7ff;
+    font-size: 28px;
+    font-weight: 300;
+    cursor: pointer;
+    padding: 4px 14px;
+    line-height: 1;
+    transition: all 0.2s;
+}
+.docs-modal-close:hover {
+    background: rgba(124, 58, 237, 0.4);
+    border-color: rgba(124, 58, 237, 0.6);
+}
+.docs-modal-content {
+    padding: 24px !important;
+    overflow-y: auto !important;
+    max-height: calc(80vh - 80px) !important;
+    color: #c7d2fe !important;
+    font-size: 15px !important;
+    line-height: 1.7 !important;
+    background: #0d0d1a !important;
+}
+.docs-modal-content h1 { font-size: 28px; color: #e0e7ff; margin: 0 0 16px 0; padding-bottom: 12px; border-bottom: 2px solid rgba(124, 58, 237, 0.3); }
+.docs-modal-content h2 { font-size: 22px; color: #e0e7ff; margin: 24px 0 12px 0; }
+.docs-modal-content h3 { font-size: 18px; color: #a5b4fc; margin: 20px 0 10px 0; }
+.docs-modal-content p { margin: 12px 0; }
+.docs-modal-content ul, .docs-modal-content ol { margin: 12px 0; padding-left: 24px; }
+.docs-modal-content li { margin: 6px 0; }
+.docs-modal-content code { background: rgba(124, 58, 237, 0.2); padding: 2px 6px; border-radius: 4px; font-family: 'SF Mono', 'Monaco', 'Consolas', monospace; font-size: 13px; color: #c4b5fd; }
+.docs-modal-content pre { background: rgba(0, 0, 0, 0.4); border: 1px solid rgba(124, 58, 237, 0.2); border-radius: 12px; padding: 16px; overflow-x: auto; margin: 16px 0; }
+.docs-modal-content pre code { background: transparent; padding: 0; color: #a5b4fc; }
+.docs-modal-content table { width: 100%; border-collapse: collapse; margin: 16px 0; }
+.docs-modal-content th, .docs-modal-content td { padding: 10px 12px; text-align: left; border: 1px solid rgba(124, 58, 237, 0.2); }
+.docs-modal-content th { background: rgba(124, 58, 237, 0.15); color: #e0e7ff; font-weight: 600; }
+.docs-modal-content td { color: #c7d2fe; }
+.docs-modal-content a { color: #a78bfa; text-decoration: none; }
+.docs-modal-content a:hover { text-decoration: underline; }
+.docs-modal-content strong { color: #e0e7ff; }
+.docs-modal-content img { max-width: 100%; height: auto; border-radius: 8px; margin: 12px 0; }
 /* ===== CARD STYLES ===== */
 .card {
     background: rgba(15, 15, 35, 0.8);
     """)
     # ==================== HEADER (FLOATING) ====================
+    gr.HTML(f"""
     <div class="header-main">
         <div class="header-left">
             <span class="header-icon">
                 <span class="header-subtitle">MCP Server</span>
             </div>
         </div>
+        <button class="docs-button" onclick="document.getElementById('docsModal').classList.add('active')">
+            <svg viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2">
+                <path d="M14 2H6a2 2 0 0 0-2 2v16a2 2 0 0 0 2 2h12a2 2 0 0 0 2-2V8z"/>
+                <polyline points="14 2 14 8 20 8"/>
+                <line x1="16" y1="13" x2="8" y2="13"/>
+                <line x1="16" y1="17" x2="8" y2="17"/>
+                <polyline points="10 9 9 9 8 9"/>
+            </svg>
+            DOCS
+        </button>
+    </div>
+    <!-- DOCS Modal -->
+    <div id="docsModal" class="docs-modal-overlay" onclick="if(event.target === this) this.classList.remove('active')">
+        <div class="docs-modal">
+            <div class="docs-modal-header">
+                <div class="docs-modal-title">
+                    <svg width="24" height="24" viewBox="0 0 24 24" fill="none" stroke="#a78bfa" stroke-width="2">
+                        <path d="M14 2H6a2 2 0 0 0-2 2v16a2 2 0 0 0 2 2h12a2 2 0 0 0 2-2V8z"/>
+                        <polyline points="14 2 14 8 20 8"/>
+                    </svg>
+                    Documentation
+                </div>
+                <button class="docs-modal-close" onclick="document.getElementById('docsModal').classList.remove('active')">&times;</button>
+            </div>
+            <div class="docs-modal-content">
+                {readme_html}
+            </div>
+        </div>
     </div>
     """)

icons/analyze_acoustics.svg ADDED Viewed

icons/extract_embedding.svg ADDED Viewed

icons/grade_voice.svg ADDED Viewed

icons/isolate_voice.svg ADDED Viewed

icons/match_voice.svg ADDED Viewed

icons/transcribe_audio.svg ADDED Viewed