likhonsheikhdev commited on
Commit
7270816
·
verified ·
1 Parent(s): f238f35

Upload folder using huggingface_hub

Browse files
Files changed (2) hide show
  1. README.md +111 -81
  2. main.py +172 -59
README.md CHANGED
@@ -11,25 +11,19 @@ pinned: false
11
 
12
  # Docker Model Runner
13
 
14
- **Anthropic API Compatible** - Text Generation endpoint with full Messages API support.
15
 
16
  ## Hardware
17
  - **CPU Basic**: 2 vCPU · 16 GB RAM
18
 
19
  ## Quick Start
20
 
21
- ### 1. Install Anthropic SDK
22
  ```bash
23
  pip install anthropic
24
- ```
25
-
26
- ### 2. Configure Environment
27
- ```bash
28
  export ANTHROPIC_BASE_URL=https://likhonsheikhdev-docker-model-runner.hf.space
29
  export ANTHROPIC_API_KEY=any-key
30
  ```
31
 
32
- ### 3. Call API
33
  ```python
34
  import anthropic
35
 
@@ -39,17 +33,7 @@ message = client.messages.create(
39
  model="MiniMax-M2",
40
  max_tokens=1000,
41
  system="You are a helpful assistant.",
42
- messages=[
43
- {
44
- "role": "user",
45
- "content": [
46
- {
47
- "type": "text",
48
- "text": "Hi, how are you?"
49
- }
50
- ]
51
- }
52
- ]
53
  )
54
 
55
  for block in message.content:
@@ -59,55 +43,36 @@ for block in message.content:
59
  print(f"Text:\n{block.text}\n")
60
  ```
61
 
62
- ## Supported Models
63
 
64
- | Model Name | Description |
65
- |------------|-------------|
66
- | MiniMax-M2 | Agentic capabilities, Advanced reasoning |
67
- | MiniMax-M2-Stable | High concurrency and commercial use |
68
 
69
- ## Compatibility
70
-
71
- ### Supported Parameters
72
-
73
- | Parameter | Status | Description |
74
- |-----------|--------|-------------|
75
- | model | ✅ Fully supported | MiniMax-M2, MiniMax-M2-Stable |
76
- | messages | ✅ Partial support | Text and tool calls |
77
- | max_tokens | ✅ Fully supported | Max tokens to generate |
78
- | stream | ✅ Fully supported | Streaming response |
79
- | system | ✅ Fully supported | System prompt |
80
- | temperature | ✅ Fully supported | Range (0.0, 1.0] |
81
- | tool_choice | ✅ Fully supported | Tool selection strategy |
82
- | tools | ✅ Fully supported | Tool definitions |
83
- | top_p | ✅ Fully supported | Nucleus sampling |
84
- | metadata | ✅ Fully supported | Metadata |
85
- | thinking | ✅ Fully supported | Reasoning content |
86
- | top_k | ⚪ Ignored | Parameter ignored |
87
- | stop_sequences | ⚪ Ignored | Parameter ignored |
88
-
89
- ### Message Types
90
 
91
- | Type | Status |
92
- |------|--------|
93
- | text | ✅ Fully supported |
94
- | tool_use | ✅ Fully supported |
95
- | tool_result | ✅ Fully supported |
96
- | thinking | ✅ Fully supported |
97
- | image | ❌ Not supported |
98
- | document | ❌ Not supported |
99
 
100
- ## Endpoints
 
 
 
 
 
 
 
 
101
 
102
- | Endpoint | Method | Description |
103
- |----------|--------|-------------|
104
- | `/v1/messages` | POST | Anthropic Messages API |
105
- | `/v1/chat/completions` | POST | OpenAI Chat API |
106
- | `/v1/models` | GET | List models |
107
- | `/health` | GET | Health check |
108
- | `/info` | GET | API info |
109
 
110
- ## Streaming Example
111
 
112
  ```python
113
  import anthropic
@@ -119,13 +84,23 @@ client = anthropic.Anthropic(
119
  with client.messages.stream(
120
  model="MiniMax-M2",
121
  max_tokens=1024,
 
122
  messages=[{"role": "user", "content": "Hello!"}]
123
  ) as stream:
124
- for text in stream.text_stream:
125
- print(text, end="", flush=True)
 
 
 
 
 
 
 
126
  ```
127
 
128
- ## Tool Calling Example
 
 
129
 
130
  ```python
131
  import anthropic
@@ -134,28 +109,82 @@ client = anthropic.Anthropic(
134
  base_url="https://likhonsheikhdev-docker-model-runner.hf.space"
135
  )
136
 
137
- tools = [
138
- {
139
- "name": "get_weather",
140
- "description": "Get the current weather in a location",
141
- "input_schema": {
142
- "type": "object",
143
- "properties": {
144
- "location": {"type": "string", "description": "City name"}
145
- },
146
- "required": ["location"]
147
- }
148
- }
149
- ]
150
 
151
- message = client.messages.create(
 
152
  model="MiniMax-M2",
153
  max_tokens=1024,
154
- tools=tools,
155
- messages=[{"role": "user", "content": "What's the weather in Tokyo?"}]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
156
  )
157
  ```
158
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
159
  ## cURL Example
160
 
161
  ```bash
@@ -165,8 +194,9 @@ curl -X POST https://likhonsheikhdev-docker-model-runner.hf.space/v1/messages \
165
  -d '{
166
  "model": "MiniMax-M2",
167
  "max_tokens": 1024,
 
168
  "messages": [
169
- {"role": "user", "content": "Hello, how are you?"}
170
  ]
171
  }'
172
  ```
 
11
 
12
  # Docker Model Runner
13
 
14
+ **Anthropic API Compatible** with **Interleaved Thinking** support.
15
 
16
  ## Hardware
17
  - **CPU Basic**: 2 vCPU · 16 GB RAM
18
 
19
  ## Quick Start
20
 
 
21
  ```bash
22
  pip install anthropic
 
 
 
 
23
  export ANTHROPIC_BASE_URL=https://likhonsheikhdev-docker-model-runner.hf.space
24
  export ANTHROPIC_API_KEY=any-key
25
  ```
26
 
 
27
  ```python
28
  import anthropic
29
 
 
33
  model="MiniMax-M2",
34
  max_tokens=1000,
35
  system="You are a helpful assistant.",
36
+ messages=[{"role": "user", "content": "Hi, how are you?"}]
 
 
 
 
 
 
 
 
 
 
37
  )
38
 
39
  for block in message.content:
 
43
  print(f"Text:\n{block.text}\n")
44
  ```
45
 
46
+ ## Interleaved Thinking
47
 
48
+ Enable thinking to get reasoning steps interleaved with responses:
 
 
 
49
 
50
+ ```python
51
+ import anthropic
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
52
 
53
+ client = anthropic.Anthropic(
54
+ base_url="https://likhonsheikhdev-docker-model-runner.hf.space"
55
+ )
 
 
 
 
 
56
 
57
+ message = client.messages.create(
58
+ model="MiniMax-M2",
59
+ max_tokens=1024,
60
+ thinking={
61
+ "type": "enabled",
62
+ "budget_tokens": 200
63
+ },
64
+ messages=[{"role": "user", "content": "Explain quantum computing"}]
65
+ )
66
 
67
+ # Response contains interleaved thinking and text blocks
68
+ for block in message.content:
69
+ if block.type == "thinking":
70
+ print(f"💭 Thinking: {block.thinking}")
71
+ elif block.type == "text":
72
+ print(f"📝 Response: {block.text}")
73
+ ```
74
 
75
+ ## Streaming with Thinking
76
 
77
  ```python
78
  import anthropic
 
84
  with client.messages.stream(
85
  model="MiniMax-M2",
86
  max_tokens=1024,
87
+ thinking={"type": "enabled", "budget_tokens": 100},
88
  messages=[{"role": "user", "content": "Hello!"}]
89
  ) as stream:
90
+ for event in stream:
91
+ if hasattr(event, 'type'):
92
+ if event.type == 'content_block_start':
93
+ print(f"\n[{event.content_block.type}]", end=" ")
94
+ elif event.type == 'content_block_delta':
95
+ if hasattr(event.delta, 'thinking'):
96
+ print(event.delta.thinking, end="")
97
+ elif hasattr(event.delta, 'text'):
98
+ print(event.delta.text, end="")
99
  ```
100
 
101
+ ## Multi-Turn with Thinking History
102
+
103
+ **Important**: In multi-turn conversations, append the complete model response (including thinking blocks) to maintain reasoning chain continuity.
104
 
105
  ```python
106
  import anthropic
 
109
  base_url="https://likhonsheikhdev-docker-model-runner.hf.space"
110
  )
111
 
112
+ messages = [{"role": "user", "content": "What is 2+2?"}]
 
 
 
 
 
 
 
 
 
 
 
 
113
 
114
+ # First turn
115
+ response = client.messages.create(
116
  model="MiniMax-M2",
117
  max_tokens=1024,
118
+ thinking={"type": "enabled", "budget_tokens": 100},
119
+ messages=messages
120
+ )
121
+
122
+ # Append full response (including thinking) to history
123
+ messages.append({
124
+ "role": "assistant",
125
+ "content": response.content # Includes both thinking and text blocks
126
+ })
127
+
128
+ # Second turn
129
+ messages.append({"role": "user", "content": "Now multiply that by 3"})
130
+
131
+ response2 = client.messages.create(
132
+ model="MiniMax-M2",
133
+ max_tokens=1024,
134
+ thinking={"type": "enabled", "budget_tokens": 100},
135
+ messages=messages
136
  )
137
  ```
138
 
139
+ ## Supported Models
140
+
141
+ | Model | Description |
142
+ |-------|-------------|
143
+ | MiniMax-M2 | Agentic capabilities, Advanced reasoning |
144
+ | MiniMax-M2-Stable | High concurrency and commercial use |
145
+
146
+ ## API Compatibility
147
+
148
+ ### Parameters
149
+
150
+ | Parameter | Status |
151
+ |-----------|--------|
152
+ | model | ✅ Fully supported |
153
+ | messages | ✅ Partial (text, tool calls) |
154
+ | max_tokens | ✅ Fully supported |
155
+ | stream | ✅ Fully supported |
156
+ | system | ✅ Fully supported |
157
+ | temperature | ✅ Range (0.0, 1.0] |
158
+ | thinking | ✅ Fully supported |
159
+ | thinking.budget_tokens | ✅ Fully supported |
160
+ | tools | ✅ Fully supported |
161
+ | tool_choice | ✅ Fully supported |
162
+ | top_p | ✅ Fully supported |
163
+ | metadata | ✅ Fully supported |
164
+ | top_k | ⚪ Ignored |
165
+ | stop_sequences | ⚪ Ignored |
166
+
167
+ ### Message Types
168
+
169
+ | Type | Status |
170
+ |------|--------|
171
+ | text | ✅ Supported |
172
+ | thinking | ✅ Supported |
173
+ | tool_use | ✅ Supported |
174
+ | tool_result | ✅ Supported |
175
+ | image | ❌ Not supported |
176
+ | document | ❌ Not supported |
177
+
178
+ ## Endpoints
179
+
180
+ | Endpoint | Method | Description |
181
+ |----------|--------|-------------|
182
+ | `/v1/messages` | POST | Anthropic Messages API |
183
+ | `/v1/chat/completions` | POST | OpenAI Chat API |
184
+ | `/v1/models` | GET | List models |
185
+ | `/health` | GET | Health check |
186
+ | `/info` | GET | API info |
187
+
188
  ## cURL Example
189
 
190
  ```bash
 
194
  -d '{
195
  "model": "MiniMax-M2",
196
  "max_tokens": 1024,
197
+ "thinking": {"type": "enabled", "budget_tokens": 100},
198
  "messages": [
199
+ {"role": "user", "content": "Explain AI briefly"}
200
  ]
201
  }'
202
  ```
main.py CHANGED
@@ -1,6 +1,6 @@
1
  """
2
  Docker Model Runner - Anthropic API Compatible
3
- Full compatibility with Anthropic Messages API format
4
  Optimized for: 2 vCPU, 16GB RAM
5
  """
6
  from fastapi import FastAPI, HTTPException, Header, Request
@@ -16,6 +16,7 @@ import uuid
16
  import time
17
  import json
18
  import asyncio
 
19
 
20
  # CPU-optimized lightweight models
21
  GENERATOR_MODEL = os.getenv("GENERATOR_MODEL", "distilgpt2")
@@ -52,7 +53,7 @@ async def lifespan(app: FastAPI):
52
 
53
  app = FastAPI(
54
  title="Docker Model Runner",
55
- description="Anthropic API Compatible Endpoint",
56
  version="1.0.0",
57
  lifespan=lifespan
58
  )
@@ -70,6 +71,11 @@ class ThinkingBlock(BaseModel):
70
  thinking: str
71
 
72
 
 
 
 
 
 
73
  class ToolUseBlock(BaseModel):
74
  type: Literal["tool_use"] = "tool_use"
75
  id: str
@@ -96,7 +102,7 @@ class ImageBlock(BaseModel):
96
  source: ImageSource
97
 
98
 
99
- ContentBlock = Union[TextBlock, ThinkingBlock, ToolUseBlock, ToolResultContent, ImageBlock, str]
100
 
101
 
102
  class MessageParam(BaseModel):
@@ -119,6 +125,7 @@ class Tool(BaseModel):
119
  class ToolChoice(BaseModel):
120
  type: Literal["auto", "any", "tool"] = "auto"
121
  name: Optional[str] = None
 
122
 
123
 
124
  class ThinkingConfig(BaseModel):
@@ -141,7 +148,7 @@ class AnthropicRequest(BaseModel):
141
  stream: Optional[bool] = False
142
  system: Optional[Union[str, List[TextBlock]]] = None
143
  tools: Optional[List[Tool]] = None
144
- tool_choice: Optional[ToolChoice] = None
145
  metadata: Optional[Metadata] = None
146
  thinking: Optional[ThinkingConfig] = None
147
  service_tier: Optional[str] = None # Ignored
@@ -158,23 +165,13 @@ class AnthropicResponse(BaseModel):
158
  id: str
159
  type: Literal["message"] = "message"
160
  role: Literal["assistant"] = "assistant"
161
- content: List[Union[TextBlock, ThinkingBlock, ToolUseBlock]]
162
  model: str
163
  stop_reason: Optional[Literal["end_turn", "max_tokens", "stop_sequence", "tool_use"]] = "end_turn"
164
  stop_sequence: Optional[str] = None
165
  usage: Usage
166
 
167
 
168
- # Streaming Event Models
169
- class StreamEvent(BaseModel):
170
- type: str
171
- index: Optional[int] = None
172
- content_block: Optional[Dict[str, Any]] = None
173
- delta: Optional[Dict[str, Any]] = None
174
- message: Optional[Dict[str, Any]] = None
175
- usage: Optional[Dict[str, Any]] = None
176
-
177
-
178
  # ============== Helper Functions ==============
179
 
180
  def extract_text_from_content(content: Union[str, List[ContentBlock]]) -> str:
@@ -207,7 +204,7 @@ def format_system_prompt(system: Optional[Union[str, List[TextBlock]]]) -> str:
207
  return " ".join([block.text for block in system if hasattr(block, 'text')])
208
 
209
 
210
- def format_messages_to_prompt(messages: List[MessageParam], system: Optional[Union[str, List[TextBlock]]] = None) -> str:
211
  """Convert chat messages to a single prompt string"""
212
  prompt_parts = []
213
 
@@ -217,12 +214,34 @@ def format_messages_to_prompt(messages: List[MessageParam], system: Optional[Uni
217
 
218
  for msg in messages:
219
  role = msg.role
220
- content_text = extract_text_from_content(msg.content)
221
-
222
- if role == "user":
223
- prompt_parts.append(f"Human: {content_text}\n\n")
224
- elif role == "assistant":
225
- prompt_parts.append(f"Assistant: {content_text}\n\n")
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
226
 
227
  prompt_parts.append("Assistant:")
228
  return "".join(prompt_parts)
@@ -239,7 +258,7 @@ def generate_text(prompt: str, max_tokens: int, temperature: float, top_p: float
239
  with torch.no_grad():
240
  outputs = model.generate(
241
  **inputs,
242
- max_new_tokens=min(max_tokens, 256), # Limit for CPU
243
  temperature=temperature if temperature > 0 else 1.0,
244
  top_p=top_p,
245
  do_sample=temperature > 0,
@@ -254,13 +273,51 @@ def generate_text(prompt: str, max_tokens: int, temperature: float, top_p: float
254
  return generated_text.strip(), input_tokens, output_tokens
255
 
256
 
257
- async def generate_stream(prompt: str, max_tokens: int, temperature: float, top_p: float, message_id: str, model_name: str):
258
- """Generate streaming response in Anthropic SSE format"""
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
259
  tokenizer = models["tokenizer"]
260
  model = models["model"]
261
 
262
  inputs = tokenizer(prompt, return_tensors="pt", truncation=True, max_length=512)
263
  input_tokens = inputs["input_ids"].shape[1]
 
264
 
265
  # Send message_start event
266
  message_start = {
@@ -278,15 +335,49 @@ async def generate_stream(prompt: str, max_tokens: int, temperature: float, top_
278
  }
279
  yield f"event: message_start\ndata: {json.dumps(message_start)}\n\n"
280
 
281
- # Send content_block_start event
282
- content_block_start = {
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
283
  "type": "content_block_start",
284
- "index": 0,
285
  "content_block": {"type": "text", "text": ""}
286
  }
287
- yield f"event: content_block_start\ndata: {json.dumps(content_block_start)}\n\n"
288
 
289
- # Generate tokens
290
  with torch.no_grad():
291
  outputs = model.generate(
292
  **inputs,
@@ -300,29 +391,29 @@ async def generate_stream(prompt: str, max_tokens: int, temperature: float, top_
300
 
301
  generated_tokens = outputs[0][input_tokens:]
302
  generated_text = tokenizer.decode(generated_tokens, skip_special_tokens=True).strip()
303
- output_tokens = len(generated_tokens)
304
 
305
  # Stream text in chunks
306
  chunk_size = 5
307
  for i in range(0, len(generated_text), chunk_size):
308
  chunk = generated_text[i:i+chunk_size]
309
- content_block_delta = {
310
  "type": "content_block_delta",
311
- "index": 0,
312
  "delta": {"type": "text_delta", "text": chunk}
313
  }
314
- yield f"event: content_block_delta\ndata: {json.dumps(content_block_delta)}\n\n"
315
- await asyncio.sleep(0.01) # Small delay for realistic streaming
316
 
317
- # Send content_block_stop event
318
- content_block_stop = {"type": "content_block_stop", "index": 0}
319
- yield f"event: content_block_stop\ndata: {json.dumps(content_block_stop)}\n\n"
320
 
321
  # Send message_delta event
322
  message_delta = {
323
  "type": "message_delta",
324
  "delta": {"stop_reason": "end_turn", "stop_sequence": None},
325
- "usage": {"output_tokens": output_tokens}
326
  }
327
  yield f"event: message_delta\ndata: {json.dumps(message_delta)}\n\n"
328
 
@@ -336,7 +427,6 @@ def handle_tool_call(tools: List[Tool], messages: List[MessageParam], generated_
336
  if not tools:
337
  return None
338
 
339
- # Simple heuristic: check if response mentions tool names
340
  for tool in tools:
341
  if tool.name.lower() in generated_text.lower():
342
  return ToolUseBlock(
@@ -353,7 +443,7 @@ def handle_tool_call(tools: List[Tool], messages: List[MessageParam], generated_
353
  @app.post("/v1/messages")
354
  async def create_message(request: AnthropicRequest):
355
  """
356
- Anthropic Messages API compatible endpoint
357
 
358
  POST /v1/messages
359
 
@@ -362,24 +452,39 @@ async def create_message(request: AnthropicRequest):
362
  - System prompts
363
  - Streaming responses
364
  - Tool/function calling
365
- - Thinking/reasoning blocks
 
 
366
  """
367
  try:
368
  message_id = f"msg_{uuid.uuid4().hex[:24]}"
369
 
370
- # Format messages to prompt
371
- prompt = format_messages_to_prompt(request.messages, request.system)
 
 
 
 
 
 
 
 
 
 
 
372
 
373
  # Handle streaming
374
  if request.stream:
375
  return StreamingResponse(
376
- generate_stream(
377
  prompt=prompt,
378
  max_tokens=request.max_tokens,
379
  temperature=request.temperature or 1.0,
380
  top_p=request.top_p or 1.0,
381
  message_id=message_id,
382
- model_name=request.model
 
 
383
  ),
384
  media_type="text/event-stream",
385
  headers={
@@ -390,20 +495,23 @@ async def create_message(request: AnthropicRequest):
390
  )
391
 
392
  # Non-streaming response
 
 
 
 
 
 
 
 
 
 
393
  generated_text, input_tokens, output_tokens = generate_text(
394
  prompt=prompt,
395
  max_tokens=request.max_tokens,
396
  temperature=request.temperature or 1.0,
397
  top_p=request.top_p or 1.0
398
  )
399
-
400
- # Build content blocks
401
- content_blocks = []
402
-
403
- # Add thinking block if enabled
404
- if request.thinking and request.thinking.type == "enabled":
405
- thinking_text = f"Analyzing the user's request and formulating a response..."
406
- content_blocks.append(ThinkingBlock(type="thinking", thinking=thinking_text))
407
 
408
  # Check for tool calls
409
  tool_use = handle_tool_call(request.tools, request.messages, generated_text) if request.tools else None
@@ -421,7 +529,7 @@ async def create_message(request: AnthropicRequest):
421
  content=content_blocks,
422
  model=request.model,
423
  stop_reason=stop_reason,
424
- usage=Usage(input_tokens=input_tokens, output_tokens=output_tokens)
425
  )
426
  except Exception as e:
427
  raise HTTPException(status_code=500, detail=str(e))
@@ -447,7 +555,6 @@ class ChatCompletionRequest(BaseModel):
447
  async def chat_completions(request: ChatCompletionRequest):
448
  """OpenAI Chat Completions API compatible endpoint"""
449
  try:
450
- # Convert to Anthropic format
451
  anthropic_messages = [
452
  MessageParam(role=msg.role if msg.role in ["user", "assistant"] else "user",
453
  content=msg.content)
@@ -502,7 +609,7 @@ async def list_models():
502
  async def root():
503
  """Welcome endpoint"""
504
  return {
505
- "message": "Docker Model Runner API (Anthropic Compatible)",
506
  "hardware": "CPU Basic: 2 vCPU · 16 GB RAM",
507
  "docs": "/docs",
508
  "api_endpoints": {
@@ -515,7 +622,8 @@ async def root():
515
  "system prompts",
516
  "streaming responses",
517
  "tool/function calling",
518
- "thinking blocks",
 
519
  "metadata"
520
  ]
521
  }
@@ -537,9 +645,14 @@ async def info():
537
  """API information"""
538
  return {
539
  "name": "Docker Model Runner",
540
- "version": "1.0.0",
541
  "api_compatibility": ["anthropic", "openai"],
542
  "supported_models": ["MiniMax-M2", "MiniMax-M2-Stable"],
 
 
 
 
 
543
  "supported_parameters": {
544
  "fully_supported": ["model", "messages", "max_tokens", "stream", "system", "temperature", "top_p", "tools", "tool_choice", "metadata", "thinking"],
545
  "ignored": ["top_k", "stop_sequences", "service_tier"]
 
1
  """
2
  Docker Model Runner - Anthropic API Compatible
3
+ Full compatibility with Anthropic Messages API + Interleaved Thinking
4
  Optimized for: 2 vCPU, 16GB RAM
5
  """
6
  from fastapi import FastAPI, HTTPException, Header, Request
 
16
  import time
17
  import json
18
  import asyncio
19
+ import re
20
 
21
  # CPU-optimized lightweight models
22
  GENERATOR_MODEL = os.getenv("GENERATOR_MODEL", "distilgpt2")
 
53
 
54
  app = FastAPI(
55
  title="Docker Model Runner",
56
+ description="Anthropic API Compatible with Interleaved Thinking",
57
  version="1.0.0",
58
  lifespan=lifespan
59
  )
 
71
  thinking: str
72
 
73
 
74
+ class SignatureBlock(BaseModel):
75
+ type: Literal["signature"] = "signature"
76
+ signature: str
77
+
78
+
79
  class ToolUseBlock(BaseModel):
80
  type: Literal["tool_use"] = "tool_use"
81
  id: str
 
102
  source: ImageSource
103
 
104
 
105
+ ContentBlock = Union[TextBlock, ThinkingBlock, SignatureBlock, ToolUseBlock, ToolResultContent, ImageBlock, str]
106
 
107
 
108
  class MessageParam(BaseModel):
 
125
  class ToolChoice(BaseModel):
126
  type: Literal["auto", "any", "tool"] = "auto"
127
  name: Optional[str] = None
128
+ disable_parallel_tool_use: Optional[bool] = False
129
 
130
 
131
  class ThinkingConfig(BaseModel):
 
148
  stream: Optional[bool] = False
149
  system: Optional[Union[str, List[TextBlock]]] = None
150
  tools: Optional[List[Tool]] = None
151
+ tool_choice: Optional[Union[ToolChoice, Dict[str, Any]]] = None
152
  metadata: Optional[Metadata] = None
153
  thinking: Optional[ThinkingConfig] = None
154
  service_tier: Optional[str] = None # Ignored
 
165
  id: str
166
  type: Literal["message"] = "message"
167
  role: Literal["assistant"] = "assistant"
168
+ content: List[Union[TextBlock, ThinkingBlock, SignatureBlock, ToolUseBlock]]
169
  model: str
170
  stop_reason: Optional[Literal["end_turn", "max_tokens", "stop_sequence", "tool_use"]] = "end_turn"
171
  stop_sequence: Optional[str] = None
172
  usage: Usage
173
 
174
 
 
 
 
 
 
 
 
 
 
 
175
  # ============== Helper Functions ==============
176
 
177
  def extract_text_from_content(content: Union[str, List[ContentBlock]]) -> str:
 
204
  return " ".join([block.text for block in system if hasattr(block, 'text')])
205
 
206
 
207
+ def format_messages_to_prompt(messages: List[MessageParam], system: Optional[Union[str, List[TextBlock]]] = None, include_thinking: bool = False) -> str:
208
  """Convert chat messages to a single prompt string"""
209
  prompt_parts = []
210
 
 
214
 
215
  for msg in messages:
216
  role = msg.role
217
+ content = msg.content
218
+
219
+ # Handle interleaved thinking in message history
220
+ if isinstance(content, list):
221
+ for block in content:
222
+ if isinstance(block, dict):
223
+ block_type = block.get('type', 'text')
224
+ if block_type == 'thinking' and include_thinking:
225
+ prompt_parts.append(f"<thinking>{block.get('thinking', '')}</thinking>\n")
226
+ elif block_type == 'text':
227
+ if role == "user":
228
+ prompt_parts.append(f"Human: {block.get('text', '')}\n\n")
229
+ else:
230
+ prompt_parts.append(f"Assistant: {block.get('text', '')}\n\n")
231
+ elif hasattr(block, 'type'):
232
+ if block.type == 'thinking' and include_thinking:
233
+ prompt_parts.append(f"<thinking>{block.thinking}</thinking>\n")
234
+ elif block.type == 'text':
235
+ if role == "user":
236
+ prompt_parts.append(f"Human: {block.text}\n\n")
237
+ else:
238
+ prompt_parts.append(f"Assistant: {block.text}\n\n")
239
+ else:
240
+ content_text = content if isinstance(content, str) else extract_text_from_content(content)
241
+ if role == "user":
242
+ prompt_parts.append(f"Human: {content_text}\n\n")
243
+ elif role == "assistant":
244
+ prompt_parts.append(f"Assistant: {content_text}\n\n")
245
 
246
  prompt_parts.append("Assistant:")
247
  return "".join(prompt_parts)
 
258
  with torch.no_grad():
259
  outputs = model.generate(
260
  **inputs,
261
+ max_new_tokens=min(max_tokens, 256),
262
  temperature=temperature if temperature > 0 else 1.0,
263
  top_p=top_p,
264
  do_sample=temperature > 0,
 
273
  return generated_text.strip(), input_tokens, output_tokens
274
 
275
 
276
+ def generate_thinking(prompt: str, budget_tokens: int = 100) -> tuple:
277
+ """Generate thinking/reasoning content"""
278
+ tokenizer = models["tokenizer"]
279
+ model = models["model"]
280
+
281
+ thinking_prompt = f"{prompt}\n\nLet me think through this step by step:\n"
282
+
283
+ inputs = tokenizer(thinking_prompt, return_tensors="pt", truncation=True, max_length=512)
284
+ input_tokens = inputs["input_ids"].shape[1]
285
+
286
+ with torch.no_grad():
287
+ outputs = model.generate(
288
+ **inputs,
289
+ max_new_tokens=min(budget_tokens, 128),
290
+ temperature=0.7,
291
+ top_p=0.9,
292
+ do_sample=True,
293
+ pad_token_id=tokenizer.pad_token_id,
294
+ eos_token_id=tokenizer.eos_token_id
295
+ )
296
+
297
+ generated_tokens = outputs[0][input_tokens:]
298
+ thinking_tokens = len(generated_tokens)
299
+ thinking_text = tokenizer.decode(generated_tokens, skip_special_tokens=True)
300
+
301
+ return thinking_text.strip(), thinking_tokens
302
+
303
+
304
+ async def generate_stream_with_thinking(
305
+ prompt: str,
306
+ max_tokens: int,
307
+ temperature: float,
308
+ top_p: float,
309
+ message_id: str,
310
+ model_name: str,
311
+ thinking_enabled: bool = False,
312
+ thinking_budget: int = 100
313
+ ):
314
+ """Generate streaming response with interleaved thinking in Anthropic SSE format"""
315
  tokenizer = models["tokenizer"]
316
  model = models["model"]
317
 
318
  inputs = tokenizer(prompt, return_tensors="pt", truncation=True, max_length=512)
319
  input_tokens = inputs["input_ids"].shape[1]
320
+ total_output_tokens = 0
321
 
322
  # Send message_start event
323
  message_start = {
 
335
  }
336
  yield f"event: message_start\ndata: {json.dumps(message_start)}\n\n"
337
 
338
+ content_index = 0
339
+
340
+ # Generate thinking block if enabled
341
+ if thinking_enabled:
342
+ # Send thinking content_block_start
343
+ thinking_block_start = {
344
+ "type": "content_block_start",
345
+ "index": content_index,
346
+ "content_block": {"type": "thinking", "thinking": ""}
347
+ }
348
+ yield f"event: content_block_start\ndata: {json.dumps(thinking_block_start)}\n\n"
349
+
350
+ # Generate thinking content
351
+ thinking_text, thinking_tokens = generate_thinking(prompt, thinking_budget)
352
+ total_output_tokens += thinking_tokens
353
+
354
+ # Stream thinking in chunks
355
+ chunk_size = 10
356
+ for i in range(0, len(thinking_text), chunk_size):
357
+ chunk = thinking_text[i:i+chunk_size]
358
+ thinking_delta = {
359
+ "type": "content_block_delta",
360
+ "index": content_index,
361
+ "delta": {"type": "thinking_delta", "thinking": chunk}
362
+ }
363
+ yield f"event: content_block_delta\ndata: {json.dumps(thinking_delta)}\n\n"
364
+ await asyncio.sleep(0.01)
365
+
366
+ # Send thinking content_block_stop
367
+ thinking_block_stop = {"type": "content_block_stop", "index": content_index}
368
+ yield f"event: content_block_stop\ndata: {json.dumps(thinking_block_stop)}\n\n"
369
+
370
+ content_index += 1
371
+
372
+ # Send text content_block_start
373
+ text_block_start = {
374
  "type": "content_block_start",
375
+ "index": content_index,
376
  "content_block": {"type": "text", "text": ""}
377
  }
378
+ yield f"event: content_block_start\ndata: {json.dumps(text_block_start)}\n\n"
379
 
380
+ # Generate main response
381
  with torch.no_grad():
382
  outputs = model.generate(
383
  **inputs,
 
391
 
392
  generated_tokens = outputs[0][input_tokens:]
393
  generated_text = tokenizer.decode(generated_tokens, skip_special_tokens=True).strip()
394
+ total_output_tokens += len(generated_tokens)
395
 
396
  # Stream text in chunks
397
  chunk_size = 5
398
  for i in range(0, len(generated_text), chunk_size):
399
  chunk = generated_text[i:i+chunk_size]
400
+ text_delta = {
401
  "type": "content_block_delta",
402
+ "index": content_index,
403
  "delta": {"type": "text_delta", "text": chunk}
404
  }
405
+ yield f"event: content_block_delta\ndata: {json.dumps(text_delta)}\n\n"
406
+ await asyncio.sleep(0.01)
407
 
408
+ # Send text content_block_stop
409
+ text_block_stop = {"type": "content_block_stop", "index": content_index}
410
+ yield f"event: content_block_stop\ndata: {json.dumps(text_block_stop)}\n\n"
411
 
412
  # Send message_delta event
413
  message_delta = {
414
  "type": "message_delta",
415
  "delta": {"stop_reason": "end_turn", "stop_sequence": None},
416
+ "usage": {"output_tokens": total_output_tokens}
417
  }
418
  yield f"event: message_delta\ndata: {json.dumps(message_delta)}\n\n"
419
 
 
427
  if not tools:
428
  return None
429
 
 
430
  for tool in tools:
431
  if tool.name.lower() in generated_text.lower():
432
  return ToolUseBlock(
 
443
  @app.post("/v1/messages")
444
  async def create_message(request: AnthropicRequest):
445
  """
446
+ Anthropic Messages API compatible endpoint with Interleaved Thinking
447
 
448
  POST /v1/messages
449
 
 
452
  - System prompts
453
  - Streaming responses
454
  - Tool/function calling
455
+ - Interleaved thinking blocks
456
+ - Thinking budget tokens
457
+ - Metadata
458
  """
459
  try:
460
  message_id = f"msg_{uuid.uuid4().hex[:24]}"
461
 
462
+ # Check if thinking is enabled
463
+ thinking_enabled = False
464
+ thinking_budget = 100
465
+ if request.thinking:
466
+ if isinstance(request.thinking, dict):
467
+ thinking_enabled = request.thinking.get('type') == 'enabled'
468
+ thinking_budget = request.thinking.get('budget_tokens', 100)
469
+ else:
470
+ thinking_enabled = request.thinking.type == 'enabled'
471
+ thinking_budget = request.thinking.budget_tokens or 100
472
+
473
+ # Format messages to prompt (include thinking from history if enabled)
474
+ prompt = format_messages_to_prompt(request.messages, request.system, include_thinking=thinking_enabled)
475
 
476
  # Handle streaming
477
  if request.stream:
478
  return StreamingResponse(
479
+ generate_stream_with_thinking(
480
  prompt=prompt,
481
  max_tokens=request.max_tokens,
482
  temperature=request.temperature or 1.0,
483
  top_p=request.top_p or 1.0,
484
  message_id=message_id,
485
+ model_name=request.model,
486
+ thinking_enabled=thinking_enabled,
487
+ thinking_budget=thinking_budget
488
  ),
489
  media_type="text/event-stream",
490
  headers={
 
495
  )
496
 
497
  # Non-streaming response
498
+ content_blocks = []
499
+ total_output_tokens = 0
500
+
501
+ # Generate thinking block if enabled
502
+ if thinking_enabled:
503
+ thinking_text, thinking_tokens = generate_thinking(prompt, thinking_budget)
504
+ total_output_tokens += thinking_tokens
505
+ content_blocks.append(ThinkingBlock(type="thinking", thinking=thinking_text))
506
+
507
+ # Generate main response
508
  generated_text, input_tokens, output_tokens = generate_text(
509
  prompt=prompt,
510
  max_tokens=request.max_tokens,
511
  temperature=request.temperature or 1.0,
512
  top_p=request.top_p or 1.0
513
  )
514
+ total_output_tokens += output_tokens
 
 
 
 
 
 
 
515
 
516
  # Check for tool calls
517
  tool_use = handle_tool_call(request.tools, request.messages, generated_text) if request.tools else None
 
529
  content=content_blocks,
530
  model=request.model,
531
  stop_reason=stop_reason,
532
+ usage=Usage(input_tokens=input_tokens, output_tokens=total_output_tokens)
533
  )
534
  except Exception as e:
535
  raise HTTPException(status_code=500, detail=str(e))
 
555
  async def chat_completions(request: ChatCompletionRequest):
556
  """OpenAI Chat Completions API compatible endpoint"""
557
  try:
 
558
  anthropic_messages = [
559
  MessageParam(role=msg.role if msg.role in ["user", "assistant"] else "user",
560
  content=msg.content)
 
609
  async def root():
610
  """Welcome endpoint"""
611
  return {
612
+ "message": "Docker Model Runner API (Anthropic Compatible + Interleaved Thinking)",
613
  "hardware": "CPU Basic: 2 vCPU · 16 GB RAM",
614
  "docs": "/docs",
615
  "api_endpoints": {
 
622
  "system prompts",
623
  "streaming responses",
624
  "tool/function calling",
625
+ "interleaved thinking blocks",
626
+ "thinking budget tokens",
627
  "metadata"
628
  ]
629
  }
 
645
  """API information"""
646
  return {
647
  "name": "Docker Model Runner",
648
+ "version": "1.1.0",
649
  "api_compatibility": ["anthropic", "openai"],
650
  "supported_models": ["MiniMax-M2", "MiniMax-M2-Stable"],
651
+ "interleaved_thinking": {
652
+ "supported": True,
653
+ "streaming": True,
654
+ "budget_tokens": True
655
+ },
656
  "supported_parameters": {
657
  "fully_supported": ["model", "messages", "max_tokens", "stream", "system", "temperature", "top_p", "tools", "tool_choice", "metadata", "thinking"],
658
  "ignored": ["top_k", "stop_sequences", "service_tier"]