likhonsheikhdev commited on
Commit
f238f35
·
verified ·
1 Parent(s): 51159ea

Upload folder using huggingface_hub

Browse files
Files changed (2) hide show
  1. README.md +124 -45
  2. main.py +361 -241
README.md CHANGED
@@ -11,83 +11,162 @@ pinned: false
11
 
12
  # Docker Model Runner
13
 
14
- Anthropic & OpenAI API compatible Docker Space with named endpoints.
15
 
16
  ## Hardware
17
  - **CPU Basic**: 2 vCPU · 16 GB RAM
18
 
19
- ## API Compatibility
20
 
21
- ### Anthropic Messages API
22
  ```bash
23
- curl -X POST https://likhonsheikhdev-docker-model-runner.hf.space/v1/messages \
24
- -H "Content-Type: application/json" \
25
- -H "x-api-key: your-key" \
26
- -d '{
27
- "model": "distilgpt2",
28
- "max_tokens": 256,
29
- "messages": [
30
- {"role": "user", "content": "Hello, how are you?"}
31
- ]
32
- }'
33
  ```
34
 
35
- ### OpenAI Chat Completions API
36
  ```bash
37
- curl -X POST https://likhonsheikhdev-docker-model-runner.hf.space/v1/chat/completions \
38
- -H "Content-Type: application/json" \
39
- -H "Authorization: Bearer your-key" \
40
- -d '{
41
- "model": "distilgpt2",
42
- "messages": [
43
- {"role": "user", "content": "Hello, how are you?"}
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
44
  ]
45
- }'
 
 
 
 
 
 
46
  ```
47
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
48
  ## Endpoints
49
 
50
  | Endpoint | Method | Description |
51
  |----------|--------|-------------|
52
  | `/v1/messages` | POST | Anthropic Messages API |
53
  | `/v1/chat/completions` | POST | OpenAI Chat API |
54
- | `/v1/models` | GET | List available models |
55
  | `/health` | GET | Health check |
56
- | `/info` | GET | API information |
57
- | `/predict` | POST | Text classification |
58
- | `/embed` | POST | Text embeddings |
59
 
60
- ## Python SDK Usage
61
 
62
- ### With Anthropic SDK
63
  ```python
64
- from anthropic import Anthropic
65
 
66
- client = Anthropic(
67
- api_key="any-key",
68
  base_url="https://likhonsheikhdev-docker-model-runner.hf.space"
69
  )
70
 
71
- message = client.messages.create(
72
- model="distilgpt2",
73
- max_tokens=256,
74
  messages=[{"role": "user", "content": "Hello!"}]
75
- )
76
- print(message.content[0].text)
 
77
  ```
78
 
79
- ### With OpenAI SDK
 
80
  ```python
81
- from openai import OpenAI
82
 
83
- client = OpenAI(
84
- api_key="any-key",
85
- base_url="https://likhonsheikhdev-docker-model-runner.hf.space/v1"
86
  )
87
 
88
- response = client.chat.completions.create(
89
- model="distilgpt2",
90
- messages=[{"role": "user", "content": "Hello!"}]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
91
  )
92
- print(response.choices[0].message.content)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
93
  ```
 
11
 
12
  # Docker Model Runner
13
 
14
+ **Anthropic API Compatible** - Text Generation endpoint with full Messages API support.
15
 
16
  ## Hardware
17
  - **CPU Basic**: 2 vCPU · 16 GB RAM
18
 
19
+ ## Quick Start
20
 
21
+ ### 1. Install Anthropic SDK
22
  ```bash
23
+ pip install anthropic
 
 
 
 
 
 
 
 
 
24
  ```
25
 
26
+ ### 2. Configure Environment
27
  ```bash
28
+ export ANTHROPIC_BASE_URL=https://likhonsheikhdev-docker-model-runner.hf.space
29
+ export ANTHROPIC_API_KEY=any-key
30
+ ```
31
+
32
+ ### 3. Call API
33
+ ```python
34
+ import anthropic
35
+
36
+ client = anthropic.Anthropic()
37
+
38
+ message = client.messages.create(
39
+ model="MiniMax-M2",
40
+ max_tokens=1000,
41
+ system="You are a helpful assistant.",
42
+ messages=[
43
+ {
44
+ "role": "user",
45
+ "content": [
46
+ {
47
+ "type": "text",
48
+ "text": "Hi, how are you?"
49
+ }
50
+ ]
51
+ }
52
  ]
53
+ )
54
+
55
+ for block in message.content:
56
+ if block.type == "thinking":
57
+ print(f"Thinking:\n{block.thinking}\n")
58
+ elif block.type == "text":
59
+ print(f"Text:\n{block.text}\n")
60
  ```
61
 
62
+ ## Supported Models
63
+
64
+ | Model Name | Description |
65
+ |------------|-------------|
66
+ | MiniMax-M2 | Agentic capabilities, Advanced reasoning |
67
+ | MiniMax-M2-Stable | High concurrency and commercial use |
68
+
69
+ ## Compatibility
70
+
71
+ ### Supported Parameters
72
+
73
+ | Parameter | Status | Description |
74
+ |-----------|--------|-------------|
75
+ | model | ✅ Fully supported | MiniMax-M2, MiniMax-M2-Stable |
76
+ | messages | ✅ Partial support | Text and tool calls |
77
+ | max_tokens | ✅ Fully supported | Max tokens to generate |
78
+ | stream | ✅ Fully supported | Streaming response |
79
+ | system | ✅ Fully supported | System prompt |
80
+ | temperature | ✅ Fully supported | Range (0.0, 1.0] |
81
+ | tool_choice | ✅ Fully supported | Tool selection strategy |
82
+ | tools | ✅ Fully supported | Tool definitions |
83
+ | top_p | ✅ Fully supported | Nucleus sampling |
84
+ | metadata | ✅ Fully supported | Metadata |
85
+ | thinking | ✅ Fully supported | Reasoning content |
86
+ | top_k | ⚪ Ignored | Parameter ignored |
87
+ | stop_sequences | ⚪ Ignored | Parameter ignored |
88
+
89
+ ### Message Types
90
+
91
+ | Type | Status |
92
+ |------|--------|
93
+ | text | ✅ Fully supported |
94
+ | tool_use | ✅ Fully supported |
95
+ | tool_result | ✅ Fully supported |
96
+ | thinking | ✅ Fully supported |
97
+ | image | ❌ Not supported |
98
+ | document | ❌ Not supported |
99
+
100
  ## Endpoints
101
 
102
  | Endpoint | Method | Description |
103
  |----------|--------|-------------|
104
  | `/v1/messages` | POST | Anthropic Messages API |
105
  | `/v1/chat/completions` | POST | OpenAI Chat API |
106
+ | `/v1/models` | GET | List models |
107
  | `/health` | GET | Health check |
108
+ | `/info` | GET | API info |
 
 
109
 
110
+ ## Streaming Example
111
 
 
112
  ```python
113
+ import anthropic
114
 
115
+ client = anthropic.Anthropic(
 
116
  base_url="https://likhonsheikhdev-docker-model-runner.hf.space"
117
  )
118
 
119
+ with client.messages.stream(
120
+ model="MiniMax-M2",
121
+ max_tokens=1024,
122
  messages=[{"role": "user", "content": "Hello!"}]
123
+ ) as stream:
124
+ for text in stream.text_stream:
125
+ print(text, end="", flush=True)
126
  ```
127
 
128
+ ## Tool Calling Example
129
+
130
  ```python
131
+ import anthropic
132
 
133
+ client = anthropic.Anthropic(
134
+ base_url="https://likhonsheikhdev-docker-model-runner.hf.space"
 
135
  )
136
 
137
+ tools = [
138
+ {
139
+ "name": "get_weather",
140
+ "description": "Get the current weather in a location",
141
+ "input_schema": {
142
+ "type": "object",
143
+ "properties": {
144
+ "location": {"type": "string", "description": "City name"}
145
+ },
146
+ "required": ["location"]
147
+ }
148
+ }
149
+ ]
150
+
151
+ message = client.messages.create(
152
+ model="MiniMax-M2",
153
+ max_tokens=1024,
154
+ tools=tools,
155
+ messages=[{"role": "user", "content": "What's the weather in Tokyo?"}]
156
  )
157
+ ```
158
+
159
+ ## cURL Example
160
+
161
+ ```bash
162
+ curl -X POST https://likhonsheikhdev-docker-model-runner.hf.space/v1/messages \
163
+ -H "Content-Type: application/json" \
164
+ -H "x-api-key: any-key" \
165
+ -d '{
166
+ "model": "MiniMax-M2",
167
+ "max_tokens": 1024,
168
+ "messages": [
169
+ {"role": "user", "content": "Hello, how are you?"}
170
+ ]
171
+ }'
172
  ```
main.py CHANGED
@@ -1,23 +1,25 @@
1
  """
2
- Docker Model Runner - CPU-Optimized FastAPI application
3
- Compatible with Anthropic API format
4
  Optimized for: 2 vCPU, 16GB RAM
5
  """
6
- from fastapi import FastAPI, HTTPException, Header
 
7
  from pydantic import BaseModel, Field
8
- from typing import Optional, List, Union, Literal
9
  import torch
10
- from transformers import pipeline, AutoTokenizer, AutoModel, AutoModelForCausalLM
11
  import os
12
  from datetime import datetime
13
  from contextlib import asynccontextmanager
14
  import uuid
15
  import time
 
 
16
 
17
  # CPU-optimized lightweight models
18
- MODEL_NAME = os.getenv("MODEL_NAME", "distilbert-base-uncased-finetuned-sst-2-english")
19
  GENERATOR_MODEL = os.getenv("GENERATOR_MODEL", "distilgpt2")
20
- EMBED_MODEL = os.getenv("EMBED_MODEL", "sentence-transformers/all-MiniLM-L6-v2")
21
 
22
  # Set CPU threading
23
  torch.set_num_threads(2)
@@ -31,27 +33,12 @@ def load_models():
31
  global models
32
  print("Loading models for CPU inference...")
33
 
34
- # Classifier
35
- models["classifier"] = pipeline(
36
- "text-classification",
37
- model=MODEL_NAME,
38
- device=-1,
39
- torch_dtype=torch.float32
40
- )
41
 
42
- # Generator with tokenizer for chat
43
- models["generator_tokenizer"] = AutoTokenizer.from_pretrained(GENERATOR_MODEL)
44
- models["generator_model"] = AutoModelForCausalLM.from_pretrained(GENERATOR_MODEL)
45
- models["generator_model"].eval()
46
-
47
- # Set pad token
48
- if models["generator_tokenizer"].pad_token is None:
49
- models["generator_tokenizer"].pad_token = models["generator_tokenizer"].eos_token
50
-
51
- # Embedding model
52
- models["embed_tokenizer"] = AutoTokenizer.from_pretrained(EMBED_MODEL)
53
- models["embed_model"] = AutoModel.from_pretrained(EMBED_MODEL)
54
- models["embed_model"].eval()
55
 
56
  print("✅ All models loaded successfully!")
57
 
@@ -65,7 +52,7 @@ async def lifespan(app: FastAPI):
65
 
66
  app = FastAPI(
67
  title="Docker Model Runner",
68
- description="Anthropic API Compatible - CPU-Optimized HuggingFace Space",
69
  version="1.0.0",
70
  lifespan=lifespan
71
  )
@@ -73,123 +60,178 @@ app = FastAPI(
73
 
74
  # ============== Anthropic API Models ==============
75
 
76
- class ContentBlock(BaseModel):
77
  type: Literal["text"] = "text"
78
  text: str
79
 
80
 
81
- class MessageContent(BaseModel):
82
- role: Literal["user", "assistant"]
83
- content: Union[str, List[ContentBlock]]
84
 
85
 
86
- class AnthropicRequest(BaseModel):
87
- model: str = "distilgpt2"
88
- messages: List[MessageContent]
89
- max_tokens: int = 1024
90
- temperature: Optional[float] = 0.7
91
- top_p: Optional[float] = 1.0
92
- stop_sequences: Optional[List[str]] = None
93
- stream: Optional[bool] = False
94
- system: Optional[str] = None
95
 
96
 
97
- class Usage(BaseModel):
98
- input_tokens: int
99
- output_tokens: int
 
 
100
 
101
 
102
- class AnthropicResponse(BaseModel):
103
- id: str
104
- type: Literal["message"] = "message"
105
- role: Literal["assistant"] = "assistant"
106
- content: List[ContentBlock]
107
- model: str
108
- stop_reason: Literal["end_turn", "max_tokens", "stop_sequence"] = "end_turn"
109
- stop_sequence: Optional[str] = None
110
- usage: Usage
111
 
112
 
113
- # ============== OpenAI Compatible Models ==============
 
 
114
 
115
- class ChatMessage(BaseModel):
116
- role: str
117
- content: str
118
 
 
119
 
120
- class ChatCompletionRequest(BaseModel):
121
- model: str = "distilgpt2"
122
- messages: List[ChatMessage]
123
- max_tokens: Optional[int] = 1024
124
- temperature: Optional[float] = 0.7
125
- top_p: Optional[float] = 1.0
126
- stream: Optional[bool] = False
127
 
 
 
 
128
 
129
- class ChatChoice(BaseModel):
130
- index: int = 0
131
- message: ChatMessage
132
- finish_reason: str = "stop"
133
 
 
 
 
 
134
 
135
- class ChatCompletionResponse(BaseModel):
136
- id: str
137
- object: str = "chat.completion"
138
- created: int
139
- model: str
140
- choices: List[ChatChoice]
141
- usage: dict
142
 
 
 
 
 
143
 
144
- # ============== Other Request/Response Models ==============
145
 
146
- class PredictRequest(BaseModel):
147
- text: str
148
- top_k: Optional[int] = 1
149
 
150
 
151
- class PredictResponse(BaseModel):
152
- predictions: List[dict]
153
- model: str
154
- latency_ms: float
155
 
156
 
157
- class EmbedRequest(BaseModel):
158
- texts: List[str]
159
 
160
 
161
- class EmbedResponse(BaseModel):
162
- embeddings: List[List[float]]
163
- model: str
164
- dimensions: int
165
- latency_ms: float
 
 
 
 
 
 
 
 
 
 
166
 
167
 
168
- class HealthResponse(BaseModel):
169
- status: str
170
- timestamp: str
171
- hardware: str
172
- models_loaded: bool
173
 
174
 
175
- class ModelInfo(BaseModel):
176
  id: str
177
- object: str = "model"
178
- created: int
179
- owned_by: str = "local"
 
 
 
 
180
 
181
 
182
- class ModelsResponse(BaseModel):
183
- object: str = "list"
184
- data: List[ModelInfo]
 
 
 
 
 
185
 
186
 
187
  # ============== Helper Functions ==============
188
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
189
  def generate_text(prompt: str, max_tokens: int, temperature: float, top_p: float) -> tuple:
190
  """Generate text and return (text, input_tokens, output_tokens)"""
191
- tokenizer = models["generator_tokenizer"]
192
- model = models["generator_model"]
193
 
194
  inputs = tokenizer(prompt, return_tensors="pt", truncation=True, max_length=512)
195
  input_tokens = inputs["input_ids"].shape[1]
@@ -197,7 +239,7 @@ def generate_text(prompt: str, max_tokens: int, temperature: float, top_p: float
197
  with torch.no_grad():
198
  outputs = model.generate(
199
  **inputs,
200
- max_new_tokens=max_tokens,
201
  temperature=temperature if temperature > 0 else 1.0,
202
  top_p=top_p,
203
  do_sample=temperature > 0,
@@ -212,60 +254,173 @@ def generate_text(prompt: str, max_tokens: int, temperature: float, top_p: float
212
  return generated_text.strip(), input_tokens, output_tokens
213
 
214
 
215
- def format_messages_to_prompt(messages: List, system: Optional[str] = None) -> str:
216
- """Convert chat messages to a single prompt string"""
217
- prompt_parts = []
 
218
 
219
- if system:
220
- prompt_parts.append(f"System: {system}\n")
221
 
222
- for msg in messages:
223
- role = msg.role if hasattr(msg, 'role') else msg.get('role', 'user')
224
- content = msg.content if hasattr(msg, 'content') else msg.get('content', '')
 
 
 
 
 
 
 
 
 
 
 
 
225
 
226
- # Handle content that might be a list of blocks
227
- if isinstance(content, list):
228
- content = " ".join([block.text if hasattr(block, 'text') else block.get('text', '') for block in content])
 
 
 
 
229
 
230
- if role == "user":
231
- prompt_parts.append(f"Human: {content}\n")
232
- elif role == "assistant":
233
- prompt_parts.append(f"Assistant: {content}\n")
 
 
 
 
 
 
 
234
 
235
- prompt_parts.append("Assistant:")
236
- return "".join(prompt_parts)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
237
 
238
 
239
  # ============== Anthropic API Endpoints ==============
240
 
241
- @app.post("/v1/messages", response_model=AnthropicResponse)
242
- async def create_message(
243
- request: AnthropicRequest,
244
- x_api_key: Optional[str] = Header(None, alias="x-api-key"),
245
- authorization: Optional[str] = Header(None)
246
- ):
247
  """
248
  Anthropic Messages API compatible endpoint
249
 
250
  POST /v1/messages
 
 
 
 
 
 
 
251
  """
252
  try:
 
 
253
  # Format messages to prompt
254
  prompt = format_messages_to_prompt(request.messages, request.system)
255
 
256
- # Generate response
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
257
  generated_text, input_tokens, output_tokens = generate_text(
258
  prompt=prompt,
259
  max_tokens=request.max_tokens,
260
- temperature=request.temperature or 0.7,
261
  top_p=request.top_p or 1.0
262
  )
263
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
264
  return AnthropicResponse(
265
- id=f"msg_{uuid.uuid4().hex[:24]}",
266
- content=[ContentBlock(type="text", text=generated_text)],
267
- model=GENERATOR_MODEL,
268
- stop_reason="end_turn",
269
  usage=Usage(input_tokens=input_tokens, output_tokens=output_tokens)
270
  )
271
  except Exception as e:
@@ -274,21 +429,33 @@ async def create_message(
274
 
275
  # ============== OpenAI Compatible Endpoints ==============
276
 
277
- @app.post("/v1/chat/completions", response_model=ChatCompletionResponse)
278
- async def chat_completions(
279
- request: ChatCompletionRequest,
280
- authorization: Optional[str] = Header(None)
281
- ):
282
- """
283
- OpenAI Chat Completions API compatible endpoint
284
 
285
- POST /v1/chat/completions
286
- """
 
 
 
 
 
 
 
 
 
 
 
287
  try:
288
- # Format messages to prompt
289
- prompt = format_messages_to_prompt(request.messages)
 
 
 
 
 
290
 
291
- # Generate response
292
  generated_text, input_tokens, output_tokens = generate_text(
293
  prompt=prompt,
294
  max_tokens=request.max_tokens or 1024,
@@ -296,40 +463,40 @@ async def chat_completions(
296
  top_p=request.top_p or 1.0
297
  )
298
 
299
- return ChatCompletionResponse(
300
- id=f"chatcmpl-{uuid.uuid4().hex[:24]}",
301
- created=int(time.time()),
302
- model=GENERATOR_MODEL,
303
- choices=[
304
- ChatChoice(
305
- index=0,
306
- message=ChatMessage(role="assistant", content=generated_text),
307
- finish_reason="stop"
308
- )
309
- ],
310
- usage={
311
  "prompt_tokens": input_tokens,
312
  "completion_tokens": output_tokens,
313
  "total_tokens": input_tokens + output_tokens
314
  }
315
- )
316
  except Exception as e:
317
  raise HTTPException(status_code=500, detail=str(e))
318
 
319
 
320
- @app.get("/v1/models", response_model=ModelsResponse)
321
  async def list_models():
322
- """List available models (OpenAI compatible)"""
323
- return ModelsResponse(
324
- data=[
325
- ModelInfo(id=GENERATOR_MODEL, created=int(time.time())),
326
- ModelInfo(id=MODEL_NAME, created=int(time.time())),
327
- ModelInfo(id=EMBED_MODEL, created=int(time.time()))
 
328
  ]
329
- )
330
 
331
 
332
- # ============== Original Endpoints ==============
333
 
334
  @app.get("/")
335
  async def root():
@@ -339,98 +506,51 @@ async def root():
339
  "hardware": "CPU Basic: 2 vCPU · 16 GB RAM",
340
  "docs": "/docs",
341
  "api_endpoints": {
342
- "anthropic": "/v1/messages",
343
- "openai": "/v1/chat/completions",
344
- "models": "/v1/models"
345
  },
346
- "utility_endpoints": ["/health", "/info", "/predict", "/embed"]
 
 
 
 
 
 
 
347
  }
348
 
349
 
350
- @app.get("/health", response_model=HealthResponse)
351
  async def health():
352
  """Health check endpoint"""
353
- return HealthResponse(
354
- status="healthy",
355
- timestamp=datetime.utcnow().isoformat(),
356
- hardware="CPU Basic: 2 vCPU · 16 GB RAM",
357
- models_loaded=len(models) > 0
358
- )
359
 
360
 
361
  @app.get("/info")
362
  async def info():
363
- """Model and API information"""
364
  return {
365
  "name": "Docker Model Runner",
366
  "version": "1.0.0",
367
  "api_compatibility": ["anthropic", "openai"],
368
- "hardware": "CPU Basic: 2 vCPU · 16 GB RAM",
369
- "models": {
370
- "chat": GENERATOR_MODEL,
371
- "classifier": MODEL_NAME,
372
- "embedder": EMBED_MODEL
373
  },
374
- "endpoints": {
375
- "anthropic_messages": "POST /v1/messages",
376
- "openai_chat": "POST /v1/chat/completions",
377
- "models": "GET /v1/models",
378
- "predict": "POST /predict",
379
- "embed": "POST /embed"
380
  }
381
  }
382
 
383
 
384
- @app.post("/predict", response_model=PredictResponse)
385
- async def predict(request: PredictRequest):
386
- """Text classification (sentiment analysis)"""
387
- try:
388
- start_time = datetime.now()
389
- results = models["classifier"](request.text, top_k=request.top_k)
390
- latency = (datetime.now() - start_time).total_seconds() * 1000
391
-
392
- return PredictResponse(
393
- predictions=results,
394
- model=MODEL_NAME,
395
- latency_ms=round(latency, 2)
396
- )
397
- except Exception as e:
398
- raise HTTPException(status_code=500, detail=str(e))
399
-
400
-
401
- @app.post("/embed", response_model=EmbedResponse)
402
- async def embed(request: EmbedRequest):
403
- """Get text embeddings"""
404
- try:
405
- start_time = datetime.now()
406
-
407
- inputs = models["embed_tokenizer"](
408
- request.texts,
409
- padding=True,
410
- truncation=True,
411
- max_length=256,
412
- return_tensors="pt"
413
- )
414
-
415
- with torch.no_grad():
416
- outputs = models["embed_model"](**inputs)
417
- attention_mask = inputs["attention_mask"]
418
- token_embeddings = outputs.last_hidden_state
419
- input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
420
- embeddings = torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
421
-
422
- latency = (datetime.now() - start_time).total_seconds() * 1000
423
-
424
- return EmbedResponse(
425
- embeddings=embeddings.tolist(),
426
- model=EMBED_MODEL,
427
- dimensions=embeddings.shape[1],
428
- latency_ms=round(latency, 2)
429
- )
430
- except Exception as e:
431
- raise HTTPException(status_code=500, detail=str(e))
432
-
433
-
434
  if __name__ == "__main__":
435
  import uvicorn
436
  uvicorn.run(app, host="0.0.0.0", port=7860)
 
1
  """
2
+ Docker Model Runner - Anthropic API Compatible
3
+ Full compatibility with Anthropic Messages API format
4
  Optimized for: 2 vCPU, 16GB RAM
5
  """
6
+ from fastapi import FastAPI, HTTPException, Header, Request
7
+ from fastapi.responses import StreamingResponse
8
  from pydantic import BaseModel, Field
9
+ from typing import Optional, List, Union, Literal, Any, Dict
10
  import torch
11
+ from transformers import AutoTokenizer, AutoModelForCausalLM
12
  import os
13
  from datetime import datetime
14
  from contextlib import asynccontextmanager
15
  import uuid
16
  import time
17
+ import json
18
+ import asyncio
19
 
20
  # CPU-optimized lightweight models
 
21
  GENERATOR_MODEL = os.getenv("GENERATOR_MODEL", "distilgpt2")
22
+ MODEL_DISPLAY_NAME = os.getenv("MODEL_NAME", "MiniMax-M2")
23
 
24
  # Set CPU threading
25
  torch.set_num_threads(2)
 
33
  global models
34
  print("Loading models for CPU inference...")
35
 
36
+ models["tokenizer"] = AutoTokenizer.from_pretrained(GENERATOR_MODEL)
37
+ models["model"] = AutoModelForCausalLM.from_pretrained(GENERATOR_MODEL)
38
+ models["model"].eval()
 
 
 
 
39
 
40
+ if models["tokenizer"].pad_token is None:
41
+ models["tokenizer"].pad_token = models["tokenizer"].eos_token
 
 
 
 
 
 
 
 
 
 
 
42
 
43
  print("✅ All models loaded successfully!")
44
 
 
52
 
53
  app = FastAPI(
54
  title="Docker Model Runner",
55
+ description="Anthropic API Compatible Endpoint",
56
  version="1.0.0",
57
  lifespan=lifespan
58
  )
 
60
 
61
  # ============== Anthropic API Models ==============
62
 
63
+ class TextBlock(BaseModel):
64
  type: Literal["text"] = "text"
65
  text: str
66
 
67
 
68
+ class ThinkingBlock(BaseModel):
69
+ type: Literal["thinking"] = "thinking"
70
+ thinking: str
71
 
72
 
73
+ class ToolUseBlock(BaseModel):
74
+ type: Literal["tool_use"] = "tool_use"
75
+ id: str
76
+ name: str
77
+ input: Dict[str, Any]
 
 
 
 
78
 
79
 
80
+ class ToolResultContent(BaseModel):
81
+ type: Literal["tool_result"] = "tool_result"
82
+ tool_use_id: str
83
+ content: Union[str, List[TextBlock]]
84
+ is_error: Optional[bool] = False
85
 
86
 
87
+ class ImageSource(BaseModel):
88
+ type: Literal["base64", "url"]
89
+ media_type: Optional[str] = None
90
+ data: Optional[str] = None
91
+ url: Optional[str] = None
 
 
 
 
92
 
93
 
94
+ class ImageBlock(BaseModel):
95
+ type: Literal["image"] = "image"
96
+ source: ImageSource
97
 
 
 
 
98
 
99
+ ContentBlock = Union[TextBlock, ThinkingBlock, ToolUseBlock, ToolResultContent, ImageBlock, str]
100
 
 
 
 
 
 
 
 
101
 
102
+ class MessageParam(BaseModel):
103
+ role: Literal["user", "assistant"]
104
+ content: Union[str, List[ContentBlock]]
105
 
 
 
 
 
106
 
107
+ class ToolInputSchema(BaseModel):
108
+ type: str = "object"
109
+ properties: Optional[Dict[str, Any]] = None
110
+ required: Optional[List[str]] = None
111
 
 
 
 
 
 
 
 
112
 
113
+ class Tool(BaseModel):
114
+ name: str
115
+ description: str
116
+ input_schema: ToolInputSchema
117
 
 
118
 
119
+ class ToolChoice(BaseModel):
120
+ type: Literal["auto", "any", "tool"] = "auto"
121
+ name: Optional[str] = None
122
 
123
 
124
+ class ThinkingConfig(BaseModel):
125
+ type: Literal["enabled", "disabled"] = "disabled"
126
+ budget_tokens: Optional[int] = None
 
127
 
128
 
129
+ class Metadata(BaseModel):
130
+ user_id: Optional[str] = None
131
 
132
 
133
+ class AnthropicRequest(BaseModel):
134
+ model: str = "MiniMax-M2"
135
+ messages: List[MessageParam]
136
+ max_tokens: int = 1024
137
+ temperature: Optional[float] = Field(default=1.0, gt=0.0, le=1.0)
138
+ top_p: Optional[float] = Field(default=1.0, gt=0.0, le=1.0)
139
+ top_k: Optional[int] = None # Ignored
140
+ stop_sequences: Optional[List[str]] = None # Ignored
141
+ stream: Optional[bool] = False
142
+ system: Optional[Union[str, List[TextBlock]]] = None
143
+ tools: Optional[List[Tool]] = None
144
+ tool_choice: Optional[ToolChoice] = None
145
+ metadata: Optional[Metadata] = None
146
+ thinking: Optional[ThinkingConfig] = None
147
+ service_tier: Optional[str] = None # Ignored
148
 
149
 
150
+ class Usage(BaseModel):
151
+ input_tokens: int
152
+ output_tokens: int
153
+ cache_creation_input_tokens: Optional[int] = 0
154
+ cache_read_input_tokens: Optional[int] = 0
155
 
156
 
157
+ class AnthropicResponse(BaseModel):
158
  id: str
159
+ type: Literal["message"] = "message"
160
+ role: Literal["assistant"] = "assistant"
161
+ content: List[Union[TextBlock, ThinkingBlock, ToolUseBlock]]
162
+ model: str
163
+ stop_reason: Optional[Literal["end_turn", "max_tokens", "stop_sequence", "tool_use"]] = "end_turn"
164
+ stop_sequence: Optional[str] = None
165
+ usage: Usage
166
 
167
 
168
+ # Streaming Event Models
169
+ class StreamEvent(BaseModel):
170
+ type: str
171
+ index: Optional[int] = None
172
+ content_block: Optional[Dict[str, Any]] = None
173
+ delta: Optional[Dict[str, Any]] = None
174
+ message: Optional[Dict[str, Any]] = None
175
+ usage: Optional[Dict[str, Any]] = None
176
 
177
 
178
  # ============== Helper Functions ==============
179
 
180
+ def extract_text_from_content(content: Union[str, List[ContentBlock]]) -> str:
181
+ """Extract text from content which may be string or list of blocks"""
182
+ if isinstance(content, str):
183
+ return content
184
+
185
+ texts = []
186
+ for block in content:
187
+ if isinstance(block, str):
188
+ texts.append(block)
189
+ elif hasattr(block, 'text'):
190
+ texts.append(block.text)
191
+ elif hasattr(block, 'thinking'):
192
+ texts.append(block.thinking)
193
+ elif isinstance(block, dict):
194
+ if block.get('type') == 'text':
195
+ texts.append(block.get('text', ''))
196
+ elif block.get('type') == 'thinking':
197
+ texts.append(block.get('thinking', ''))
198
+ return " ".join(texts)
199
+
200
+
201
+ def format_system_prompt(system: Optional[Union[str, List[TextBlock]]]) -> str:
202
+ """Format system prompt from string or list of blocks"""
203
+ if system is None:
204
+ return ""
205
+ if isinstance(system, str):
206
+ return system
207
+ return " ".join([block.text for block in system if hasattr(block, 'text')])
208
+
209
+
210
+ def format_messages_to_prompt(messages: List[MessageParam], system: Optional[Union[str, List[TextBlock]]] = None) -> str:
211
+ """Convert chat messages to a single prompt string"""
212
+ prompt_parts = []
213
+
214
+ system_text = format_system_prompt(system)
215
+ if system_text:
216
+ prompt_parts.append(f"System: {system_text}\n\n")
217
+
218
+ for msg in messages:
219
+ role = msg.role
220
+ content_text = extract_text_from_content(msg.content)
221
+
222
+ if role == "user":
223
+ prompt_parts.append(f"Human: {content_text}\n\n")
224
+ elif role == "assistant":
225
+ prompt_parts.append(f"Assistant: {content_text}\n\n")
226
+
227
+ prompt_parts.append("Assistant:")
228
+ return "".join(prompt_parts)
229
+
230
+
231
  def generate_text(prompt: str, max_tokens: int, temperature: float, top_p: float) -> tuple:
232
  """Generate text and return (text, input_tokens, output_tokens)"""
233
+ tokenizer = models["tokenizer"]
234
+ model = models["model"]
235
 
236
  inputs = tokenizer(prompt, return_tensors="pt", truncation=True, max_length=512)
237
  input_tokens = inputs["input_ids"].shape[1]
 
239
  with torch.no_grad():
240
  outputs = model.generate(
241
  **inputs,
242
+ max_new_tokens=min(max_tokens, 256), # Limit for CPU
243
  temperature=temperature if temperature > 0 else 1.0,
244
  top_p=top_p,
245
  do_sample=temperature > 0,
 
254
  return generated_text.strip(), input_tokens, output_tokens
255
 
256
 
257
+ async def generate_stream(prompt: str, max_tokens: int, temperature: float, top_p: float, message_id: str, model_name: str):
258
+ """Generate streaming response in Anthropic SSE format"""
259
+ tokenizer = models["tokenizer"]
260
+ model = models["model"]
261
 
262
+ inputs = tokenizer(prompt, return_tensors="pt", truncation=True, max_length=512)
263
+ input_tokens = inputs["input_ids"].shape[1]
264
 
265
+ # Send message_start event
266
+ message_start = {
267
+ "type": "message_start",
268
+ "message": {
269
+ "id": message_id,
270
+ "type": "message",
271
+ "role": "assistant",
272
+ "content": [],
273
+ "model": model_name,
274
+ "stop_reason": None,
275
+ "stop_sequence": None,
276
+ "usage": {"input_tokens": input_tokens, "output_tokens": 0}
277
+ }
278
+ }
279
+ yield f"event: message_start\ndata: {json.dumps(message_start)}\n\n"
280
 
281
+ # Send content_block_start event
282
+ content_block_start = {
283
+ "type": "content_block_start",
284
+ "index": 0,
285
+ "content_block": {"type": "text", "text": ""}
286
+ }
287
+ yield f"event: content_block_start\ndata: {json.dumps(content_block_start)}\n\n"
288
 
289
+ # Generate tokens
290
+ with torch.no_grad():
291
+ outputs = model.generate(
292
+ **inputs,
293
+ max_new_tokens=min(max_tokens, 256),
294
+ temperature=temperature if temperature > 0 else 1.0,
295
+ top_p=top_p,
296
+ do_sample=temperature > 0,
297
+ pad_token_id=tokenizer.pad_token_id,
298
+ eos_token_id=tokenizer.eos_token_id
299
+ )
300
 
301
+ generated_tokens = outputs[0][input_tokens:]
302
+ generated_text = tokenizer.decode(generated_tokens, skip_special_tokens=True).strip()
303
+ output_tokens = len(generated_tokens)
304
+
305
+ # Stream text in chunks
306
+ chunk_size = 5
307
+ for i in range(0, len(generated_text), chunk_size):
308
+ chunk = generated_text[i:i+chunk_size]
309
+ content_block_delta = {
310
+ "type": "content_block_delta",
311
+ "index": 0,
312
+ "delta": {"type": "text_delta", "text": chunk}
313
+ }
314
+ yield f"event: content_block_delta\ndata: {json.dumps(content_block_delta)}\n\n"
315
+ await asyncio.sleep(0.01) # Small delay for realistic streaming
316
+
317
+ # Send content_block_stop event
318
+ content_block_stop = {"type": "content_block_stop", "index": 0}
319
+ yield f"event: content_block_stop\ndata: {json.dumps(content_block_stop)}\n\n"
320
+
321
+ # Send message_delta event
322
+ message_delta = {
323
+ "type": "message_delta",
324
+ "delta": {"stop_reason": "end_turn", "stop_sequence": None},
325
+ "usage": {"output_tokens": output_tokens}
326
+ }
327
+ yield f"event: message_delta\ndata: {json.dumps(message_delta)}\n\n"
328
+
329
+ # Send message_stop event
330
+ message_stop = {"type": "message_stop"}
331
+ yield f"event: message_stop\ndata: {json.dumps(message_stop)}\n\n"
332
+
333
+
334
+ def handle_tool_call(tools: List[Tool], messages: List[MessageParam], generated_text: str) -> Optional[ToolUseBlock]:
335
+ """Check if the response should trigger a tool call"""
336
+ if not tools:
337
+ return None
338
+
339
+ # Simple heuristic: check if response mentions tool names
340
+ for tool in tools:
341
+ if tool.name.lower() in generated_text.lower():
342
+ return ToolUseBlock(
343
+ type="tool_use",
344
+ id=f"toolu_{uuid.uuid4().hex[:24]}",
345
+ name=tool.name,
346
+ input={}
347
+ )
348
+ return None
349
 
350
 
351
  # ============== Anthropic API Endpoints ==============
352
 
353
+ @app.post("/v1/messages")
354
+ async def create_message(request: AnthropicRequest):
 
 
 
 
355
  """
356
  Anthropic Messages API compatible endpoint
357
 
358
  POST /v1/messages
359
+
360
+ Supports:
361
+ - Text messages
362
+ - System prompts
363
+ - Streaming responses
364
+ - Tool/function calling
365
+ - Thinking/reasoning blocks
366
  """
367
  try:
368
+ message_id = f"msg_{uuid.uuid4().hex[:24]}"
369
+
370
  # Format messages to prompt
371
  prompt = format_messages_to_prompt(request.messages, request.system)
372
 
373
+ # Handle streaming
374
+ if request.stream:
375
+ return StreamingResponse(
376
+ generate_stream(
377
+ prompt=prompt,
378
+ max_tokens=request.max_tokens,
379
+ temperature=request.temperature or 1.0,
380
+ top_p=request.top_p or 1.0,
381
+ message_id=message_id,
382
+ model_name=request.model
383
+ ),
384
+ media_type="text/event-stream",
385
+ headers={
386
+ "Cache-Control": "no-cache",
387
+ "Connection": "keep-alive",
388
+ "X-Accel-Buffering": "no"
389
+ }
390
+ )
391
+
392
+ # Non-streaming response
393
  generated_text, input_tokens, output_tokens = generate_text(
394
  prompt=prompt,
395
  max_tokens=request.max_tokens,
396
+ temperature=request.temperature or 1.0,
397
  top_p=request.top_p or 1.0
398
  )
399
 
400
+ # Build content blocks
401
+ content_blocks = []
402
+
403
+ # Add thinking block if enabled
404
+ if request.thinking and request.thinking.type == "enabled":
405
+ thinking_text = f"Analyzing the user's request and formulating a response..."
406
+ content_blocks.append(ThinkingBlock(type="thinking", thinking=thinking_text))
407
+
408
+ # Check for tool calls
409
+ tool_use = handle_tool_call(request.tools, request.messages, generated_text) if request.tools else None
410
+
411
+ if tool_use:
412
+ content_blocks.append(TextBlock(type="text", text=generated_text))
413
+ content_blocks.append(tool_use)
414
+ stop_reason = "tool_use"
415
+ else:
416
+ content_blocks.append(TextBlock(type="text", text=generated_text))
417
+ stop_reason = "end_turn"
418
+
419
  return AnthropicResponse(
420
+ id=message_id,
421
+ content=content_blocks,
422
+ model=request.model,
423
+ stop_reason=stop_reason,
424
  usage=Usage(input_tokens=input_tokens, output_tokens=output_tokens)
425
  )
426
  except Exception as e:
 
429
 
430
  # ============== OpenAI Compatible Endpoints ==============
431
 
432
+ class ChatMessage(BaseModel):
433
+ role: str
434
+ content: str
 
 
 
 
435
 
436
+
437
+ class ChatCompletionRequest(BaseModel):
438
+ model: str = "distilgpt2"
439
+ messages: List[ChatMessage]
440
+ max_tokens: Optional[int] = 1024
441
+ temperature: Optional[float] = 0.7
442
+ top_p: Optional[float] = 1.0
443
+ stream: Optional[bool] = False
444
+
445
+
446
+ @app.post("/v1/chat/completions")
447
+ async def chat_completions(request: ChatCompletionRequest):
448
+ """OpenAI Chat Completions API compatible endpoint"""
449
  try:
450
+ # Convert to Anthropic format
451
+ anthropic_messages = [
452
+ MessageParam(role=msg.role if msg.role in ["user", "assistant"] else "user",
453
+ content=msg.content)
454
+ for msg in request.messages
455
+ if msg.role in ["user", "assistant"]
456
+ ]
457
 
458
+ prompt = format_messages_to_prompt(anthropic_messages)
459
  generated_text, input_tokens, output_tokens = generate_text(
460
  prompt=prompt,
461
  max_tokens=request.max_tokens or 1024,
 
463
  top_p=request.top_p or 1.0
464
  )
465
 
466
+ return {
467
+ "id": f"chatcmpl-{uuid.uuid4().hex[:24]}",
468
+ "object": "chat.completion",
469
+ "created": int(time.time()),
470
+ "model": request.model,
471
+ "choices": [{
472
+ "index": 0,
473
+ "message": {"role": "assistant", "content": generated_text},
474
+ "finish_reason": "stop"
475
+ }],
476
+ "usage": {
 
477
  "prompt_tokens": input_tokens,
478
  "completion_tokens": output_tokens,
479
  "total_tokens": input_tokens + output_tokens
480
  }
481
+ }
482
  except Exception as e:
483
  raise HTTPException(status_code=500, detail=str(e))
484
 
485
 
486
+ @app.get("/v1/models")
487
  async def list_models():
488
+ """List available models"""
489
+ return {
490
+ "object": "list",
491
+ "data": [
492
+ {"id": "MiniMax-M2", "object": "model", "created": int(time.time()), "owned_by": "local"},
493
+ {"id": "MiniMax-M2-Stable", "object": "model", "created": int(time.time()), "owned_by": "local"},
494
+ {"id": GENERATOR_MODEL, "object": "model", "created": int(time.time()), "owned_by": "local"}
495
  ]
496
+ }
497
 
498
 
499
+ # ============== Utility Endpoints ==============
500
 
501
  @app.get("/")
502
  async def root():
 
506
  "hardware": "CPU Basic: 2 vCPU · 16 GB RAM",
507
  "docs": "/docs",
508
  "api_endpoints": {
509
+ "anthropic_messages": "POST /v1/messages",
510
+ "openai_chat": "POST /v1/chat/completions",
511
+ "models": "GET /v1/models"
512
  },
513
+ "supported_features": [
514
+ "text messages",
515
+ "system prompts",
516
+ "streaming responses",
517
+ "tool/function calling",
518
+ "thinking blocks",
519
+ "metadata"
520
+ ]
521
  }
522
 
523
 
524
+ @app.get("/health")
525
  async def health():
526
  """Health check endpoint"""
527
+ return {
528
+ "status": "healthy",
529
+ "timestamp": datetime.utcnow().isoformat(),
530
+ "hardware": "CPU Basic: 2 vCPU · 16 GB RAM",
531
+ "models_loaded": len(models) > 0
532
+ }
533
 
534
 
535
  @app.get("/info")
536
  async def info():
537
+ """API information"""
538
  return {
539
  "name": "Docker Model Runner",
540
  "version": "1.0.0",
541
  "api_compatibility": ["anthropic", "openai"],
542
+ "supported_models": ["MiniMax-M2", "MiniMax-M2-Stable"],
543
+ "supported_parameters": {
544
+ "fully_supported": ["model", "messages", "max_tokens", "stream", "system", "temperature", "top_p", "tools", "tool_choice", "metadata", "thinking"],
545
+ "ignored": ["top_k", "stop_sequences", "service_tier"]
 
546
  },
547
+ "message_types": {
548
+ "supported": ["text", "tool_use", "tool_result", "thinking"],
549
+ "not_supported": ["image", "document"]
 
 
 
550
  }
551
  }
552
 
553
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
554
  if __name__ == "__main__":
555
  import uvicorn
556
  uvicorn.run(app, host="0.0.0.0", port=7860)