--- language: - en license: apache-2.0 tags: - math - reasoning - agent - qwen - grpo - reinforcement-learning base_model: Qwen/Qwen3-4B-Thinking-2507 datasets: - nvidia/OpenMathReasoning metrics: - accuracy library_name: transformers pipeline_tag: text-generation --- # DeepMath: A Lightweight Math Reasoning Agent An LLM is using a calculator to answer questions.

An LLM is using a calculator to answer questions.

## Model Description **DeepMath** is a 4B parameter mathematical reasoning model that combines a fine-tuned LLM with a sandboxed Python executor. Built on [Qwen3-4B Thinking](https://huggingface.co/Qwen/Qwen3-4B-Thinking-2507) and trained with **GRPO (Group Relative Policy Optimization)**, DeepMath generates concise Python snippets for computational steps instead of verbose text explanations, significantly reducing errors and output length. - **Developed by:** Intel AI Labs - **Model type:** Causal language model with agent capabilities - **Language:** English - **Base model:** [Qwen3-4B Thinking](https://huggingface.co/Qwen/Qwen3-4B-Thinking-2507) - **License:** Apache 2.0 - **Blog:**: 🔗 - **Repository:** 💻 [https://github.com/IntelLabs/DeepMath](https://github.com/IntelLabs/DeepMath) ## Key Features ✅ **Code-driven reasoning:** Generates short Python snippets for intermediate computational steps ✅ **Sandboxed execution:** No file I/O, no network calls, strict timeouts ✅ **Improved accuracy:** Offloading computation reduces arithmetic errors ✅ **Reduced verbosity:** Up to 66% shorter outputs compared to baseline ✅ **Safe and auditable:** Deterministic execution with readable code snippets ## Model Architecture DeepMath uses a LoRA adapter fine-tuned on top of Qwen3-4B Thinking with the following components: - **Agent Interface:** Outputs special tokens for Python code execution during reasoning - **Executor:** Sandboxed Python environment with allow-listed modules - **Safety Constraints:** Per-snippet timeouts, no file/network access - **Training Method:** GRPO with accuracy and code generation rewards

Changes to vLLM client and server in TRL library. — *Figure 1: The vLLM client and server were modified to use the DeepMath agent in generating the candidates, while using the vLLM backend.*

## Training Details ### Training Data - **Dataset:** [OpenMathReasoning](https://huggingface.co/datasets/nvidia/OpenMathReasoning) (tool-usage subset) - **Note:** GRPO training only uses problems, not solutions - **In-context Learning:** 4 solved examples demonstrating agent call syntax and patterns ### Training Procedure **GRPO (Group Relative Policy Optimization)** fine-tuning with: - **Accuracy Reward:** +1 for correct answers - **Code Generation Reward:** +1 for using code snippets (weighted 10:1 vs. accuracy) - **Length Constraint:** GRPO completions limited to 5k tokens - **Temperature Scheduling:** Linear schedule from T=1.2 → T=0.7 during training - **Infrastructure:** Modified TRL library's vLLM client and server ### Training Infrastructure - Base inference engine: [vLLM](https://github.com/vllm-project/vllm) - Agent framework: Based on [SmolAgents](https://github.com/huggingface/smolagents/) - Training framework: Modified [TRL](https://github.com/huggingface/trl) GRPO trainer ## Performance ### Benchmark Results We evaluated DeepMath on four mathematical reasoning datasets using **majority@16** and mean output length metrics: Main results table showing performance across MATH500, AIME, HMMT, and HLE datasets.

Main results table showing performance across MATH500, AIME, HMMT, and HLE datasets.

**Key Findings:** - **Accuracy:** Improved performance on challenging datasets (AIME, HMMT, HLE) - **Efficiency:** Up to **66% reduction** in output length - **Robustness:** Consistent improvements when combining agent + GRPO training ### Evaluation Datasets - **MATH500:** Subset of the MATH dataset - **AIME:** American Invitational Mathematics Examination problems - **HMMT:** Harvard-MIT Mathematics Tournament problems - **HLE:** High-level exam problems

Output example showing Python code generation and execution. — *Figure 2: Example output where Python code is generated, evaluated, and the result is inserted into the reasoning trace.*

## Usage ### Installation ```bash # Install uv package manager curl -LsSf https://astral.sh/uv/install.sh | sh # Clone repository git clone https://github.com/IntelLabs/DeepMath.git cd DeepMath # Install dependencies uv pip install -r requirements.txt uv pip install -e . ``` ### Basic Inference ```python from transformers import AutoModelForCausalLM, AutoTokenizer model_name = "Intel/deepmath-v1" tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForCausalLM.from_pretrained(model_name) # Example problem problem = "What is the sum of the first 100 positive integers?" inputs = tokenizer(problem, return_tensors="pt") outputs = model.generate(**inputs, max_new_tokens=3000) print(tokenizer.decode(outputs[0])) ``` ### Inference with Agent For full agent capabilities with sandboxed Python execution: ```bash python inference.py \ +model.use_vllm=true \ +model.math_agent=true \ +model.examples=deep_math/fewshot.txt \ model.generation.max_new_tokens=3000 \ +model.max_agent_output=20000 \ +model.max_steps=50 \ model.model_name_or_path=Intel/deepmath-v1 \ hf_tag=HuggingFaceH4/MATH-500 \ generated_file=output.jsonl ``` See the [repository](https://github.com/IntelLabs/DeepMath) for complete usage examples. ## Limitations and Biases ### Limitations - **Scope:** Optimized for mathematical reasoning tasks; may not generalize to other domains - **Problem Types:** Evaluated on contest-style math problems; performance on open-ended mathematical creativity or formal proofs is unknown - **Model Size:** 4B parameters may limit reasoning depth on extremely complex problems - **Code Execution:** Requires sandboxed environment for full agent capabilities ### Safety Considerations ⚠️ **Code Execution Risk:** This model generates and executes Python code. While DeepMath uses strict sandboxing and resource limits, any deployment should: - Carefully manage attack surfaces - Enforce rate limits - Use proper isolation (containers, VMs) - Monitor resource usage - Validate generated code before execution in production ### Ethical Considerations - The model is trained on mathematical problem-solving datasets and should not be used for decision-making in critical applications without human oversight - Generated code should be reviewed before execution in production environments - The model may reflect biases present in the training data ## Citation If you use DeepMath in your research, please cite: ```bibtex @software{deepmath2025, author = {Fleischer, Daniel and Berchansky, Moshe and Wasserblat, Moshe}, title = {DeepMath: A Lightweight Math Reasoning Agent for LLMs}, year = {2025}, publisher = {Intel AI Labs}, url = {https://github.com/IntelLabs/DeepMath} } ``` ## Model Card Contact For questions or issues, please open an issue on the [GitHub repository](https://github.com/IntelLabs/DeepMath).