LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents Paper • 2311.05437 • Published Nov 9, 2023 • 51
OS-ATLAS: A Foundation Action Model for Generalist GUI Agents Paper • 2410.23218 • Published Oct 30, 2024 • 49
ShowUI: One Vision-Language-Action Model for GUI Visual Agent Paper • 2411.17465 • Published Nov 26, 2024 • 90
Mobile-Agent-E: Self-Evolving Mobile Assistant for Complex Tasks Paper • 2501.11733 • Published Jan 20 • 28
Being-0: A Humanoid Robotic Agent with Vision-Language Models and Modular Skills Paper • 2503.12533 • Published Mar 16 • 68
Embodied-Reasoner: Synergizing Visual Search, Reasoning, and Action for Embodied Interactive Tasks Paper • 2503.21696 • Published Mar 27 • 23
UI-R1: Enhancing Action Prediction of GUI Agents by Reinforcement Learning Paper • 2503.21620 • Published Mar 27 • 62
UniVLA: Learning to Act Anywhere with Task-centric Latent Actions Paper • 2505.06111 • Published May 9 • 25
Robo2VLM: Visual Question Answering from Large-Scale In-the-Wild Robot Manipulation Datasets Paper • 2505.15517 • Published May 21 • 4
Interactive Post-Training for Vision-Language-Action Models Paper • 2505.17016 • Published May 22 • 6
InfantAgent-Next: A Multimodal Generalist Agent for Automated Computer Interaction Paper • 2505.10887 • Published May 16 • 10
Paper2Poster: Towards Multimodal Poster Automation from Scientific Papers Paper • 2505.21497 • Published May 27 • 109
VisualToolAgent (VisTA): A Reinforcement Learning Framework for Visual Tool Selection Paper • 2505.20289 • Published May 26 • 10
Spatial-MLLM: Boosting MLLM Capabilities in Visual-based Spatial Intelligence Paper • 2505.23747 • Published May 29 • 68
Open CaptchaWorld: A Comprehensive Web-based Platform for Testing and Benchmarking Multimodal LLM Agents Paper • 2505.24878 • Published May 30 • 22
SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics Paper • 2506.01844 • Published Jun 2 • 143
LoHoVLA: A Unified Vision-Language-Action Model for Long-Horizon Embodied Tasks Paper • 2506.00411 • Published May 31 • 31
VS-Bench: Evaluating VLMs for Strategic Reasoning and Decision-Making in Multi-Agent Environments Paper • 2506.02387 • Published Jun 3 • 58
Multimodal DeepResearcher: Generating Text-Chart Interleaved Reports From Scratch with Agentic Framework Paper • 2506.02454 • Published Jun 3 • 7
SAFE: Multitask Failure Detection for Vision-Language-Action Models Paper • 2506.09937 • Published Jun 11 • 9
Optimus-3: Towards Generalist Multimodal Minecraft Agents with Scalable Task Experts Paper • 2506.10357 • Published Jun 12 • 21
VideoDeepResearch: Long Video Understanding With Agentic Tool Using Paper • 2506.10821 • Published Jun 12 • 19
BridgeVLA: Input-Output Alignment for Efficient 3D Manipulation Learning with Vision-Language Models Paper • 2506.07961 • Published Jun 9 • 11
EfficientVLA: Training-Free Acceleration and Compression for Vision-Language-Action Models Paper • 2506.10100 • Published Jun 11 • 9
From Intention to Execution: Probing the Generalization Boundaries of Vision-Language-Action Models Paper • 2506.09930 • Published Jun 11 • 8
A Survey on Vision-Language-Action Models: An Action Tokenization Perspective Paper • 2507.01925 • Published Jul 2 • 38
PresentAgent: Multimodal Agent for Presentation Video Generation Paper • 2507.04036 • Published Jul 5 • 10
A Survey on Vision-Language-Action Models for Autonomous Driving Paper • 2506.24044 • Published Jun 30 • 14
ThinkAct: Vision-Language-Action Reasoning via Reinforced Visual Latent Planning Paper • 2507.16815 • Published Jul 22 • 39
ScreenCoder: Advancing Visual-to-Code Generation for Front-End Automation via Modular Multimodal Agents Paper • 2507.22827 • Published Jul 30 • 99
villa-X: Enhancing Latent Action Modeling in Vision-Language-Action Models Paper • 2507.23682 • Published Jul 31 • 23
Cosmos-Reason1: From Physical Common Sense To Embodied Reasoning Paper • 2503.15558 • Published Mar 18 • 50
InstructVLA: Vision-Language-Action Instruction Tuning from Understanding to Manipulation Paper • 2507.17520 • Published Jul 23 • 14
RoboMemory: A Brain-inspired Multi-memory Agentic Framework for Lifelong Learning in Physical Embodied Systems Paper • 2508.01415 • Published Aug 2 • 7
OS Agents: A Survey on MLLM-based Agents for General Computing Devices Use Paper • 2508.04482 • Published Aug 6 • 9
MolmoAct: Action Reasoning Models that can Reason in Space Paper • 2508.07917 • Published Aug 11 • 44
MM-BrowseComp: A Comprehensive Benchmark for Multimodal Browsing Agents Paper • 2508.13186 • Published Aug 14 • 18
UI-Venus Technical Report: Building High-performance UI Agents with RFT Paper • 2508.10833 • Published Aug 14 • 44
Seeing, Listening, Remembering, and Reasoning: A Multimodal Agent with Long-Term Memory Paper • 2508.09736 • Published Aug 13 • 57
WebWatcher: Breaking New Frontier of Vision-Language Deep Research Agent Paper • 2508.05748 • Published Aug 7 • 141
Do What? Teaching Vision-Language-Action Models to Reject the Impossible Paper • 2508.16292 • Published Aug 22 • 9
CogVLA: Cognition-Aligned Vision-Language-Action Model via Instruction-Driven Routing & Sparsification Paper • 2508.21046 • Published Aug 28 • 9
Mind the Third Eye! Benchmarking Privacy Awareness in MLLM-powered Smartphone Agents Paper • 2508.19493 • Published Aug 27 • 11
EmbodiedOneVision: Interleaved Vision-Text-Action Pretraining for General Robot Control Paper • 2508.21112 • Published Aug 28 • 77
UItron: Foundational GUI Agent with Advanced Perception and Planning Paper • 2508.21767 • Published Aug 29 • 12
Robix: A Unified Model for Robot Interaction, Reasoning and Planning Paper • 2509.01106 • Published Sep 1 • 49
UI-TARS-2 Technical Report: Advancing GUI Agent with Multi-Turn Reinforcement Learning Paper • 2509.02544 • Published Sep 2 • 124
VLA-Adapter: An Effective Paradigm for Tiny-Scale Vision-Language-Action Model Paper • 2509.09372 • Published Sep 11 • 239
SimpleVLA-RL: Scaling VLA Training via Reinforcement Learning Paper • 2509.09674 • Published Sep 11 • 80
F1: A Vision-Language-Action Model Bridging Understanding and Generation to Actions Paper • 2509.06951 • Published Sep 8 • 31
UI-S1: Advancing GUI Automation via Semi-online Reinforcement Learning Paper • 2509.11543 • Published Sep 15 • 47
Video2Roleplay: A Multimodal Dataset and Framework for Video-Guided Role-playing Agents Paper • 2509.15233 • Published Sep 17 • 2
A Vision-Language-Action-Critic Model for Robotic Real-World Reinforcement Learning Paper • 2509.15937 • Published Sep 19 • 20
RynnVLA-001: Using Human Demonstrations to Improve Robot Manipulation Paper • 2509.15212 • Published Sep 18 • 21
D-Artemis: A Deliberative Cognitive Framework for Mobile GUI Multi-Agents Paper • 2509.21799 • Published Sep 26 • 8
Efficient Multi-turn RL for GUI Agents via Decoupled Training and Adaptive Data Curation Paper • 2509.23866 • Published Sep 28 • 13
PixelCraft: A Multi-Agent System for High-Fidelity Visual Reasoning on Structured Images Paper • 2509.25185 • Published Sep 29 • 4
UniMIC: Token-Based Multimodal Interactive Coding for Human-AI Collaboration Paper • 2509.22570 • Published Sep 26 • 3
VLA-R1: Enhancing Reasoning in Vision-Language-Action Models Paper • 2510.01623 • Published Oct 2 • 10
VLA-RFT: Vision-Language-Action Reinforcement Fine-tuning with Verified Rewards in World Simulators Paper • 2510.00406 • Published Oct 1 • 65
Expertise need not monopolize: Action-Specialized Mixture of Experts for Vision-Language-Action Learning Paper • 2510.14300 • Published Oct 16 • 11
VLA^2: Empowering Vision-Language-Action Models with an Agentic Framework for Unseen Concept Manipulation Paper • 2510.14902 • Published Oct 16 • 15
InternVLA-M1: A Spatially Guided Vision-Language-Action Framework for Generalist Robot Policy Paper • 2510.13778 • Published Oct 15 • 16
X-VLA: Soft-Prompted Transformer as Scalable Cross-Embodiment Vision-Language-Action Model Paper • 2510.10274 • Published Oct 11 • 14
DeepMMSearch-R1: Empowering Multimodal LLMs in Multimodal Web Search Paper • 2510.12801 • Published Oct 14 • 13
Vlaser: Vision-Language-Action Model with Synergistic Embodied Reasoning Paper • 2510.11027 • Published Oct 13 • 21
ReLook: Vision-Grounded RL with a Multimodal LLM Critic for Agentic Web Coding Paper • 2510.11498 • Published Oct 13 • 10
GigaBrain-0: A World Model-Powered Vision-Language-Action Model Paper • 2510.19430 • Published Oct 22 • 48
DaMo: Data Mixing Optimizer in Fine-tuning Multimodal LLMs for Mobile Phone Agents Paper • 2510.19336 • Published Oct 22 • 16
PhysVLM-AVR: Active Visual Reasoning for Multimodal Large Language Models in Physical Environments Paper • 2510.21111 • Published Oct 24 • 2
π_RL: Online RL Fine-tuning for Flow-based Vision-Language-Action Models Paper • 2510.25889 • Published Oct 29 • 64
RoboOmni: Proactive Robot Manipulation in Omni-modal Context Paper • 2510.23763 • Published Oct 27 • 53
From Spatial to Actions: Grounding Vision-Language-Action Model in Spatial Foundation Priors Paper • 2510.17439 • Published Oct 20 • 26
World Simulation with Video Foundation Models for Physical AI Paper • 2511.00062 • Published Oct 28 • 40
Unified Diffusion VLA: Vision-Language-Action Model via Joint Discrete Denoising Diffusion Process Paper • 2511.01718 • Published Nov 3 • 6
Don't Blind Your VLA: Aligning Visual Representations for OOD Generalization Paper • 2510.25616 • Published Oct 29 • 96
UniVA: Universal Video Agent towards Open-Source Next-Generation Video Generalist Paper • 2511.08521 • Published 27 days ago • 37
AffordBot: 3D Fine-grained Embodied Reasoning via Multimodal Large Language Models Paper • 2511.10017 • Published 25 days ago • 6
WMPO: World Model-based Policy Optimization for Vision-Language-Action Models Paper • 2511.09515 • Published 26 days ago • 17
WebVIA: A Web-based Vision-Language Agentic Framework for Interactive and Verifiable UI-to-Code Generation Paper • 2511.06251 • Published 29 days ago • 13
Orion: A Unified Visual Agent for Multimodal Perception, Advanced Visual Reasoning and Execution Paper • 2511.14210 • Published 20 days ago • 19
MiMo-Embodied: X-Embodied Foundation Model Technical Report Paper • 2511.16518 • Published 18 days ago • 23
RynnVLA-002: A Unified Vision-Language-Action and World Model Paper • 2511.17502 • Published 17 days ago • 24
Mantis: A Versatile Vision-Language-Action Model with Disentangled Visual Foresight Paper • 2511.16175 • Published 18 days ago • 12
VLA-4D: Embedding 4D Awareness into Vision-Language-Action Models for SpatioTemporally Coherent Robotic Manipulation Paper • 2511.17199 • Published 17 days ago • 7
MobileVLA-R1: Reinforcing Vision-Language-Action for Mobile Robots Paper • 2511.17889 • Published 16 days ago • 5
Agent0-VL: Exploring Self-Evolving Agent for Tool-Integrated Vision-Language Reasoning Paper • 2511.19900 • Published 13 days ago • 46
Agentic Learner with Grow-and-Refine Multimodal Semantic Memory Paper • 2511.21678 • Published 12 days ago • 10
DualVLA: Building a Generalizable Embodied Agent via Partial Decoupling of Reasoning and Action Paper • 2511.22134 • Published 11 days ago • 21
LongVT: Incentivizing "Thinking with Long Videos" via Native Tool Calling Paper • 2511.20785 • Published 13 days ago • 148
VLASH: Real-Time VLAs via Future-State-Aware Asynchronous Inference Paper • 2512.01031 • Published 8 days ago • 22
WorldMM: Dynamic Multimodal Memory Agent for Long Video Reasoning Paper • 2512.02425 • Published 6 days ago • 22
SwiftVLA: Unlocking Spatiotemporal Dynamics for Lightweight VLA Models at Minimal Overhead Paper • 2512.00903 • Published 8 days ago • 5
SIMA 2: A Generalist Embodied Agent for Virtual Worlds Paper • 2512.04797 • Published 4 days ago • 15
Steering Vision-Language-Action Models as Anti-Exploration: A Test-Time Scaling Approach Paper • 2512.02834 • Published 6 days ago • 37