LeRobot.js / docs /planning /009_web_worker.md
NERDDISCO's picture
docs(story): move parts of the lib into web workers
696222f
# User Story 009: Web Worker Architecture (Main-thread Safe Web Library)
## Story
**As a** user building robotics UIs that also render live camera previews and interactive controls
**I want** `@lerobot/web` to run heavy control/recording work off the main thread
**So that** my UI stays smooth (no flicker/jank) even when teleoperation and recording are active
## Background
The current browser implementation runs teleoperation control loops, dataset assembly, and export logic on the main thread. When activating keyboard teleoperation while previewing a camera stream, the preview can flicker due to main-thread contention. This is a UX blocker for real-world apps that combine live video, UI interactions, and hardware control.
A worker-based architecture lets us move CPU-intensive, frequent, or bursty work off the main thread. The main thread remains responsible for DOM, video rendering and user interactions. The library must preserve the existing API (`calibrate()`, `teleoperate()`, `record()`) while transparently using workers when available, and cleanly falling back to the current approach otherwise.
## Goals
- Identical public API to today’s `@lerobot/web` (no breaking changes)
- Main-thread safe by default: heavy or frequent work executes in a Web Worker
- Graceful fallback when workers or specific APIs aren’t available
- Type-safe, minimal-copy message protocol using Transferables when possible
- Strict library/demo separation: UI and storage remain in demos
- Maintain Python lerobot UX parity and behavior
## Non-Goals (for this story)
- Changing dataset formats or camera acquisition approach
- Rewriting Web Serial API usage into worker (browser support is limited in workers)
- Introducing new external dependencies
## Acceptance Criteria
- Smooth UI under load:
- With at least one active camera preview and keyboard teleoperation at 60–120 Hz, the preview does not flicker and UI remains responsive at ~60 FPS
- API compatibility:
- `calibrate()`, `teleoperate()`, `record()` signatures and return shapes are unchanged
- Feature-detect workers; automatically use worker-backed runtime when available, otherwise use current main-thread runtime
- Clear separation of responsibilities:
- Worker executes control loops, interpolation, dataset assembly, export packaging, and CPU-heavy transforms
- Main thread owns DOM/UI and browser-only APIs that are unavailable in workers (e.g., Web Serial write calls)
- Type-safe protocol:
- Strongly typed request/response messages with versioned `type` fields; Transferable payloads used for large data
- Reliability & fallback:
- If the worker crashes or becomes unavailable, operations fail gracefully with descriptive errors and suggest retry
- Fallback path (main-thread) is automatically used when worker creation fails
- Tests & docs:
- Unit tests cover protocol routing and basic round-trips
- Planning docs updated; README notes main-thread-safe architecture
## Architecture Overview
### Worker Boundaries
- Execute in Worker:
- Control loop scheduling and target computation for teleoperation (keyboard/direct and future teleoperators)
- Episode/frame buffering and interpolation (regularization) for recording
- Dataset assembly (tables/metadata), packaging (ZIP writer), and background export streaming
- Lightweight telemetry aggregation for UI
- Execute on Main Thread:
- DOM, UI, and camera previews (`<video>` elements)
- Web Serial API read/write bridge (if browser does not permit worker access)
- MediaRecorder handling (browser-optimized implementation already off main CPU in many engines)
### Threading Model
- Main thread spawns one worker per β€œprocess” instance as needed:
- TeleoperationProcess β†’ TeleopWorker
- RecordProcess β†’ RecordWorker (can be shared or composed with teleop worker depending on lifecycle)
- The public process objects returned from `teleoperate()`/`record()` are proxies. Method calls post messages to the worker and return promises where appropriate.
- SerialBridge (main-thread): worker requests motor write/read; main thread performs Web Serial operations and returns results. This preserves worker advantages while respecting browser API constraints.
### Message Protocol (Typed)
All messages include a discriminant `type` and a `requestId` when a response is expected.
- Teleoperation (examples):
- `teleop/start`, `teleop/stop`
- `teleop/update_key_state` { key, pressed }
- `teleop/move_motor` { motorName, position }
- `teleop/state_update` { motorConfigs, keyStates, lastUpdate } (worker β†’ main)
- `serial/write_position` { id, position } (worker β†’ main) β†’ `serial/ack`
- Recording (examples):
- `record/start`, `record/stop`, `record/next_episode`
- `record/frame_append` { payload transferable }
- `record/export_zip` { options } β†’ streaming progress events
- Error & lifecycle:
- `worker/error`, `worker/ready`, `worker/teardown`
Use Transferables (ArrayBuffer/MessagePort) for large payloads to avoid copies.
### File Structure (web package)
```
packages/web/src/
β”œβ”€β”€ workers/
β”‚ β”œβ”€β”€ teleop.worker.ts # Teleoperation control loop
β”‚ β”œβ”€β”€ record.worker.ts # Recording assembly/export
β”‚ β”œβ”€β”€ protocol.ts # Message types & guards
β”‚ └── utils.worker.ts # Worker-side helpers (interpolation, zip)
β”œβ”€β”€ bridges/
β”‚ └── serial-bridge.ts # Main-thread serial proxy for workers
β”œβ”€β”€ teleoperate.ts # Spawns worker, returns proxy process
β”œβ”€β”€ record.ts # Spawns worker, returns proxy process
└── types/
└── worker.ts # Public worker-related types (narrow)
```
### Lifecycle & Fallback
- On `teleoperate()`/`record()` call:
- Try to instantiate corresponding worker via `new Worker(new URL(...), { type: 'module' })`
- If success: wire protocol channels and return proxy-backed process
- If fail: fall back to current main-thread implementation (no behavioral changes)
- On `process.stop()` or page unload: send `worker/teardown` and terminate the worker
### Performance Notes
- Control loop cadence generated inside worker to avoid main-thread timers
- Batch serial commands from worker to main-thread bridge to minimize postMessage overhead
- Use coarse-to-fine update: high-rate calculations in worker; lower-rate UI state updates to main thread (e.g., 10–20 Hz) for rendering
- For export, stream chunks from worker; main thread triggers download or HF upload
### Error Handling
- All request/response messages enforce timeouts with descriptive errors
- Worker initialization guarded with feature detection and clear fallback
- Protocol version field enables future evolution without breaking older callers
## Phased Implementation Plan
### Phase 1: Dataset & Export Offload (Low Risk)
- Move episode interpolation, dataset assembly, and ZIP packaging to `record.worker.ts`
- Main thread keeps MediaRecorder and camera preview as-is
- Public API unchanged; verify ZIP download and HF upload via streamed messages
### Phase 2: Teleoperation Offload with SerialBridge
- Move control loop scheduling and target computation to `teleop.worker.ts`
- Implement SerialBridge on main thread for Web Serial commands
- Worker posts motor write requests; main thread executes and responds
- Throttle state updates to UI while maintaining high-rate control internally
### Phase 3: Fine-Grained Optimizations
- Introduce Transferables for large buffers
- Optional OffscreenCanvas pipelines for future video transforms (not required for current scope)
- Tune batching and message cadence under hardware testing
### Phase 4: Reliability & Observability
- Heartbeat messages and auto-restart policy for worker failures
- Dev diagnostics toggles; production minimal logging
## Risks & Mitigations
- Web Serial availability in workers: use main-thread SerialBridge (design accounts for this)
- Message overhead at high Hz: batch commands and reduce UI state update frequency
- Browser differences: feature-detect and test on Chromium, Firefox (where supported), Safari Technology Preview
## Definition of Done
- UI remains smooth with active camera preview and keyboard teleoperation; no flicker observed in manual tests
- Worker-backed runtime enabled by default when available; fallback path verified
- `calibrate()`, `teleoperate()`, `record()` maintain identical signatures and behavior
- Typed protocol implemented with Transferables where applicable
- Unit tests for protocol routing and error timeouts
- Documentation updated (this user story + README note)