Spaces:

NERDDISCO
/

LeRobot.js

Running

App Files Files Community

LeRobot.js / docs /dataset /v2.1.md

NERDDISCO

feat: record (#14)

b7ce6b9 unverified about 1 month ago

preview code

raw

history blame contribute delete

5.44 kB

LeRobot Dataset Format v2.1 — Complete Spec

This document describes the exact structure and syntax of a dataset repository following the LeRobot 2.1 standard.
It is intended as an implementation guide — follow it exactly to ensure 100% compatibility with the dataset viewer and LeRobotDataset loader.

1) Repository Layout (typical)

<dataset-root>/
├─ data/
│  ├─ chunk-000/
│  │  ├─ episode_000000.parquet
│  │  └─ episode_000001.parquet
│  └─ chunk-001/
│     └─ ...
├─ videos/
│  ├─ chunk-000/
│  │  ├─ observation.images.front/
│  │  │  ├─ episode_000000.mp4
│  │  │  └─ episode_000001.mp4
│  │  └─ observation.images.wrist/
│  │     ├─ episode_000000.mp4
│  │     └─ episode_000001.mp4
│  └─ chunk-001/
│     └─ ...
├─ meta/
│  ├─ info.json
│  ├─ episodes.jsonl
│  ├─ tasks.jsonl
│  └─ episodes_stats.jsonl  # (v2.1); older v2.0 used stats.json
└─ README.md

2) `data/`

Contains one Parquet file per episode.

Path pattern (from info.json.data_path):

data/chunk-{episode_chunk:03d}/episode_{episode_index:06d}.parquet

Each Parquet file stores synchronized timesteps with:

timestamp
observations (sensor values, images, poses, etc.)
actions (target joint commands, end-effector states, etc.)

3) `videos/`

Contains one MP4 video per episode per camera key.

Path pattern (from info.json.video_path):

videos/chunk-{episode_chunk:03d}/{video_key}/episode_{episode_index:06d}.mp4

{video_key} MUST exactly match a feature key in info.json.features under observation.images.*.

Requirements:

Codec: H.264 (avc1)
FPS matches info.json.fps
Frame count == number of timesteps in corresponding Parquet

4) `meta/`

`info.json`

Top-level dataset metadata and templates.

{
  "version": "2.1",
  "fps": 30,
  "chunks_size": 1000,
  "data_path": "data/chunk-{episode_chunk:03d}/episode_{episode_index:06d}.parquet",
  "video_path": "videos/chunk-{episode_chunk:03d}/{video_key}/episode_{episode_index:06d}.mp4",
  "features": {
    "observation.images.front": {
      "dtype": "video",
      "shape": [240, 320, 3],
      "names": ["height", "width", "channel"],
      "video_info": { "video.fps": 30.0, "video.codec": "avc1" }
    }
  }
}

`episodes.jsonl`

Line-delimited JSON, one entry per episode.

{"episode_id":"000000","task":"pick_and_place","length":243,"chunk":0}
{"episode_id":"000001","task":"pick_and_place","length":198,"chunk":0}

Fields:

episode_id → must match filenames (episode_000000)
task → must exist in tasks.jsonl
length → number of timesteps (and frames)
chunk → numeric index (0,1,2…) matching folder

`tasks.jsonl`

Defines available tasks.

{"task":"pick_and_place","description":"Pick an object and place it"}
{"task":"push_button","description":"Push the button until LED lights up"}

`episodes_stats.jsonl`

Dataset statistics for analysis & normalization.

{
  "total_frames": 25873,
  "episode_lengths": { "min": 120, "max": 450, "mean": 215 },
  "action_stats": {
    "x": { "min": -0.12, "max": 0.12, "mean": 0.0, "std": 0.05 },
    "y": { "min": -0.1, "max": 0.1, "mean": 0.0, "std": 0.04 },
    "z": { "min": 0.05, "max": 0.35, "mean": 0.2, "std": 0.06 }
  }
}

5) Chunking explained

Purpose: purely structural; keeps folders manageable and enables fast lookups.
Decided by dataset builder: choose chunks_size (often 1000).

Rule:

episode_chunk = episode_index // chunks_size

Examples

Small dataset (N ≤ 1000, chunks_size=1000) → all in chunk-000
Large dataset (N=2000, chunks_size=1000):
- episodes 0–999 → chunk-000
- episodes 1000–1999 → chunk-001

Naming

Chunk folder must be chunk-XXX with zero padding.

6) Synchronization rules

Parquet rows = MP4 frames = episodes.jsonl.length
Episode IDs consistent across parquet, mp4, and JSONL
Video folder name = exact feature key (observation.images.front)
Codecs = H.264 (avc1)
Forward slashes in all paths

7) Common pitfalls

Wrong folder naming (front/ instead of observation.images.front/)
Skipping chunk-000 entirely (must exist, even for small datasets)
Filename mismatch (000001.mp4 vs episode_000001.mp4)
Nested JSON instead of JSONL
Using unsupported codec (HEVC/AV1)

8) Quick implementation checklist

info.json.version == "2.1"
chunks_size defined (e.g., 1000)
data_path and video_path templates correct
All video_key entries match features keys under observation.images.*
Episode IDs zero‑padded: episode_{:06d}
episodes.jsonl one JSON per line
Parquet rows == MP4 frames == length
Codec = H.264 (avc1), FPS correct
All paths use / slashes

9) Minimal valid dataset (≤1000 episodes)

my_dataset/
├─ data/chunk-000/episode_000000.parquet
├─ data/chunk-000/episode_000001.parquet
├─ videos/chunk-000/observation.images.front/episode_000000.mp4
├─ videos/chunk-000/observation.images.front/episode_000001.mp4
└─ meta/{info.json, episodes.jsonl, tasks.jsonl, episodes_stats.jsonl}

LeRobot Dataset Format v2.1 — Complete Spec

1) Repository Layout (typical)

2) data/

3) videos/

4) meta/

info.json

episodes.jsonl

tasks.jsonl

episodes_stats.jsonl