NERDDISCO
feat: record (#14)
b7ce6b9 unverified

LeRobot Dataset Format v2.1 β€” Complete Spec

This document describes the exact structure and syntax of a dataset repository following the LeRobot 2.1 standard.
It is intended as an implementation guide β€” follow it exactly to ensure 100% compatibility with the dataset viewer and LeRobotDataset loader.


1) Repository Layout (typical)

<dataset-root>/
β”œβ”€ data/
β”‚  β”œβ”€ chunk-000/
β”‚  β”‚  β”œβ”€ episode_000000.parquet
β”‚  β”‚  └─ episode_000001.parquet
β”‚  └─ chunk-001/
β”‚     └─ ...
β”œβ”€ videos/
β”‚  β”œβ”€ chunk-000/
β”‚  β”‚  β”œβ”€ observation.images.front/
β”‚  β”‚  β”‚  β”œβ”€ episode_000000.mp4
β”‚  β”‚  β”‚  └─ episode_000001.mp4
β”‚  β”‚  └─ observation.images.wrist/
β”‚  β”‚     β”œβ”€ episode_000000.mp4
β”‚  β”‚     └─ episode_000001.mp4
β”‚  └─ chunk-001/
β”‚     └─ ...
β”œβ”€ meta/
β”‚  β”œβ”€ info.json
β”‚  β”œβ”€ episodes.jsonl
β”‚  β”œβ”€ tasks.jsonl
β”‚  └─ episodes_stats.jsonl  # (v2.1); older v2.0 used stats.json
└─ README.md

2) data/

  • Contains one Parquet file per episode.
  • Path pattern (from info.json.data_path):
    data/chunk-{episode_chunk:03d}/episode_{episode_index:06d}.parquet
    

Each Parquet file stores synchronized timesteps with:

  • timestamp
  • observations (sensor values, images, poses, etc.)
  • actions (target joint commands, end-effector states, etc.)

3) videos/

  • Contains one MP4 video per episode per camera key.
  • Path pattern (from info.json.video_path):
    videos/chunk-{episode_chunk:03d}/{video_key}/episode_{episode_index:06d}.mp4
    
  • {video_key} MUST exactly match a feature key in info.json.features under observation.images.*.

Requirements:

  • Codec: H.264 (avc1)
  • FPS matches info.json.fps
  • Frame count == number of timesteps in corresponding Parquet

4) meta/

info.json

Top-level dataset metadata and templates.

{
  "version": "2.1",
  "fps": 30,
  "chunks_size": 1000,
  "data_path": "data/chunk-{episode_chunk:03d}/episode_{episode_index:06d}.parquet",
  "video_path": "videos/chunk-{episode_chunk:03d}/{video_key}/episode_{episode_index:06d}.mp4",
  "features": {
    "observation.images.front": {
      "dtype": "video",
      "shape": [240, 320, 3],
      "names": ["height", "width", "channel"],
      "video_info": { "video.fps": 30.0, "video.codec": "avc1" }
    }
  }
}

episodes.jsonl

Line-delimited JSON, one entry per episode.

{"episode_id":"000000","task":"pick_and_place","length":243,"chunk":0}
{"episode_id":"000001","task":"pick_and_place","length":198,"chunk":0}

Fields:

  • episode_id β†’ must match filenames (episode_000000)
  • task β†’ must exist in tasks.jsonl
  • length β†’ number of timesteps (and frames)
  • chunk β†’ numeric index (0,1,2…) matching folder

tasks.jsonl

Defines available tasks.

{"task":"pick_and_place","description":"Pick an object and place it"}
{"task":"push_button","description":"Push the button until LED lights up"}

episodes_stats.jsonl

Dataset statistics for analysis & normalization.

{
  "total_frames": 25873,
  "episode_lengths": { "min": 120, "max": 450, "mean": 215 },
  "action_stats": {
    "x": { "min": -0.12, "max": 0.12, "mean": 0.0, "std": 0.05 },
    "y": { "min": -0.1, "max": 0.1, "mean": 0.0, "std": 0.04 },
    "z": { "min": 0.05, "max": 0.35, "mean": 0.2, "std": 0.06 }
  }
}

5) Chunking explained

  • Purpose: purely structural; keeps folders manageable and enables fast lookups.
  • Decided by dataset builder: choose chunks_size (often 1000).
  • Rule:
    episode_chunk = episode_index // chunks_size
    

Examples

  • Small dataset (N ≀ 1000, chunks_size=1000) β†’ all in chunk-000
  • Large dataset (N=2000, chunks_size=1000):
    • episodes 0–999 β†’ chunk-000
    • episodes 1000–1999 β†’ chunk-001

Naming

  • Chunk folder must be chunk-XXX with zero padding.

6) Synchronization rules

  • Parquet rows = MP4 frames = episodes.jsonl.length
  • Episode IDs consistent across parquet, mp4, and JSONL
  • Video folder name = exact feature key (observation.images.front)
  • Codecs = H.264 (avc1)
  • Forward slashes in all paths

7) Common pitfalls

  1. Wrong folder naming (front/ instead of observation.images.front/)
  2. Skipping chunk-000 entirely (must exist, even for small datasets)
  3. Filename mismatch (000001.mp4 vs episode_000001.mp4)
  4. Nested JSON instead of JSONL
  5. Using unsupported codec (HEVC/AV1)

8) Quick implementation checklist

  • info.json.version == "2.1"
  • chunks_size defined (e.g., 1000)
  • data_path and video_path templates correct
  • All video_key entries match features keys under observation.images.*
  • Episode IDs zero‑padded: episode_{:06d}
  • episodes.jsonl one JSON per line
  • Parquet rows == MP4 frames == length
  • Codec = H.264 (avc1), FPS correct
  • All paths use / slashes

9) Minimal valid dataset (≀1000 episodes)

my_dataset/
β”œβ”€ data/chunk-000/episode_000000.parquet
β”œβ”€ data/chunk-000/episode_000001.parquet
β”œβ”€ videos/chunk-000/observation.images.front/episode_000000.mp4
β”œβ”€ videos/chunk-000/observation.images.front/episode_000001.mp4
└─ meta/{info.json, episodes.jsonl, tasks.jsonl, episodes_stats.jsonl}