Spaces:
Running
Running
| # LeRobot Dataset Format v2.1 β Complete Spec | |
| This document describes the **exact structure and syntax** of a dataset repository following the **LeRobot 2.1 standard**. | |
| It is intended as an implementation guide β follow it exactly to ensure 100% compatibility with the dataset viewer and `LeRobotDataset` loader. | |
| --- | |
| ## 1) Repository Layout (typical) | |
| ``` | |
| <dataset-root>/ | |
| ββ data/ | |
| β ββ chunk-000/ | |
| β β ββ episode_000000.parquet | |
| β β ββ episode_000001.parquet | |
| β ββ chunk-001/ | |
| β ββ ... | |
| ββ videos/ | |
| β ββ chunk-000/ | |
| β β ββ observation.images.front/ | |
| β β β ββ episode_000000.mp4 | |
| β β β ββ episode_000001.mp4 | |
| β β ββ observation.images.wrist/ | |
| β β ββ episode_000000.mp4 | |
| β β ββ episode_000001.mp4 | |
| β ββ chunk-001/ | |
| β ββ ... | |
| ββ meta/ | |
| β ββ info.json | |
| β ββ episodes.jsonl | |
| β ββ tasks.jsonl | |
| β ββ episodes_stats.jsonl # (v2.1); older v2.0 used stats.json | |
| ββ README.md | |
| ``` | |
| --- | |
| ## 2) `data/` | |
| - Contains **one Parquet file per episode**. | |
| - Path pattern (from `info.json.data_path`): | |
| ``` | |
| data/chunk-{episode_chunk:03d}/episode_{episode_index:06d}.parquet | |
| ``` | |
| Each Parquet file stores synchronized timesteps with: | |
| - `timestamp` | |
| - `observations` (sensor values, images, poses, etc.) | |
| - `actions` (target joint commands, end-effector states, etc.) | |
| --- | |
| ## 3) `videos/` | |
| - Contains **one MP4 video per episode per camera key**. | |
| - Path pattern (from `info.json.video_path`): | |
| ``` | |
| videos/chunk-{episode_chunk:03d}/{video_key}/episode_{episode_index:06d}.mp4 | |
| ``` | |
| - `{video_key}` MUST exactly match a feature key in `info.json.features` under `observation.images.*`. | |
| **Requirements:** | |
| - Codec: H.264 (`avc1`) | |
| - FPS matches `info.json.fps` | |
| - Frame count == number of timesteps in corresponding Parquet | |
| --- | |
| ## 4) `meta/` | |
| ### `info.json` | |
| Top-level dataset metadata and templates. | |
| ```json | |
| { | |
| "version": "2.1", | |
| "fps": 30, | |
| "chunks_size": 1000, | |
| "data_path": "data/chunk-{episode_chunk:03d}/episode_{episode_index:06d}.parquet", | |
| "video_path": "videos/chunk-{episode_chunk:03d}/{video_key}/episode_{episode_index:06d}.mp4", | |
| "features": { | |
| "observation.images.front": { | |
| "dtype": "video", | |
| "shape": [240, 320, 3], | |
| "names": ["height", "width", "channel"], | |
| "video_info": { "video.fps": 30.0, "video.codec": "avc1" } | |
| } | |
| } | |
| } | |
| ``` | |
| --- | |
| ### `episodes.jsonl` | |
| Line-delimited JSON, one entry per episode. | |
| ```json | |
| {"episode_id":"000000","task":"pick_and_place","length":243,"chunk":0} | |
| {"episode_id":"000001","task":"pick_and_place","length":198,"chunk":0} | |
| ``` | |
| Fields: | |
| - `episode_id` β must match filenames (`episode_000000`) | |
| - `task` β must exist in `tasks.jsonl` | |
| - `length` β number of timesteps (and frames) | |
| - `chunk` β numeric index (0,1,2β¦) matching folder | |
| --- | |
| ### `tasks.jsonl` | |
| Defines available tasks. | |
| ```json | |
| {"task":"pick_and_place","description":"Pick an object and place it"} | |
| {"task":"push_button","description":"Push the button until LED lights up"} | |
| ``` | |
| --- | |
| ### `episodes_stats.jsonl` | |
| Dataset statistics for analysis & normalization. | |
| ```json | |
| { | |
| "total_frames": 25873, | |
| "episode_lengths": { "min": 120, "max": 450, "mean": 215 }, | |
| "action_stats": { | |
| "x": { "min": -0.12, "max": 0.12, "mean": 0.0, "std": 0.05 }, | |
| "y": { "min": -0.1, "max": 0.1, "mean": 0.0, "std": 0.04 }, | |
| "z": { "min": 0.05, "max": 0.35, "mean": 0.2, "std": 0.06 } | |
| } | |
| } | |
| ``` | |
| --- | |
| ## 5) Chunking explained | |
| - **Purpose:** purely structural; keeps folders manageable and enables fast lookups. | |
| - **Decided by dataset builder:** choose `chunks_size` (often 1000). | |
| - **Rule:** | |
| ``` | |
| episode_chunk = episode_index // chunks_size | |
| ``` | |
| ### Examples | |
| - Small dataset (N β€ 1000, chunks_size=1000) β all in `chunk-000` | |
| - Large dataset (N=2000, chunks_size=1000): | |
| - episodes 0β999 β `chunk-000` | |
| - episodes 1000β1999 β `chunk-001` | |
| ### Naming | |
| - Chunk folder must be `chunk-XXX` with zero padding. | |
| --- | |
| ## 6) Synchronization rules | |
| - Parquet rows = MP4 frames = `episodes.jsonl.length` | |
| - Episode IDs consistent across parquet, mp4, and JSONL | |
| - Video folder name = exact feature key (`observation.images.front`) | |
| - Codecs = H.264 (`avc1`) | |
| - Forward slashes in all paths | |
| --- | |
| ## 7) Common pitfalls | |
| 1. Wrong folder naming (`front/` instead of `observation.images.front/`) | |
| 2. Skipping `chunk-000` entirely (must exist, even for small datasets) | |
| 3. Filename mismatch (`000001.mp4` vs `episode_000001.mp4`) | |
| 4. Nested JSON instead of JSONL | |
| 5. Using unsupported codec (HEVC/AV1) | |
| --- | |
| ## 8) Quick implementation checklist | |
| - [ ] `info.json.version == "2.1"` | |
| - [ ] `chunks_size` defined (e.g., 1000) | |
| - [ ] `data_path` and `video_path` templates correct | |
| - [ ] All `video_key` entries match `features` keys under `observation.images.*` | |
| - [ ] Episode IDs zeroβpadded: `episode_{:06d}` | |
| - [ ] `episodes.jsonl` one JSON per line | |
| - [ ] Parquet rows == MP4 frames == `length` | |
| - [ ] Codec = H.264 (`avc1`), FPS correct | |
| - [ ] All paths use `/` slashes | |
| --- | |
| ## 9) Minimal valid dataset (β€1000 episodes) | |
| ``` | |
| my_dataset/ | |
| ββ data/chunk-000/episode_000000.parquet | |
| ββ data/chunk-000/episode_000001.parquet | |
| ββ videos/chunk-000/observation.images.front/episode_000000.mp4 | |
| ββ videos/chunk-000/observation.images.front/episode_000001.mp4 | |
| ββ meta/{info.json, episodes.jsonl, tasks.jsonl, episodes_stats.jsonl} | |
| ``` | |