NERDDISCO
feat: record (#14)
b7ce6b9 unverified
# LeRobot Dataset Format v2.1 β€” Complete Spec
This document describes the **exact structure and syntax** of a dataset repository following the **LeRobot 2.1 standard**.
It is intended as an implementation guide β€” follow it exactly to ensure 100% compatibility with the dataset viewer and `LeRobotDataset` loader.
---
## 1) Repository Layout (typical)
```
<dataset-root>/
β”œβ”€ data/
β”‚ β”œβ”€ chunk-000/
β”‚ β”‚ β”œβ”€ episode_000000.parquet
β”‚ β”‚ └─ episode_000001.parquet
β”‚ └─ chunk-001/
β”‚ └─ ...
β”œβ”€ videos/
β”‚ β”œβ”€ chunk-000/
β”‚ β”‚ β”œβ”€ observation.images.front/
β”‚ β”‚ β”‚ β”œβ”€ episode_000000.mp4
β”‚ β”‚ β”‚ └─ episode_000001.mp4
β”‚ β”‚ └─ observation.images.wrist/
β”‚ β”‚ β”œβ”€ episode_000000.mp4
β”‚ β”‚ └─ episode_000001.mp4
β”‚ └─ chunk-001/
β”‚ └─ ...
β”œβ”€ meta/
β”‚ β”œβ”€ info.json
β”‚ β”œβ”€ episodes.jsonl
β”‚ β”œβ”€ tasks.jsonl
β”‚ └─ episodes_stats.jsonl # (v2.1); older v2.0 used stats.json
└─ README.md
```
---
## 2) `data/`
- Contains **one Parquet file per episode**.
- Path pattern (from `info.json.data_path`):
```
data/chunk-{episode_chunk:03d}/episode_{episode_index:06d}.parquet
```
Each Parquet file stores synchronized timesteps with:
- `timestamp`
- `observations` (sensor values, images, poses, etc.)
- `actions` (target joint commands, end-effector states, etc.)
---
## 3) `videos/`
- Contains **one MP4 video per episode per camera key**.
- Path pattern (from `info.json.video_path`):
```
videos/chunk-{episode_chunk:03d}/{video_key}/episode_{episode_index:06d}.mp4
```
- `{video_key}` MUST exactly match a feature key in `info.json.features` under `observation.images.*`.
**Requirements:**
- Codec: H.264 (`avc1`)
- FPS matches `info.json.fps`
- Frame count == number of timesteps in corresponding Parquet
---
## 4) `meta/`
### `info.json`
Top-level dataset metadata and templates.
```json
{
"version": "2.1",
"fps": 30,
"chunks_size": 1000,
"data_path": "data/chunk-{episode_chunk:03d}/episode_{episode_index:06d}.parquet",
"video_path": "videos/chunk-{episode_chunk:03d}/{video_key}/episode_{episode_index:06d}.mp4",
"features": {
"observation.images.front": {
"dtype": "video",
"shape": [240, 320, 3],
"names": ["height", "width", "channel"],
"video_info": { "video.fps": 30.0, "video.codec": "avc1" }
}
}
}
```
---
### `episodes.jsonl`
Line-delimited JSON, one entry per episode.
```json
{"episode_id":"000000","task":"pick_and_place","length":243,"chunk":0}
{"episode_id":"000001","task":"pick_and_place","length":198,"chunk":0}
```
Fields:
- `episode_id` β†’ must match filenames (`episode_000000`)
- `task` β†’ must exist in `tasks.jsonl`
- `length` β†’ number of timesteps (and frames)
- `chunk` β†’ numeric index (0,1,2…) matching folder
---
### `tasks.jsonl`
Defines available tasks.
```json
{"task":"pick_and_place","description":"Pick an object and place it"}
{"task":"push_button","description":"Push the button until LED lights up"}
```
---
### `episodes_stats.jsonl`
Dataset statistics for analysis & normalization.
```json
{
"total_frames": 25873,
"episode_lengths": { "min": 120, "max": 450, "mean": 215 },
"action_stats": {
"x": { "min": -0.12, "max": 0.12, "mean": 0.0, "std": 0.05 },
"y": { "min": -0.1, "max": 0.1, "mean": 0.0, "std": 0.04 },
"z": { "min": 0.05, "max": 0.35, "mean": 0.2, "std": 0.06 }
}
}
```
---
## 5) Chunking explained
- **Purpose:** purely structural; keeps folders manageable and enables fast lookups.
- **Decided by dataset builder:** choose `chunks_size` (often 1000).
- **Rule:**
```
episode_chunk = episode_index // chunks_size
```
### Examples
- Small dataset (N ≀ 1000, chunks_size=1000) β†’ all in `chunk-000`
- Large dataset (N=2000, chunks_size=1000):
- episodes 0–999 β†’ `chunk-000`
- episodes 1000–1999 β†’ `chunk-001`
### Naming
- Chunk folder must be `chunk-XXX` with zero padding.
---
## 6) Synchronization rules
- Parquet rows = MP4 frames = `episodes.jsonl.length`
- Episode IDs consistent across parquet, mp4, and JSONL
- Video folder name = exact feature key (`observation.images.front`)
- Codecs = H.264 (`avc1`)
- Forward slashes in all paths
---
## 7) Common pitfalls
1. Wrong folder naming (`front/` instead of `observation.images.front/`)
2. Skipping `chunk-000` entirely (must exist, even for small datasets)
3. Filename mismatch (`000001.mp4` vs `episode_000001.mp4`)
4. Nested JSON instead of JSONL
5. Using unsupported codec (HEVC/AV1)
---
## 8) Quick implementation checklist
- [ ] `info.json.version == "2.1"`
- [ ] `chunks_size` defined (e.g., 1000)
- [ ] `data_path` and `video_path` templates correct
- [ ] All `video_key` entries match `features` keys under `observation.images.*`
- [ ] Episode IDs zero‑padded: `episode_{:06d}`
- [ ] `episodes.jsonl` one JSON per line
- [ ] Parquet rows == MP4 frames == `length`
- [ ] Codec = H.264 (`avc1`), FPS correct
- [ ] All paths use `/` slashes
---
## 9) Minimal valid dataset (≀1000 episodes)
```
my_dataset/
β”œβ”€ data/chunk-000/episode_000000.parquet
β”œβ”€ data/chunk-000/episode_000001.parquet
β”œβ”€ videos/chunk-000/observation.images.front/episode_000000.mp4
β”œβ”€ videos/chunk-000/observation.images.front/episode_000001.mp4
└─ meta/{info.json, episodes.jsonl, tasks.jsonl, episodes_stats.jsonl}
```