Spaces:

NERDDISCO
/

LeRobot.js

Running

App Files Files Community

LeRobot.js / docs /dataset /v2.1.md

NERDDISCO

feat: record (#14)

b7ce6b9 unverified about 1 month ago

preview code

raw

history blame contribute delete

5.44 kB

	# LeRobot Dataset Format v2.1 — Complete Spec

	This document describes the exact structure and syntax of a dataset repository following the LeRobot 2.1 standard.
	It is intended as an implementation guide — follow it exactly to ensure 100% compatibility with the dataset viewer and `LeRobotDataset` loader.

	---

	## 1) Repository Layout (typical)

	```
	<dataset-root>/
	├─ data/
	│ ├─ chunk-000/
	│ │ ├─ episode_000000.parquet
	│ │ └─ episode_000001.parquet
	│ └─ chunk-001/
	│ └─ ...
	├─ videos/
	│ ├─ chunk-000/
	│ │ ├─ observation.images.front/
	│ │ │ ├─ episode_000000.mp4
	│ │ │ └─ episode_000001.mp4
	│ │ └─ observation.images.wrist/
	│ │ ├─ episode_000000.mp4
	│ │ └─ episode_000001.mp4
	│ └─ chunk-001/
	│ └─ ...
	├─ meta/
	│ ├─ info.json
	│ ├─ episodes.jsonl
	│ ├─ tasks.jsonl
	│ └─ episodes_stats.jsonl # (v2.1); older v2.0 used stats.json
	└─ README.md
	```

	---

	## 2) `data/`

	- Contains one Parquet file per episode.
	- Path pattern (from `info.json.data_path`):
	```
	data/chunk-{episode_chunk:03d}/episode_{episode_index:06d}.parquet
	```

	Each Parquet file stores synchronized timesteps with:

	- `timestamp`
	- `observations` (sensor values, images, poses, etc.)
	- `actions` (target joint commands, end-effector states, etc.)

	---

	## 3) `videos/`

	- Contains one MP4 video per episode per camera key.
	- Path pattern (from `info.json.video_path`):
	```
	videos/chunk-{episode_chunk:03d}/{video_key}/episode_{episode_index:06d}.mp4
	```
	- `{video_key}` MUST exactly match a feature key in `info.json.features` under `observation.images.*`.

	Requirements:

	- Codec: H.264 (`avc1`)
	- FPS matches `info.json.fps`
	- Frame count == number of timesteps in corresponding Parquet

	---

	## 4) `meta/`

	### `info.json`

	Top-level dataset metadata and templates.

	```json
	{
	"version": "2.1",
	"fps": 30,
	"chunks_size": 1000,
	"data_path": "data/chunk-{episode_chunk:03d}/episode_{episode_index:06d}.parquet",
	"video_path": "videos/chunk-{episode_chunk:03d}/{video_key}/episode_{episode_index:06d}.mp4",
	"features": {
	"observation.images.front": {
	"dtype": "video",
	"shape": [240, 320, 3],
	"names": ["height", "width", "channel"],
	"video_info": { "video.fps": 30.0, "video.codec": "avc1" }
	}
	}
	}
	```

	---

	### `episodes.jsonl`

	Line-delimited JSON, one entry per episode.

	```json
	{"episode_id":"000000","task":"pick_and_place","length":243,"chunk":0}
	{"episode_id":"000001","task":"pick_and_place","length":198,"chunk":0}
	```

	Fields:

	- `episode_id` → must match filenames (`episode_000000`)
	- `task` → must exist in `tasks.jsonl`
	- `length` → number of timesteps (and frames)
	- `chunk` → numeric index (0,1,2…) matching folder

	---

	### `tasks.jsonl`

	Defines available tasks.

	```json
	{"task":"pick_and_place","description":"Pick an object and place it"}
	{"task":"push_button","description":"Push the button until LED lights up"}
	```

	---

	### `episodes_stats.jsonl`

	Dataset statistics for analysis & normalization.

	```json
	{
	"total_frames": 25873,
	"episode_lengths": { "min": 120, "max": 450, "mean": 215 },
	"action_stats": {
	"x": { "min": -0.12, "max": 0.12, "mean": 0.0, "std": 0.05 },
	"y": { "min": -0.1, "max": 0.1, "mean": 0.0, "std": 0.04 },
	"z": { "min": 0.05, "max": 0.35, "mean": 0.2, "std": 0.06 }
	}
	}
	```

	---

	## 5) Chunking explained

	- Purpose: purely structural; keeps folders manageable and enables fast lookups.
	- Decided by dataset builder: choose `chunks_size` (often 1000).
	- Rule:
	```
	episode_chunk = episode_index // chunks_size
	```

	### Examples

	- Small dataset (N ≤ 1000, chunks_size=1000) → all in `chunk-000`
	- Large dataset (N=2000, chunks_size=1000):
	- episodes 0–999 → `chunk-000`
	- episodes 1000–1999 → `chunk-001`

	### Naming

	- Chunk folder must be `chunk-XXX` with zero padding.

	---

	## 6) Synchronization rules

	- Parquet rows = MP4 frames = `episodes.jsonl.length`
	- Episode IDs consistent across parquet, mp4, and JSONL
	- Video folder name = exact feature key (`observation.images.front`)
	- Codecs = H.264 (`avc1`)
	- Forward slashes in all paths

	---

	## 7) Common pitfalls

	1. Wrong folder naming (`front/` instead of `observation.images.front/`)
	2. Skipping `chunk-000` entirely (must exist, even for small datasets)
	3. Filename mismatch (`000001.mp4` vs `episode_000001.mp4`)
	4. Nested JSON instead of JSONL
	5. Using unsupported codec (HEVC/AV1)

	---

	## 8) Quick implementation checklist

	- [ ] `info.json.version == "2.1"`
	- [ ] `chunks_size` defined (e.g., 1000)
	- [ ] `data_path` and `video_path` templates correct
	- [ ] All `video_key` entries match `features` keys under `observation.images.*`
	- [ ] Episode IDs zero‑padded: `episode_{:06d}`
	- [ ] `episodes.jsonl` one JSON per line
	- [ ] Parquet rows == MP4 frames == `length`
	- [ ] Codec = H.264 (`avc1`), FPS correct
	- [ ] All paths use `/` slashes

	---

	## 9) Minimal valid dataset (≤1000 episodes)

	```
	my_dataset/
	├─ data/chunk-000/episode_000000.parquet
	├─ data/chunk-000/episode_000001.parquet
	├─ videos/chunk-000/observation.images.front/episode_000000.mp4
	├─ videos/chunk-000/observation.images.front/episode_000001.mp4
	└─ meta/{info.json, episodes.jsonl, tasks.jsonl, episodes_stats.jsonl}
	```