Paper · Website · Checkpoints
mimic-video extracts generalist language-conditioned robot policies (Video-Action Models / VAMs) from pretrained video models by conditioning small action decoders on the video backbones' latent representations. By drawing on the video model's knowledge of real-world dynamics and behaviors, performant action decoders can be learned efficiently and without updating the video model. Employing decoupled flow times for video and for actions, efficient inference can be performed with a single video model forward pass per action chunk.
We instantiate our approach with the lightweight 2B Cosmos-Predict2 video model and release trained checkpoints for Bridge and LIBERO.
We provide our data (DATA.md), modeling (MODEL.md), and evaluation (EVAL.md) code. See the respective markdowns for details.
mimic-video
├── data_preprocessing # data preprocessing
├── eval # evaluation
└── model # dataloading, model architecture, training, and inference
- Create uv environment.
curl -LsSf https://astral.sh/uv/install.sh | sh
cd model
uv sync --extra cu126
source .venv/bin/activate- (Optional) Download trained bridge or libero checkpoints.
hf auth login
python scripts/download_checkpoints.pyIf you want to run guardrails (default enabled), additionally request access to and download nvidia/Cosmos-Guardrail1 and meta-llama/Llama-Guard-3-8B to the checkpoints directory.
You can find an overview over the training repository (which is built on the Cosmos-Predict2 repo) in MODEL.md. A quickstart to training your own models is given below.
Multi-node & multi-gpu configuration is handled through torchrun.
This assumes you have downloaded at least the text encoder, video tokenizer, and v2w_pretrained_cosmos.
- Extract videos and language instructions.
- Choose a
/path/to/dataset/. - Populate
/path/to/dataset/video/withep.mp4and/path/to/dataset/metas/withep.txtfiles. Example scripts for bridge and libero are provided in data_preprocessing/video.
- Choose a
- Precompute language embeddings in
/path/to/dataset/t5_xxl/.
cd data_preprocessing/video/
python get_t5_embeddings.py --dataset_path /path/to/dataset/- Create video finetuning config.
- Add your dataset to
train_datasetsin data_video.py (line 24). - Add your experiment hyperparameters to video2world.py.
- Add your dataset to
- Start training with torchrun. The experiment name is defined in video2world.py from the step before.
torchrun -m scripts.train --config=cosmos_predict2/configs/config.py -- experiment=...This assumes you have downloaded the text encoder, video tokenizer, and the video backbone you would like to train an action decoder for.
- Download raw data and unzip.
aria2c -x 16 -s 16 -c "https://rail.eecs.berkeley.edu/datasets/bridge_release/data/demos_8_17.zip"
7z x demos_8_17.zip -obridge/
# todo: maybe untar that one file? still don't know what it is. have to look inside.- Convert to zarr.
cd data_preprocessing/action/
python process_bridge.py --raw-dir ../../bridge/raw --output-dir /path/to/data/bridge/- Precompute language embeddings.
python precompute_t5.py --dataset-path /path/to/data/bridge/- Create training config.
- Adapt dataset.data_dir in bridge.yaml to point to the directory containing the data you want to train on. See DATA.md for details on the data config structure.
- Choose training hyperparameters (cross-attention layer, learning rate, batch size) and the video model checkpoint in experiment/world2action.py. To use the same hyperparameters as the pretrained checkpoints you can select the correct configuration via the experiment name without changing code.
- Start training with torchrun. The experiment name is defined in world2action.py from the step before.
cd ../../model
torchrun -m scripts.train --config=cosmos_predict2/configs/config.py -- experiment=...- Follow the LIBERO dependency installation steps.
- Download the official datasets.
cd LIBERO
python benchmark_scripts/download_libero_datasets.py --use-huggingface- Regenerate h5 recordings (filter success, filter no-op, rotate image, re-render at higher resolution).
cd ../../../data_preprocessing/action/
PYTHONPATH=../../eval/libero/LIBERO/ python regenerate_libero.py --in-dir /path/to/libero/datasets/ --out-dir /path/to/libero/regenerated_datasets/- Convert to zarr.
python process_libero.py --input-dir /path/to/libero/regenerated_datasets/ --output-dir /path/to/data/- Precompute language embeddings.
python precompute_t5.py --dataset-path /path/to/data/libero_*- Create training config.
- Adapt dataset.data_dir in libero.yaml to point to the directory containing the data you want to train on. See DATA.md for details on the data config structure. If you want to train on different LIBERO subsets, you might want to set it from the top level of the data config (and have several of those).
- Choose training hyperparameters (cross-attention layer, learning rate, batch size) and the video model checkpoint in experiment/world2action.py. To use the same hyperparameters as the pretrained checkpoints you can select the correct configuration via the experiment name without changing code.
- Start training with torchrun. The experiment name is defined in world2action.py from the step before.
cd ../../model
torchrun -m scripts.train --config=cosmos_predict2/configs/config.py -- experiment=...We have integrated vanilla SIMPLER-Bridge, human-in-the-loop SIMPLER-Bridge (for ground-truth future video generation), and vanilla LIBERO evals in this repo. To reproduce the sim results with our checkpoints, follow these quick steps:
sudo apt install libvulkan1
cd eval/bridge
uv pip install -r SimplerEnv/requirements.txt
uv pip install -e SimplerEnv/ManiSkill2_real2sim
uv pip install -e SimplerEnvThis assumes you have the checkpoints from Environment Setup and Downloading Checkpoints.
# Adapt the `GPUS` list (line 1) to which GPUs to parallelize over (now: 0-7).
# Adapt line 8 for how many evals can run in parallel per GPU (now: 2).
# Fill in `checkpoint_dir` with the path to the checkpoint directory (line 33).
bash eval.shFor this one, you have to sit down and teleop. The policy will get the ground-truth future video from the teleop, add noise, and then decode actions.
# Fill in `checkpoint_dir` with the path to the checkpoint directory (line 1).
bash eval_hil.shcd eval/libero
uv pip install -r LIBERO/requirements.txt
uv pip install -e LIBEROThis also assumes you have the checkpoints from Environment Setup and Downloading Checkpoints.
# Adapt the `GPUS` list (line 1) to which GPUs to parallelize over (now: 0-7).
# Adapt line 8 for how many evals can run in parallel per GPU (now: 2).
# Fill in `checkpoint_dir` with the path to the checkpoint directory (line 29).
bash eval.shCopyright 2026 mimic-video authors and mimic robotics AG
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this repository except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
@misc{pai2025mimicvideo,
title={mimic-video: Video-Action Models for Generalizable Robot Control Beyond VLAs},
author={Jonas Pai and Liam Achenbach and Victoriano Montesinos and Benedek Forrai and Oier Mees and Elvis Nava},
year={2025},
eprint={2512.15692},
archivePrefix={arXiv},
primaryClass={cs.RO},
url={https://arxiv.org/abs/2512.15692},
}