LTX Video 2 Quickstart¶
In this example, we'll train an LTX Video 2 LoRA using the LTX-2 video/audio VAEs and a Gemma3 text encoder.
Hardware requirements¶
LTX Video 2 is a heavy 19B model. It combines: 1. Gemma3: The text encoder. 2. LTX-2 Video VAE (plus the Audio VAE when conditioning on audio). 3. 19B Video Transformer: A large DiT backbone.
This setup is VRAM-intensive, and the VAE pre-caching step can spike memory usage.
- Single-GPU training: Start with
train_batch_size: 1and enable group offload. - Note: The initial VAE pre-caching step can require more VRAM. You may need CPU offloading or a larger GPU just for the caching phase.
- Tip: Set
"offload_during_startup": truein yourconfig.jsonto ensure the VAE and text encoder are not loaded to the GPU at the same time, which significantly reduces pre-caching memory pressure. - Multi-GPU training: FSDP2 or aggressive Group Offload is recommended if you need more headroom.
- System RAM: 64GB+ is recommended for larger runs; more RAM helps with caching.
Observed performance and memory (field reports)¶
- Baseline settings: 480p, 17 frames, batch size 2 (minimal video length/resolution).
- RamTorch (incl. text encoder): ~13 GB VRAM used on an AMD 7900XTX.
- NVIDIA 3090/4090/5090+ should see similar or better VRAM headroom.
- No offload (int8 TorchAO): ~29-30 GB VRAM used; 32 GB hardware recommended.
- Peak system RAM: ~46 GB when loading bf16 Gemma3 then quantizing to int8 (~32 GB VRAM).
- Peak system RAM: ~34 GB when loading bf16 LTX-2 transformer then quantizing to int8 (~30 GB VRAM).
- No offload (full bf16): ~48 GB VRAM required for model training without any offload enabled.
- Throughput:
- ~8 sec/step on A100-80G SXM4 (no compile).
- ~16 sec/step on 7900XTX (local run).
- ~30 min for 200 steps on A100-80G SXM4.
Memory offloading (Critical)¶
For most single-GPU setups training LTX Video 2, you should enable grouped offloading. It is optional but recommended to keep VRAM headroom for larger batches/resolutions.
Add this to your config.json:
View example config
Prerequisites¶
Ensure Python 3.12 is installed.
Installation¶
pip install 'simpletuner[cuda]'
# CUDA 13 / Blackwell users (NVIDIA B-series GPUs)
pip install 'simpletuner[cuda13]' --extra-index-url https://download.pytorch.org/whl/cu130
See INSTALL.md for advanced installation options.
Setting up the environment¶
Web interface¶
Access at http://localhost:8001.Manual configuration¶
Run the helper script:
Or copy the example and edit manually:
Configuration parameters¶
Key settings for LTX Video 2:
model_family:ltxvideo2model_flavour:dev(default),dev-fp4,dev-fp8,2.3-dev, or2.3-distilled.pretrained_model_name_or_path:Lightricks/LTX-2,dg845/LTX-2.3-Diffusers,dg845/LTX-2.3-Distilled-Diffusers, or a local.safetensorsfile.train_batch_size:1. Do not increase this unless you have an A100/H100.validation_resolution:512x768is a safe default for testing.720x1280(720p) is possible but heavy.validation_num_video_frames: Must be compatible with VAE compression (4x).- For 5s (at ~12-24fps): Use
61or49. - Formula:
(frames - 1) % 4 == 0. validation_guidance:5.0.frame_rate: Default is 25.
LTX-2 2.0 flavours ship as a single .safetensors checkpoint that includes the transformer, video VAE, audio VAE, and vocoder.
For LTX-2.3, SimpleTuner loads the matching Diffusers repo selected by model_flavour (2.3-dev or 2.3-distilled).
Optional: VRAM optimizations¶
If you need more VRAM headroom:
- Musubi block swap: Set musubi_blocks_to_swap (try 4-8) and optionally musubi_block_swap_device (default cpu) to stream the last transformer blocks from CPU. Expect lower throughput but lower peak VRAM.
- VAE patch convolution: Set --vae_enable_patch_conv=true to enable temporal chunking in the LTX-2 VAE; expect a small speed hit but lower peak VRAM.
- VAE temporal roll: Set --vae_enable_temporal_roll=true for more aggressive temporal chunking (larger speed hit).
- VAE tiling: Set --vae_enable_tiling=true to tile VAE encode/decode for large resolutions.
Optional: CREPA temporal regularizer¶
To reduce flicker and keep subjects stable across frames:
- In Training → Loss functions, enable CREPA.
- Recommended starting values: Block Index = 8, Weight = 0.5, Adjacent Distance = 1, Temporal Decay = 1.0.
- Keep the default vision encoder (dinov2_vitg14, size 518) unless you need a smaller one (dinov2_vits14 + 224).
- Requires network (or a cached torch hub) to fetch DINOv2 weights the first time.
- Only enable Drop VAE Encoder if you are training entirely from cached latents; otherwise leave it off.
Advanced Experimental Features¶
Show advanced experimental details
SimpleTuner includes experimental features that can significantly improve training stability and performance. * **[Scheduled Sampling (Rollout)](../experimental/SCHEDULED_SAMPLING.md):** reduces exposure bias and improves output quality by letting the model generate its own inputs during training. > ⚠️ These features increase the computational overhead of training. #### Dataset considerations Video datasets require careful setup. Create `config/multidatabackend.json`:[
{
"id": "my-video-dataset",
"type": "local",
"dataset_type": "video",
"instance_data_dir": "datasets/videos",
"caption_strategy": "textfile",
"resolution": 512,
"video": {
"num_frames": 61,
"min_frames": 61,
"frame_rate": 25,
"bucket_strategy": "aspect_ratio"
},
"repeats": 10
},
{
"id": "text-embeds",
"type": "local",
"dataset_type": "text_embeds",
"default": true,
"cache_dir": "cache/text/ltxvideo2",
"disabled": false
}
]
"audio": {
"auto_split": true,
"sample_rate": 16000,
"channels": 1,
"duration_interval": 3.0,
"allow_zero_audio": false
}
mkdir -p datasets/videos
</details>
# Place .mp4 / .mov files here.
# Place corresponding .txt files with same filename for captions.
View example config
View 7900XTX config (lowest VRAM use)
{
"base_model_precision": "int8-quanto",
"checkpoint_step_interval": 100,
"data_backend_config": "config/ltx2/multidatabackend.json",
"disable_benchmark": true,
"dynamo_mode": "",
"evaluation_type": "none",
"hub_model_id": "simpletuner-ltxvideo2-19b-t2v-lora-test",
"learning_rate": 0.00006,
"lr_warmup_steps": 50,
"lycoris_config": "config/lycoris_config.json",
"max_grad_norm": 0.1,
"max_train_steps": 200,
"minimum_image_size": 0,
"model_family": "ltxvideo2",
"model_flavour": "dev",
"model_type": "lora",
"num_train_epochs": 0,
"offload_during_startup": true,
"optimizer": "adamw_bf16",
"output_dir": "output/examples/ltxvideo2-19b-t2v.peft-lora",
"override_dataset_config": true,
"ramtorch": true,
"ramtorch_text_encoder": true,
"report_to": "none",
"resolution": 480,
"scheduled_sampling_reflexflow": false,
"seed": 42,
"skip_file_discovery": "",
"tracker_project_name": "lora-training",
"tracker_run_name": "example-training-run",
"train_batch_size": 2,
"vae_batch_size": 1,
"vae_enable_patch_conv": true,
"vae_enable_slicing": true,
"vae_enable_temporal_roll": true,
"vae_enable_tiling": true,
"validation_disable": true,
"validation_disable_unconditional": true,
"validation_guidance": 5,
"validation_num_inference_steps": 40,
"validation_num_video_frames": 81,
"validation_prompt": "🟫 is holding a sign that says hello world from ltxvideo2",
"validation_resolution": "768x512",
"validation_seed": 42,
"validation_using_datasets": false
}
[
{
"id": "my-audio-dataset",
"type": "local",
"dataset_type": "audio",
"instance_data_dir": "datasets/audio",
"caption_strategy": "textfile",
"audio": {
"sample_rate": 16000,
"channels": 2,
"duration_interval": 3.0,
"truncation_mode": "beginning"
},
"repeats": 10
},
{
"id": "text-embeds",
"type": "local",
"dataset_type": "text_embeds",
"default": true,
"cache_dir": "cache/text/ltxvideo2",
"disabled": false
}
]