Skip to content

LongCat‑Video Edit (Image‑to‑Video) Quickstart

This guide walks you through training and validating the image‑to‑video workflow for LongCat‑Video. You don’t need to flip flavours; the same final checkpoint covers both text‑to‑video and image‑to‑video. The difference comes from your datasets and validation settings.


1) Model differences vs base LongCat‑Video

Base (text2video) Edit / I2V
Flavour final final (same weights)
Conditioning none requires conditioning frame (first latent kept fixed)
Text encoder Qwen‑2.5‑VL Qwen‑2.5‑VL (same)
Pipeline TEXT2IMG IMG2VIDEO
Validation inputs prompt only prompt and conditioning image
Buckets / stride 64px buckets, (frames-1)%4==0 same

Core defaults you inherit - Flow matching with shift 12.0. - Aspect buckets enforced at 64px. - Qwen‑2.5‑VL text encoder; empty negatives auto‑added when CFG is on. - Default frames: 93 (satisfies (frames-1)%4==0).


2) Config changes (CLI/WebUI)

{
  "model_family": "longcat_video",
  "model_flavour": "final",
  "model_type": "lora",
  "train_batch_size": 1,
  "gradient_checkpointing": true,
  "lora_rank": 8,
  "learning_rate": 1e-4,
  "validation_resolution": "480x832",
  "validation_num_video_frames": 93,
  "validation_num_inference_steps": 40,
  "validation_guidance": 4.0,
  "validation_using_datasets": true,
  "eval_dataset_id": "longcat-video-val"
}

Keep aspect_bucket_alignment at 64. The first latent frame holds the start image; leave it intact. Stick with 93 frames (already matches the VAE stride rule (frames - 1) % 4 == 0) unless you have a strong reason to change it.

Quick setup:

cp config/config.json.example config/config.json
Fill in model_family, model_flavour, output_dir, data_backend_config, and eval_dataset_id. Leave the defaults above unless you know you need different values.

CUDA attention options: - On CUDA, LongCat‑Video automatically prefers the bundled block‑sparse Triton kernel when present and falls back to the standard dispatcher otherwise. No manual toggle is required. - To force xFormers instead, set attention_implementation: "xformers" in your config/CLI.


3) Dataloader: pair clips with start frames

  • Create two datasets:
  • Clips: the target videos + captions (edit instructions). Mark them is_i2v: true and set conditioning_data to the start-frame dataset ID.
  • Start frames: one image per clip, same filenames, no captions.
  • Keep both on the 64px grid (e.g., 480x832). Height/width must be divisible by 16. Frame counts must meet (frames - 1) % 4 == 0; 93 is already valid.
  • Use separate VAE caches for clips vs start frames.

Example multidatabackend.json:

[
  {
    "id": "longcat-video-train",
    "type": "local",
    "dataset_type": "video",
    "is_i2v": true,
    "instance_data_dir": "/data/video-clips",
    "caption_strategy": "textfile",
    "resolution": 480,
    "cache_dir_vae": "/cache/vae/longcat/video",
    "conditioning_data": ["longcat-video-cond"]
  },
  {
    "id": "longcat-video-cond",
    "type": "local",
    "dataset_type": "conditioning",
    "instance_data_dir": "/data/video-start-frames",
    "caption_strategy": null,
    "resolution": 480,
    "cache_dir_vae": "/cache/vae/longcat/video-cond"
  }
]

See caption_strategy options and requirements in DATALOADER.md.


4) Validation specifics

  • Add a small validation split with the same paired structure as training. Set validation_using_datasets: true and point eval_dataset_id to that split (e.g., longcat-video-val) so validation pulls the start frame automatically.
  • WebUI previews: start simpletuner server, choose LongCat‑Video edit, and upload the start frame + prompt.
  • Guidance: 3.5–5.0 works; empty negatives are auto‑filled when CFG is on.
  • For low‑VRAM previews or training, set musubi_blocks_to_swap (start with 4–8) and optionally musubi_block_swap_device to stream the last transformer blocks from CPU during forward/backward. It trades some throughput for lower peak VRAM.
  • The conditioning frame stays fixed during sampling; only later frames denoise.

5) Training start (CLI)

After config and dataloader are set:

simpletuner train --config config/config.json
Ensure conditioning frames are present in the training data so the pipeline can build conditioning latents.


6) Troubleshooting

  • Missing conditioning image: provide a conditioning dataset via conditioning_data with matching filenames; set eval_dataset_id to your validation split ID.
  • Height/width errors: keep dimensions divisible by 16 and on the 64px grid.
  • First frame drifts: lower guidance (3.5–4.0) or reduce steps.
  • OOM: lower validation resolution/frames, reduce lora_rank, enable group offload, or use int8-quanto/fp8-torchao.