Skip to content

LongCat‑Image Edit Quickstart

This is the edit/img2img variant of LongCat‑Image. Read LONGCAT_IMAGE.md first; this file only lists what changes for the edit flavour.


1) Model differences vs base LongCat‑Image

Base (text2img) Edit
Flavour final / dev edit
Conditioning none requires conditioning latents (reference image)
Text encoder Qwen‑2.5‑VL Qwen‑2.5‑VL with vision context (prompt encoding needs ref image)
Pipeline TEXT2IMG IMG2IMG/EDIT
Validation inputs prompt only prompt and reference

2) Config changes (CLI/WebUI)

{
  "model_type": "lora",
  "model_family": "longcat_image",
  "model_flavour": "edit",
  "base_model_precision": "int8-quanto",      // fp8-torchao also fine; helps fit 16–24 GB
  "train_batch_size": 1,
  "gradient_checkpointing": true,
  "learning_rate": 5e-5,
  "validation_guidance": 4.5,
  "validation_num_inference_steps": 40,
  "validation_resolution": "768x768"
}

Keep aspect_bucket_alignment at 64. Do not disable conditioning latents; the edit pipeline expects them.

Fast config creation:

cp config/config.json.example config/config.json
Then set model_family, model_flavour, dataset paths, and output_dir.


3) Dataloader: paired edit + reference

Use two aligned datasets: edit images (caption = edit instruction) and reference images. The edit dataset’s conditioning_data must point to the reference dataset ID. Filenames must match 1‑to‑1.

[
  {
    "id": "edit-images",
    "type": "local",
    "instance_data_dir": "/data/edits",
    "caption_strategy": "textfile",
    "resolution": 768,
    "cache_dir_vae": "/cache/vae/longcat/edit",
    "conditioning_data": ["ref-images"]
  },
  {
    "id": "ref-images",
    "type": "local",
    "instance_data_dir": "/data/refs",
    "caption_strategy": null,
    "resolution": 768,
    "cache_dir_vae": "/cache/vae/longcat/ref"
  }
]

See caption_strategy options and requirements in DATALOADER.md.

Notes: - Aspect buckets: keep on the 64px grid. - Reference captions are optional; if present they replace edit captions (usually undesired). - VAE caches for edit and reference should be separate paths. - If you see cache misses or shape errors, clear the VAE caches for both datasets and regenerate.


4) Validation specifics

  • Validation needs reference images to produce conditioning latents. Point the validation split of edit-images to ref-images via conditioning_data.
  • Guidance: 4–6 works well; keep negative prompt empty.
  • Preview callbacks are supported; latents are unpacked for decoders automatically.
  • If validation fails due to missing conditioning latents, check that the validation dataloader includes both edit and reference entries with matching filenames.

5) Inference / validation commands

Quick CLI validation:

simpletuner validate \
  --model_family longcat_image \
  --model_flavour edit \
  --validation_resolution 768x768 \
  --validation_guidance 4.5 \
  --validation_num_inference_steps 40

WebUI: choose the Edit pipeline, supply both the source image and the edit instruction.


6) Training start (CLI)

After config and dataloader are set:

simpletuner train --config config/config.json
Ensure the reference dataset is present during training so conditioning latents can be computed or loaded from cache.


7) Troubleshooting

  • Missing conditioning latents: ensure the reference dataset is wired via conditioning_data and filenames match.
  • MPS dtype errors: the pipeline auto‑downgrades pos‑ids to float32 on MPS; keep the rest at float32/bf16.
  • Channel mismatch in previews: previews un‑patchify latents before decoding (keep this SimpleTuner version).
  • OOM during edit: lower validation resolution/steps, reduce lora_rank, enable group offload, and prefer int8-quanto/fp8-torchao.