DCM Distillation Quickstart (SimpleTuner)¶

In this example, we'll be training a 4-step student using DCM distillation from a large flow-matching teacher model like Wan 2.1 T2V.

DCM supports:

Semantic mode: standard flow-matching with CFG baked in.
Fine mode: optional GAN-based adversarial supervision (experimental).

✅ Hardware Requirements¶

Model	Batch Size	Min VRAM	Notes
Wan 1.3B	1	12 GB	A5000 / 3090+ tier GPU
Wan 14B	1	24 GB	Slower; use `--offload_during_startup`
Fine mode	1	+10%	Discriminator runs per-GPU

⚠️ Mac and Apple silicon are slow and not recommended. You'll get 10 min/step runtimes even in semantic mode.

📦 Installation¶

Same steps as the Wan guide:

git clone --branch=release https://github.com/bghira/SimpleTuner.git
cd SimpleTuner
python3.13 -m venv .venv && source .venv/bin/activate

# Install with automatic platform detection
pip install -e .

Note: The setup.py automatically detects your platform (CUDA/ROCm/Apple) and installs the appropriate dependencies.

📁 Configuration¶

Edit your config/config.json:

{
    "aspect_bucket_rounding": 2,
    "attention_mechanism": "diffusers",
    "base_model_precision": "int8-quanto",
    "caption_dropout_probability": 0.1,
    "checkpoint_step_interval": 100,
    "checkpoints_total_limit": 5,
    "compress_disk_cache": true,
    "data_backend_config": "config/wan/multidatabackend.json",
    "delete_problematic_images": false,
    "disable_benchmark": false,
    "disable_bucket_pruning": true,
    "distillation_method": "dcm",
    "distillation_config": {
      "mode": "semantic",
      "euler_steps": 100
    },
    "ema_update_interval": 2,
    "ema_validation": "ema_only",
    "flow_schedule_shift": 17,
    "grad_clip_method": "value",
    "gradient_accumulation_steps": 1,
    "gradient_checkpointing": true,
    "hub_model_id": "wan-disney-DCM-distilled",
    "ignore_final_epochs": true,
    "learning_rate": 1e-4,
    "lora_alpha": 128,
    "lora_rank": 128,
    "lora_type": "standard",
    "lr_scheduler": "cosine",
    "lr_warmup_steps": 400000,
    "lycoris_config": "config/wan/lycoris_config.json",
    "max_grad_norm": 0.01,
    "max_train_steps": 400000,
    "minimum_image_size": 0,
    "mixed_precision": "bf16",
    "model_family": "wan",
    "model_type": "lora",
    "num_train_epochs": 0,
    "optimizer": "adamw_bf16",
    "output_dir": "output/wan",
    "pretrained_model_name_or_path": "Wan-AI/Wan2.1-T2V-1.3B-Diffusers",
    "pretrained_t5_model_name_or_path": "Wan-AI/Wan2.1-T2V-1.3B-Diffusers",
    "prodigy_steps": 100000,
    "push_checkpoints_to_hub": true,
    "push_to_hub": true,
    "quantize_via": "cpu",
    "report_to": "wandb",
    "resolution": 480,
    "resolution_type": "pixel_area",
    "resume_from_checkpoint": "latest",
    "seed": 42,
    "text_encoder_1_precision": "int8-quanto",
    "tracker_project_name": "lora-training",
    "tracker_run_name": "wan-AdamW-DCM",
    "train_batch_size": 2,
    "use_ema": false,
    "vae_batch_size": 1,
    "validation_guidance": 1.0,
    "validation_negative_prompt": "色调艳丽，过曝，静态，细节模糊不清，字幕，风格，作品，画作，画面，静止，整体发灰，最差质量，低质量，JPEG压缩残留，丑陋的，残缺的，多余的手指，画得不好的手部，画得不好的脸部，畸形的，毁容的，形态畸形的肢体，手指融合，静止不动的画面，杂乱的背景，三条腿，背景人很多，倒着走",
    "validation_num_inference_steps": 8,
    "validation_num_video_frames": 16,
    "validation_prompt": "A black and white animated scene unfolds featuring a distressed upright cow with prominent horns and expressive eyes, suspended by its legs from a hook on a static background wall. A smaller Mickey Mouse-like character enters, standing near a wooden bench, initiating interaction between the two. The cow's posture changes as it leans, stretches, and falls, while the mouse watches with a concerned expression, its face a mixture of curiosity and worry, in a world devoid of color.",
    "validation_prompt_library": false,
    "validation_resolution": "832x480",
    "validation_seed": 42,
    "validation_step_interval": 4,
    "webhook_config": "config/wan/webhook.json"
}

Optional:¶

For fine mode, just change "mode": "fine".
This mode is currently experimental in SimpleTuner and requires some extra steps to make use of, which are not yet outlined in this guide.

🎬 Dataset & Dataloader¶

Reuse the Disney dataset and data_backend_config JSON from the Wan quickstart.

Note: This dataset is inadequate for distillation, much more diverse and higher volume of data is required to succeed.

Make sure:

num_frames: 75–81
resolution: 480
crop: false (leave videos uncropped)
repeats: 0 for now

📌 Notes¶

Semantic mode is stable and recommended for most use cases.
Fine mode adds realism, but needs more steps and tuning and SimpleTuner's current support level of this isn't great.

🧩 Troubleshooting¶

Problem	Fix
Results are blurry	Use more euler_steps, or increase `multiphase`
Validation is degrading	Use `validation_guidance: 1.0`
OOM in fine mode	Lower `train_batch_size`, reduce precision levels, or use larger GPU
Fine mode not converging	Don't use fine mode, it is not super well-tested in SimpleTuner