DCM Distillation Quickstart (SimpleTuner)¶
In this example, we'll be training a 4-step student using DCM distillation from a large flow-matching teacher model like Wan 2.1 T2V.
DCM supports:
- Semantic mode: standard flow-matching with CFG baked in.
- Fine mode: optional GAN-based adversarial supervision (experimental).
✅ Hardware Requirements¶
| Model | Batch Size | Min VRAM | Notes |
|---|---|---|---|
| Wan 1.3B | 1 | 12 GB | A5000 / 3090+ tier GPU |
| Wan 14B | 1 | 24 GB | Slower; use --offload_during_startup |
| Fine mode | 1 | +10% | Discriminator runs per-GPU |
⚠️ Mac and Apple silicon are slow and not recommended. You'll get 10 min/step runtimes even in semantic mode.
📦 Installation¶
Same steps as the Wan guide:
git clone --branch=release https://github.com/bghira/SimpleTuner.git
cd SimpleTuner
python3.13 -m venv .venv && source .venv/bin/activate
# Install with automatic platform detection
pip install -e .
Note: The setup.py automatically detects your platform (CUDA/ROCm/Apple) and installs the appropriate dependencies.
📁 Configuration¶
Edit your config/config.json:
{
"aspect_bucket_rounding": 2,
"attention_mechanism": "diffusers",
"base_model_precision": "int8-quanto",
"caption_dropout_probability": 0.1,
"checkpoint_step_interval": 100,
"checkpoints_total_limit": 5,
"compress_disk_cache": true,
"data_backend_config": "config/wan/multidatabackend.json",
"delete_problematic_images": false,
"disable_benchmark": false,
"disable_bucket_pruning": true,
"distillation_method": "dcm",
"distillation_config": {
"mode": "semantic",
"euler_steps": 100
},
"ema_update_interval": 2,
"ema_validation": "ema_only",
"flow_schedule_shift": 17,
"grad_clip_method": "value",
"gradient_accumulation_steps": 1,
"gradient_checkpointing": true,
"hub_model_id": "wan-disney-DCM-distilled",
"ignore_final_epochs": true,
"learning_rate": 1e-4,
"lora_alpha": 128,
"lora_rank": 128,
"lora_type": "standard",
"lr_scheduler": "cosine",
"lr_warmup_steps": 400000,
"lycoris_config": "config/wan/lycoris_config.json",
"max_grad_norm": 0.01,
"max_train_steps": 400000,
"minimum_image_size": 0,
"mixed_precision": "bf16",
"model_family": "wan",
"model_type": "lora",
"num_train_epochs": 0,
"optimizer": "adamw_bf16",
"output_dir": "output/wan",
"pretrained_model_name_or_path": "Wan-AI/Wan2.1-T2V-1.3B-Diffusers",
"pretrained_t5_model_name_or_path": "Wan-AI/Wan2.1-T2V-1.3B-Diffusers",
"prodigy_steps": 100000,
"push_checkpoints_to_hub": true,
"push_to_hub": true,
"quantize_via": "cpu",
"report_to": "wandb",
"resolution": 480,
"resolution_type": "pixel_area",
"resume_from_checkpoint": "latest",
"seed": 42,
"text_encoder_1_precision": "int8-quanto",
"tracker_project_name": "lora-training",
"tracker_run_name": "wan-AdamW-DCM",
"train_batch_size": 2,
"use_ema": false,
"vae_batch_size": 1,
"validation_guidance": 1.0,
"validation_negative_prompt": "色调艳丽,过曝,静态,细节模糊不清,字幕,风格,作品,画作,画面,静止,整体发灰,最差质量,低质量,JPEG压缩残留,丑陋的,残缺的,多余的手指,画得不好的手部,画得不好的脸部,畸形的,毁容的,形态畸形的肢体,手指融合,静止不动的画面,杂乱的背景,三条腿,背景人很多,倒着走",
"validation_num_inference_steps": 8,
"validation_num_video_frames": 16,
"validation_prompt": "A black and white animated scene unfolds featuring a distressed upright cow with prominent horns and expressive eyes, suspended by its legs from a hook on a static background wall. A smaller Mickey Mouse-like character enters, standing near a wooden bench, initiating interaction between the two. The cow's posture changes as it leans, stretches, and falls, while the mouse watches with a concerned expression, its face a mixture of curiosity and worry, in a world devoid of color.",
"validation_prompt_library": false,
"validation_resolution": "832x480",
"validation_seed": 42,
"validation_step_interval": 4,
"webhook_config": "config/wan/webhook.json"
}
Optional:¶
- For fine mode, just change
"mode": "fine". - This mode is currently experimental in SimpleTuner and requires some extra steps to make use of, which are not yet outlined in this guide.
🎬 Dataset & Dataloader¶
Reuse the Disney dataset and data_backend_config JSON from the Wan quickstart.
Note: This dataset is inadequate for distillation, much more diverse and higher volume of data is required to succeed.
Make sure:
num_frames: 75–81resolution: 480crop: false (leave videos uncropped)repeats: 0 for now
📌 Notes¶
- Semantic mode is stable and recommended for most use cases.
- Fine mode adds realism, but needs more steps and tuning and SimpleTuner's current support level of this isn't great.
🧩 Troubleshooting¶
| Problem | Fix |
|---|---|
| Results are blurry | Use more euler_steps, or increase multiphase |
| Validation is degrading | Use validation_guidance: 1.0 |
| OOM in fine mode | Lower train_batch_size, reduce precision levels, or use larger GPU |
| Fine mode not converging | Don't use fine mode, it is not super well-tested in SimpleTuner |