DMD Distillation Quickstart (SimpleTuner)¶
In this example, we'll be training a 3-step student using DMD (Distribution Matching Distillation) from a large flow-matching teacher model like Wan 2.1 T2V.
DMD features:
- Generator (Student): Learns to match teacher in fewer steps
- Fake Score Transformer: Discriminates between teacher and student outputs
- Multi-step simulation: Optional train-inference consistency mode
✅ Hardware Requirements¶
⚠️ DMD is memory-intensive due to the fake score transformer which requires a complete second copy of the base model be retained in-memory.
It's recommended to attempt LCM or DCM distillation methods for the 14B Wan model instead of DMD, if you do not have the needed VRAM.
An NVIDIA B200 may be required when distilling the 14B model without sparse attention support.
Using LoRA student training can reduce the requirements substantially, but still quite hefty.
📦 Installation¶
git clone --branch=release https://github.com/bghira/SimpleTuner.git
cd SimpleTuner
python3.13 -m venv .venv && source .venv/bin/activate
# Install with automatic platform detection
pip install -e .
Note: The setup.py automatically detects your platform (CUDA/ROCm/Apple) and installs the appropriate dependencies.
📁 Configuration¶
Edit your config/config.json:
{
"aspect_bucket_rounding": 2,
"attention_mechanism": "diffusers",
"base_model_precision": "int8-quanto",
"caption_dropout_probability": 0.1,
"checkpoint_step_interval": 200,
"checkpoints_total_limit": 3,
"compress_disk_cache": true,
"data_backend_config": "config/wan/multidatabackend.json",
"delete_problematic_images": false,
"disable_benchmark": false,
"disable_bucket_pruning": true,
"distillation_method": "dmd",
"distillation_config": {
"dmd_denoising_steps": "1000,757,522",
"generator_update_interval": 1,
"real_score_guidance_scale": 3.0,
"fake_score_lr": 1e-5,
"fake_score_weight_decay": 0.01,
"fake_score_betas": [0.9, 0.999],
"fake_score_eps": 1e-8,
"fake_score_grad_clip": 1.0,
"fake_score_guidance_scale": 0.0,
"min_timestep_ratio": 0.02,
"max_timestep_ratio": 0.98,
"num_frame_per_block": 3,
"independent_first_frame": false,
"same_step_across_blocks": false,
"last_step_only": false,
"num_training_frames": 21,
"context_noise": 0,
"ts_schedule": true,
"ts_schedule_max": false,
"min_score_timestep": 0,
"timestep_shift": 1.0
},
"ema_update_interval": 5,
"ema_validation": "ema_only",
"flow_schedule_shift": 5,
"grad_clip_method": "value",
"gradient_accumulation_steps": 1,
"gradient_checkpointing": true,
"hub_model_id": "wan-disney-DMD-3step",
"ignore_final_epochs": true,
"learning_rate": 2e-5,
"lora_alpha": 128,
"lora_rank": 128,
"lora_type": "standard",
"lr_scheduler": "cosine_with_min_lr",
"lr_warmup_steps": 100,
"max_grad_norm": 1.0,
"max_train_steps": 4000,
"minimum_image_size": 0,
"mixed_precision": "bf16",
"model_family": "wan",
"model_type": "lora",
"num_train_epochs": 0,
"optimizer": "adamw_bf16",
"output_dir": "output/wan-dmd",
"pretrained_model_name_or_path": "Wan-AI/Wan2.1-T2V-1.3B-Diffusers",
"pretrained_t5_model_name_or_path": "Wan-AI/Wan2.1-T2V-1.3B-Diffusers",
"push_checkpoints_to_hub": true,
"push_to_hub": true,
"quantize_via": "cpu",
"report_to": "wandb",
"resolution": 480,
"resolution_type": "pixel_area",
"resume_from_checkpoint": "latest",
"seed": 1000,
"text_encoder_1_precision": "int8-quanto",
"tracker_project_name": "dmd-training",
"tracker_run_name": "wan-DMD-3step",
"train_batch_size": 1,
"use_ema": true,
"vae_batch_size": 1,
"validation_guidance": 1.0,
"validation_negative_prompt": "色调艳丽,过曝,静态,细节模糊不清,字幕,风格,作品,画作,画面,静止,整体发灰,最差质量,低质量,JPEG压缩残留,丑陋的,残缺的,多余的手指,画得不好的手部,画得不好的脸部,畸形的,毁容的,形态畸形的肢体,手指融合,静止不动的画面,杂乱的背景,三条腿,背景人很多,倒着走",
"validation_num_inference_steps": 3,
"validation_num_video_frames": 121,
"validation_prompt": "A black and white animated scene unfolds featuring a distressed upright cow with prominent horns and expressive eyes, suspended by its legs from a hook on a static background wall. A smaller Mickey Mouse-like character enters, standing near a wooden bench, initiating interaction between the two. The cow's posture changes as it leans, stretches, and falls, while the mouse watches with a concerned expression, its face a mixture of curiosity and worry, in a world devoid of color.",
"validation_prompt_library": "config/wan/validation_prompts_dmd.json",
"validation_resolution": "1280x704",
"validation_seed": 42,
"validation_step_interval": 200,
"webhook_config": "config/wan/webhook.json"
}
Key DMD Settings:¶
dmd_denoising_steps– Target timesteps for the backward simulation (default:1000,757,522for a 3-step student).generator_update_interval– Run the expensive generator replay every N trainer steps. Increase to trade sample quality for speed.fake_score_lr/fake_score_weight_decay/fake_score_betas– Optimiser hyperparameters for the fake score transformer.fake_score_guidance_scale– Optional classifier-free guidance on the fake score network (defaults to off).num_frame_per_block,same_step_across_blocks,last_step_only– Control how temporal blocks are scheduled during self-forcing rollout.num_training_frames– Maximum frames generated during the backward simulation (larger values improve fidelity at a memory cost).min_timestep_ratio,max_timestep_ratio,timestep_shift– Shape the KL sampling window. Match these with your teacher’s flow schedule if you deviate from defaults.
🎬 Dataset & Dataloader¶
For DMD to work well, you need diverse, high-quality data:
{
"dataset_type": "video",
"cache_dir": "cache/wan-dmd",
"resolution_type": "pixel_area",
"crop": false,
"num_frames": 121,
"frame_interval": 1,
"resolution": 480,
"minimum_image_size": 0,
"repeats": 0
}
Note: The Disney dataset is inadequate for DMD. DON'T use it! It's provided merely for illustrative purposes.
You need:
- High volume (10k+ videos minimum)
- Diverse content (different styles, motions, subjects)
- High quality (no compression artifacts)
These may be generated from the parent model.
🚀 Training Tips¶
- Keep generator interval small: Start with
"generator_update_interval": 1. Increase only if you need throughput and can tolerate noisier updates. - Monitor both losses: Watch
dmd_lossandfake_score_lossin wandb - Validation frequency: DMD converges quickly, validate often
- Memory management:
- Use
gradient_checkpointing - Lower
train_batch_sizeto 1 - Consider
base_model_precision: "int8-quanto"
📌 DMD vs DCM¶
| Feature | DCM | DMD |
|---|---|---|
| Memory usage | Lower | Higher (fake score model) |
| Training time | Longer | Shorter (4k steps typical) |
| Quality | Good | Excellent |
| Inference steps | 4-8+ | 3-8 |
| Stability | Stable | Requires tuning |
🧩 Troubleshooting¶
| Problem | Fix |
|---|---|
| OOM errors | Reduce num_training_frames, drop generator_update_interval, or lower batch size |
| Fake score not learning | Increase fake_score_lr or use different scheduler |
| Generator overfitting | Increase generator_update_interval to 10 |
| Poor 3-step quality | Try "1000,500" for 2-step first |
| Training unstable | Lower learning rates, check data quality |
🔬 Advanced Options¶
For brave souls wanting to experiment:
"distillation_config": {
"dmd_denoising_steps": "1000,666,333",
"generator_update_interval": 4,
"fake_score_guidance_scale": 1.2,
"num_training_frames": 28,
"same_step_across_blocks": true,
"timestep_shift": 7.0
}
⚠️ It's recommended to use the original FastVideo implementation of DMD for resource-constrained projects as it supports sequence-parallel and video-sparse attention (VSA) for far more efficient runtime usage.