LayerSync (SimpleTuner)¶
LayerSync is a “teach yourself” nudge for transformer models: one layer (the student) learns to line up with a stronger layer (the teacher). It is light, self-contained, and has no extra models to download.
When to use it¶
- You are training transformer families that expose hidden states (e.g., Flux/Flux Kontext/Flux.2, PixArt Sigma, SD3/SDXL, Sana, Wan, Qwen Image/Edit, Hunyuan Video, LTXVideo, Kandinsky5 Video, Chroma, ACE-Step, HiDream, Cosmos/LongCat/Z-Image/Auraflow).
- You want a built-in regularizer without shipping an external teacher checkpoint.
- You are seeing mid-training drift or unstable heads and want to pull a mid-layer back toward a deeper teacher.
- You have a bit of VRAM headroom to hold student/teacher activations for the current step.
Quick setup (WebUI)¶
- Open Training → Loss functions.
- Enable LayerSync.
- Set Student Block to a mid-layer and Teacher Block to a deeper one. On 24-layer DiT-style models (Flux, PixArt, SD3), start with
8→16; on shorter stacks, keep the teacher a few blocks deeper than the student. - Leave Weight at
0.2(defaults to this when LayerSync is enabled). - Train normally; logs will include
layersync_lossandlayersync_similarity.
Quick setup (config JSON / CLI)¶
{
"layersync_enabled": true,
"layersync_student_block": 8,
"layersync_teacher_block": 16,
"layersync_lambda": 0.2
}
Tuning knobs¶
layersync_student_block/layersync_teacher_block: 1-based-friendly indexing; we tryidx-1first, thenidx.layersync_lambda: scales the cosine loss; must be > 0 when enabled (defaults to0.2).- Teacher defaults to the student block when omitted, making the loss self-similarity.
- VRAM: activations for both layers are kept until the aux loss runs; disable LayerSync (or CREPA) if you need to free memory.
- Plays fine with CREPA/TwinFlow; they share the same hidden-state buffer.
How it works (practitioner)
- Computes negative cosine similarity between flattened student and teacher tokens; higher weight pushes the student toward the teacher’s features. - Teacher tokens are always detached to avoid gradients flowing backward. - Handles 3D `(B, S, D)` and 4D `(B, T, P, D)` hidden states for both image and video transformers. - Upstream option mapping: - `--encoder-depth` → `--layersync_student_block` - `--gt-encoder-depth` → `--layersync_teacher_block` - `--reg-weight` → `--layersync_lambda` - Defaults: off by default; when enabled and unset, `layersync_lambda=0.2`.Technical (SimpleTuner internals)
- Implementation: `simpletuner/helpers/training/layersync.py`; invoked from `ModelFoundation._apply_layersync_regularizer`. - Hidden-state capture: triggered when LayerSync or CREPA requests it; transformers store states as `layer_{idx}` via `_store_hidden_state`. - Layer resolution: tries 1-based then 0-based indices; errors if the requested layers are missing. - Loss path: normalizes student/teacher tokens, computes mean cosine similarity, logs `layersync_loss` and `layersync_similarity`, and adds the scaled loss to the main objective. - Interaction: runs after CREPA so both can reuse the same buffer; clears the buffer afterward.Common pitfalls¶
- Missing student block → startup error; set
layersync_student_blockexplicitly. - Weight ≤ 0 → startup error; keep the default
0.2if unsure. - Requesting blocks deeper than the model exposes → “LayerSync could not find layer” errors; lower the indices.
- Enabling on models that do not expose transformer hidden states (Kolors, Lumina2, Stable Cascade C, Kandinsky5 Image, OmniGen) will fail; stick to transformer-backed families.
- VRAM spikes: lower the block indices or disable CREPA/LayerSync to free the hidden-state buffer.
Use LayerSync when you want a cheap, built-in regularizer to gently steer intermediate representations without adding external teachers.