FLUX.2 Quickstart¶
This guide covers training LoRAs on FLUX.2, Black Forest Labs' latest image generation model family.
Note: The default model flavour is
klein-9b, but this guide focuses ondev(the full 12B transformer with 24B Mistral-3 text encoder) since it has the highest resource requirements. Klein models are easier to run - see Model Variants below.
Model Variants¶
FLUX.2 comes in three variants:
| Variant | Transformer | Text Encoder | Total Blocks | Default |
|---|---|---|---|---|
dev |
12B params | Mistral-3 (24B) | 56 (8+48) | |
klein-9b |
9B params | Qwen3 (bundled) | 32 (8+24) | ✓ |
klein-4b |
4B params | Qwen3 (bundled) | 25 (5+20) |
Key differences: - dev: Uses standalone Mistral-Small-3.1-24B text encoder, has guidance embeddings - klein models: Use Qwen3 text encoder bundled in the model repo, no guidance embeddings (guidance training options are ignored)
To select a variant, set model_flavour in your config:
Important: For
klein-4bandklein-9b, leavepretrained_text_encoder_model_name_or_pathunset unless you intentionally want to replace the bundled Qwen3 text encoder. Setting that field overrides the Klein default and can trigger downloads of a different text encoder.
Model Overview¶
FLUX.2-dev introduces significant architectural changes from FLUX.1:
- Text Encoder: Mistral-Small-3.1-24B (dev) or Qwen3 (klein)
- Architecture: 8 DoubleStreamBlocks + 48 SingleStreamBlocks (dev)
- Latent Channels: 32 VAE channels → 128 after pixel shuffle (vs 16 in FLUX.1)
- VAE: Custom VAE with batch normalization and pixel shuffling
- Embedding Dimension: 15,360 for dev (3×5,120), 12,288 for klein-9b (3×4,096), 7,680 for klein-4b (3×2,560)
Hardware Requirements¶
Hardware requirements vary significantly by model variant.
Klein Models (Recommended for Most Users)¶
Klein models are much more accessible:
| Variant | bf16 VRAM | int8 VRAM | System RAM |
|---|---|---|---|
klein-4b |
~12GB | ~8GB | 32GB+ |
klein-9b |
~22GB | ~14GB | 64GB+ |
Recommended for klein-9b: Single 24GB GPU (RTX 3090/4090, A5000) Recommended for klein-4b: Single 16GB GPU (RTX 4080, A4000)
FLUX.2-dev (Advanced)¶
FLUX.2-dev has significant resource requirements due to the Mistral-3 text encoder:
VRAM Requirements¶
The 24B Mistral text encoder alone requires significant VRAM:
| Component | bf16 | int8 | int4 |
|---|---|---|---|
| Mistral-3 (24B) | ~48GB | ~24GB | ~12GB |
| FLUX.2 Transformer | ~24GB | ~12GB | ~6GB |
| VAE + overhead | ~4GB | ~4GB | ~4GB |
| Configuration | Approximate Total VRAM |
|---|---|
| bf16 everything | ~76GB+ |
| int8 text encoder + bf16 transformer | ~52GB |
| int8 everything | ~40GB |
| int4 text encoder + int8 transformer | ~22GB |
System RAM¶
- Minimum: 96GB system RAM (loading 24B text encoder requires substantial memory)
- Recommended: 128GB+ for comfortable operation
Recommended Hardware¶
- Minimum: 2x 48GB GPUs (A6000, L40S) with FSDP2 or DeepSpeed
- Recommended: 4x H100 80GB with fp8-torchao
- With heavy quantization (int4): 2x 24GB GPUs may work but is experimental
Multi-GPU distributed training (FSDP2 or DeepSpeed) is essentially required for FLUX.2-dev due to the combined size of the Mistral-3 text encoder and transformer.
Prerequisites¶
Python Version¶
FLUX.2 requires Python 3.10 or later with recent transformers:
Model Access¶
FLUX.2 models require access approval on Hugging Face:
For dev: 1. Visit black-forest-labs/FLUX.2-dev 2. Accept the license agreement
For klein models: 1. Visit black-forest-labs/FLUX.2-klein-base-9B or black-forest-labs/FLUX.2-klein-base-4B 2. Accept the license agreement
Ensure you're logged in to Hugging Face CLI: huggingface-cli login
Installation¶
pip install 'simpletuner[cuda]'
# CUDA 13 / Blackwell users (NVIDIA B-series GPUs)
pip install 'simpletuner[cuda13]' --extra-index-url https://download.pytorch.org/whl/cu130
For development setup:
Configuration¶
Web Interface¶
Access http://localhost:8001 and select FLUX.2 as the model family.
Manual Configuration¶
Create config/config.json:
View example config
{
"model_type": "lora",
"model_family": "flux2",
"model_flavour": "dev",
"pretrained_model_name_or_path": "black-forest-labs/FLUX.2-dev",
"output_dir": "/path/to/output",
"train_batch_size": 1,
"gradient_accumulation_steps": 1,
"gradient_checkpointing": true,
"mixed_precision": "bf16",
"learning_rate": 1e-4,
"lr_scheduler": "constant",
"max_train_steps": 10000,
"validation_resolution": "1024x1024",
"validation_num_inference_steps": 20,
"flux_guidance_mode": "constant",
"flux_guidance_value": 1.0,
"lora_rank": 16
}
Key Configuration Options¶
Guidance Configuration¶
Note: Klein models (
klein-4b,klein-9b) do not have guidance embeddings. The following guidance options only apply todev.
FLUX.2-dev uses guidance embedding similar to FLUX.1:
Or for random guidance during training:
View example config
Quantization (Memory Optimization)¶
For reduced VRAM usage:
View example config
TREAD (Training Acceleration)¶
FLUX.2 supports TREAD for faster training:
View example config
Advanced Experimental Features¶
Show advanced experimental details
SimpleTuner includes experimental features that can significantly improve training stability and performance. * **[Scheduled Sampling (Rollout)](../experimental/SCHEDULED_SAMPLING.md):** reduces exposure bias and improves output quality by letting the model generate its own inputs during training. > ⚠️ These features increase the computational overhead of training.Dataset Configuration¶
Create config/multidatabackend.json:
View example config
[
{
"id": "my-dataset",
"type": "local",
"crop": true,
"crop_aspect": "square",
"crop_style": "center",
"resolution": 1024,
"minimum_image_size": 1024,
"maximum_image_size": 1024,
"resolution_type": "pixel_area",
"cache_dir_vae": "cache/vae/flux2/my-dataset",
"instance_data_dir": "datasets/my-dataset",
"caption_strategy": "textfile",
"metadata_backend": "discovery",
"repeats": 10
},
{
"id": "text-embeds",
"type": "local",
"dataset_type": "text_embeds",
"default": true,
"cache_dir": "cache/text/flux2",
"write_batch_size": 64
}
]
See caption_strategy options and requirements in DATALOADER.md.
Optional edit / reference conditioning¶
FLUX.2 can train either plain text-to-image (no conditioning) or with paired reference/edit images. To add conditioning, pair your main dataset to one or more conditioning datasets using conditioning_data and choose a conditioning_type:
View example config
[
{
"id": "flux2-edits",
"type": "local",
"instance_data_dir": "/datasets/flux2/edits",
"caption_strategy": "textfile",
"resolution": 1024,
"conditioning_data": ["flux2-references"],
"cache_dir_vae": "cache/vae/flux2/edits"
},
{
"id": "flux2-references",
"type": "local",
"dataset_type": "conditioning",
"instance_data_dir": "/datasets/flux2/references",
"conditioning_type": "reference_strict",
"resolution": 1024,
"cache_dir_vae": "cache/vae/flux2/references"
}
]
- Use
conditioning_type=reference_strictwhen you need crops aligned 1:1 with the edit image.reference_looseallows mismatched aspect ratios. - File names must match between edit and reference datasets; each edit image should have a corresponding reference file.
- When supplying multiple conditioning datasets, set
conditioning_multidataset_sampling(combinedvsrandom) as needed; see OPTIONS. - Without
conditioning_data, FLUX.2 falls back to standard text-to-image training.
LoRA Targets¶
Available LoRA target presets:
all(default): All attention and MLP layersattention: Only attention layers (qkv, proj)mlp: Only MLP/feed-forward layerstiny: Minimal training (just qkv layers)
Training¶
Login to Services¶
Start Training¶
Or via script:
Memory Offloading¶
For memory-constrained setups, FLUX.2 supports group offloading for both the transformer and optionally the Mistral-3 text encoder:
--enable_group_offload \
--group_offload_type block_level \
--group_offload_blocks_per_group 1 \
--group_offload_use_stream \
--group_offload_text_encoder
The --group_offload_text_encoder flag is recommended for FLUX.2 since the 24B Mistral text encoder benefits significantly from offloading during text embedding caching. You can also add --group_offload_vae to include the VAE in offloading during latent caching.
Validation Prompts¶
Create config/user_prompt_library.json:
View example config
Inference¶
Using Trained LoRA¶
FLUX.2 LoRAs can be loaded with the SimpleTuner inference pipeline or compatible tools once community support develops.
Guidance Scale¶
- Training with
flux_guidance_value=1.0works well for most use cases - At inference, use normal guidance values (3.0-5.0)
Differences from FLUX.1¶
| Aspect | FLUX.1 | FLUX.2-dev | FLUX.2-klein-9b | FLUX.2-klein-4b |
|---|---|---|---|---|
| Text Encoder | CLIP-L/14 + T5-XXL | Mistral-3 (24B) | Qwen3 (bundled) | Qwen3 (bundled) |
| Embedding Dim | CLIP: 768, T5: 4096 | 15,360 | 12,288 | 7,680 |
| Latent Channels | 16 | 32 (→128) | 32 (→128) | 32 (→128) |
| VAE | AutoencoderKL | Custom (BatchNorm) | Custom (BatchNorm) | Custom (BatchNorm) |
| Transformer Blocks | 19 joint + 38 single | 8 double + 48 single | 8 double + 24 single | 5 double + 20 single |
| Guidance Embeds | Yes | Yes | No | No |
Troubleshooting¶
Out of Memory During Startup¶
- Enable
--offload_during_startup=true - Use
--quantize_via=cpufor text encoder quantization - Reduce
--vae_batch_size
Slow Text Embedding¶
Mistral-3 is large; consider:
- Pre-caching all text embeddings before training
- Using text encoder quantization
- Batch processing with larger write_batch_size
Training Instability¶
- Lower learning rate (try 5e-5)
- Increase gradient accumulation steps
- Enable gradient checkpointing
- Use
--max_grad_norm=1.0
CUDA Out of Memory¶
- Enable quantization (
int8-quantoorint4-quanto) - Enable gradient checkpointing
- Reduce batch size
- Enable group offloading
- Use TREAD for token routing efficiency
Advanced: TREAD Configuration¶
TREAD (Token Routing for Efficient Architecture-agnostic Diffusion) speeds up training by selectively processing tokens:
View example config
selection_ratio: Fraction of tokens to keep (0.5 = 50%)start_layer_idx: First layer to apply routingend_layer_idx: Last layer (negative = from end)
Expected speedup: 20-40% depending on configuration.
See Also¶
- FLUX.1 Quickstart - For FLUX.1 training
- TREAD Documentation - Detailed TREAD configuration
- LyCORIS Training Guide - LoRA and LyCORIS training methods
- Dataloader Configuration - Dataset setup