FLUX.2 Quickstart¶

This guide covers training LoRAs on FLUX.2, Black Forest Labs' latest image generation model family.

Note: The default model flavour is klein-9b, but this guide focuses on dev (the full 12B transformer with 24B Mistral-3 text encoder) since it has the highest resource requirements. Klein models are easier to run - see Model Variants below.

Model Variants¶

FLUX.2 comes in three variants:

Variant	Transformer	Text Encoder	Total Blocks	Default
`dev`	12B params	Mistral-3 (24B)	56 (8+48)
`klein-9b`	9B params	Qwen3 (bundled)	32 (8+24)	✓
`klein-4b`	4B params	Qwen3 (bundled)	25 (5+20)

Key differences: - dev: Uses standalone Mistral-Small-3.1-24B text encoder, has guidance embeddings - klein models: Use Qwen3 text encoder bundled in the model repo, no guidance embeddings (guidance training options are ignored)

To select a variant, set model_flavour in your config:

{
  "model_flavour": "dev"
}

Important: For klein-4b and klein-9b, leave pretrained_text_encoder_model_name_or_path unset unless you intentionally want to replace the bundled Qwen3 text encoder. Setting that field overrides the Klein default and can trigger downloads of a different text encoder.

Model Overview¶

FLUX.2-dev introduces significant architectural changes from FLUX.1:

Text Encoder: Mistral-Small-3.1-24B (dev) or Qwen3 (klein)
Architecture: 8 DoubleStreamBlocks + 48 SingleStreamBlocks (dev)
Latent Channels: 32 VAE channels → 128 after pixel shuffle (vs 16 in FLUX.1)
VAE: Custom VAE with batch normalization and pixel shuffling
Embedding Dimension: 15,360 for dev (3×5,120), 12,288 for klein-9b (3×4,096), 7,680 for klein-4b (3×2,560)

Hardware Requirements¶

Hardware requirements vary significantly by model variant.

Klein Models (Recommended for Most Users)¶

Klein models are much more accessible:

Variant	bf16 VRAM	int8 VRAM	System RAM
`klein-4b`	~12GB	~8GB	32GB+
`klein-9b`	~22GB	~14GB	64GB+

Recommended for klein-9b: Single 24GB GPU (RTX 3090/4090, A5000) Recommended for klein-4b: Single 16GB GPU (RTX 4080, A4000)

FLUX.2-dev (Advanced)¶

FLUX.2-dev has significant resource requirements due to the Mistral-3 text encoder:

VRAM Requirements¶

The 24B Mistral text encoder alone requires significant VRAM:

Component	bf16	int8	int4
Mistral-3 (24B)	~48GB	~24GB	~12GB
FLUX.2 Transformer	~24GB	~12GB	~6GB
VAE + overhead	~4GB	~4GB	~4GB

Configuration	Approximate Total VRAM
bf16 everything	~76GB+
int8 text encoder + bf16 transformer	~52GB
int8 everything	~40GB
int4 text encoder + int8 transformer	~22GB

System RAM¶

Minimum: 96GB system RAM (loading 24B text encoder requires substantial memory)
Recommended: 128GB+ for comfortable operation

Recommended Hardware¶

Minimum: 2x 48GB GPUs (A6000, L40S) with FSDP2 or DeepSpeed
Recommended: 4x H100 80GB with fp8-torchao
With heavy quantization (int4): 2x 24GB GPUs may work but is experimental

Multi-GPU distributed training (FSDP2 or DeepSpeed) is essentially required for FLUX.2-dev due to the combined size of the Mistral-3 text encoder and transformer.

Prerequisites¶

Python Version¶

FLUX.2 requires Python 3.10 or later with recent transformers:

python --version  # Should be 3.10+
pip install transformers>=4.45.0

Model Access¶

FLUX.2 models require access approval on Hugging Face:

For dev: 1. Visit black-forest-labs/FLUX.2-dev 2. Accept the license agreement

For klein models: 1. Visit black-forest-labs/FLUX.2-klein-base-9B or black-forest-labs/FLUX.2-klein-base-4B 2. Accept the license agreement

Ensure you're logged in to Hugging Face CLI: huggingface-cli login

Installation¶

pip install 'simpletuner[cuda]'

# CUDA 13 / Blackwell users (NVIDIA B-series GPUs)
pip install 'simpletuner[cuda13]' --extra-index-url https://download.pytorch.org/whl/cu130

For development setup:

git clone https://github.com/bghira/SimpleTuner
cd SimpleTuner
pip install -e ".[cuda]"

Configuration¶

Web Interface¶

simpletuner server

Access http://localhost:8001 and select FLUX.2 as the model family.

Manual Configuration¶

Create config/config.json:

View example config

{
  "model_type": "lora",
  "model_family": "flux2",
  "model_flavour": "dev",
  "pretrained_model_name_or_path": "black-forest-labs/FLUX.2-dev",
  "output_dir": "/path/to/output",
  "train_batch_size": 1,
  "gradient_accumulation_steps": 1,
  "gradient_checkpointing": true,
  "mixed_precision": "bf16",
  "learning_rate": 1e-4,
  "lr_scheduler": "constant",
  "max_train_steps": 10000,
  "validation_resolution": "1024x1024",
  "validation_num_inference_steps": 20,
  "flux_guidance_mode": "constant",
  "flux_guidance_value": 1.0,
  "lora_rank": 16
}

Key Configuration Options¶

Guidance Configuration¶

Note: Klein models (klein-4b, klein-9b) do not have guidance embeddings. The following guidance options only apply to dev.

FLUX.2-dev uses guidance embedding similar to FLUX.1:

View example config

{
  "flux_guidance_mode": "constant",
  "flux_guidance_value": 1.0
}

Or for random guidance during training:

View example config

{
  "flux_guidance_mode": "random-range",
  "flux_guidance_min": 1.0,
  "flux_guidance_max": 5.0
}

Quantization (Memory Optimization)¶

For reduced VRAM usage:

View example config

{
  "base_model_precision": "int8-quanto",
  "text_encoder_1_precision": "int8-quanto",
  "base_model_default_dtype": "bf16"
}

TREAD (Training Acceleration)¶

FLUX.2 supports TREAD for faster training:

View example config

{
  "tread_config": {
    "routes": [
      {"selection_ratio": 0.5, "start_layer_idx": 2, "end_layer_idx": -2}
    ]
  }
}

Advanced Experimental Features¶

Show advanced experimental details

SimpleTuner includes experimental features that can significantly improve training stability and performance. * **[Scheduled Sampling (Rollout)](../experimental/SCHEDULED_SAMPLING.md):** reduces exposure bias and improves output quality by letting the model generate its own inputs during training. > ⚠️ These features increase the computational overhead of training.

Dataset Configuration¶

Create config/multidatabackend.json:

View example config

[
  {
    "id": "my-dataset",
    "type": "local",
    "crop": true,
    "crop_aspect": "square",
    "crop_style": "center",
    "resolution": 1024,
    "minimum_image_size": 1024,
    "maximum_image_size": 1024,
    "resolution_type": "pixel_area",
    "cache_dir_vae": "cache/vae/flux2/my-dataset",
    "instance_data_dir": "datasets/my-dataset",
    "caption_strategy": "textfile",
    "metadata_backend": "discovery",
    "repeats": 10
  },
  {
    "id": "text-embeds",
    "type": "local",
    "dataset_type": "text_embeds",
    "default": true,
    "cache_dir": "cache/text/flux2",
    "write_batch_size": 64
  }
]

See caption_strategy options and requirements in DATALOADER.md.

Optional edit / reference conditioning¶

FLUX.2 can train either plain text-to-image (no conditioning) or with paired reference/edit images. To add conditioning, pair your main dataset to one or more conditioning datasets using conditioning_data and choose a conditioning_type:

View example config

[
  {
    "id": "flux2-edits",
    "type": "local",
    "instance_data_dir": "/datasets/flux2/edits",
    "caption_strategy": "textfile",
    "resolution": 1024,
    "conditioning_data": ["flux2-references"],
    "cache_dir_vae": "cache/vae/flux2/edits"
  },
  {
    "id": "flux2-references",
    "type": "local",
    "dataset_type": "conditioning",
    "instance_data_dir": "/datasets/flux2/references",
    "conditioning_type": "reference_strict",
    "resolution": 1024,
    "cache_dir_vae": "cache/vae/flux2/references"
  }
]

Use conditioning_type=reference_strict when you need crops aligned 1:1 with the edit image. reference_loose allows mismatched aspect ratios.
File names must match between edit and reference datasets; each edit image should have a corresponding reference file.
When supplying multiple conditioning datasets, set conditioning_multidataset_sampling (combined vs random) as needed; see OPTIONS.
Without conditioning_data, FLUX.2 falls back to standard text-to-image training.

LoRA Targets¶

Available LoRA target presets:

all (default): All attention and MLP layers
attention: Only attention layers (qkv, proj)
mlp: Only MLP/feed-forward layers
tiny: Minimal training (just qkv layers)

View example config

{
  "--flux_lora_target": "all"
}

Training¶

huggingface-cli login
wandb login  # optional

Start Training¶

simpletuner train

Or via script:

./train.sh

Memory Offloading¶

For memory-constrained setups, FLUX.2 supports group offloading for both the transformer and optionally the Mistral-3 text encoder:

--enable_group_offload \
--group_offload_type block_level \
--group_offload_blocks_per_group 1 \
--group_offload_use_stream \
--group_offload_text_encoder

The --group_offload_text_encoder flag is recommended for FLUX.2 since the 24B Mistral text encoder benefits significantly from offloading during text embedding caching. You can also add --group_offload_vae to include the VAE in offloading during latent caching.

Validation Prompts¶

Create config/user_prompt_library.json:

View example config

{
  "portrait_subject": "a professional portrait photograph of <subject>, studio lighting, high detail",
  "artistic_subject": "an artistic interpretation of <subject> in the style of renaissance painting",
  "cinematic_subject": "a cinematic shot of <subject>, dramatic lighting, film grain"
}

Inference¶

Using Trained LoRA¶

FLUX.2 LoRAs can be loaded with the SimpleTuner inference pipeline or compatible tools once community support develops.

Guidance Scale¶

Training with flux_guidance_value=1.0 works well for most use cases
At inference, use normal guidance values (3.0-5.0)

Differences from FLUX.1¶

Aspect	FLUX.1	FLUX.2-dev	FLUX.2-klein-9b	FLUX.2-klein-4b
Text Encoder	CLIP-L/14 + T5-XXL	Mistral-3 (24B)	Qwen3 (bundled)	Qwen3 (bundled)
Embedding Dim	CLIP: 768, T5: 4096	15,360	12,288	7,680
Latent Channels	16	32 (→128)	32 (→128)	32 (→128)
VAE	AutoencoderKL	Custom (BatchNorm)	Custom (BatchNorm)	Custom (BatchNorm)
Transformer Blocks	19 joint + 38 single	8 double + 48 single	8 double + 24 single	5 double + 20 single
Guidance Embeds	Yes	Yes	No	No

Troubleshooting¶

Out of Memory During Startup¶

Enable --offload_during_startup=true
Use --quantize_via=cpu for text encoder quantization
Reduce --vae_batch_size

Slow Text Embedding¶

Mistral-3 is large; consider: - Pre-caching all text embeddings before training - Using text encoder quantization - Batch processing with larger write_batch_size

Training Instability¶

Lower learning rate (try 5e-5)
Increase gradient accumulation steps
Enable gradient checkpointing
Use --max_grad_norm=1.0

CUDA Out of Memory¶

Enable quantization (int8-quanto or int4-quanto)
Enable gradient checkpointing
Reduce batch size
Enable group offloading
Use TREAD for token routing efficiency

Advanced: TREAD Configuration¶

TREAD (Token Routing for Efficient Architecture-agnostic Diffusion) speeds up training by selectively processing tokens:

View example config

{
  "tread_config": {
    "routes": [
      {
        "selection_ratio": 0.5,
        "start_layer_idx": 4,
        "end_layer_idx": -4
      }
    ]
  }
}

selection_ratio: Fraction of tokens to keep (0.5 = 50%)
start_layer_idx: First layer to apply routing
end_layer_idx: Last layer (negative = from end)

Expected speedup: 20-40% depending on configuration.

FLUX.2 Quickstart¶

Model Variants¶

Model Overview¶

Hardware Requirements¶

Klein Models (Recommended for Most Users)¶

FLUX.2-dev (Advanced)¶

VRAM Requirements¶

System RAM¶

Recommended Hardware¶

Prerequisites¶

Python Version¶

Model Access¶

Installation¶

Configuration¶

Web Interface¶

Manual Configuration¶

Key Configuration Options¶

Guidance Configuration¶

Quantization (Memory Optimization)¶

TREAD (Training Acceleration)¶

Advanced Experimental Features¶

Dataset Configuration¶

Optional edit / reference conditioning¶

LoRA Targets¶

Training¶

Login to Services¶

Start Training¶

Memory Offloading¶

Validation Prompts¶

Inference¶

Using Trained LoRA¶

Guidance Scale¶

Differences from FLUX.1¶

Troubleshooting¶

Out of Memory During Startup¶

Slow Text Embedding¶

Training Instability¶

CUDA Out of Memory¶

Advanced: TREAD Configuration¶

See Also¶