Skip to content

FLUX.2 Quickstart

This guide covers training LoRAs on FLUX.2, Black Forest Labs' latest image generation model family.

Note: The default model flavour is klein-9b, but this guide focuses on dev (the full 12B transformer with 24B Mistral-3 text encoder) since it has the highest resource requirements. Klein models are easier to run - see Model Variants below.

Model Variants

FLUX.2 comes in three variants:

Variant Transformer Text Encoder Total Blocks Default
dev 12B params Mistral-3 (24B) 56 (8+48)
klein-9b 9B params Qwen3 (bundled) 32 (8+24)
klein-4b 4B params Qwen3 (bundled) 25 (5+20)

Key differences: - dev: Uses standalone Mistral-Small-3.1-24B text encoder, has guidance embeddings - klein models: Use Qwen3 text encoder bundled in the model repo, no guidance embeddings (guidance training options are ignored)

To select a variant, set model_flavour in your config:

{
  "model_flavour": "dev"
}

Important: For klein-4b and klein-9b, leave pretrained_text_encoder_model_name_or_path unset unless you intentionally want to replace the bundled Qwen3 text encoder. Setting that field overrides the Klein default and can trigger downloads of a different text encoder.

Model Overview

FLUX.2-dev introduces significant architectural changes from FLUX.1:

  • Text Encoder: Mistral-Small-3.1-24B (dev) or Qwen3 (klein)
  • Architecture: 8 DoubleStreamBlocks + 48 SingleStreamBlocks (dev)
  • Latent Channels: 32 VAE channels → 128 after pixel shuffle (vs 16 in FLUX.1)
  • VAE: Custom VAE with batch normalization and pixel shuffling
  • Embedding Dimension: 15,360 for dev (3×5,120), 12,288 for klein-9b (3×4,096), 7,680 for klein-4b (3×2,560)

Hardware Requirements

Hardware requirements vary significantly by model variant.

Klein models are much more accessible:

Variant bf16 VRAM int8 VRAM System RAM
klein-4b ~12GB ~8GB 32GB+
klein-9b ~22GB ~14GB 64GB+

Recommended for klein-9b: Single 24GB GPU (RTX 3090/4090, A5000) Recommended for klein-4b: Single 16GB GPU (RTX 4080, A4000)

FLUX.2-dev (Advanced)

FLUX.2-dev has significant resource requirements due to the Mistral-3 text encoder:

VRAM Requirements

The 24B Mistral text encoder alone requires significant VRAM:

Component bf16 int8 int4
Mistral-3 (24B) ~48GB ~24GB ~12GB
FLUX.2 Transformer ~24GB ~12GB ~6GB
VAE + overhead ~4GB ~4GB ~4GB
Configuration Approximate Total VRAM
bf16 everything ~76GB+
int8 text encoder + bf16 transformer ~52GB
int8 everything ~40GB
int4 text encoder + int8 transformer ~22GB

System RAM

  • Minimum: 96GB system RAM (loading 24B text encoder requires substantial memory)
  • Recommended: 128GB+ for comfortable operation
  • Minimum: 2x 48GB GPUs (A6000, L40S) with FSDP2 or DeepSpeed
  • Recommended: 4x H100 80GB with fp8-torchao
  • With heavy quantization (int4): 2x 24GB GPUs may work but is experimental

Multi-GPU distributed training (FSDP2 or DeepSpeed) is essentially required for FLUX.2-dev due to the combined size of the Mistral-3 text encoder and transformer.

Prerequisites

Python Version

FLUX.2 requires Python 3.10 or later with recent transformers:

python --version  # Should be 3.10+
pip install transformers>=4.45.0

Model Access

FLUX.2 models require access approval on Hugging Face:

For dev: 1. Visit black-forest-labs/FLUX.2-dev 2. Accept the license agreement

For klein models: 1. Visit black-forest-labs/FLUX.2-klein-base-9B or black-forest-labs/FLUX.2-klein-base-4B 2. Accept the license agreement

Ensure you're logged in to Hugging Face CLI: huggingface-cli login

Installation

pip install 'simpletuner[cuda]'

# CUDA 13 / Blackwell users (NVIDIA B-series GPUs)
pip install 'simpletuner[cuda13]' --extra-index-url https://download.pytorch.org/whl/cu130

For development setup:

git clone https://github.com/bghira/SimpleTuner
cd SimpleTuner
pip install -e ".[cuda]"

Configuration

Web Interface

simpletuner server

Access http://localhost:8001 and select FLUX.2 as the model family.

Manual Configuration

Create config/config.json:

View example config
{
  "model_type": "lora",
  "model_family": "flux2",
  "model_flavour": "dev",
  "pretrained_model_name_or_path": "black-forest-labs/FLUX.2-dev",
  "output_dir": "/path/to/output",
  "train_batch_size": 1,
  "gradient_accumulation_steps": 1,
  "gradient_checkpointing": true,
  "mixed_precision": "bf16",
  "learning_rate": 1e-4,
  "lr_scheduler": "constant",
  "max_train_steps": 10000,
  "validation_resolution": "1024x1024",
  "validation_num_inference_steps": 20,
  "flux_guidance_mode": "constant",
  "flux_guidance_value": 1.0,
  "lora_rank": 16
}

Key Configuration Options

Guidance Configuration

Note: Klein models (klein-4b, klein-9b) do not have guidance embeddings. The following guidance options only apply to dev.

FLUX.2-dev uses guidance embedding similar to FLUX.1:

View example config
{
  "flux_guidance_mode": "constant",
  "flux_guidance_value": 1.0
}

Or for random guidance during training:

View example config
{
  "flux_guidance_mode": "random-range",
  "flux_guidance_min": 1.0,
  "flux_guidance_max": 5.0
}

Quantization (Memory Optimization)

For reduced VRAM usage:

View example config
{
  "base_model_precision": "int8-quanto",
  "text_encoder_1_precision": "int8-quanto",
  "base_model_default_dtype": "bf16"
}

TREAD (Training Acceleration)

FLUX.2 supports TREAD for faster training:

View example config
{
  "tread_config": {
    "routes": [
      {"selection_ratio": 0.5, "start_layer_idx": 2, "end_layer_idx": -2}
    ]
  }
}

Advanced Experimental Features

Show advanced experimental details SimpleTuner includes experimental features that can significantly improve training stability and performance. * **[Scheduled Sampling (Rollout)](../experimental/SCHEDULED_SAMPLING.md):** reduces exposure bias and improves output quality by letting the model generate its own inputs during training. > ⚠️ These features increase the computational overhead of training.

Dataset Configuration

Create config/multidatabackend.json:

View example config
[
  {
    "id": "my-dataset",
    "type": "local",
    "crop": true,
    "crop_aspect": "square",
    "crop_style": "center",
    "resolution": 1024,
    "minimum_image_size": 1024,
    "maximum_image_size": 1024,
    "resolution_type": "pixel_area",
    "cache_dir_vae": "cache/vae/flux2/my-dataset",
    "instance_data_dir": "datasets/my-dataset",
    "caption_strategy": "textfile",
    "metadata_backend": "discovery",
    "repeats": 10
  },
  {
    "id": "text-embeds",
    "type": "local",
    "dataset_type": "text_embeds",
    "default": true,
    "cache_dir": "cache/text/flux2",
    "write_batch_size": 64
  }
]

See caption_strategy options and requirements in DATALOADER.md.

Optional edit / reference conditioning

FLUX.2 can train either plain text-to-image (no conditioning) or with paired reference/edit images. To add conditioning, pair your main dataset to one or more conditioning datasets using conditioning_data and choose a conditioning_type:

View example config
[
  {
    "id": "flux2-edits",
    "type": "local",
    "instance_data_dir": "/datasets/flux2/edits",
    "caption_strategy": "textfile",
    "resolution": 1024,
    "conditioning_data": ["flux2-references"],
    "cache_dir_vae": "cache/vae/flux2/edits"
  },
  {
    "id": "flux2-references",
    "type": "local",
    "dataset_type": "conditioning",
    "instance_data_dir": "/datasets/flux2/references",
    "conditioning_type": "reference_strict",
    "resolution": 1024,
    "cache_dir_vae": "cache/vae/flux2/references"
  }
]
  • Use conditioning_type=reference_strict when you need crops aligned 1:1 with the edit image. reference_loose allows mismatched aspect ratios.
  • File names must match between edit and reference datasets; each edit image should have a corresponding reference file.
  • When supplying multiple conditioning datasets, set conditioning_multidataset_sampling (combined vs random) as needed; see OPTIONS.
  • Without conditioning_data, FLUX.2 falls back to standard text-to-image training.

LoRA Targets

Available LoRA target presets:

  • all (default): All attention and MLP layers
  • attention: Only attention layers (qkv, proj)
  • mlp: Only MLP/feed-forward layers
  • tiny: Minimal training (just qkv layers)
View example config
{
  "--flux_lora_target": "all"
}

Training

Login to Services

huggingface-cli login
wandb login  # optional

Start Training

simpletuner train

Or via script:

./train.sh

Memory Offloading

For memory-constrained setups, FLUX.2 supports group offloading for both the transformer and optionally the Mistral-3 text encoder:

--enable_group_offload \
--group_offload_type block_level \
--group_offload_blocks_per_group 1 \
--group_offload_use_stream \
--group_offload_text_encoder

The --group_offload_text_encoder flag is recommended for FLUX.2 since the 24B Mistral text encoder benefits significantly from offloading during text embedding caching. You can also add --group_offload_vae to include the VAE in offloading during latent caching.

Validation Prompts

Create config/user_prompt_library.json:

View example config
{
  "portrait_subject": "a professional portrait photograph of <subject>, studio lighting, high detail",
  "artistic_subject": "an artistic interpretation of <subject> in the style of renaissance painting",
  "cinematic_subject": "a cinematic shot of <subject>, dramatic lighting, film grain"
}

Inference

Using Trained LoRA

FLUX.2 LoRAs can be loaded with the SimpleTuner inference pipeline or compatible tools once community support develops.

Guidance Scale

  • Training with flux_guidance_value=1.0 works well for most use cases
  • At inference, use normal guidance values (3.0-5.0)

Differences from FLUX.1

Aspect FLUX.1 FLUX.2-dev FLUX.2-klein-9b FLUX.2-klein-4b
Text Encoder CLIP-L/14 + T5-XXL Mistral-3 (24B) Qwen3 (bundled) Qwen3 (bundled)
Embedding Dim CLIP: 768, T5: 4096 15,360 12,288 7,680
Latent Channels 16 32 (→128) 32 (→128) 32 (→128)
VAE AutoencoderKL Custom (BatchNorm) Custom (BatchNorm) Custom (BatchNorm)
Transformer Blocks 19 joint + 38 single 8 double + 48 single 8 double + 24 single 5 double + 20 single
Guidance Embeds Yes Yes No No

Troubleshooting

Out of Memory During Startup

  • Enable --offload_during_startup=true
  • Use --quantize_via=cpu for text encoder quantization
  • Reduce --vae_batch_size

Slow Text Embedding

Mistral-3 is large; consider: - Pre-caching all text embeddings before training - Using text encoder quantization - Batch processing with larger write_batch_size

Training Instability

  • Lower learning rate (try 5e-5)
  • Increase gradient accumulation steps
  • Enable gradient checkpointing
  • Use --max_grad_norm=1.0

CUDA Out of Memory

  • Enable quantization (int8-quanto or int4-quanto)
  • Enable gradient checkpointing
  • Reduce batch size
  • Enable group offloading
  • Use TREAD for token routing efficiency

Advanced: TREAD Configuration

TREAD (Token Routing for Efficient Architecture-agnostic Diffusion) speeds up training by selectively processing tokens:

View example config
{
  "tread_config": {
    "routes": [
      {
        "selection_ratio": 0.5,
        "start_layer_idx": 4,
        "end_layer_idx": -4
      }
    ]
  }
}
  • selection_ratio: Fraction of tokens to keep (0.5 = 50%)
  • start_layer_idx: First layer to apply routing
  • end_layer_idx: Last layer (negative = from end)

Expected speedup: 20-40% depending on configuration.

See Also