QWEN IMAGE

Qwen Image Quickstart¶

🆕 Looking for the edit checkpoints? See the Qwen Image Edit quickstart for paired-reference training instructions.

In this example, we'll be training a LoRA for Qwen Image, a 20B parameter vision-language model. Due to its size, we'll need aggressive memory optimization techniques.

A 24GB GPU is the absolute minimum, and even then you'll need extensive quantization and careful configuration. 40GB+ is strongly recommended for a smoother experience.

When training on 24G, validations will run out of memory unless you use lower resolution or aggressive quant level beyond int8.

Hardware requirements¶

Qwen Image is a 20B parameter model with a sophisticated text encoder that alone consumes ~16GB VRAM before quantization. The model uses a custom VAE with 16 latent channels.

Important limitations: - Not supported on AMD ROCm or MacOS due to lack of efficient flash attention - Batch size > 1 is not currently working correctly; use gradient accumulation instead - TREAD (Text-Representation Enhanced Adversarial Diffusion) is not yet supported

Prerequisites¶

Make sure that you have python installed; SimpleTuner does well with 3.10 through 3.13.

You can check this by running:

python --version

If you don't have python 3.13 installed on Ubuntu, you can try the following:

apt -y install python3.13 python3.13-venv

Container image dependencies¶

For Vast, RunPod, and TensorDock (among others), the following will work on a CUDA 12.2-12.8 image to enable compiling of CUDA extensions:

apt -y install nvidia-cuda-toolkit

Installation¶

Install SimpleTuner via pip:

pip install 'simpletuner[cuda]'

# CUDA 13 / Blackwell users (NVIDIA B-series GPUs)
pip install 'simpletuner[cuda13]' --extra-index-url https://download.pytorch.org/whl/cu130

For manual installation or development setup, see the installation documentation.

Setting up the environment¶

To run SimpleTuner, you will need to set up a configuration file, the dataset and model directories, and a dataloader configuration file.

Configuration file¶

An experimental script, configure.py, may allow you to entirely skip this section through an interactive step-by-step configuration. It contains some safety features that help avoid common pitfalls.

Note: This doesn't configure your dataloader. You will still have to do that manually, later.

To run it:

simpletuner configure

⚠️ For users located in countries where Hugging Face Hub is not readily accessible, you should add HF_ENDPOINT=https://hf-mirror.com to your ~/.bashrc or ~/.zshrc depending on which $SHELL your system uses.

If you prefer to manually configure:

Copy config/config.json.example to config/config.json:

cp config/config.json.example config/config.json

There, you will possibly need to modify the following variables:

model_type - Set this to lora.
lora_type - Set this to standard for PEFT LoRA or lycoris for LoKr.
model_family - Set this to qwen_image.
model_flavour - Set this to v1.0.
output_dir - Set this to the directory where you want to store your checkpoints and validation images. It's recommended to use a full path here.
train_batch_size - Set this based on your available VRAM. Current SimpleTuner Qwen overrides support batch sizes greater than 1.
gradient_accumulation_steps - Set this to 2-8 if you want a larger effective batch without increasing per-step VRAM.
validation_resolution - You should set this to 1024x1024 or lower for memory constraints.
24G cannot handle 1024x1024 validations currently - you'll need to reduce the size
Other resolutions may be specified using commas to separate them: 1024x1024,768x768,512x512
validation_guidance - Use a value around 3.0-4.0 for good results.
validation_num_inference_steps - Use somewhere around 30.
use_ema - Setting this to true will help obtain smoother results but uses more memory.
optimizer - Use optimi-lion for good results, or adamw-bf16 if you have memory to spare.
mixed_precision - Must be set to bf16 for Qwen Image.
gradient_checkpointing - Required to be enabled (true) for reasonable memory usage.
base_model_precision - Strongly recommended to set to int8-quanto or nf4-bnb for 24GB cards.
quantize_via - Set to cpu to avoid OOM during quantization on smaller GPUs.
quantize_activations - Keep this false to maintain training quality.

Memory optimization settings for 24GB GPUs: - lora_rank - Use 8 or lower. - lora_alpha - Match this to your lora_rank value. - flow_schedule_shift - Set to 1.73 (or experiment between 1.0-3.0).

Your config.json will look something like this for a minimal setup:

View example config

{
    "model_type": "lora",
    "model_family": "qwen_image",
    "model_flavour": "v1.0",
    "lora_type": "standard",
    "lora_rank": 8,
    "lora_alpha": 8,
    "output_dir": "output/models-qwen_image",
    "train_batch_size": 1,
    "gradient_accumulation_steps": 4,
    "validation_resolution": "1024x1024",
    "validation_guidance": 4.0,
    "validation_num_inference_steps": 30,
    "validation_seed": 42,
    "validation_prompt": "A photo-realistic image of a cat",
    "validation_step_interval": 100,
    "vae_batch_size": 1,
    "seed": 42,
    "resume_from_checkpoint": "latest",
    "resolution": 1024,
    "resolution_type": "pixel_area",
    "report_to": "tensorboard",
    "optimizer": "optimi-lion",
    "num_train_epochs": 0,
    "num_eval_images": 1,
    "mixed_precision": "bf16",
    "minimum_image_size": 0,
    "max_train_steps": 1000,
    "max_grad_norm": 0.01,
    "lr_warmup_steps": 100,
    "lr_scheduler": "constant_with_warmup",
    "learning_rate": "1e-4",
    "gradient_checkpointing": "true",
    "base_model_precision": "int2-quanto",
    "quantize_via": "cpu",
    "quantize_activations": false,
    "flow_schedule_shift": 1.73,
    "disable_benchmark": false,
    "data_backend_config": "config/qwen_image/multidatabackend.json",
    "checkpoints_total_limit": 5,
    "checkpoint_step_interval": 500,
    "caption_dropout_probability": 0.0,
    "aspect_bucket_rounding": 2
}

ℹ️ Multi-GPU users can reference this document for information on configuring the number of GPUs to use.

⚠️ Critical for 24GB GPUs: The text encoder alone uses ~16GB VRAM. With int2-quanto or nf4-bnb quantization, this can be reduced significantly.

For a quick sanity check with a known working configuration:

Option 1 (Recommended - pip install):

pip install 'simpletuner[cuda]'

# CUDA 13 / Blackwell users (NVIDIA B-series GPUs)
pip install 'simpletuner[cuda13]' --extra-index-url https://download.pytorch.org/whl/cu130

simpletuner train example=qwen_image.peft-lora

Option 2 (Git clone method):

simpletuner train env=examples/qwen_image.peft-lora

Option 3 (Legacy method - still works):

ENV=examples/qwen_image.peft-lora ./train.sh

Advanced Experimental Features¶

Show advanced experimental details

SimpleTuner includes experimental features that can significantly improve training stability and performance. * **[Scheduled Sampling (Rollout)](../experimental/SCHEDULED_SAMPLING.md):** reduces exposure bias and improves output quality by letting the model generate its own inputs during training. > ⚠️ These features increase the computational overhead of training. #### Validation prompts Inside `config/config.json` is the "primary validation prompt", which is typically the main instance_prompt you are training on for your single subject or style. Additionally, a JSON file may be created that contains extra prompts to run through during validations. The example config file `config/user_prompt_library.json.example` contains the following format:

{
  "nickname": "the prompt goes here",
  "another_nickname": "another prompt goes here"
}

The nicknames are the filename for the validation, so keep them short and compatible with your filesystem. To point the trainer to this prompt library, add it to your config.json:

  "validation_prompt_library": "config/user_prompt_library.json",

A set of diverse prompts will help determine whether the model is learning properly:

{
    "anime_style": "a breathtaking anime-style portrait with vibrant colors and expressive features",
    "chef_cooking": "a high-quality, detailed photograph of a sous-chef immersed in culinary creation",
    "portrait": "a lifelike and intimate portrait showcasing unique personality and charm",
    "cinematic": "a cinematic, visually stunning photo with dramatic and captivating presence",
    "elegant": "an elegant and timeless portrait exuding grace and sophistication",
    "adventurous": "a dynamic and adventurous photo captured in an exciting moment",
    "mysterious": "a mysterious and enigmatic portrait shrouded in shadows and intrigue",
    "vintage": "a vintage-style portrait evoking the charm and nostalgia of a bygone era",
    "artistic": "an artistic and abstract representation blending creativity with visual storytelling",
    "futuristic": "a futuristic and cutting-edge portrayal set against advanced technology"
}

#### CLIP score tracking If you wish to enable evaluations to score the model's performance, see [this document](../evaluation/CLIP_SCORES.md) for information on configuring and interpreting CLIP scores. #### Stable evaluation loss If you wish to use stable MSE loss to score the model's performance, see [this document](../evaluation/EVAL_LOSS.md) for information on configuring and interpreting evaluation loss. #### Validation previews SimpleTuner supports streaming intermediate validation previews during generation using Tiny AutoEncoder models. This allows you to see validation images being generated step-by-step in real-time via webhook callbacks. To enable:

{
  "validation_preview": true,
  "validation_preview_steps": 1
}

**Requirements:** - Webhook configuration - Validation enabled Set `validation_preview_steps` to a higher value (e.g., 3 or 5) to reduce Tiny AutoEncoder overhead. With `validation_num_inference_steps=20` and `validation_preview_steps=5`, you'll receive preview images at steps 5, 10, 15, and 20. #### Flow schedule shifting Qwen Image, as a flow-matching model, supports timestep schedule shifting to control which parts of the generation process are trained. The `flow_schedule_shift` parameter controls this: - Lower values (0.1-1.0): Focus on fine details - Medium values (1.0-3.0): Balanced training (recommended) - Higher values (3.0-6.0): Focus on large compositional features ##### Auto-shift You can enable resolution-dependent timestep shift with `--flow_schedule_auto_shift`, which uses higher shift values for larger images and lower shift values for smaller images. This can provide stable but potentially mediocre training results. ##### Manual specification A `--flow_schedule_shift` value of 1.73 is recommended as a starting point for Qwen Image, though you may need to experiment based on your dataset and goals. #### Dataset considerations It's crucial to have a substantial dataset to train your model on. There are limitations on the dataset size, and you will need to ensure that your dataset is large enough to train your model effectively. > ℹ️ With few enough images, you might see a message **no images detected in dataset** - increasing the `repeats` value will overcome this limitation. > ⚠️ **Important**: Due to current limitations, keep `train_batch_size` at 1 and use `gradient_accumulation_steps` instead to simulate larger batch sizes. Create a `--data_backend_config` (`config/multidatabackend.json`) document containing this:

[
  {
    "id": "pseudo-camera-10k-qwen",
    "type": "local",
    "crop": true,
    "crop_aspect": "square",
    "crop_style": "center",
    "resolution": 1024,
    "minimum_image_size": 512,
    "maximum_image_size": 1024,
    "target_downsample_size": 1024,
    "resolution_type": "pixel_area",
    "cache_dir_vae": "cache/vae/qwen_image/pseudo-camera-10k",
    "instance_data_dir": "datasets/pseudo-camera-10k",
    "disabled": false,
    "skip_file_discovery": "",
    "caption_strategy": "filename",
    "metadata_backend": "discovery",
    "repeats": 0,
    "is_regularisation_data": true
  },
  {
    "id": "dreambooth-subject",
    "type": "local",
    "crop": false,
    "resolution": 1024,
    "minimum_image_size": 512,
    "maximum_image_size": 1024,
    "target_downsample_size": 1024,
    "resolution_type": "pixel_area",
    "cache_dir_vae": "cache/vae/qwen_image/dreambooth-subject",
    "instance_data_dir": "datasets/dreambooth-subject",
    "caption_strategy": "instanceprompt",
    "instance_prompt": "the name of your subject goes here",
    "metadata_backend": "discovery",
    "repeats": 1000
  },
  {
    "id": "text-embeds",
    "type": "local",
    "dataset_type": "text_embeds",
    "default": true,
    "cache_dir": "cache/text/qwen_image",
    "disabled": false,
    "write_batch_size": 16
  }
]

> ℹ️ Use `caption_strategy=textfile` if you have `.txt` files containing captions. > See caption_strategy options and requirements in [DATALOADER.md](../DATALOADER.md#caption_strategy). > ℹ️ Note the reduced `write_batch_size` for text embeds to avoid OOM issues. Then, create a `datasets` directory:

mkdir -p datasets
pushd datasets
    huggingface-cli download --repo-type=dataset bghira/pseudo-camera-10k --local-dir=pseudo-camera-10k
    mkdir dreambooth-subject
    # place your images into dreambooth-subject/ now
popd

This will download about 10k photograph samples to your `datasets/pseudo-camera-10k` directory, which will be automatically created for you. Your Dreambooth images should go into the `datasets/dreambooth-subject` directory. #### Login to WandB and Huggingface Hub You'll want to login to WandB and HF Hub before beginning training, especially if you're using `--push_to_hub` and `--report_to=wandb`. If you're going to be pushing items to a Git LFS repository manually, you should also run `git config --global credential.helper store` Run the following commands:

wandb login

and

huggingface-cli login

Follow the instructions to log in to both services.

Executing the training run¶

From the SimpleTuner directory, one simply has to run:

./train.sh

This will begin the text embed and VAE output caching to disk.

For more information, see the dataloader and tutorial documents.

Memory optimization tips¶

Lowest VRAM config (24GB minimum)¶

The lowest VRAM Qwen Image configuration requires approximately 24GB:

OS: Ubuntu Linux 24
GPU: A single NVIDIA CUDA device (24GB minimum)
System memory: 64GB+ recommended
Base model precision:
For NVIDIA systems: int2-quanto or nf4-bnb (required for 24GB cards)
int4-quanto can work but may have lower quality
Optimizer: optimi-lion or bnb-lion8bit-paged for memory efficiency
Resolution: Start with 512px or 768px, work up to 1024px if memory allows
Batch size: 1 (mandatory due to current limitations)
Gradient accumulation steps: 2-8 to simulate larger batches
Enable --gradient_checkpointing (required)
Use --quantize_via=cpu to avoid OOM during startup
Use a small LoRA rank (1-8)
Setting the environment variable PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True helps minimize VRAM usage

NOTE: Pre-caching of VAE embeds and text encoder outputs will use significant memory. Enable offload_during_startup=true if you encounter OOM issues.

Running inference on the LoRA afterward¶

Since Qwen Image is a newer model, here's a functioning example for inference:

Show Python inference example

import torch
from diffusers import QwenImagePipeline, QwenImageTransformer2DModel
from transformers import Qwen2Tokenizer, Qwen2_5_VLForConditionalGeneration

model_id = 'Qwen/Qwen-Image'
adapter_id = 'your-username/your-lora-name'

# Load the pipeline
pipeline = QwenImagePipeline.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16
)

# Load LoRA weights
pipeline.load_lora_weights(adapter_id)

# Optional: quantize the model to save VRAM
from optimum.quanto import quantize, freeze, qint8
quantize(pipeline.transformer, weights=qint8)
freeze(pipeline.transformer)

# Move to device
pipeline.to('cuda' if torch.cuda.is_available() else 'cpu')

# Generate an image
prompt = "Your test prompt here"
negative_prompt = 'ugly, cropped, blurry, low-quality, mediocre average'

image = pipeline(
    prompt=prompt,
    negative_prompt=negative_prompt,
    num_inference_steps=30,
    guidance_scale=4.0,
    generator=torch.Generator(device='cuda').manual_seed(42),
    width=1024,
    height=1024,
).images[0]

image.save("output.png", format="PNG")

Notes & troubleshooting tips¶

Batch size limitations¶

Older diffusers Qwen builds had batch-size > 1 issues caused by text embed padding and attention-mask handling. Current SimpleTuner Qwen overrides patch both paths, so larger batches work if your VRAM allows them. - Increase train_batch_size only after confirming your memory headroom. - If you still see artifacts on an older install, update and regenerate any stale text embeds.

Quantization¶

int2-quanto provides the most aggressive memory savings but may impact quality
nf4-bnb offers a good balance between memory and quality
int4-quanto is a middle ground option
Avoid int8 unless you have 40GB+ VRAM

Learning rates¶

For LoRA training: - Small LoRAs (rank 1-8): Use learning rates around 1e-4 - Larger LoRAs (rank 16-32): Use learning rates around 5e-5 - With Prodigy optimizer: Start with 1.0 and let it adapt

Image artifacts¶

If you encounter artifacts: - Lower your learning rate - Increase gradient accumulation steps - Ensure your images are high quality and properly preprocessed - Consider using lower resolutions initially

Multiple-resolution training¶

Start training at lower resolutions (512px or 768px) to speed up initial learning, then fine-tune at 1024px. Enable --flow_schedule_auto_shift when training at different resolutions.

Platform limitations¶

Not supported on: - AMD ROCm (lacks efficient flash attention implementation) - Apple Silicon/MacOS (memory and attention limitations) - Consumer GPUs with less than 24GB VRAM

Current known issues¶

Batch size > 1 doesn't work correctly (use gradient accumulation)
TREAD is not yet supported
High memory usage from text encoder (~16GB before quantization)
Sequence length handling issues (upstream issue)

For additional help and troubleshooting, consult the SimpleTuner documentation or join the community Discord.