Skip to content

Stable Diffusion XL Quickstart

In this example, we'll be training a Stable Diffusion XL model using the SimpleTuner toolkit and will be using the lora model type.

Compared to modern, larger models, SDXL is quite modest in size so it may be possible to make use of full training, but that will require additional VRAM versus LoRA training, and other hyperparameter adjustments.

Prerequisites

Make sure that you have python installed; SimpleTuner does well with 3.10 through 3.13 (AMD ROCm machines will require 3.12 or 3.13).

You can check this by running:

python --version

If you don't have python 3.13 installed on Ubuntu, you can try the following:

apt -y install python3.13 python3.13-venv

Container image dependencies

For Vast, RunPod, and TensorDock (among others), the following will work on a CUDA 12.2-12.8 image to enable compiling of CUDA extensions:

apt -y install nvidia-cuda-toolkit

Installation

Install SimpleTuner via pip:

pip install 'simpletuner[cuda]'

# CUDA 13 / Blackwell users (NVIDIA B-series GPUs)
pip install 'simpletuner[cuda13]' --extra-index-url https://download.pytorch.org/whl/cu130

For manual installation or development setup, see the installation documentation.

Setting up the environment

To run SimpleTuner, you will need to set up a configuration file, the dataset and model directories, and a dataloader configuration file.

Configuration file

An experimental script, configure.py, may allow you to entirely skip this section through an interactive step-by-step configuration. It contains some safety features that help avoid common pitfalls.

Note: This doesn't fully configure your dataloader. You will still have to do that manually, later.

To run it:

simpletuner configure

⚠️ For users located in countries where Hugging Face Hub is not readily accessible, you should add HF_ENDPOINT=https://hf-mirror.com to your ~/.bashrc or ~/.zshrc depending on which $SHELL your system uses.

If you prefer to manually configure:

Copy config/config.json.example to config/config.json:

cp config/config.json.example config/config.json

AMD ROCm follow-up steps

The following must be executed for an AMD MI300X to be useable:

apt install amd-smi-lib
pushd /opt/rocm/share/amd_smi
python3 -m pip install --upgrade pip
python3 -m pip install .
popd

There, you will need to modify the following variables:

View example config
{
  "model_type": "lora",
  "model_family": "sdxl",
  "model_flavour": "base-1.0",
  "output_dir": "/home/user/output/models",
  "validation_resolution": "1024x1024,1280x768",
  "validation_guidance": 3.4,
  "use_gradient_checkpointing": true,
  "learning_rate": 1e-4
}
  • model_family - Set this to sdxl.
  • model_flavour - Set this to base-1.0, or, use pretrained_model_name_or_path to point to a different model.
  • model_type - Set this to lora.
  • use_dora - Set this to true if you wish to train DoRA.
  • output_dir - Set this to the directory where you want to store your checkpoints and validation images. It's recommended to use a full path here.
  • validation_resolution - Set this to 1024x1024 for this example.
  • Additionally, Stable Diffusion XL was fine-tuned on multi-aspect buckets, and other resolutions may be specified using commas to separate them: 1024x1024,1280x768
  • validation_guidance - Use whatever value you are comfortable with for testing at inference time. Set this between 4.2 to 6.4.
  • use_gradient_checkpointing - This should probably be true unless you have a LOT of VRAM and want to sacrifice some to make it go faster.
  • learning_rate - 1e-4 is fairly common for low-rank networks, though 1e-5 might be a more conservative choice if you notice any "burning" or early overtraining.

There are a few more if using a Mac M-series machine:

  • mixed_precision should be set to no.
  • This used to be true in pytorch 2.4, but maybe bf16 can be used now as of 2.6+
  • attention_mechanism could be set to xformers to make use of that, but it's kind of obsoleted.

Quantised model training

Tested on Apple and NVIDIA systems, Hugging Face Optimum-Quanto can be used to reduce the precision and VRAM requirements of the Unet, but it doesn't work as well as on Diffusion Transformer models like SD3/Flux, so, is not recommended.

If you're on tight resource constraints however, you can still make use of it.

For config.json:

View example config
{
  "base_model_precision": "int8-quanto",
  "text_encoder_1_precision": "no_change",
  "text_encoder_2_precision": "no_change",
  "optimizer": "optimi-lion"
}

Advanced Experimental Features

Show advanced experimental details SimpleTuner includes experimental features that can significantly improve training stability and performance, particularly for smaller datasets or older architectures like SDXL. * **[Scheduled Sampling (Rollout)](../experimental/SCHEDULED_SAMPLING.md):** reduces exposure bias and improves output quality by letting the model generate its own inputs during training. * **[Diff2Flow](../experimental/DIFF2FLOW.md):** allows training SDXL with a Flow Matching objective, potentially improving generation straightness and quality. > ⚠️ These features increase the computational overhead of training.

Dataset considerations

It's crucial to have a substantial dataset to train your model on. There are limitations on the dataset size, and you will need to ensure that your dataset is large enough to train your model effectively. Note that the bare minimum dataset size is TRAIN_BATCH_SIZE * GRADIENT_ACCUMULATION_STEPS. The dataset will not be discoverable by the trainer if it is too small.

💡 Tip: For large datasets where disk space is a concern, you can use --vae_cache_disable to perform online VAE encoding without caching the results to disk. This is implicitly enabled if you use --vae_cache_ondemand but adding --vae_cache_disable ensures nothing is written to disk.

Depending on the dataset you have, you will need to set up your dataset directory and dataloader configuration file differently. In this example, we will be using pseudo-camera-10k as the dataset.

In your OUTPUT_DIR directory, create a multidatabackend.json:

View example config
[
  {
    "id": "pseudo-camera-10k-sdxl",
    "type": "local",
    "crop": true,
    "crop_aspect": "square",
    "crop_style": "random",
    "resolution": 1.0,
    "minimum_image_size": 0.25,
    "maximum_image_size": 1.0,
    "target_downsample_size": 1.0,
    "resolution_type": "area",
    "cache_dir_vae": "cache/vae/sdxl/pseudo-camera-10k",
    "instance_data_dir": "/home/user/simpletuner/datasets/pseudo-camera-10k",
    "disabled": false,
    "skip_file_discovery": "",
    "caption_strategy": "filename",
    "metadata_backend": "discovery"
  },
  {
    "id": "text-embeds",
    "type": "local",
    "dataset_type": "text_embeds",
    "default": true,
    "cache_dir": "cache/text/sdxl/pseudo-camera-10k",
    "disabled": false,
    "write_batch_size": 128
  }
]

See caption_strategy options and requirements in DATALOADER.md.

Then, create a datasets directory:

mkdir -p datasets
huggingface-cli download --repo-type=dataset bghira/pseudo-camera-10k --local-dir=datasets/pseudo-camera-10k

This will download about 10k photograph samples to your datasets/pseudo-camera-10k directory, which will be automatically created for you.

Login to WandB and Huggingface Hub

You'll want to login to WandB and HF Hub before beginning training, especially if you're using push_to_hub: true and --report_to=wandb.

If you're going to be pushing items to a Git LFS repository manually, you should also run git config --global credential.helper store

Run the following commands:

wandb login

and

huggingface-cli login

Follow the instructions to log in to both services.

Executing the training run

From the SimpleTuner directory, one simply has to run:

bash train.sh

This will begin the text embed and VAE output caching to disk.

For more information, see the dataloader and tutorial documents.

CLIP score tracking

If you wish to enable evaluations to score the model's performance, see this document for information on configuring and interpreting CLIP scores.

Stable evaluation loss

If you wish to use stable MSE loss to score the model's performance, see this document for information on configuring and interpreting evaluation loss.

Validation previews

SimpleTuner supports streaming intermediate validation previews during generation using Tiny AutoEncoder models. This allows you to see validation images being generated step-by-step in real-time via webhook callbacks.

To enable:

View example config
{
  "validation_preview": true,
  "validation_preview_steps": 1
}

Requirements: - Webhook configuration - Validation enabled

Set validation_preview_steps to a higher value (e.g., 3 or 5) to reduce Tiny AutoEncoder overhead. With validation_num_inference_steps=20 and validation_preview_steps=5, you'll receive preview images at steps 5, 10, 15, and 20.