Skip to content

Distributed Training (Multi-node)

This document contains notes* on configuring a 4-way 8xH100 cluster for use with SimpleTuner.

*This guide does not contain full end-to-end installation instructions. Instead, these serve as considerations to take when following the INSTALL document or one of the quickstart guides.

Storage backend

Multi-node training requires by default the use of shared storage between nodes for the output_dir

Ubuntu NFS example

Just a basic storage example that will get you started.

On the 'master' node that will write the checkpoints

1. Install NFS Server Packages

sudo apt update
sudo apt install nfs-kernel-server

2. Configure the NFS Export

Edit the NFS exports file to share the directory:

sudo nano /etc/exports

Add the following line at the end of the file (replace slave_ip with the actual IP address of your slave machine):

/home/ubuntu/simpletuner/output slave_ip(rw,sync,no_subtree_check)

If you want to allow multiple slaves or an entire subnet, you can use:

/home/ubuntu/simpletuner/output subnet_ip/24(rw,sync,no_subtree_check)

3. Export the Shared Directory

sudo exportfs -a

4. Restart the NFS Server

sudo systemctl restart nfs-kernel-server

5. Verify NFS Server Status

sudo systemctl status nfs-kernel-server

On the slave nodes that send optimiser and other states

1. Install NFS Client Packages

sudo apt update
sudo apt install nfs-common

2. Create the Mount Point Directory

Ensure the directory exists (it should already exist based on your setup):

sudo mkdir -p /home/ubuntu/simpletuner/output

Note: If the directory contains data, back it up, as mounting will hide existing contents.

3. Mount the NFS Share

Mount the master's shared directory to the slave's local directory (replace master_ip with the master's IP address):

sudo mount master_ip:/home/ubuntu/simpletuner/output /home/ubuntu/simpletuner/output

4. Verify the Mount

Check that the mount is successful:

mount | grep /home/ubuntu/simpletuner/output

5. Test Write Access

Create a test file to ensure you have write permissions:

touch /home/ubuntu/simpletuner/output/test_file_from_slave.txt

Then, check on the master machine if the file appears in /home/ubuntu/simpletuner/output.

6. Make the Mount Persistent

To ensure the mount persists across reboots, add it to the /etc/fstab file:

sudo nano /etc/fstab

Add the following line at the end:

master_ip:/home/ubuntu/simpletuner/output /home/ubuntu/simpletuner/output nfs defaults 0 0

Additional Considerations:

  • User Permissions: Ensure that the ubuntu user has the same UID and GID on both machines so that file permissions are consistent. You can check UIDs with id ubuntu.

  • Firewall Settings: If you have a firewall enabled, make sure to allow NFS traffic. On the master machine:

sudo ufw allow from slave_ip to any port nfs
  • Synchronize Clocks: It's good practice to have both systems' clocks synchronized, especially in distributed setups. Use ntp or systemd-timesyncd.

  • Testing DeepSpeed Checkpoints: Run a small DeepSpeed job to confirm that checkpoints are correctly written to the master's directory.

Dataloader configuration

Very-large datasets can be a challenge to efficiently manage. SimpleTuner will automatically shard datasets over each node and distribute pre-processing across every available GPU in the cluster, while using asynchronous queues and threads to maintain throughput.

Dataset sizing for multi-GPU training

When training across multiple GPUs or nodes, your dataset must contain enough samples to satisfy the effective batch size:

effective_batch_size = train_batch_size × num_gpus × gradient_accumulation_steps

Example calculations:

Configuration Calculation Effective Batch Size
1 node, 8 GPUs, batch_size=4, grad_accum=1 4 × 8 × 1 32 samples
2 nodes, 16 GPUs, batch_size=8, grad_accum=2 8 × 16 × 2 256 samples
4 nodes, 32 GPUs, batch_size=8, grad_accum=1 8 × 32 × 1 256 samples

Each aspect ratio bucket in your dataset must contain at least this many samples (accounting for repeats) or training will fail with a detailed error message.

Solutions for small datasets

If your dataset is smaller than the effective batch size:

  1. Reduce batch size - Lower train_batch_size to reduce memory requirements
  2. Reduce GPU count - Train on fewer GPUs (though this slows training)
  3. Increase repeats - Set repeats in your dataloader configuration
  4. Enable automatic oversubscription - Use --allow_dataset_oversubscription to automatically adjust repeats

The --allow_dataset_oversubscription flag (documented in OPTIONS.md) will automatically calculate and apply the minimum required repeats for your configuration, making it ideal for prototyping or small dataset experiments.

Slow image scan / discovery

The discovery backend currently restricts aspect bucket data collection to a single node. This can take an extremely long time with very-large datasets as each image has to be read from storage to retrieve its geometry.

To work-around this problem, the parquet metadata_backend should be used, allowing you to preprocess your data in any manner accessible to you. As outlined in the linked document section, the parquet table contains the filename, width, height, and caption columns to help quickly and efficiently sort the data into its respective buckets.

Storage space

Huge datasets, especially when using the T5-XXL text encoder, will consume enormous quantities of space for the original data, the image embeds, and the text embeds.

Cloud storage

Using a provider such as Cloudflare R2, one can generate extremely large datasets with very little storage fees.

See the dataloader configuration guide for an example of how to configure the aws type in multidatabackend.json

  • Image data can be stored locally or via S3
  • If images are in S3, the preprocessing speed reduces according to network bandwidth
  • If images are stored locally, this does not take advantage of NVMe throughput during training
  • Image embeds and text embeds can be separately stored on local or cloud storage
  • Placing embeds on cloud storage reduce the training rate very little, as they are fetched in parallel

Ideally, all images and all embeds are simply maintained in a cloud storage bucket. This greatly simplifies the risk of issues during pre-processing and resuming training.

On-demand VAE encoding

For large datasets where storing cached VAE latents is impractical due to storage constraints or slow shared storage access, you can use --vae_cache_disable. This disables the VAE cache entirely, forcing the VAE to encode images on-the-fly during training.

This increases GPU computation load but significantly reduces storage requirements and network I/O for cached latents.

Preserving filesystem scan caches

If your datasets so large that scanning for new images becomes a bottleneck, adding preserve_data_backend_cache=true to each dataloader config entry will prevent the backend from being scanned for new images.

Note that you should then use the image_embeds data backend type (more information here) to allow these cache lists to live separately in case your pre-processing job is interrupted. This will prevent the image list from being re-scanned at startup.

Data compression

Data compression should be enabled by adding the following to config.json:

{
    ...
    "--compress_disk_cache": true,
    ...
}

This will use inline gzip to reduce the amount of redundant disk space consumed by the rather-large text and image embeds.

Configuring via 🤗 Accelerate

When using accelerate config (/home/user/.cache/huggingface/accelerate/default_config.yaml) to deploy SimpleTuner, these options will take priority over the contents of config/config.env

An example default_config for Accelerate that does not include DeepSpeed:

# this should be updated on EACH node.
machine_rank: 0
# Everything below here is the same on EACH node.
compute_environment: LOCAL_MACHINE
debug: false
distributed_type: MULTI_GPU
downcast_bf16: 'no'
dynamo_config:
  dynamo_backend: NO
enable_cpu_affinity: false
main_process_ip: 10.0.0.100
main_process_port: 8080
main_training_function: main
mixed_precision: bf16
num_machines: 4
num_processes: 32
rdzv_backend: static
same_network: false
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

DeepSpeed

This document doesn't go into as much detail as the dedicated page.

When optimising training on DeepSpeed for multi-node, using the lowest-possible ZeRO level is essential.

For example, an 80G NVIDIA GPU can successfully train Flux with ZeRO level 1 offload, minimising overhead substantially.

Adding the following lines

# Update this from MULTI_GPU to DEEPSPEED
distributed_type: DEEPSPEED
deepspeed_config:
  deepspeed_multinode_launcher: standard
  gradient_accumulation_steps: 1
  gradient_clipping: 0.01
  zero3_init_flag: false
  zero_stage: 1

torch compile optimisation

For extra performance (with a drawback of compatibility issues) you can enable torch compile by adding the following lines into each node's yaml config:

dynamo_config:
  # Update this from NO to INDUCTOR
  dynamo_backend: INDUCTOR
  dynamo_mode: max-autotune
  dynamo_use_dynamic: false
  dynamo_use_fullgraph: false

Expected performance

  • 4x H100 SXM5 nodes connected via local network
  • 1TB of memory per node
  • Training cache streaming from shared S3-compatible data backend (Cloudflare R2) in same region
  • Batch size of 8 per accelerator, and no gradient accumulation steps
  • Total effective batch size is 256
  • Resolution is at 1024px with data bucketing enabled
  • Speed: 15 seconds per step with 1024x1024 data when full-rank training Flux.1-dev (12B)

Lower batch sizes, lower resolution, and enabling torch compile can bring the speed into iterations per second:

  • Reduce resolution to 512px and disable data bucketing (square crops only)
  • Swap DeepSpeed from AdamW to Lion fused optimiser
  • Enable torch compile with max-autotune
  • Speed: 2 iterations per second

GPU Health Monitoring

SimpleTuner includes automatic GPU health monitoring to detect hardware faults early, which is especially important in distributed training where a single GPU failure can waste compute time and money across the entire cluster.

GPU Circuit Breaker

The GPU circuit breaker is always enabled and monitors:

  • ECC errors - Detects uncorrectable memory errors (important for A100/H100 GPUs)
  • Temperature - Alerts when approaching thermal shutdown threshold
  • Throttling - Detects hardware slowdown from thermal or power issues
  • CUDA errors - Catches runtime errors during training

When a GPU fault is detected:

  1. A gpu.fault webhook is emitted (if webhooks are configured)
  2. The circuit opens to prevent further training on faulty hardware
  3. Training exits cleanly so orchestrators can terminate the instance

Webhook Configuration

Configure webhooks in your config.json to receive GPU fault alerts:

{
  "--webhook_config": "config/webhooks.json"
}

Example webhooks.json for Discord alerts:

{
  "webhook_url": "https://discord.com/api/webhooks/...",
  "webhook_type": "discord"
}

Or for raw HTTP webhooks to an orchestrator:

{
  "webhook_url": "https://your-orchestrator.com/webhook",
  "webhook_type": "raw"
}

The orchestrator will receive a gpu.fault event with full GPU metrics, allowing automatic instance termination or failover.

Multi-Node Considerations

In multi-node training:

  • Each node runs its own GPU health monitor
  • A GPU fault on any node will trigger a webhook from that node
  • The training job will fail on all nodes due to distributed communication failure
  • Orchestrators should monitor for faults from any node in the cluster

See Resilience Infrastructure for detailed webhook payload format and programmatic access.

Distributed training caveats

  • Every node must have the same number of accelerators available
  • Only LoRA/LyCORIS can be quantised, so full distributed model training requires DeepSpeed instead
  • This is a very high-cost operation, and high batch sizes might slow you down more than you want, requiring scaling up the count of GPUs in the cluster. A careful balance of budgeting should be considered.
  • (DeepSpeed) Validations might need to be disabled when training with DeepSpeed ZeRO 3
  • (DeepSpeed) Model saving ends up creating weird sharded copies when saving with ZeRO level 3, but levels 1 and 2 function as expected
  • (DeepSpeed) The use of DeepSpeed's CPU-based optimisers becomes required as it handles sharding and offload of the optim states.