Distributed Training (Multi-node)¶
This document contains notes* on configuring a 4-way 8xH100 cluster for use with SimpleTuner.
*This guide does not contain full end-to-end installation instructions. Instead, these serve as considerations to take when following the INSTALL document or one of the quickstart guides.
Storage backend¶
Multi-node training requires by default the use of shared storage between nodes for the output_dir
Ubuntu NFS example¶
Just a basic storage example that will get you started.
On the 'master' node that will write the checkpoints¶
1. Install NFS Server Packages
2. Configure the NFS Export
Edit the NFS exports file to share the directory:
Add the following line at the end of the file (replace slave_ip with the actual IP address of your slave machine):
If you want to allow multiple slaves or an entire subnet, you can use:
3. Export the Shared Directory
4. Restart the NFS Server
5. Verify NFS Server Status
On the slave nodes that send optimiser and other states¶
1. Install NFS Client Packages
2. Create the Mount Point Directory
Ensure the directory exists (it should already exist based on your setup):
Note: If the directory contains data, back it up, as mounting will hide existing contents.
3. Mount the NFS Share
Mount the master's shared directory to the slave's local directory (replace master_ip with the master's IP address):
4. Verify the Mount
Check that the mount is successful:
5. Test Write Access
Create a test file to ensure you have write permissions:
Then, check on the master machine if the file appears in /home/ubuntu/simpletuner/output.
6. Make the Mount Persistent
To ensure the mount persists across reboots, add it to the /etc/fstab file:
Add the following line at the end:
Additional Considerations:¶
-
User Permissions: Ensure that the
ubuntuuser has the same UID and GID on both machines so that file permissions are consistent. You can check UIDs withid ubuntu. -
Firewall Settings: If you have a firewall enabled, make sure to allow NFS traffic. On the master machine:
-
Synchronize Clocks: It's good practice to have both systems' clocks synchronized, especially in distributed setups. Use
ntporsystemd-timesyncd. -
Testing DeepSpeed Checkpoints: Run a small DeepSpeed job to confirm that checkpoints are correctly written to the master's directory.
Dataloader configuration¶
Very-large datasets can be a challenge to efficiently manage. SimpleTuner will automatically shard datasets over each node and distribute pre-processing across every available GPU in the cluster, while using asynchronous queues and threads to maintain throughput.
Dataset sizing for multi-GPU training¶
When training across multiple GPUs or nodes, your dataset must contain enough samples to satisfy the effective batch size:
Example calculations:
| Configuration | Calculation | Effective Batch Size |
|---|---|---|
| 1 node, 8 GPUs, batch_size=4, grad_accum=1 | 4 × 8 × 1 | 32 samples |
| 2 nodes, 16 GPUs, batch_size=8, grad_accum=2 | 8 × 16 × 2 | 256 samples |
| 4 nodes, 32 GPUs, batch_size=8, grad_accum=1 | 8 × 32 × 1 | 256 samples |
Each aspect ratio bucket in your dataset must contain at least this many samples (accounting for repeats) or training will fail with a detailed error message.
Solutions for small datasets¶
If your dataset is smaller than the effective batch size:
- Reduce batch size - Lower
train_batch_sizeto reduce memory requirements - Reduce GPU count - Train on fewer GPUs (though this slows training)
- Increase repeats - Set
repeatsin your dataloader configuration - Enable automatic oversubscription - Use
--allow_dataset_oversubscriptionto automatically adjust repeats
The --allow_dataset_oversubscription flag (documented in OPTIONS.md) will automatically calculate and apply the minimum required repeats for your configuration, making it ideal for prototyping or small dataset experiments.
Slow image scan / discovery¶
The discovery backend currently restricts aspect bucket data collection to a single node. This can take an extremely long time with very-large datasets as each image has to be read from storage to retrieve its geometry.
To work-around this problem, the parquet metadata_backend should be used, allowing you to preprocess your data in any manner accessible to you. As outlined in the linked document section, the parquet table contains the filename, width, height, and caption columns to help quickly and efficiently sort the data into its respective buckets.
Storage space¶
Huge datasets, especially when using the T5-XXL text encoder, will consume enormous quantities of space for the original data, the image embeds, and the text embeds.
Cloud storage¶
Using a provider such as Cloudflare R2, one can generate extremely large datasets with very little storage fees.
See the dataloader configuration guide for an example of how to configure the aws type in multidatabackend.json
- Image data can be stored locally or via S3
- If images are in S3, the preprocessing speed reduces according to network bandwidth
- If images are stored locally, this does not take advantage of NVMe throughput during training
- Image embeds and text embeds can be separately stored on local or cloud storage
- Placing embeds on cloud storage reduce the training rate very little, as they are fetched in parallel
Ideally, all images and all embeds are simply maintained in a cloud storage bucket. This greatly simplifies the risk of issues during pre-processing and resuming training.
On-demand VAE encoding¶
For large datasets where storing cached VAE latents is impractical due to storage constraints or slow shared storage access, you can use --vae_cache_disable. This disables the VAE cache entirely, forcing the VAE to encode images on-the-fly during training.
This increases GPU computation load but significantly reduces storage requirements and network I/O for cached latents.
Preserving filesystem scan caches¶
If your datasets so large that scanning for new images becomes a bottleneck, adding preserve_data_backend_cache=true to each dataloader config entry will prevent the backend from being scanned for new images.
Note that you should then use the image_embeds data backend type (more information here) to allow these cache lists to live separately in case your pre-processing job is interrupted. This will prevent the image list from being re-scanned at startup.
Data compression¶
Data compression should be enabled by adding the following to config.json:
This will use inline gzip to reduce the amount of redundant disk space consumed by the rather-large text and image embeds.
Configuring via 🤗 Accelerate¶
When using accelerate config (/home/user/.cache/huggingface/accelerate/default_config.yaml) to deploy SimpleTuner, these options will take priority over the contents of config/config.env
An example default_config for Accelerate that does not include DeepSpeed:
# this should be updated on EACH node.
machine_rank: 0
# Everything below here is the same on EACH node.
compute_environment: LOCAL_MACHINE
debug: false
distributed_type: MULTI_GPU
downcast_bf16: 'no'
dynamo_config:
dynamo_backend: NO
enable_cpu_affinity: false
main_process_ip: 10.0.0.100
main_process_port: 8080
main_training_function: main
mixed_precision: bf16
num_machines: 4
num_processes: 32
rdzv_backend: static
same_network: false
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false
DeepSpeed¶
This document doesn't go into as much detail as the dedicated page.
When optimising training on DeepSpeed for multi-node, using the lowest-possible ZeRO level is essential.
For example, an 80G NVIDIA GPU can successfully train Flux with ZeRO level 1 offload, minimising overhead substantially.
Adding the following lines
# Update this from MULTI_GPU to DEEPSPEED
distributed_type: DEEPSPEED
deepspeed_config:
deepspeed_multinode_launcher: standard
gradient_accumulation_steps: 1
gradient_clipping: 0.01
zero3_init_flag: false
zero_stage: 1
torch compile optimisation¶
For extra performance (with a drawback of compatibility issues) you can enable torch compile by adding the following lines into each node's yaml config:
dynamo_config:
# Update this from NO to INDUCTOR
dynamo_backend: INDUCTOR
dynamo_mode: max-autotune
dynamo_use_dynamic: false
dynamo_use_fullgraph: false
Expected performance¶
- 4x H100 SXM5 nodes connected via local network
- 1TB of memory per node
- Training cache streaming from shared S3-compatible data backend (Cloudflare R2) in same region
- Batch size of 8 per accelerator, and no gradient accumulation steps
- Total effective batch size is 256
- Resolution is at 1024px with data bucketing enabled
- Speed: 15 seconds per step with 1024x1024 data when full-rank training Flux.1-dev (12B)
Lower batch sizes, lower resolution, and enabling torch compile can bring the speed into iterations per second:
- Reduce resolution to 512px and disable data bucketing (square crops only)
- Swap DeepSpeed from AdamW to Lion fused optimiser
- Enable torch compile with max-autotune
- Speed: 2 iterations per second
GPU Health Monitoring¶
SimpleTuner includes automatic GPU health monitoring to detect hardware faults early, which is especially important in distributed training where a single GPU failure can waste compute time and money across the entire cluster.
GPU Circuit Breaker¶
The GPU circuit breaker is always enabled and monitors:
- ECC errors - Detects uncorrectable memory errors (important for A100/H100 GPUs)
- Temperature - Alerts when approaching thermal shutdown threshold
- Throttling - Detects hardware slowdown from thermal or power issues
- CUDA errors - Catches runtime errors during training
When a GPU fault is detected:
- A
gpu.faultwebhook is emitted (if webhooks are configured) - The circuit opens to prevent further training on faulty hardware
- Training exits cleanly so orchestrators can terminate the instance
Webhook Configuration¶
Configure webhooks in your config.json to receive GPU fault alerts:
Example webhooks.json for Discord alerts:
Or for raw HTTP webhooks to an orchestrator:
The orchestrator will receive a gpu.fault event with full GPU metrics, allowing automatic instance termination or failover.
Multi-Node Considerations¶
In multi-node training:
- Each node runs its own GPU health monitor
- A GPU fault on any node will trigger a webhook from that node
- The training job will fail on all nodes due to distributed communication failure
- Orchestrators should monitor for faults from any node in the cluster
See Resilience Infrastructure for detailed webhook payload format and programmatic access.
Distributed training caveats¶
- Every node must have the same number of accelerators available
- Only LoRA/LyCORIS can be quantised, so full distributed model training requires DeepSpeed instead
- This is a very high-cost operation, and high batch sizes might slow you down more than you want, requiring scaling up the count of GPUs in the cluster. A careful balance of budgeting should be considered.
- (DeepSpeed) Validations might need to be disabled when training with DeepSpeed ZeRO 3
- (DeepSpeed) Model saving ends up creating weird sharded copies when saving with ZeRO level 3, but levels 1 and 2 function as expected
- (DeepSpeed) The use of DeepSpeed's CPU-based optimisers becomes required as it handles sharding and offload of the optim states.