Hunyuan Video 1.5 Quickstart¶

This guide walks through training a LoRA on Tencent's 8.3B Hunyuan Video 1.5 release (tencent/HunyuanVideo-1.5) using SimpleTuner.

Hardware requirements¶

Hunyuan Video 1.5 is a large model (8.3B parameters).

Minimum: 24GB-32GB VRAM is comfortable for a Rank-16 LoRA with full gradient checkpointing at 480p.
Recommended: A6000 / A100 (48GB-80GB) for 720p training or larger batch sizes.
System RAM: 64GB+ is recommended to handle model loading.

Memory offloading (optional)¶

Add the following to your config.json:

View example config

{
  "enable_group_offload": true,
  "group_offload_type": "block_level",
  "group_offload_blocks_per_group": 1,
  "group_offload_use_stream": true
}

--group_offload_use_stream: Only works on CUDA devices.
Do not combine this with --enable_model_cpu_offload.

Prerequisites¶

Make sure that you have python installed; SimpleTuner does well with 3.10 through 3.13.

You can check this by running:

python --version

If you don't have python 3.13 installed on Ubuntu, you can try the following:

apt -y install python3.13 python3.13-venv

Container image dependencies¶

For Vast, RunPod, and TensorDock (among others), the following will work on a CUDA 12.2-12.8 image to enable compiling of CUDA extensions:

apt -y install nvidia-cuda-toolkit

AMD ROCm follow-up steps¶

The following must be executed for an AMD MI300X to be useable:

apt install amd-smi-lib
pushd /opt/rocm/share/amd_smi
python3 -m pip install --upgrade pip
python3 -m pip install .
popd

Installation¶

Install SimpleTuner via pip:

pip install 'simpletuner[cuda]'

# CUDA 13 / Blackwell users (NVIDIA B-series GPUs)
pip install 'simpletuner[cuda13]' --extra-index-url https://download.pytorch.org/whl/cu130

For manual installation or development setup, see the installation documentation.

Required checkpoints¶

The main tencent/HunyuanVideo-1.5 repo contains the transformer/vae/scheduler, but the text encoder (text_encoder/llm) and vision encoder (vision_encoder/siglip) live in separate downloads. Point SimpleTuner at your local copies before launching:

export HUNYUANVIDEO_TEXT_ENCODER_PATH=/path/to/text_encoder_root
export HUNYUANVIDEO_VISION_ENCODER_PATH=/path/to/vision_encoder_root

If these are unset, SimpleTuner tries to pull them from the model repo; most mirrors do not bundle them, so set the paths explicitly to avoid startup errors.

Setting up the environment¶

Web interface method¶

The SimpleTuner WebUI makes setup fairly straightforward. To run the server:

simpletuner server

This will create a webserver on port 8001 by default, which you can access by visiting http://localhost:8001.

Manual / command-line method¶

To run SimpleTuner via command-line tools, you will need to set up a configuration file, the dataset and model directories, and a dataloader configuration file.

Configuration file¶

An experimental script, configure.py, may allow you to entirely skip this section through an interactive step-by-step configuration.

Note: This doesn't configure your dataloader. You will still have to do that manually, later.

To run it:

simpletuner configure

If you prefer to manually configure:

Copy config/config.json.example to config/config.json:

cp config/config.json.example config/config.json

Key configuration overrides for HunyuanVideo:

View example config

{
  "model_type": "lora",
  "model_family": "hunyuanvideo",
  "pretrained_model_name_or_path": "tencent/HunyuanVideo-1.5",
  "model_flavour": "t2v-480p",
  "output_dir": "output/hunyuan-video",
  "validation_resolution": "854x480",
  "validation_num_video_frames": 61,
  "validation_guidance": 6.0,
  "train_batch_size": 1,
  "gradient_accumulation_steps": 1,
  "learning_rate": 1e-4,
  "mixed_precision": "bf16",
  "optimizer": "adamw_bf16",
  "lora_rank": 16,
  "enable_group_offload": true,
  "group_offload_type": "block_level",
  "dataset_backend_config": "config/multidatabackend.json"
}

model_flavour options:
t2v-480p (Default)
t2v-720p
i2v-480p (Image-to-Video)
i2v-720p (Image-to-Video)
validation_num_video_frames: Must be (frames - 1) % 4 == 0. E.g., 61, 129.

Advanced Experimental Features¶

Show advanced experimental details

SimpleTuner includes experimental features that can significantly improve training stability and performance. * **[Scheduled Sampling (Rollout)](../experimental/SCHEDULED_SAMPLING.md):** reduces exposure bias and improves output quality by letting the model generate its own inputs during training. > ⚠️ These features increase the computational overhead of training. #### Dataset considerations Create a `--data_backend_config` (`config/multidatabackend.json`) document containing this:

[
  {
    "id": "my-video-dataset",
    "type": "local",
    "dataset_type": "video",
    "instance_data_dir": "datasets/videos",
    "caption_strategy": "textfile",
    "resolution": 480,
    "video": {
        "num_frames": 61,
        "min_frames": 61,
        "frame_rate": 24,
        "bucket_strategy": "aspect_ratio"
    },
    "repeats": 10
  },
  {
    "id": "text-embeds",
    "type": "local",
    "dataset_type": "text_embeds",
    "default": true,
    "cache_dir": "cache/text/hunyuan",
    "disabled": false
  }
]

In the `video` subsection: - `num_frames`: Target frame count for training. Must satisfy `(frames - 1) % 4 == 0`. - `min_frames`: Minimum video length (shorter videos are discarded). - `max_frames`: Maximum video length filter. - `bucket_strategy`: How videos are grouped into buckets: - `aspect_ratio` (default): Group by spatial aspect ratio only. - `resolution_frames`: Group by `WxH@F` format (e.g., `854x480@61`) for mixed-resolution/duration datasets. - `frame_interval`: When using `resolution_frames`, round frame counts to this interval. > See caption_strategy options and requirements in [DATALOADER.md](../DATALOADER.md#caption_strategy). - **Text Embed Caching**: Highly recommended. Hunyuan uses a large LLM text encoder. Caching saves significant VRAM during training. #### Login to WandB and Huggingface Hub

wandb login
huggingface-cli login

Executing the training run¶

From the SimpleTuner directory:

simpletuner train

Notes & troubleshooting tips¶

VRAM Optimization¶

Group Offload: Essential for consumer GPUs. Ensure enable_group_offload is true.
Resolution: Stick to 480p (854x480 or similar) if you have limited VRAM. 720p (1280x720) increases memory usage significantly.
Quantization: Use base_model_precision (bf16 default); int8-torchao works for further savings at the cost of speed.
VAE patch convolution: For HunyuanVideo VAE OOMs, set --vae_enable_patch_conv=true (or toggle in the UI). This slices 3D conv/attention work to lower peak VRAM; expect a small throughput hit.

Image-to-Video (I2V)¶

Use model_flavour="i2v-480p" or i2v-720p.
SimpleTuner automatically uses the first frame of your video dataset samples as the conditioning image during training.

I2V Validation Options¶

For validation with i2v models, you have two options:

Auto-extracted first frame: By default, validation uses the first frame from video samples in your dataset.
Separate image dataset (simpler setup): Use --validation_using_datasets=true with --eval_dataset_id pointing to an image dataset. This allows you to use any image dataset as the first-frame conditioning input for validation videos, without needing to set up the complex conditioning dataset pairing used during training.

Example config for option 2:

{
  "validation_using_datasets": true,
  "eval_dataset_id": "my-image-dataset"
}

Text Encoders¶

Hunyuan uses a dual text encoder setup (LLM + CLIP). Ensure your system RAM can handle loading these during the caching phase.