ACE-Step Quickstart¶

In this example, we'll be training the ACE-Step audio generation model. SimpleTuner currently supports the original ACE-Step v1 3.5B path plus forward-compatible LoRA training for the ACE-Step v1.5 bundle.

Overview¶

ACE-Step is a transformer-based flow-matching audio model designed for high-quality synthesis. In SimpleTuner:

base targets the original ACE-Step v1 3.5B training path.
v15-turbo, v15-base, and v15-sft target the ACE-Step v1.5 bundle variants loaded from ACE-Step/Ace-Step1.5.

Hardware Requirements¶

ACE-Step is a 3.5B parameter model, making it relatively lightweight compared to large image generation models like Flux.

Minimum: NVIDIA GPU with 12GB+ VRAM (e.g., 3060, 4070).
Recommended: NVIDIA GPU with 24GB+ VRAM (e.g., 3090, 4090, A10G) for larger batch sizes.
Mac: Supported via MPS on Apple Silicon (Requires ~36GB+ Unified Memory).

Storage Requirements¶

⚠️ Disk Usage Warning: The VAE cache for audio models can be substantial. For example, a single 60-second audio clip can result in a ~89MB cached latent file. This caching strategy is used to drastically reduce VRAM requirements during training. Ensure you have sufficient disk space for your dataset's cache.

💡 Tip: For larger datasets, you can use the --vae_cache_disable option to disable writing embeddings to disk. This will implicitly enable on-demand caching, which saves disk space but will increase training time and memory usage as encodings are performed during the training loop.

💡 Tip: Using int8-quanto quantization allows training on GPUs with less VRAM (e.g., 12GB-16GB) with minimal quality loss.

Prerequisites¶

Ensure you have a working Python 3.10+ environment.

pip install simpletuner

Configuration¶

It is recommended to keep your configurations organized. We'll create a dedicated folder for this demo.

mkdir -p config/acestep-training-demo

Critical Settings¶

SimpleTuner currently supports these ACE-Step flavours:

base: original ACE-Step v1 3.5B
v15-turbo, v15-base, v15-sft: ACE-Step v1.5 bundle variants

Use the matching config for your target variant.

Ready-made example presets are available at:

simpletuner/examples/ace_step-v1-0.peft-lora
simpletuner/examples/ace_step-v1-5.peft-lora

You can launch them directly with simpletuner train example=ace_step-v1-0.peft-lora or simpletuner train example=ace_step-v1-5.peft-lora.

ACE-Step v1 example¶

Create config/acestep-training-demo/config.json with these values:

View example config

{
  "model_family": "ace_step",
  "model_type": "lora",
  "model_flavour": "base",
  "pretrained_model_name_or_path": "ACE-Step/ACE-Step-v1-3.5B",
  "resolution": 0,
  "mixed_precision": "bf16",
  "base_model_precision": "int8-quanto",
  "data_backend_config": "config/acestep-training-demo/multidatabackend.json"
}

ACE-Step v1.5 example¶

For ACE-Step v1.5, keep model_family: "ace_step" but select a v1.5 flavour and point the checkpoint root at the shared v1.5 bundle:

View example config

{
  "model_family": "ace_step",
  "model_type": "lora",
  "model_flavour": "v15-base",
  "pretrained_model_name_or_path": "ACE-Step/Ace-Step1.5",
  "trust_remote_code": true,
  "resolution": 0,
  "mixed_precision": "bf16",
  "base_model_precision": "int8-quanto",
  "data_backend_config": "config/acestep-training-demo/multidatabackend.json"
}

Validation Settings¶

Add these to your config.json to monitor progress:

validation_prompt: A text description of the audio you want to generate (e.g., "A catchy pop song with upbeat drums").
validation_lyrics: (Optional) Lyrics for the model to sing.
validation_audio_duration: Duration in seconds for validation clips (default: 30.0).
validation_guidance: Guidance scale (default: ~3.0 - 5.0).
validation_step_interval: How often to generate samples (e.g., every 100 steps).

ℹ️ ACE-Step v1.5 note: SimpleTuner now supports built-in v1.5 validation renders for prompt + optional lyrics conditioning. Loading the upstream v1.5 repository still requires trust_remote_code: true, and more advanced upstream editing/inference workflows are not exposed through the SimpleTuner validation pipeline yet.

Advanced Experimental Features¶

Show advanced experimental details

SimpleTuner includes experimental features that can significantly improve training stability and performance. * **[Scheduled Sampling (Rollout)](../experimental/SCHEDULED_SAMPLING.md):** reduces exposure bias and improves output quality by letting the model generate its own inputs during training. > ⚠️ These features increase the computational overhead of training.

Dataset Configuration¶

ACE-Step requires an audio-specific dataset configuration.

Option 1: Demo Dataset (Hugging Face)¶

For a quick start, you can use the prepared ACEStep-Songs preset.

Create config/acestep-training-demo/multidatabackend.json:

View example config

[
  {
    "id": "acestep-demo-data",
    "type": "huggingface",
    "dataset_type": "audio",
    "dataset_name": "Yi3852/ACEStep-Songs",
    "metadata_backend": "huggingface",
    "caption_strategy": "huggingface",
    "cache_dir_vae": "cache/vae/{model_family}/acestep-demo-data"
  },
  {
    "id": "text-embeds",
    "dataset_type": "text_embeds",
    "default": true,
    "type": "local",
    "cache_dir": "cache/text/{model_family}"
  }
]

See caption_strategy options and requirements in DATALOADER.md.

Option 2: Local Audio Files¶

Create config/acestep-training-demo/multidatabackend.json:

View example config

[
  {
    "id": "my-audio-dataset",
    "type": "local",
    "dataset_type": "audio",
    "instance_data_dir": "datasets/my_audio_files",
    "caption_strategy": "textfile",
    "metadata_backend": "discovery",
    "disabled": false
  },
  {
    "id": "text-embeds",
    "dataset_type": "text_embeds",
    "default": true,
    "type": "local",
    "cache_dir": "cache/text/{model_family}"
  }
]

Data Structure¶

Place your audio files in datasets/my_audio_files. SimpleTuner supports a wide range of formats including:

Lossless: .wav, .flac, .aiff, .alac
Lossy: .mp3, .ogg, .m4a, .aac, .wma, .opus

ℹ️ Note: To support formats like MP3, AAC, and WMA, you must have FFmpeg installed on your system.

For captions and lyrics, place corresponding text files next to your audio files:

Audio: track_01.wav
Caption (Prompt): track_01.txt (Contains the text description, e.g., "A slow jazz ballad")
Lyrics (Optional): track_01.lyrics (Contains the lyrics text)

Example dataset layout

datasets/my_audio_files/
├── track_01.wav
├── track_01.txt
└── track_01.lyrics

💡 Advanced: If your dataset uses a different naming convention (e.g. _lyrics.txt), you can customize this in your dataset config.

View custom lyrics filename example

"audio": {
  "lyrics_filename_format": "{filename}_lyrics.txt"
}

⚠️ Note on Lyrics: If a .lyrics file is not found for a sample, the lyric embeddings will be zeroed out. ACE-Step expects lyric conditioning; training heavily on data without lyrics (instrumentals) may require more training steps for the model to learn to generate high-quality instrumental audio with zeroed lyric inputs.

Training¶

Start the training run by specifying your environment:

simpletuner train env=acestep-training-demo

This command tells SimpleTuner to look for config.json inside config/acestep-training-demo/.

💡 Tip (Continue Training): To continue fine-tuning from an existing LoRA (e.g. the official ACE-Step checkpoints or community adapters), use the --init_lora option:
simpletuner train env=acestep-training-demo --init_lora=/path/to/existing_lora.safetensors

Training the Lyrics Embedder (upstream-style)¶

ℹ️ Version note: lyrics_embedder_train currently applies to the ACE-Step v1 training path. The v1.5 forward-compatible LoRA path in SimpleTuner is decoder-only.

The upstream ACE-Step trainer fine-tunes the lyrics embedder alongside the denoiser. To mirror that behaviour in SimpleTuner (full or standard LoRA only):

Enable it: lyrics_embedder_train: true
Optional overrides (otherwise the main optimizer/scheduler are reused):
lyrics_embedder_lr
lyrics_embedder_optimizer
lyrics_embedder_lr_scheduler

Example snippet:

View example config

{
  "lyrics_embedder_train": true,
  "lyrics_embedder_lr": 5e-5,
  "lyrics_embedder_optimizer": "torch-adamw",
  "lyrics_embedder_lr_scheduler": "cosine_with_restarts"
}

Embedder weights are checkpointed with LoRA saves and restored on resume.

Troubleshooting¶

Validation Errors: Ensure you are not trying to use image-centric validation features like num_validation_images > 1 (conceptually mapped to batch size for audio) or image-based metrics (CLIP score).
Memory Issues: If running OOM, try reducing train_batch_size or enabling gradient_checkpointing.

Migrating from Upstream Trainer¶

If you are coming from the original ACE-Step training scripts, here is how the parameters map to SimpleTuner's config.json:

Upstream Parameter	SimpleTuner `config.json`	Default / Notes
`--learning_rate`	`learning_rate`	`1e-4`
`--num_workers`	`dataloader_num_workers`	`8`
`--max_steps`	`max_train_steps`	`2000000`
`--every_n_train_steps`	`checkpointing_steps`	`2000`
`--precision`	`mixed_precision`	`"fp16"` or `"bf16"` (use `"no"` for fp32)
`--accumulate_grad_batches`	`gradient_accumulation_steps`	`1`
`--gradient_clip_val`	`max_grad_norm`	`0.5`
`--shift`	`flow_schedule_shift`	`3.0` (Specific to ACE-Step)

Converting Raw Data¶

If you have raw audio/text/lyrics files and want to use the Hugging Face dataset format (as used by the upstream convert2hf_dataset.py tool), you can use the resulting dataset directly in SimpleTuner.

The upstream converter produces a dataset with tags and norm_lyrics columns. To use these, configure your backend like this:

View example config

{
    "type": "huggingface",
    "dataset_type": "audio",
    "dataset_name": "path/to/converted/dataset",
    "config": {
        "audio_caption_fields": ["tags"],
        "lyrics_column": "norm_lyrics"
    }
}