ACE-Step Quickstart¶
In this example, we'll be training the ACE-Step audio generation model. SimpleTuner currently supports the original ACE-Step v1 3.5B path plus forward-compatible LoRA training for the ACE-Step v1.5 bundle.
Overview¶
ACE-Step is a transformer-based flow-matching audio model designed for high-quality synthesis. In SimpleTuner:
basetargets the original ACE-Step v1 3.5B training path.v15-turbo,v15-base, andv15-sfttarget the ACE-Step v1.5 bundle variants loaded fromACE-Step/Ace-Step1.5.
Hardware Requirements¶
ACE-Step is a 3.5B parameter model, making it relatively lightweight compared to large image generation models like Flux.
- Minimum: NVIDIA GPU with 12GB+ VRAM (e.g., 3060, 4070).
- Recommended: NVIDIA GPU with 24GB+ VRAM (e.g., 3090, 4090, A10G) for larger batch sizes.
- Mac: Supported via MPS on Apple Silicon (Requires ~36GB+ Unified Memory).
Storage Requirements¶
â ī¸ Disk Usage Warning: The VAE cache for audio models can be substantial. For example, a single 60-second audio clip can result in a ~89MB cached latent file. This caching strategy is used to drastically reduce VRAM requirements during training. Ensure you have sufficient disk space for your dataset's cache.
đĄ Tip: For larger datasets, you can use the
--vae_cache_disableoption to disable writing embeddings to disk. This will implicitly enable on-demand caching, which saves disk space but will increase training time and memory usage as encodings are performed during the training loop.đĄ Tip: Using
int8-quantoquantization allows training on GPUs with less VRAM (e.g., 12GB-16GB) with minimal quality loss.
Prerequisites¶
Ensure you have a working Python 3.10+ environment.
Configuration¶
It is recommended to keep your configurations organized. We'll create a dedicated folder for this demo.
Critical Settings¶
SimpleTuner currently supports these ACE-Step flavours:
base: original ACE-Step v1 3.5Bv15-turbo,v15-base,v15-sft: ACE-Step v1.5 bundle variants
Use the matching config for your target variant.
Ready-made example presets are available at:
simpletuner/examples/ace_step-v1-0.peft-lorasimpletuner/examples/ace_step-v1-5.peft-lora
You can launch them directly with simpletuner train example=ace_step-v1-0.peft-lora or simpletuner train example=ace_step-v1-5.peft-lora.
ACE-Step v1 example¶
Create config/acestep-training-demo/config.json with these values:
View example config
ACE-Step v1.5 example¶
For ACE-Step v1.5, keep model_family: "ace_step" but select a v1.5 flavour and point the checkpoint root at the shared v1.5 bundle:
View example config
{
"model_family": "ace_step",
"model_type": "lora",
"model_flavour": "v15-base",
"pretrained_model_name_or_path": "ACE-Step/Ace-Step1.5",
"trust_remote_code": true,
"resolution": 0,
"mixed_precision": "bf16",
"base_model_precision": "int8-quanto",
"data_backend_config": "config/acestep-training-demo/multidatabackend.json"
}
Validation Settings¶
Add these to your config.json to monitor progress:
validation_prompt: A text description of the audio you want to generate (e.g., "A catchy pop song with upbeat drums").validation_lyrics: (Optional) Lyrics for the model to sing.validation_audio_duration: Duration in seconds for validation clips (default: 30.0).validation_guidance: Guidance scale (default: ~3.0 - 5.0).validation_step_interval: How often to generate samples (e.g., every 100 steps).
âšī¸ ACE-Step v1.5 note: SimpleTuner now supports built-in v1.5 validation renders for prompt + optional lyrics conditioning. Loading the upstream v1.5 repository still requires
trust_remote_code: true, and more advanced upstream editing/inference workflows are not exposed through the SimpleTuner validation pipeline yet.
Advanced Experimental Features¶
Show advanced experimental details
SimpleTuner includes experimental features that can significantly improve training stability and performance. * **[Scheduled Sampling (Rollout)](../experimental/SCHEDULED_SAMPLING.md):** reduces exposure bias and improves output quality by letting the model generate its own inputs during training. > â ī¸ These features increase the computational overhead of training.Dataset Configuration¶
ACE-Step requires an audio-specific dataset configuration.
Option 1: Demo Dataset (Hugging Face)¶
For a quick start, you can use the prepared ACEStep-Songs preset.
Create config/acestep-training-demo/multidatabackend.json:
View example config
[
{
"id": "acestep-demo-data",
"type": "huggingface",
"dataset_type": "audio",
"dataset_name": "Yi3852/ACEStep-Songs",
"metadata_backend": "huggingface",
"caption_strategy": "huggingface",
"cache_dir_vae": "cache/vae/{model_family}/acestep-demo-data"
},
{
"id": "text-embeds",
"dataset_type": "text_embeds",
"default": true,
"type": "local",
"cache_dir": "cache/text/{model_family}"
}
]
See caption_strategy options and requirements in DATALOADER.md.
Option 2: Local Audio Files¶
Create config/acestep-training-demo/multidatabackend.json:
View example config
[
{
"id": "my-audio-dataset",
"type": "local",
"dataset_type": "audio",
"instance_data_dir": "datasets/my_audio_files",
"caption_strategy": "textfile",
"metadata_backend": "discovery",
"disabled": false
},
{
"id": "text-embeds",
"dataset_type": "text_embeds",
"default": true,
"type": "local",
"cache_dir": "cache/text/{model_family}"
}
]
Data Structure¶
Place your audio files in datasets/my_audio_files. SimpleTuner supports a wide range of formats including:
- Lossless:
.wav,.flac,.aiff,.alac - Lossy:
.mp3,.ogg,.m4a,.aac,.wma,.opus
âšī¸ Note: To support formats like MP3, AAC, and WMA, you must have FFmpeg installed on your system.
For captions and lyrics, place corresponding text files next to your audio files:
- Audio:
track_01.wav - Caption (Prompt):
track_01.txt(Contains the text description, e.g., "A slow jazz ballad") - Lyrics (Optional):
track_01.lyrics(Contains the lyrics text)
Example dataset layout
đĄ Advanced: If your dataset uses a different naming convention (e.g.
_lyrics.txt), you can customize this in your dataset config.
â ī¸ Note on Lyrics: If a
.lyricsfile is not found for a sample, the lyric embeddings will be zeroed out. ACE-Step expects lyric conditioning; training heavily on data without lyrics (instrumentals) may require more training steps for the model to learn to generate high-quality instrumental audio with zeroed lyric inputs.
Training¶
Start the training run by specifying your environment:
This command tells SimpleTuner to look for config.json inside config/acestep-training-demo/.
đĄ Tip (Continue Training): To continue fine-tuning from an existing LoRA (e.g. the official ACE-Step checkpoints or community adapters), use the
--init_loraoption:
Training the Lyrics Embedder (upstream-style)¶
âšī¸ Version note:
lyrics_embedder_traincurrently applies to the ACE-Step v1 training path. The v1.5 forward-compatible LoRA path in SimpleTuner is decoder-only.
The upstream ACE-Step trainer fine-tunes the lyrics embedder alongside the denoiser. To mirror that behaviour in SimpleTuner (full or standard LoRA only):
- Enable it:
lyrics_embedder_train: true - Optional overrides (otherwise the main optimizer/scheduler are reused):
lyrics_embedder_lrlyrics_embedder_optimizerlyrics_embedder_lr_scheduler
Example snippet:
View example config
Embedder weights are checkpointed with LoRA saves and restored on resume.
Troubleshooting¶
- Validation Errors: Ensure you are not trying to use image-centric validation features like
num_validation_images> 1 (conceptually mapped to batch size for audio) or image-based metrics (CLIP score). - Memory Issues: If running OOM, try reducing
train_batch_sizeor enablinggradient_checkpointing.
Migrating from Upstream Trainer¶
If you are coming from the original ACE-Step training scripts, here is how the parameters map to SimpleTuner's config.json:
| Upstream Parameter | SimpleTuner config.json |
Default / Notes |
|---|---|---|
--learning_rate |
learning_rate |
1e-4 |
--num_workers |
dataloader_num_workers |
8 |
--max_steps |
max_train_steps |
2000000 |
--every_n_train_steps |
checkpointing_steps |
2000 |
--precision |
mixed_precision |
"fp16" or "bf16" (use "no" for fp32) |
--accumulate_grad_batches |
gradient_accumulation_steps |
1 |
--gradient_clip_val |
max_grad_norm |
0.5 |
--shift |
flow_schedule_shift |
3.0 (Specific to ACE-Step) |
Converting Raw Data¶
If you have raw audio/text/lyrics files and want to use the Hugging Face dataset format (as used by the upstream convert2hf_dataset.py tool), you can use the resulting dataset directly in SimpleTuner.
The upstream converter produces a dataset with tags and norm_lyrics columns. To use these, configure your backend like this: