HeartMuLa Quickstart¶
In this example, we'll be training the HeartMuLa oss 3B audio generation model.
Overview¶
HeartMuLa is a 3B parameter autoregressive transformer that predicts discrete audio tokens from tags and lyrics. The tokens are decoded with HeartCodec to produce waveforms.
Hardware Requirements¶
HeartMuLa is a 3B parameter model, making it relatively lightweight compared to large image generation models like Flux.
- Minimum: NVIDIA GPU with 12GB+ VRAM (e.g., 3060, 4070).
- Recommended: NVIDIA GPU with 24GB+ VRAM (e.g., 3090, 4090, A10G) for larger batch sizes.
- Mac: Supported via MPS on Apple Silicon (Requires ~36GB+ Unified Memory).
Storage Requirements¶
â ī¸ Token Dataset Warning: HeartMuLa trains on precomputed audio tokens. SimpleTuner does not build tokens during training, so your dataset must supply
audio_tokensoraudio_tokens_pathmetadata. Token files can be large, so plan disk space accordingly.đĄ Tip: Using
int8-quantoquantization allows training on GPUs with less VRAM (e.g., 12GB-16GB) with minimal quality loss.
Prerequisites¶
Ensure you have a working Python 3.10+ environment.
Configuration¶
It is recommended to keep your configurations organized. We'll create a dedicated folder for this demo.
Critical Settings¶
Create config/heartmula-training-demo/config.json with these values:
View example config
{
"model_family": "heartmula",
"model_type": "lora",
"model_flavour": "3b",
"pretrained_model_name_or_path": "HeartMuLa/HeartMuLa-oss-3B",
"resolution": 0,
"mixed_precision": "bf16",
"base_model_precision": "int8-quanto",
"data_backend_config": "config/heartmula-training-demo/multidatabackend.json"
}
Validation Settings¶
Add these to your config.json to monitor progress:
validation_prompt: Tags or a text description of the audio (e.g., "Upbeat pop with bright synths").validation_lyrics: (Optional) Lyrics for the model to sing. Use an empty string for instrumentals.validation_audio_duration: Duration in seconds for validation clips (default: 30.0).validation_guidance: Guidance scale (start around 1.5 - 3.0).validation_step_interval: How often to generate samples (e.g., every 100 steps).
Advanced Experimental Features¶
Show advanced experimental details
SimpleTuner includes experimental features that can significantly improve training stability and performance. * **[Scheduled Sampling (Rollout)](../experimental/SCHEDULED_SAMPLING.md):** reduces exposure bias and improves output quality by letting the model generate its own inputs during training. > â ī¸ These features increase the computational overhead of training.Dataset Configuration¶
HeartMuLa requires an audio-specific dataset with precomputed tokens.
Each sample must provide:
tags(string)lyrics(string; can be empty)audio_tokensoraudio_tokens_path
Token arrays must be 2D with shape [frames, num_codebooks] or [num_codebooks, frames].
đĄ Note: HeartMuLa does not use a separate text encoder, so a text-embeds backend is not required.
Option 1: Hugging Face Dataset (Tokens in Columns)¶
Create config/heartmula-training-demo/multidatabackend.json:
View example config
Make sure your dataset includes audio_tokens or audio_tokens_path columns alongside the text fields.
Option 2: Local Audio Files + Token Metadata¶
Create config/heartmula-training-demo/multidatabackend.json:
View example config
Ensure your metadata backend supplies audio_tokens or audio_tokens_path for every sample.
Data Structure¶
Place your audio files in datasets/my_audio_files. SimpleTuner supports a wide range of formats including:
- Lossless:
.wav,.flac,.aiff,.alac - Lossy:
.mp3,.ogg,.m4a,.aac,.wma,.opus
âšī¸ Note: To support formats like MP3, AAC, and WMA, you must have FFmpeg installed on your system.
For tags and lyrics, place corresponding text files next to your audio files if you use caption_strategy: textfile:
- Audio:
track_01.wav - Tags (Prompt):
track_01.txt(Contains the text description, e.g., "A slow jazz ballad") - Lyrics (Optional):
track_01.lyrics(Contains the lyrics text)
Provide token arrays via metadata (for example, audio_tokens_path entries pointing at .npy or .npz files).
Example dataset layout
â ī¸ Note on Lyrics: HeartMuLa expects a lyrics string for every sample. For instrumental data, provide an empty string rather than omitting the field.
Training¶
Start the training run by specifying your environment:
This command tells SimpleTuner to look for config.json inside config/heartmula-training-demo/.
đĄ Tip (Continue Training): To continue fine-tuning from an existing LoRA, use the
--init_loraoption:
Troubleshooting¶
- Validation Errors: Ensure you are not trying to use image-centric validation features like
num_validation_images> 1 (conceptually mapped to batch size for audio) or image-based metrics (CLIP score). - Memory Issues: If running OOM, try reducing
train_batch_sizeor enablinggradient_checkpointing.