Cloud Training System¶

Status: Experimental

Available in: Web UI (Cloud tab)

SimpleTuner's cloud training system enables you to run training jobs on cloud GPU providers without setting up your own infrastructure. The system is designed to be pluggable, allowing multiple providers to be added over time.

Overview¶

The cloud training system provides:

Unified job tracking - Track both local and cloud training jobs in one place
Automatic data packaging - Local datasets are packaged and uploaded automatically
Results delivery - Trained models can be sent to HuggingFace, S3, or downloaded locally
Cost tracking - Monitor spending and job costs per provider with spending limits
Config snapshots - Optionally version-control your training configs with git

Key Concepts¶

Before using cloud training, understand these three things:

1. What Happens to Your Data¶

When you submit a cloud job:

Datasets are packaged - Local datasets (type: "local") are zipped and you'll see a summary
Uploaded to provider - The zip goes directly to the cloud provider after consenting
Training runs - The model may have to download before your samples are trained on cloud GPUs
Data is deleted - After training, uploaded data is removed from the provider servers and your model is delivered

Upload limit (Replicate): Packaged archives must be 100 MiB or smaller. Larger uploads are blocked before submission.

Security notes: - Your API token never leaves your machine - Sensitive files (.env, .git, credentials) are automatically excluded - You review and consent to uploads before each job

2. How You Receive Trained Models¶

Training produces a model that needs somewhere to go. Configure one of:

Destination	Setup	Best For
HuggingFace Hub	Set `HF_TOKEN` env var, enable in Publishing tab	Sharing models, easy access
Local Download	Set webhook URL, expose server via ngrok	Privacy, local workflows
S3 Storage	Configure endpoint in Publishing tab	Team access, archival

See Receiving Trained Models for step-by-step setup.

3. Cost Model¶

Replicate bills per second of GPU time:

Hardware	VRAM	Cost	Typical LoRA (2000 steps)
L40S	48GB	~$3.50/hr	$5-15

Billing starts when training begins and stops when it completes or fails.

Protect yourself: - Set a spending limit in Cloud settings - Cost estimates shown before every submission - Cancel running jobs anytime (you pay for time used)

See Costs for pricing and limits.

Architecture¶

┌─────────────────────────────────────────────────────────────────┐
│                        Web UI (Cloud Tab)                       │
├─────────────────────────────────────────────────────────────────┤
│  Job List  │  Metrics/Charts  │  Actions/Config  │  Job Details │
└─────────────────────────────────────────────────────────────────┘
                               │
                    ┌──────────┴──────────┐
                    │   Cloud API Routes  │
                    │   /api/cloud/*      │
                    └──────────┬──────────┘
                               │
         ┌─────────────────────┼─────────────────────┐
         │                     │                     │
┌────────▼────────┐   ┌────────▼────────┐   ┌────────▼────────┐
│    JobStore     │   │ Upload Service  │   │    Provider     │
│  (Persistence)  │   │ (Data Packaging)│   │   Clients       │
└─────────────────┘   └─────────────────┘   └─────────────────┘
                                                     │
                             ┌───────────────────────┤
                             │                       │
                   ┌─────────▼─────────┐   ┌─────────▼─────────┐
                   │     Replicate     │   │  SimpleTuner.io   │
                   │    Cog Client     │   │   (Coming Soon)   │
                   └───────────────────┘   └───────────────────┘

Supported Providers¶

Provider	Status	Features
Replicate	Supported	Cost tracking, live logs, webhooks
Worker Orchestration	Supported	Self-hosted distributed workers, any GPU
SimpleTuner.io	Coming Soon	Managed training service by the SimpleTuner team

Worker Orchestration¶

For self-hosted distributed training across multiple machines, see the Worker Orchestration Guide. Workers can run on:

On-premise GPU servers
Cloud VMs (any provider)
Spot instances (RunPod, Vast.ai, Lambda Labs)

Workers register with your SimpleTuner orchestrator and receive jobs automatically.

Data Flow¶

Submitting a Job¶

Config Preparation - Your training config is serialized
Data Packaging - Local datasets (those with type: "local") are zipped
Upload - The zip is uploaded to Replicate's file hosting
Submission - Job is submitted to the cloud provider
Tracking - Job status is polled and updated in real-time

Receiving Results¶

Results can be delivered via:

HuggingFace Hub - Push trained model to your HuggingFace account
S3-Compatible Storage - Upload to any S3 endpoint (AWS, MinIO, etc.)
Local Download - SimpleTuner includes a built-in S3-compatible endpoint that receives uploads locally

When you submit a cloud job, SimpleTuner may upload:

Training datasets - Images/files from datasets with type: "local"
Configuration - Your training parameters (learning rate, model settings, etc.)
Captions/metadata - Any text files associated with your datasets

Data is uploaded directly to the cloud provider (e.g., Replicate's file hosting). It is not routed through SimpleTuner's servers.

In the Cloud tab, you can configure data upload behavior:

Setting	Behavior
Always Ask	Show a confirmation dialog listing datasets before each upload
Always Allow	Skip confirmation for trusted workflows
Never Upload	Disable cloud training (local only)

Local S3 Endpoint¶

SimpleTuner includes a built-in S3-compatible endpoint for receiving trained models:

PUT /api/cloud/storage/{bucket}/{key}
GET /api/cloud/storage/{bucket}/{key}
GET /api/cloud/storage/{bucket}  (list objects)
GET /api/cloud/storage  (list buckets)

Files are stored to ~/.simpletuner/cloud_outputs/ by default.

You can configure the credentials manually; if you don't, ephemeral credentials are auto-generated for each training job, which is the recommended approach.

This allows "download only" mode where you: 1. Set a webhook URL pointing to your local SimpleTuner instance 2. SimpleTuner auto-configures the S3 publishing settings 3. Trained models are uploaded back to your machine

Note: You'll need to expose your local SimpleTuner instance via ngrok, cloudflared, or similar for the cloud provider to reach it.

Adding New Providers¶

The cloud system is designed for extensibility. To add a new provider:

Create a new client class implementing CloudTrainerService:

from .base import CloudTrainerService, CloudJobInfo, CloudJobStatus

class NewProviderClient(CloudTrainerService):
    @property
    def provider_name(self) -> str:
        return "new_provider"

    @property
    def supports_cost_tracking(self) -> bool:
        return True  # or False

    @property
    def supports_live_logs(self) -> bool:
        return True  # or False

    async def validate_credentials(self) -> Dict[str, Any]:
        # Validate API key and return user info
        ...

    async def list_jobs(self, limit: int = 50) -> List[CloudJobInfo]:
        # List recent jobs from the provider
        ...

    async def run_job(self, config, dataloader, ...) -> CloudJobInfo:
        # Submit a new training job
        ...

    async def cancel_job(self, job_id: str) -> bool:
        # Cancel a running job
        ...

    async def get_job_logs(self, job_id: str) -> str:
        # Fetch logs for a job
        ...

    async def get_job_status(self, job_id: str) -> CloudJobInfo:
        # Get current status of a job
        ...

Register the provider in the cloud routes
Add UI elements for the new provider tab

Files & Locations¶

Path	Description
`~/.simpletuner/cloud/`	Cloud-related state and job history
`~/.simpletuner/cloud/job_history.json`	Unified job tracking database
`~/.simpletuner/cloud/provider_configs/`	Per-provider configuration
`~/.simpletuner/cloud_outputs/`	Local S3 endpoint storage

Troubleshooting¶

"REPLICATE_API_TOKEN not set"¶

Set the environment variable before starting SimpleTuner:

export REPLICATE_API_TOKEN="r8_..."
simpletuner --webui

Data upload fails¶

Check your internet connection
Verify the dataset paths exist
Look for errors in the browser console
Ensure you have adequate credits and permissions on the cloud provider

Webhook not receiving results¶

Ensure your local instance is publicly accessible
Check that the webhook URL is correct
Verify firewall rules allow incoming connections

Current Limitations¶

The cloud training system is designed for single-shot training jobs. The following features are not currently supported:

Workflow / Pipeline Jobs (DAGs)¶

SimpleTuner does not support job dependencies or multi-step workflows where one job's output feeds into another. Each job is independent and self-contained.

If you need workflows: - Use external orchestration tools (Airflow, Prefect, Dagster) - Chain jobs via the REST API from your pipeline - See ENTERPRISE.md for an Airflow integration example

Resuming Training Runs¶

There is no built-in support for resuming a training run that was interrupted, failed, or stopped early. If a job fails or is cancelled: - You must resubmit from the beginning - No automatic checkpoint recovery from cloud storage

Workarounds: - Configure frequent HuggingFace Hub pushes (--push_checkpoints_to_hub) to save intermediate checkpoints - Implement your own checkpoint management by downloading outputs and re-uploading as the starting point for a new job - For critical long-running jobs, consider breaking into smaller training segments

These limitations may be addressed in future releases.

Cloud Training System¶

Overview¶

Key Concepts¶

1. What Happens to Your Data¶

2. How You Receive Trained Models¶

3. Cost Model¶

Architecture¶

Supported Providers¶

Worker Orchestration¶

Data Flow¶

Submitting a Job¶

Receiving Results¶

Local S3 Endpoint¶

Adding New Providers¶

Files & Locations¶

Troubleshooting¶

"REPLICATE_API_TOKEN not set"¶

Data upload fails¶

Webhook not receiving results¶

Current Limitations¶

Workflow / Pipeline Jobs (DAGs)¶

Resuming Training Runs¶

See Also¶

Cloud Training¶

Multi-User Features (applies to local and cloud)¶

General¶

Cloud Training System¶

Overview¶

Key Concepts¶

1. What Happens to Your Data¶

2. How You Receive Trained Models¶

3. Cost Model¶

Architecture¶

Supported Providers¶

Worker Orchestration¶

Data Flow¶

Submitting a Job¶

Receiving Results¶

Data Privacy & Consent¶

Consent Settings¶

Local S3 Endpoint¶

Adding New Providers¶

Files & Locations¶

Troubleshooting¶

"REPLICATE_API_TOKEN not set"¶

Data upload fails¶

Webhook not receiving results¶

Current Limitations¶

Workflow / Pipeline Jobs (DAGs)¶

Resuming Training Runs¶

See Also¶

Cloud Training¶

Multi-User Features (applies to local and cloud)¶

General¶