Cloud Training Tutorial¶
This guide walks through running SimpleTuner training jobs on cloud GPU infrastructure. It covers both the Web UI and REST API workflows.
Prerequisites¶
- SimpleTuner installed and server running (see the local API tutorial)
- Datasets staged locally with captions (same dataset requirements as local training)
- A cloud provider account (see Supported Providers)
- For API usage: a shell with
curlandjq
Provider Setup¶
Cloud training requires credentials for your chosen provider. Follow the setup guide for your provider:
| Provider | Setup Guide |
|---|---|
| Replicate | REPLICATE.md |
After completing provider setup, return here to submit jobs.
Quick Start¶
With your provider configured:
- Open
http://localhost:8001and go to the Cloud tab - Verify your credentials in Settings (gear icon) → Validate
- Configure your training in the Model/Training/Dataloader tabs
- Click Train in Cloud
- Review the upload summary and click Submit
Upload limit (Replicate): Packaged archives must be 100 MiB or smaller. Larger uploads are blocked before submission.
Receiving Trained Models¶
After training completes, your model needs a destination. Configure one of these before your first job.
Option 1: HuggingFace Hub (Recommended)¶
Push directly to your HuggingFace account:
- Get a HuggingFace token with write access
- Set the environment variable:
- In the Publishing tab, enable "Push to Hub" and set your repo name
Option 2: Local Download via Webhook¶
Have models upload back to your machine. Requires exposing your server to the internet.
- Start a tunnel:
- Copy the public URL (e.g.,
https://abc123.ngrok.io) - In Cloud tab → Settings → Webhook URL, paste the URL
- Models land in
~/.simpletuner/cloud_outputs/
Option 3: External S3¶
Upload to any S3-compatible endpoint (AWS S3, MinIO, Backblaze B2, Cloudflare R2):
- In the Publishing tab, configure S3 settings
- Provide endpoint, bucket, access key, secret key
Web UI Workflow¶
Submitting Jobs¶
- Configure your training in the Model/Training/Dataloader tabs
- Navigate to Cloud tab and select your provider
- Click Train in Cloud to open the pre-submit dialog
- Review the upload summary—local datasets will be packaged and uploaded
- Optionally set a run name for tracking
- Click Submit
Monitoring Jobs¶
The job list shows all cloud and local jobs with:
- Status indicator: Queued → Running → Completed/Failed
- Live progress: Training step, loss values (when available)
- Cost tracking: Estimated cost based on GPU time
Click a job to see details: - Job configuration snapshot - Real-time logs (click View Logs) - Actions: Cancel, Delete (after completion)
Settings Panel¶
Click the gear icon to access:
- API Key validation and account status
- Webhook URL for local model delivery
- Cost limits to prevent runaway spending
- Hardware info (GPU type, cost per hour)
API Workflow¶
Submit a Job¶
curl -s -X POST 'http://localhost:8001/api/cloud/jobs/submit?provider=PROVIDER' \
-H 'Content-Type: application/json' \
-d '{
"config_name_to_load": "my-training-config",
"tracker_run_name": "api-test-run"
}' | jq
Replace PROVIDER with your provider name (e.g., replicate).
Or submit with inline config:
curl -s -X POST 'http://localhost:8001/api/cloud/jobs/submit?provider=PROVIDER' \
-H 'Content-Type: application/json' \
-d '{
"config": {
"--model_family": "flux",
"--model_type": "lora",
"--pretrained_model_name_or_path": "black-forest-labs/FLUX.1-dev",
"--output_dir": "/outputs/flux-lora",
"--max_train_steps": 1000,
"--lora_rank": 16
},
"dataloader_config": [
{
"id": "training-images",
"type": "local",
"dataset_type": "image",
"instance_data_dir": "/data/datasets/my-dataset",
"caption_strategy": "textfile",
"resolution": 1024
}
]
}' | jq
Monitor Job Status¶
# Get job details
curl -s http://localhost:8001/api/cloud/jobs/JOB_ID | jq
# List all jobs
curl -s 'http://localhost:8001/api/cloud/jobs?limit=10' | jq
# Sync status of active jobs from provider
curl -s 'http://localhost:8001/api/cloud/jobs?sync_active=true' | jq
Fetch Job Logs¶
Cancel a Running Job¶
Delete a Completed Job¶
CI/CD Integration¶
Idempotent Job Submission¶
Prevent duplicate jobs with idempotency keys:
curl -s -X POST 'http://localhost:8001/api/cloud/jobs/submit?provider=PROVIDER' \
-H 'Content-Type: application/json' \
-d '{
"config_name_to_load": "my-config",
"idempotency_key": "ci-build-12345"
}' | jq
If the same key is submitted again within 24 hours, you get back the existing job instead of creating a duplicate.
GitHub Actions Example¶
name: Cloud Training
on:
push:
branches: [main]
paths:
- 'training-configs/**'
jobs:
train:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Submit Training Job
env:
SIMPLETUNER_URL: ${{ secrets.SIMPLETUNER_URL }}
run: |
RESPONSE=$(curl -s -X POST "$SIMPLETUNER_URL/api/cloud/jobs/submit?provider=replicate" \
-H 'Content-Type: application/json' \
-d '{
"config_name_to_load": "production-lora",
"idempotency_key": "gh-${{ github.sha }}",
"tracker_run_name": "gh-run-${{ github.run_number }}"
}')
JOB_ID=$(echo $RESPONSE | jq -r '.job_id')
echo "Submitted job: $JOB_ID"
echo "JOB_ID=$JOB_ID" >> $GITHUB_ENV
- name: Wait for Completion
run: |
while true; do
STATUS=$(curl -s "$SIMPLETUNER_URL/api/cloud/jobs/$JOB_ID" | jq -r '.job.status')
echo "Job status: $STATUS"
case $STATUS in
completed) exit 0 ;;
failed|cancelled) exit 1 ;;
*) sleep 60 ;;
esac
done
API Key Authentication¶
For automated pipelines, create API keys instead of session authentication.
Via UI: Cloud tab → Settings → API Keys → Create New Key
Via API:
curl -s -X POST 'http://localhost:8001/api/cloud/auth/api-keys' \
-H 'Content-Type: application/json' \
-H 'Authorization: Bearer YOUR_SESSION_TOKEN' \
-d '{
"name": "ci-pipeline",
"expires_days": 90,
"scoped_permissions": ["job.submit", "job.view.own"]
}'
The full key is only returned once. Store it securely.
Using an API key:
curl -s -X POST 'http://localhost:8001/api/cloud/jobs/submit?provider=PROVIDER' \
-H 'Content-Type: application/json' \
-H 'Authorization: Bearer stk_abc123...' \
-d '{...}'
Scoped permissions:
| Permission | Description |
|---|---|
job.submit |
Submit new jobs |
job.view.own |
View own jobs |
job.cancel.own |
Cancel own jobs |
job.view.all |
View all jobs (admin) |
Troubleshooting¶
For provider-specific issues (credentials, queuing, hardware), see your provider's documentation:
General Issues¶
Data Upload Fails - Verify dataset paths exist and are readable - Check available disk space for zip packaging - Look for errors in the browser console or API response
Webhook Not Receiving Events - Ensure your local instance is publicly accessible (tunnel running) - Verify the webhook URL is correct (including https://) - Check SimpleTuner's terminal output for webhook handling errors
API Reference¶
Provider-Agnostic Endpoints¶
| Endpoint | Method | Description |
|---|---|---|
/api/cloud/jobs |
GET | List jobs with optional filters |
/api/cloud/jobs/submit |
POST | Submit a new training job |
/api/cloud/jobs/sync |
POST | Sync jobs from provider |
/api/cloud/jobs/{id} |
GET | Get job details |
/api/cloud/jobs/{id}/logs |
GET | Fetch job logs |
/api/cloud/jobs/{id}/cancel |
POST | Cancel a running job |
/api/cloud/jobs/{id} |
DELETE | Delete a completed job |
/api/metrics |
GET | Get job and cost metrics |
/api/cloud/metrics/cost-limit |
GET | Get current cost limit status |
/api/cloud/providers/{provider} |
PUT | Update provider settings |
/api/cloud/storage/{bucket}/{key} |
PUT | S3-compatible upload endpoint |
For provider-specific endpoints, see: - Replicate API Reference
For full schema details, see the OpenAPI docs at http://localhost:8001/docs.
See Also¶
- README.md – Architecture overview and provider status
- REPLICATE.md – Replicate provider setup and details
- ENTERPRISE.md – SSO, approvals, and governance
- End-to-end cloud operations tutorial – Production deployment and monitoring
- End-to-end local API Tutorial – Complete local training via API
- Dataloader Configuration – Dataset setup reference