Worker Orchestration¶
SimpleTuner's worker orchestration allows you to distribute training jobs across multiple GPU machines. Workers register with a central orchestrator, receive job dispatch events in real-time, and report status back.
Overview¶
The orchestrator/worker architecture enables:
- Distributed training - Run jobs on any machine with GPUs, anywhere
- Auto-discovery - Workers self-register with GPU capabilities
- Real-time dispatch - Jobs dispatched via SSE (Server-Sent Events)
- Mixed fleet - Combine cloud-launched ephemeral workers with persistent on-prem machines
- Fault tolerance - Orphaned jobs are automatically requeued
Worker Types¶
| Type | Lifecycle | Use Case |
|---|---|---|
| Ephemeral | Shuts down after job completion | Cloud spot instances (RunPod, Vast.ai) |
| Persistent | Stays online between jobs | On-prem GPUs, reserved instances |
Quick Start¶
1. Start the Orchestrator¶
Run the SimpleTuner server on your central machine:
For production, enable SSL:
2. Create a Worker Token¶
Via Web UI: Administration → Workers → Create Worker
Via API:
curl -s -X POST http://localhost:8001/api/admin/workers \
-H "Authorization: Bearer $ADMIN_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"name": "gpu-worker-1",
"worker_type": "persistent",
"labels": {"location": "datacenter-a", "gpu_type": "a100"}
}'
Response includes the token (shown only once):
3. Start the Worker¶
On the GPU machine:
simpletuner worker \
--orchestrator-url https://orchestrator.example.com:8001 \
--worker-token wt_xxxxxxxxxxxx \
--name gpu-worker-1 \
--persistent
Or via environment variables:
export SIMPLETUNER_ORCHESTRATOR_URL=https://orchestrator.example.com:8001
export SIMPLETUNER_WORKER_TOKEN=wt_xxxxxxxxxxxx
export SIMPLETUNER_WORKER_NAME=gpu-worker-1
export SIMPLETUNER_WORKER_PERSISTENT=true
simpletuner worker
The worker will:
- Connect to the orchestrator
- Report GPU capabilities (auto-detected)
- Enter the job dispatch loop
- Send heartbeats every 30 seconds
4. Submit Jobs to Workers¶
Via Web UI: Configure your training, then click Train in Cloud → select Worker as target.
Via API:
curl -s -X POST http://localhost:8001/api/queue/submit \
-H "Content-Type: application/json" \
-d '{
"config_name": "my-training-config",
"target": "worker"
}'
Target options:
| Target | Behavior |
|---|---|
worker |
Dispatch only to remote workers |
local |
Run on orchestrator's GPUs |
auto |
Prefer worker if available, fall back to local |
CLI Reference¶
simpletuner worker [OPTIONS]
OPTIONS:
--orchestrator-url URL Orchestrator panel URL (or SIMPLETUNER_ORCHESTRATOR_URL)
--worker-token TOKEN Authentication token (or SIMPLETUNER_WORKER_TOKEN)
--name NAME Worker name (default: hostname)
--persistent Stay online between jobs (default: ephemeral)
-v, --verbose Enable debug logging
Ephemeral vs Persistent Mode¶
Ephemeral (default): - Worker shuts down after completing one job - Ideal for cloud spot instances that bill per minute - Orchestrator cleans up offline ephemeral workers after 1 hour
Persistent (--persistent):
- Worker stays online waiting for new jobs
- Reconnects automatically if connection drops
- Use for on-prem GPUs or reserved instances
Worker Lifecycle¶
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ CONNECTING │ ──▶ │ IDLE │ ──▶ │ BUSY │
└─────────────┘ └─────────────┘ └─────────────┘
│ │
│ │
▼ ▼
┌─────────────┐ ┌─────────────┐
│ DRAINING │ │ OFFLINE │
└─────────────┘ └─────────────┘
| Status | Description |
|---|---|
CONNECTING |
Worker establishing connection |
IDLE |
Ready to receive jobs |
BUSY |
Currently running a job |
DRAINING |
Finishing current job, then shutting down |
OFFLINE |
Disconnected (heartbeat timeout) |
Health Monitoring¶
The orchestrator monitors worker health:
- Heartbeat interval: 30 seconds (worker → orchestrator)
- Timeout threshold: 120 seconds without heartbeat → mark offline
- Health check loop: Runs every 60 seconds on orchestrator
Handling Failures¶
Worker goes offline during a job:
- Job marked as failed after heartbeat timeout
- If retries remaining (default: 3), job requeued
- Next available worker picks up the job
Orchestrator restarts:
- Workers automatically reconnect
- Workers report any in-progress jobs
- Orchestrator reconciles state and resumes
GPU Matching¶
Workers report their GPU capabilities on registration:
{
"gpu_count": 2,
"gpu_name": "NVIDIA A100-SXM4-80GB",
"gpu_vram_gb": 80,
"accelerator_type": "cuda"
}
Jobs can specify GPU requirements:
curl -s -X POST http://localhost:8001/api/queue/submit \
-H "Content-Type: application/json" \
-d '{
"config_name": "my-config",
"target": "worker",
"worker_labels": {"gpu_type": "a100*"}
}'
The scheduler matches jobs to workers based on:
- GPU count requirements
- Label matching (glob patterns supported)
- Worker availability (IDLE status)
Labels¶
Labels provide flexible worker selection:
Assign labels on worker creation:
curl -s -X POST http://localhost:8001/api/admin/workers \
-H "Content-Type: application/json" \
-d '{
"name": "worker-1",
"labels": {
"location": "us-west",
"gpu_type": "a100",
"team": "nlp"
}
}'
Select workers by label:
# Match workers with team=nlp
curl -s -X POST http://localhost:8001/api/queue/submit \
-d '{"config_name": "my-config", "worker_labels": {"team": "nlp"}}'
# Match workers with gpu_type starting with "a100"
curl -s -X POST http://localhost:8001/api/queue/submit \
-d '{"config_name": "my-config", "worker_labels": {"gpu_type": "a100*"}}'
Admin Operations¶
List Workers¶
Response:
{
"workers": [
{
"id": "w_abc123",
"name": "gpu-worker-1",
"status": "idle",
"worker_type": "persistent",
"gpu_count": 2,
"gpu_name": "A100",
"labels": {"location": "datacenter-a"},
"last_heartbeat": "2024-01-15T10:30:00Z"
}
]
}
Drain a Worker¶
Gracefully finish current job and prevent new dispatch:
The worker will:
- Complete any running job
- Enter DRAINING status
- Refuse new jobs
- Disconnect after job completion (ephemeral) or remain in draining state (persistent)
Rotate Token¶
Regenerate a worker's authentication token:
The old token is immediately invalidated. Update the worker with the new token.
Delete a Worker¶
Only works if the worker is offline.
Security¶
Token Authentication¶
- Workers authenticate via
X-Worker-Tokenheader - Tokens are SHA-256 hashed before storage
- Tokens never leave the orchestrator after creation
- Rotate tokens periodically for security
Network Security¶
For production:
- Use
--sslflag or terminate TLS at a reverse proxy - Restrict worker registration to trusted networks
- Use firewall rules to limit access to
/api/workers/*endpoints
Audit Logging¶
All worker actions are logged:
- Registration attempts (success/failure)
- Job dispatch events
- Status transitions
- Token rotations
- Admin operations
See Audit Guide for log access.
Troubleshooting¶
Worker Can't Connect¶
"Connection refused"
- Verify orchestrator URL and port
- Check firewall rules allow inbound connections
- Ensure orchestrator is running with --host 0.0.0.0
"Invalid token" - Token may have been rotated—request a new one - Check for whitespace in token string
"SSL certificate verify failed"
- Use --ssl-no-verify for self-signed certs (development only)
- Or add the CA certificate to the system trust store
Worker Goes Offline Unexpectedly¶
Heartbeat timeout (120s)
- Check network stability between worker and orchestrator
- Look for resource exhaustion (CPU/memory) on worker
- Increase SIMPLETUNER_HEARTBEAT_TIMEOUT if on unreliable network
Process crash
- Check worker logs for Python exceptions
- Verify GPU drivers are functioning (nvidia-smi)
- Ensure sufficient disk space for training
Jobs Not Dispatching to Workers¶
No idle workers - Check worker status in admin panel - Verify workers are connected and IDLE - Check for label mismatch between job and workers
GPU requirements not met
- Job requires more GPUs than any worker has
- Adjust --num_processes in training config
API Reference¶
Worker Endpoints (Worker → Orchestrator)¶
| Endpoint | Method | Description |
|---|---|---|
/api/workers/register |
POST | Register and report capabilities |
/api/workers/stream |
GET | SSE stream for job dispatch |
/api/workers/heartbeat |
POST | Periodic keepalive |
/api/workers/job/{id}/status |
POST | Report job progress |
/api/workers/disconnect |
POST | Graceful shutdown notification |
Admin Endpoints (Requires admin.workers permission)¶
| Endpoint | Method | Description |
|---|---|---|
/api/admin/workers |
GET | List all workers |
/api/admin/workers |
POST | Create worker token |
/api/admin/workers/{id} |
DELETE | Remove worker |
/api/admin/workers/{id}/drain |
POST | Drain worker |
/api/admin/workers/{id}/token |
POST | Rotate token |
See Also¶
- Enterprise Guide - SSO, quotas, approval workflows
- Job Queue - Queue scheduling and priorities
- Cloud Training - Cloud provider integration
- API Tutorial - Local training via REST API