Job Progress and Sync API¶
This document covers SimpleTuner's mechanisms for monitoring cloud training job progress and keeping local job state synchronized with cloud providers.
Overview¶
SimpleTuner provides multiple approaches to track job status:
| Method | Use Case | Latency | Resource Usage |
|---|---|---|---|
| Inline Progress API | UI polling for running jobs | Low (5s default) | Per-job API calls |
| Job Sync (pull) | Discover jobs from provider | Medium (on-demand) | Batch API call |
sync_active param |
Refresh active job statuses | Medium (on-demand) | Per-active-job calls |
| Background Poller | Automatic status updates | Configurable (30s default) | Continuous polling |
| Webhooks | Real-time push notifications | Instant | No polling required |
Inline Progress API¶
Endpoint¶
Purpose¶
Returns compact progress information for a single running job, suitable for displaying inline status updates in a job list without fetching full logs.
Response¶
{
"job_id": "abc123",
"stage": "Training",
"last_log": "Step 1500/5000 - loss: 0.0234",
"progress": 30.0
}
| Field | Type | Description |
|---|---|---|
job_id |
string | The job identifier |
stage |
string or null | Current training stage: Preprocessing, Warmup, Training, Validation, Saving checkpoint |
last_log |
string or null | Last log line (truncated to 80 chars) |
progress |
float or null | Progress percentage (0-100) based on step/epoch parsing |
Stage Detection¶
The API parses recent log lines to determine the training stage:
- Preprocessing: Detected when logs contain "preprocessing" or "loading"
- Warmup: Detected when logs contain "warming up" or "warmup"
- Training: Detected when logs contain "step" or "epoch" patterns
- Validation: Detected when logs contain "validat"
- Saving checkpoint: Detected when logs contain "checkpoint"
Progress Calculation¶
Progress percentage is extracted from log patterns like:
- step 1500/5000 -> 30%
- epoch 3/10 -> 30%
When to Use¶
Use the inline progress API when: - Displaying compact status in job list cards - Polling frequently (every 5 seconds) for running jobs only - You need minimal data transfer per request
Client Example (JavaScript)
async function fetchInlineProgress() {
const runningJobs = jobs.filter(j => j.status === 'running');
for (const job of runningJobs) {
try {
const response = await fetch(
`/api/cloud/jobs/${job.job_id}/inline-progress`
);
if (response.ok) {
const data = await response.json();
// Update job card with progress info
job.inline_stage = data.stage;
job.inline_log = data.last_log;
job.inline_progress = data.progress;
}
} catch (error) {
// Silently ignore - job may have completed
}
}
}
// Poll every 5 seconds
setInterval(fetchInlineProgress, 5000);
Job Sync Mechanisms¶
SimpleTuner provides two sync approaches for keeping local job state current with cloud providers.
1. Full Provider Sync¶
Endpoint¶
Purpose¶
Discovers jobs from the cloud provider that may not exist in the local store. This is useful when: - Jobs were submitted outside SimpleTuner (directly via Replicate API) - The local job store was reset or corrupted - You want to import historical jobs
Response¶
Behavior¶
- Fetches up to 100 recent jobs from Replicate
- For each job:
- If not in local store: Creates new
UnifiedJobrecord - If already in store: Updates status, cost, and timestamps
- Returns count of newly discovered jobs
Client Example
Web UI Sync Button¶
The Cloud dashboard includes a sync button for discovering orphaned jobs:
- Click the Sync button in the job list toolbar
- The button shows a loading spinner during sync
- On success, a toast notification shows: "Discovered N jobs from Replicate"
- The job list and metrics automatically refresh
Use Cases: - Discovering jobs submitted directly via Replicate API or web console - Recovering after a database reset - Importing jobs from a shared team Replicate account
The sync button calls POST /api/cloud/jobs/sync internally and then reloads both the job list and dashboard metrics.
2. Active Job Status Sync (sync_active)¶
Endpoint¶
Purpose¶
Refreshes the status of all active (non-terminal) cloud jobs before returning the job list. This provides up-to-date status without waiting for background polling.
Active States¶
Jobs in these states are considered "active" and will be synced:
- pending - Job submitted but not yet started
- uploading - Data upload in progress
- queued - Waiting in provider queue
- running - Training in progress
Behavior¶
- Before listing jobs, fetches current status for each active cloud job
- Updates local store with:
- Current status
started_at/completed_attimestampscost_usd(accumulated cost)error_message(if failed)- Returns updated job list
Client Example (JavaScript)
// Load jobs with active status sync
async function loadJobs(syncActive = false) {
const params = new URLSearchParams({
limit: '50',
provider: 'replicate',
});
if (syncActive) {
params.set('sync_active', 'true');
}
const response = await fetch(`/api/cloud/jobs?${params}`);
const data = await response.json();
return data.jobs;
}
// Use sync_active during periodic refresh
setInterval(() => loadJobs(true), 30000);
Comparison: Sync vs sync_active¶
| Feature | POST /jobs/sync |
GET /jobs?sync_active=true |
|---|---|---|
| Discovers new jobs | Yes | No |
| Updates existing jobs | Yes | Yes (active only) |
| Scope | All provider jobs | Only active local jobs |
| Use case | Initial import, recovery | Regular status refresh |
| Performance | Heavier (batch query) | Lighter (selective) |
Background Poller Configuration¶
The background poller automatically syncs active job statuses without client intervention.
Default Behavior¶
- Auto-enabled: If no webhook URL is configured, polling starts automatically
- Default interval: 30 seconds
- Scope: All active cloud jobs
Enterprise Configuration
For production deployments, configure polling via `simpletuner-enterprise.yaml`:Environment Variables
How It Works¶
- On server startup,
BackgroundTaskManagerchecks: - If enterprise config explicitly enables polling, use that interval
- Otherwise, if no webhook is configured, auto-enable with 30s interval
- Every interval, the poller:
- Lists all jobs with active status
- Groups by provider
- Fetches current status from each provider
- Updates local store
- Emits SSE events for status changes
- Updates queue entries for terminal states
SSE Events
When the background poller detects status changes, it broadcasts SSE events:// Subscribe to SSE events
const eventSource = new EventSource('/api/events');
eventSource.addEventListener('message', (event) => {
const data = JSON.parse(event.data);
if (data.type === 'job_status_changed') {
console.log(`Job ${data.job_id} is now ${data.status}`);
// Refresh job list
loadJobs();
}
});
Programmatic Access
from simpletuner.simpletuner_sdk.server.services.cloud.background_tasks import (
get_task_manager,
start_background_tasks,
stop_background_tasks,
)
# Get the manager
manager = get_task_manager()
# Check if running
if manager._running:
print("Background tasks are active")
# Manual start/stop (usually handled by app lifespan)
await start_background_tasks()
await stop_background_tasks()
Best Practices¶
1. Choose the Right Sync Strategy¶
| Scenario | Recommended Approach |
|---|---|
| Initial page load | GET /jobs without sync (fast) |
| Periodic refresh (30s) | GET /jobs?sync_active=true |
| User clicks "Refresh" | POST /jobs/sync for discovery |
| Running job details | Inline progress API (5s) |
| Production deployment | Background poller + webhooks |
2. Avoid Over-Polling¶
Example
3. Use SSE for Real-Time Updates¶
Example
Instead of aggressive polling, subscribe to SSE events:// Combine SSE with conservative polling
const eventSource = new EventSource('/api/events');
eventSource.addEventListener('message', (event) => {
const data = JSON.parse(event.data);
if (data.type === 'job_status_changed') {
loadJobs(); // Refresh on status change
}
});
// Fallback: poll every 30 seconds
setInterval(() => loadJobs(true), 30000);