API Training Tutorial¶

Introduction¶

This guide walks through running SimpleTuner training jobs entirely through the HTTP API while keeping setup and dataset management on the command line. It mirrors the structure of the other tutorials but skips the WebUI onboarding. You will:

install and start the unified server
discover and download the OpenAPI schema
create and update environments with REST calls
validate, launch, and monitor training jobs via /api/training
branch into two proven configurations: a PixArt Sigma 900M full fine-tune and a Flux Kontext LyCORIS LoRA run

Prerequisites¶

Python 3.10–3.13, Git, and pip
SimpleTuner installed in a virtual environment (pip install 'simpletuner[cuda]' or the variant that matches your platform)
CUDA 13 / Blackwell users (NVIDIA B-series GPUs): pip install 'simpletuner[cuda13]' --extra-index-url https://download.pytorch.org/whl/cu130
Access to required Hugging Face repos (huggingface-cli login before pulling gated models)
Datasets staged locally with captions (caption text files for PixArt, paired edit/reference folders for Kontext)
A shell with curl and jq

Start the server¶

From your SimpleTuner checkout (or the environment where the package is installed):

simpletuner server --port 8001

The API lives at http://localhost:8001. Leave the server running while you issue the following commands in another terminal.

Tip: If you have an existing configuration environment ready to train, you can start the server with --env to automatically begin training once the server is fully loaded:
simpletuner server --port 8001 --env my-training-config
This validates your configuration at startup and launches training immediately after the server is ready—useful for unattended or scripted deployments. The --env option works identically to simpletuner train --env.

Configuration & Deployment¶

For production usage, you can configure the bind address and port:

Option	Environment Variable	Default	Description
`--host`	`SIMPLETUNER_HOST`	`0.0.0.0`	Address to bind the server to (use `127.0.0.1` behind reverse proxy)
`--port`	`SIMPLETUNER_PORT`	`8001`	Port to bind the server to

Production Deployment Options (TLS, Reverse Proxy, Systemd, Docker)

For production deployments, it is recommended to use a reverse proxy for TLS termination. #### Nginx Configuration

server {
    listen 443 ssl http2;
    server_name training.example.com;

    # TLS configuration (example using Let's Encrypt paths)
    ssl_certificate /etc/letsencrypt/live/training.example.com/fullchain.pem;
    ssl_certificate_key /etc/letsencrypt/live/training.example.com/privkey.pem;

    # WebSocket support for SSE streaming (Critical for real-time logs)
    location /api/training/stream {
        proxy_pass http://127.0.0.1:8001;
        proxy_http_version 1.1;
        proxy_set_header Connection "";
        proxy_set_header Host $host;
        # SSE-specific settings
        proxy_buffering off;
        proxy_read_timeout 86400s;
    }

    # Main application
    location / {
        proxy_pass http://127.0.0.1:8001;
        proxy_http_version 1.1;
        proxy_set_header Host $host;
        # Large file uploads for datasets
        client_max_body_size 10G;
        proxy_request_buffering off;
    }
}

#### Caddy Configuration

training.example.com {
    reverse_proxy 127.0.0.1:8001 {
        # SSE streaming support
        flush_interval -1
    }
    # Large file uploads
    request_body {
        max_size 10GB
    }
}

#### systemd Service

[Unit]
Description=SimpleTuner Training Server
After=network.target

[Service]
Type=simple
User=trainer
WorkingDirectory=/home/trainer/simpletuner-workspace
Environment="SIMPLETUNER_HOST=127.0.0.1"
Environment="SIMPLETUNER_PORT=8001"
ExecStart=/home/trainer/simpletuner-workspace/.venv/bin/simpletuner server
Restart=on-failure

[Install]
WantedBy=multi-user.target

#### Docker Compose with Traefik

version: '3.8'
services:
  simpletuner:
    image: ghcr.io/bghira/simpletuner:latest
    command: simpletuner server --host 0.0.0.0 --port 8001
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
    labels:
      - "traefik.enable=true"
      - "traefik.http.routers.simpletuner.rule=Host(`training.example.com`)"
      - "traefik.http.services.simpletuner.loadbalancer.server.port=8001"

Authentication¶

SimpleTuner supports multi-user authentication. On first launch, you'll need to create an admin account.

First-time setup¶

Check if setup is needed:

curl -s http://localhost:8001/api/cloud/auth/setup/status | jq

If needs_setup is true, create the first admin:

curl -s -X POST http://localhost:8001/api/cloud/auth/setup/first-admin \
  -H 'Content-Type: application/json' \
  -d '{
    "email": "[email protected]",
    "username": "admin",
    "password": "your-secure-password"
  }'

API keys¶

For scripted access, generate an API key after logging in:

# Login first (stores session cookie)
curl -s -X POST http://localhost:8001/api/cloud/auth/login \
  -H 'Content-Type: application/json' \
  -c cookies.txt \
  -d '{"username": "admin", "password": "your-secure-password"}'

# Create an API key
curl -s -X POST http://localhost:8001/api/cloud/auth/api-keys \
  -H 'Content-Type: application/json' \
  -b cookies.txt \
  -d '{"name": "automation-key"}' | jq

Use the returned key (prefixed with st_) in subsequent requests:

curl -s http://localhost:8001/api/training/status \
  -H 'X-API-Key: st_your_key_here'

User management¶

Admins can create additional users via the API or the WebUI's Manage Users page:

# Create a new user (requires admin session)
curl -s -X POST http://localhost:8001/api/users \
  -H 'Content-Type: application/json' \
  -b cookies.txt \
  -d '{
    "email": "[email protected]",
    "username": "researcher",
    "password": "their-password",
    "level_names": ["researcher"]
  }'

Note: Public registration is disabled by default. Admins can enable it in Manage Users → Registration tab, but it's recommended to keep it disabled for private deployments.

Discover the API¶

FastAPI serves interactive docs and the OpenAPI schema:

# FastAPI Swagger UI
python -m webbrowser http://localhost:8001/docs

# ReDoc view
python -m webbrowser http://localhost:8001/redoc

# Download the schema for local inspection
curl -o openapi.json http://localhost:8001/openapi.json
jq '.info' openapi.json

Every endpoint used in this tutorial is documented there under the configurations and training tags.

Fast path: run without environments¶

If you prefer to skip config/environment management entirely, you can issue a one-off training run by posting the full CLI-style payload straight to the training endpoints:

Author or reuse a dataloader JSON that describes your dataset. The trainer only needs the path referenced by --data_backend_config.

cat <<'JSON' > config/multidatabackend-once.json
[
  {
    "id": "demo-images",
    "type": "local",
    "dataset_type": "image",
    "instance_data_dir": "/data/datasets/demo",
    "caption_strategy": "textfile",
    "resolution": 1024,
    "resolution_type": "pixel_area"
  },
  {
    "id": "demo-text-embeds",
    "type": "local",
    "dataset_type": "text_embeds",
    "default": true,
    "cache_dir": "/data/cache/text/demo"
  }
]
JSON

Validate the inline config. Provide every required CLI argument (--model_family, --model_type, --pretrained_model_name_or_path, --output_dir, --data_backend_config, and either --num_train_epochs or --max_train_steps):

curl -s -X POST http://localhost:8001/api/training/validate \
  -F __active_tab__=model \
  -F --model_family=pixart_sigma \
  -F --model_type=full \
  -F --model_flavour=900M-1024-v0.6 \
  -F --pretrained_model_name_or_path=terminusresearch/pixart-900m-1024-ft-v0.6 \
  -F --output_dir=/workspace/output/inline-demo \
  -F --data_backend_config=config/multidatabackend-once.json \
  -F --train_batch_size=1 \
  -F --learning_rate=0.0001 \
  -F --max_train_steps=200 \
  -F --num_train_epochs=0

A green “Configuration Valid” snippet confirms the trainer will accept the payload.

Launch training with the same form fields (you can add overrides such as --seed or --validation_prompt):

curl -s -X POST http://localhost:8001/api/training/start \
  -F __active_tab__=model \
  -F --model_family=pixart_sigma \
  -F --model_type=full \
  -F --model_flavour=900M-1024-v0.6 \
  -F --pretrained_model_name_or_path=terminusresearch/pixart-900m-1024-ft-v0.6 \
  -F --output_dir=/workspace/output/inline-demo \
  -F --data_backend_config=config/multidatabackend-once.json \
  -F --train_batch_size=1 \
  -F --learning_rate=0.0001 \
  -F --max_train_steps=200 \
  -F --num_train_epochs=0 \
  -F --validation_prompt='test shot of <token>'

The server automatically merges the submitted settings with its defaults, writes the resolved config into the active file, and begins training. You can reuse the same approach for any model family—the remaining sections cover a fuller workflow when you want reusable environments.

Monitoring ad-hoc runs¶

You can track progress through the same status endpoints used later in the guide:

Poll GET /api/training/status for high-level state, active job ID, and startup stage info.
Fetch incremental logs with GET /api/training/events?since_index=N or stream them via the WebSocket at /api/training/events/stream.

For push-style updates, supply webhook settings alongside your form fields:

curl -s -X POST http://localhost:8001/api/training/start \
  -F __active_tab__=model \
  -F --model_family=pixart_sigma \
  ... \
  -F --webhook_config='[{"webhook_type":"raw","callback_url":"https://example.com/simpletuner","log_level":"info","ssl_no_verify":false}]' \
  -F --webhook_reporting_interval=10

The payload must be JSON serialised as a string; the server posts job lifecycle updates to the callback_url. See the --webhook_config description in documentation/OPTIONS.md or the sample config/webhooks.json template for supported fields.

Webhook Configuration for Reverse Proxies

When using a reverse proxy with HTTPS, your webhook URL must be the public endpoint: 1. **Public Server:** Use `https://training.example.com/webhook/callback` 2. **Tunneling:** Use ngrok or cloudflared for local dev. **Troubleshooting Real-time Logs (SSE):** If `GET /api/training/events` works but the stream hangs: * **Nginx:** Ensure `proxy_buffering off;` and `proxy_read_timeout` is high (e.g., 86400s). * **CloudFlare:** Terminates long-lived connections; use CloudFlare Tunnel or bypass the proxy for the stream endpoint.

Trigger manual validation¶

If you want to force an evaluation pass between scheduled validation intervals, call the new endpoint:

curl -s -X POST http://localhost:8001/api/training/validation/run

The server responds with the active job_id.
The trainer queues a validation run that fires immediately after the next gradient synchronization (it does not interrupt the current micro-batch).
The run reuses your configured validation prompts/settings so the resulting images appear in the usual event/log streams.
To offload validation to an external executable instead of the built-in pipeline, set --validation_method=external-script in your config (or payload) and point --validation_external_script at your script. You can pass training context to the script with placeholders: {local_checkpoint_path}, {global_step}, {tracker_run_name}, {tracker_project_name}, {model_family}, {huggingface_path}, {remote_checkpoint_path} (empty for validation), plus any validation_* config values (e.g., validation_num_inference_steps, validation_guidance, validation_noise_scheduler). Enable --validation_external_background if you want the script to fire-and-forget without blocking training.
Want to trigger automation immediately after each checkpoint is written locally (even while uploads run in the background)? Configure --post_checkpoint_script='/opt/hooks/run_eval.sh {local_checkpoint_path} {global_step}'. It uses the same placeholders as validation hooks; {remote_checkpoint_path} resolves to empty for this hook.
Prefer to keep SimpleTuner's built-in uploads and hand the resulting remote URL to your own tool? Configure --post_upload_script instead; it fires once per publishing provider/Hugging Face Hub upload with {remote_checkpoint_path} (if provided by the backend) and the same context placeholders. SimpleTuner does not ingest results from your script, so log artifacts/metrics to your tracker yourself.
Example: --post_upload_script='/opt/hooks/notify.sh {remote_checkpoint_path} {tracker_project_name} {tracker_run_name}' where notify.sh calls your tracker API.
Working samples:
- simpletuner/examples/external-validation/replicate_post_upload.py triggers a Replicate inference using {remote_checkpoint_path}, {model_family}, {model_type}, {lora_type}, and {huggingface_path}.
- simpletuner/examples/external-validation/wavespeed_post_upload.py triggers a WaveSpeed inference and polls for completion using the same placeholders.
- simpletuner/examples/external-validation/fal_post_upload.py triggers a fal.ai Flux LoRA inference (requires FAL_KEY and model_family containing flux).
- simpletuner/examples/external-validation/use_second_gpu.py runs Flux LoRA inference on another GPU without requiring uploads.

If no job is active the endpoint returns HTTP 400, so check /api/training/status first when scripting retries.

Trigger manual checkpoint¶

To persist the current model state immediately (without waiting for the next scheduled checkpoint), hit:

curl -s -X POST http://localhost:8001/api/training/checkpoint/run

The server responds with the active job_id.
The trainer saves a checkpoint after the next gradient synchronization using the same settings as scheduled checkpoints (upload rules, rolling retention, etc.).
Rolling cleanup and webhook notifications behave exactly like a scheduled checkpoint.

As with validation, the endpoint returns HTTP 400 if no training job is running.

Stream validation previews¶

Models that expose Tiny AutoEncoder (or equivalent) hooks can emit per-step validation previews while an image/video is still being sampled. Enable the feature by adding the CLI flags to your payload:

curl -s -X POST http://localhost:8001/api/training/start \
  -F __active_tab__=validation \
  -F --validation_preview=true \
  -F --validation_preview_steps=4 \
  -F --validation_num_inference_steps=20 \
  …other fields…

--validation_preview (defaults to false) unlocks the preview decoder.
--validation_preview_steps determines how often to emit intermediate frames. With the example above, you receive events at steps 1,5,9,13,17,20 (the first step is always emitted, then every 4^th step).

Each preview is published as a validation.image event (see simpletuner/helpers/training/validation.py:899-929). You can consume them via raw webhooks, GET /api/training/events, or the SSE stream at /api/training/events/stream. A typical payload looks like:

{
  "type": "validation.image",
  "title": "Validation (step 5/20): night bench",
  "body": "night bench shot of <token>",
  "data": {
    "step": 5,
    "timestep": 563.0,
    "resolution": [1024, 1024],
    "validation_type": "intermediary",
    "prompt": "night bench shot of <token>",
    "step_label": "5/20"
  },
  "images": [
    {"src": "data:image/png;base64,...", "mime_type": "image/png"}
  ]
}

Video-capable models attach a videos array instead (GIF data URIs with mime_type: image/gif). Because these events stream in near-real-time, you can surface them directly in dashboards or send them to Slack/Discord via a raw webhook backend.

Common API workflow¶

Create an environment – POST /api/configs/environments
Populate the dataloader file – edit the generated config/<env>/multidatabackend.json
Update training hyperparameters – PUT /api/configs/<env>
Activate the environment – POST /api/configs/<env>/activate
Validate training parameters – POST /api/training/validate
Launch training – POST /api/training/start
Monitor or stop the job – /api/training/status, /api/training/events, /api/training/stop, /api/training/cancel

Each example below follows this flow.

Optional: upload datasets over the API (local backends)¶

If the dataset is not yet on the machine where SimpleTuner runs, you can push it over HTTP before wiring the dataloader. The upload endpoints respect the configured datasets_dir (set during WebUI onboarding) and are intended for local filesystems:

Create a target folder under your datasets root:

DATASETS_DIR=${DATASETS_DIR:-/workspace/simpletuner/datasets}
curl -s -X POST http://localhost:8001/api/datasets/folders \
  -F parent_path="$DATASETS_DIR" \
  -F folder_name="pixart-upload"

Upload files or a ZIP (images plus optional .txt/.jsonl/.csv metadata are accepted):

# Upload a zip (automatically extracted on the server)
curl -s -X POST http://localhost:8001/api/datasets/upload/zip \
  -F target_path="$DATASETS_DIR/pixart-upload" \
  -F file=@/path/to/dataset.zip

# Or upload individual files
curl -s -X POST http://localhost:8001/api/datasets/upload \
  -F target_path="$DATASETS_DIR/pixart-upload" \
  -F files[]=@image001.png \
  -F files[]=@image001.txt

Troubleshooting Uploads: If large uploads fail with a "Entity Too Large" error when using a reverse proxy, ensure you have increased the body size limit (e.g., client_max_body_size 10G; in Nginx or request_body { max_size 10GB } in Caddy).

After the upload finishes, point your multidatabackend.json entry at the new folder (for example, "/data/datasets/pixart-upload").

Example: PixArt Sigma 900M full fine-tune¶

1. Create the environment via REST¶

curl -s -X POST http://localhost:8001/api/configs/environments \
  -H 'Content-Type: application/json' \
  -d
```json
{
        "name": "pixart-api-demo",
        "model_family": "pixart_sigma",
        "model_flavour": "900M-1024-v0.6",
        "model_type": "full",
        "description": "PixArt 900M API-driven training"
      }

This creates config/pixart-api-demo/ and a starter multidatabackend.json.

2. Wire the dataset¶

Edit the dataloader file (replace paths with your actual dataset/cache locations):

cat <<'JSON' > config/pixart-api-demo/multidatabackend.json
[
  {
    "id": "pixart-camera",
    "type": "local",
    "dataset_type": "image",
    "instance_data_dir": "/data/datasets/pseudo-camera-10k",
    "caption_strategy": "filename",
    "resolution": 1.0,
    "resolution_type": "area",
    "minimum_image_size": 0.25,
    "maximum_image_size": 1.0,
    "target_downsample_size": 1.0,
    "cache_dir_vae": "/data/cache/vae/pixart/pseudo-camera-10k",
    "crop": true,
    "crop_style": "random",
    "crop_aspect": "square",
    "metadata_backend": "discovery"
  },
  {
    "id": "pixart-text-embeds",
    "type": "local",
    "dataset_type": "text_embeds",
    "default": true,
    "cache_dir": "/data/cache/text/pixart/pseudo-camera-10k",
    "write_batch_size": 128
  }
]
JSON

3. Update hyperparameters through the API¶

Grab the current config, merge overrides, and push the result back:

curl -s http://localhost:8001/api/configs/pixart-api-demo \
  | jq '.config + {
      "--output_dir": "/workspace/output/pixart900m",
      "--train_batch_size": 2,
      "--gradient_accumulation_steps": 2,
      "--learning_rate": 0.0001,
      "--optimizer": "adamw_bf16",
      "--lr_scheduler": "cosine",
      "--lr_warmup_steps": 500,
      "--max_train_steps": 1800,
      "--num_train_epochs": 0,
      "--validation_prompt": "a studio portrait of <token> wearing a leather jacket",
      "--validation_guidance": 3.8,
      "--validation_resolution": "1024x1024",
      "--validation_num_inference_steps": 28,
      "--cache_dir_vae": "/data/cache/vae/pixart",
      "--seed": 1337,
      "--resume_from_checkpoint": "latest",
      "--base_model_precision": "bf16",
      "--dataloader_prefetch": true,
      "--report_to": "none",
      "--checkpoints_total_limit": 4,
      "--validation_seed": 12345,
      "--data_backend_config": "pixart-api-demo/multidatabackend.json"
    }' > /tmp/pixart-config.json

jq '{
      "name": "pixart-api-demo",
      "description": "PixArt 900M full tune (API)",
      "tags": ["pixart", "api"],
      "config": .
    }' /tmp/pixart-config.json > /tmp/pixart-update.json

curl -s -X PUT http://localhost:8001/api/configs/pixart-api-demo \
  -H 'Content-Type: application/json' \
  --data-binary @/tmp/pixart-update.json

4. Activate the environment¶

curl -s -X POST http://localhost:8001/api/configs/pixart-api-demo/activate

5. Validate before launching¶

validate consumes form-encoded data. At minimum, ensure one of num_train_epochs or max_train_steps is 0:

curl -s -X POST http://localhost:8001/api/training/validate \
  -F __active_tab__=model \
  -F --num_train_epochs=0

A success block (Configuration Valid) means the trainer accepts the merged configuration.

6. Start training¶

curl -s -X POST http://localhost:8001/api/training/start \
  -F __active_tab__=model \
  -F --num_train_epochs=0

The response includes the job ID. Training runs with the parameters saved in step 3.

7. Monitor and stop¶

# Query coarse status
curl -s http://localhost:8001/api/training/status | jq

# Stream incremental log events
curl -s 'http://localhost:8001/api/training/events?since_index=0' | jq

# Cancel or stop
curl -s -X POST http://localhost:8001/api/training/stop
curl -s -X POST http://localhost:8001/api/training/cancel -F job_id=<JOB_ID>

PixArt notes:

Keep the dataset large enough for the chosen train_batch_size * gradient_accumulation_steps
Set HF_ENDPOINT if you need a mirror, and authenticate before downloading terminusresearch/pixart-900m-1024-ft-v0.6
Tune --validation_guidance between 3.6 and 4.4 depending on your prompts

Example: Flux Kontext LyCORIS LoRA¶

Kontext shares most of its pipeline with Flux Dev but needs paired edit/reference images.

1. Provision the environment¶

curl -s -X POST http://localhost:8001/api/configs/environments \
  -H 'Content-Type: application/json' \
  -d
```json
{
        "name": "kontext-api-demo",
        "model_family": "flux",
        "model_flavour": "kontext",
        "model_type": "lora",
        "lora_type": "lycoris",
        "description": "Flux Kontext LoRA via API"
      }

2. Describe the paired dataloader¶

Kontext needs edit/reference pairs plus a text-embed cache:

cat <<'JSON' > config/kontext-api-demo/multidatabackend.json
[
  {
    "id": "kontext-edit",
    "type": "local",
    "dataset_type": "image",
    "instance_data_dir": "/data/datasets/kontext/edit",
    "conditioning_data": ["kontext-reference"],
    "resolution": 1024,
    "resolution_type": "pixel_area",
    "caption_strategy": "textfile",
    "minimum_image_size": 768,
    "maximum_image_size": 1536,
    "target_downsample_size": 1024,
    "cache_dir_vae": "/data/cache/vae/kontext/edit",
    "crop": true,
    "crop_style": "random",
    "crop_aspect": "square"
  },
  {
    "id": "kontext-reference",
    "type": "local",
    "dataset_type": "conditioning",
    "instance_data_dir": "/data/datasets/kontext/reference",
    "conditioning_type": "reference_strict",
    "resolution": 1024,
    "resolution_type": "pixel_area",
    "cache_dir_vae": "/data/cache/vae/kontext/reference"
  },
  {
    "id": "kontext-text-embeds",
    "type": "local",
    "dataset_type": "text_embeds",
    "default": true,
    "cache_dir": "/data/cache/text/kontext"
  }
]
JSON

Ensure filenames match between edit and reference folders; SimpleTuner stitches embeddings based on names.

3. Apply Kontext-specific hyperparameters¶

curl -s http://localhost:8001/api/configs/kontext-api-demo \
  | jq '.config + {
      "--output_dir": "/workspace/output/kontext",
      "--train_batch_size": 1,
      "--gradient_accumulation_steps": 4,
      "--learning_rate": 0.00001,
      "--optimizer": "optimi-lion",
      "--lr_scheduler": "cosine",
      "--lr_warmup_steps": 200,
      "--max_train_steps": 12000,
      "--num_train_epochs": 0,
      "--validation_prompt": "a cinematic 1024px product photo of <token>",
      "--validation_guidance": 2.5,
      "--validation_resolution": "1024x1024",
      "--validation_num_inference_steps": 30,
      "--cache_dir_vae": "/data/cache/vae/kontext",
      "--seed": 777,
      "--resume_from_checkpoint": "latest",
      "--base_model_precision": "int8-quanto",
      "--dataloader_prefetch": true,
      "--report_to": "wandb",
      "--lora_rank": 16,
      "--lora_dropout": 0.05,
      "--conditioning_multidataset_sampling": "combined",
      "--clip_skip": 2,
      "--data_backend_config": "kontext-api-demo/multidatabackend.json"
    }' > /tmp/kontext-config.json

jq '{
      "name": "kontext-api-demo",
      "description": "Flux Kontext LyCORIS (API)",
      "tags": ["flux", "kontext", "api"],
      "config": .
    }' /tmp/kontext-config.json > /tmp/kontext-update.json

curl -s -X PUT http://localhost:8001/api/configs/kontext-api-demo \
  -H 'Content-Type: application/json' \
  --data-binary @/tmp/kontext-update.json

4. Activate, validate, and launch¶

curl -s -X POST http://localhost:8001/api/configs/kontext-api-demo/activate

curl -s -X POST http://localhost:8001/api/training/validate \
  -F __active_tab__=model \
  -F --num_train_epochs=0

curl -s -X POST http://localhost:8001/api/training/start \
  -F __active_tab__=model \
  -F --num_train_epochs=0

Kontext tips:

conditioning_type=reference_strict keeps crops aligned; switch to reference_loose if your datasets differ in aspect ratio
Quantise to int8-quanto to stay within 24 GB VRAM at 1024 px; full precision requires Hopper/Blackwell-class GPUs
For multi-node runs, set --accelerate_config or CUDA_VISIBLE_DEVICES before launching the server

Submit local jobs with GPU-aware queuing¶

When running on a multi-GPU machine, you can submit local training jobs through the queue API with GPU allocation awareness. Jobs are queued if required GPUs are unavailable.

Check GPU availability¶

curl -s "http://localhost:8001/api/system/status?include_allocation=true" | jq '.gpu_allocation'

Response shows which GPUs are available:

{
  "allocated_gpus": [0, 1],
  "available_gpus": [2, 3],
  "running_local_jobs": 1,
  "devices": [
    {"index": 0, "name": "A100", "memory_gb": 40, "allocated": true, "job_id": "abc123"},
    {"index": 1, "name": "A100", "memory_gb": 40, "allocated": true, "job_id": "abc123"},
    {"index": 2, "name": "A100", "memory_gb": 40, "allocated": false, "job_id": null},
    {"index": 3, "name": "A100", "memory_gb": 40, "allocated": false, "job_id": null}
  ]
}

You can also get queue statistics including local GPU info:

curl -s http://localhost:8001/api/queue/stats | jq '.local'

Submit a local job¶

curl -s -X POST http://localhost:8001/api/queue/submit \
  -H 'Content-Type: application/json' \
  -d '{
    "config_name": "my-training-config",
    "no_wait": false,
    "any_gpu": false
  }'

Options:

Option	Default	Description
`config_name`	required	Name of the training environment to run
`no_wait`	false	If true, reject immediately when GPUs unavailable
`any_gpu`	false	If true, use any available GPUs instead of configured device IDs

Response:

{
  "success": true,
  "job_id": "abc123",
  "status": "running",
  "allocated_gpus": [0, 1],
  "queue_position": null
}

The status field indicates the outcome:

running - Job started immediately with allocated GPUs
queued - Job queued, will start when GPUs become available
rejected - GPUs unavailable and no_wait was true

Configure local concurrency limits¶

Admins can limit how many local jobs and GPUs can be used via the queue concurrency endpoint:

# Get current limits
curl -s http://localhost:8001/api/queue/stats | jq '{local_gpu_max_concurrent, local_job_max_concurrent}'

# Update limits (alongside cloud limits)
curl -s -X POST http://localhost:8001/api/queue/concurrency \
  -H 'Content-Type: application/json' \
  -d '{
    "local_gpu_max_concurrent": 6,
    "local_job_max_concurrent": 2
  }'

Set local_gpu_max_concurrent to null for unlimited GPU usage.

CLI alternative¶

The same functionality is available via CLI:

# Submit with default queuing behavior
simpletuner jobs submit my-config

# Reject if GPUs unavailable
simpletuner jobs submit my-config --no-wait

# Use any available GPUs
simpletuner jobs submit my-config --any-gpu

# Preview what would happen (dry-run)
simpletuner jobs submit my-config --dry-run

Dispatch jobs to remote workers¶

If you have remote GPU machines registered as workers (see Worker Orchestration), you can dispatch jobs to them via the queue API.

Check available workers¶

curl -s http://localhost:8001/api/admin/workers | jq '.workers[] | {name, status, gpu_name, gpu_count}'

Submit to a specific target¶

# Prefer remote workers, fall back to local GPUs (default)
curl -s -X POST http://localhost:8001/api/queue/submit \
  -H 'Content-Type: application/json' \
  -d '{
    "config_name": "my-training-config",
    "target": "auto"
  }'

# Force dispatch to remote workers only
curl -s -X POST http://localhost:8001/api/queue/submit \
  -H 'Content-Type: application/json' \
  -d '{
    "config_name": "my-training-config",
    "target": "worker"
  }'

# Run only on orchestrator's local GPUs
curl -s -X POST http://localhost:8001/api/queue/submit \
  -H 'Content-Type: application/json' \
  -d '{
    "config_name": "my-training-config",
    "target": "local"
  }'

Select workers by label¶

Workers can have labels for filtering (e.g., GPU type, location, team):

curl -s -X POST http://localhost:8001/api/queue/submit \
  -H 'Content-Type: application/json' \
  -d '{
    "config_name": "my-training-config",
    "target": "worker",
    "worker_labels": {"gpu_type": "a100*", "team": "nlp"}
  }'

Labels support glob patterns (* matches any characters).

Useful endpoints at a glance¶

GET /api/configs/ – list environments (pass ?config_type=model for training configs)
GET /api/configs/examples – enumerate shipped templates
POST /api/configs/{name}/dataloader – regenerate a dataloader file if you want defaults
GET /api/training/status – high-level state, active job_id, and startup stage info
GET /api/training/events?since_index=N – incremental trainer log stream
POST /api/training/checkpoints – list checkpoints for the active job's output directory
GET /api/system/status?include_allocation=true – system metrics with GPU allocation info
GET /api/queue/stats – queue statistics including local GPU allocation
POST /api/queue/submit – submit a local or worker job with GPU-aware queuing
POST /api/queue/concurrency – update cloud and local concurrency limits
GET /api/admin/workers – list registered workers and their status

Where to go next¶

Explore specific option definitions in documentation/OPTIONS.md
Combine these REST calls with jq/yq or a Python client for automation
Hook WebSockets at /api/training/events/stream for real-time dashboards
Reuse the exported configs (GET /api/configs/<env>/export) to version-control working setups
Run training on cloud GPUs via Replicate—see the Cloud Training Tutorial

With these patterns you can fully script SimpleTuner training without touching the WebUI, while still relying on the battle-tested CLI setup process.