Enterprise Guide¶
This document covers deploying SimpleTuner in multi-user environments with authentication, approval workflows, and quota management.
1. Deployment & Infrastructure¶
Configuration Methods¶
Most enterprise features can be configured via the Web UI (Administration panel) or REST API. A few infrastructure-level settings require a config file or environment variables.
| Feature | Web UI | API | Config File |
|---|---|---|---|
| OIDC/LDAP providers | ✓ | ✓ | ✓ |
| Users & roles | ✓ | ✓ | |
| Approval rules | ✓ | ✓ | |
| Quotas | ✓ | ✓ | |
| Notifications | ✓ | ✓ | |
| Network bypass (trusted proxies) | ✓ | ||
| Background job polling | ✓ | ||
| TLS settings | ✓ |
Config file (simpletuner-enterprise.yaml or .json) is only needed for infrastructure settings that must be known at startup. SimpleTuner searches these locations:
$SIMPLETUNER_ENTERPRISE_CONFIG(environment variable)./simpletuner-enterprise.yaml(current directory)~/.config/simpletuner/enterprise.yaml/etc/simpletuner/enterprise.yaml
The file supports environment variable interpolation with ${VAR} syntax.
Quick Start Checklist¶
- Start SimpleTuner:
simpletuner server(or--webuifor local use) - Configure via UI: Navigate to Administration panel to set up users, SSO, quotas
- Health Checks (for production):
- Liveness:
GET /api/cloud/health/live(200 OK) - Readiness:
GET /api/cloud/health/ready(200 OK) - Deep Check:
GET /api/cloud/health(includes provider connectivity)
- Liveness:
Network Security & Authentication Bypass¶
Configuring trusted proxies and internal network bypass (config file required)
In corporate environments (VPNs, private VPCs), you may want to trust internal traffic or offload authentication to a gateway. **simpletuner-enterprise.yaml:**network:
# Trust headers from your load balancer (e.g., AWS ALB, Nginx)
trust_proxy_headers: true
trusted_proxies:
- "10.0.0.0/8"
- "192.168.0.0/16"
# Optional: Trust specific internal subnets to bypass login
bypass_auth_for_internal: true
internal_networks:
- "10.10.0.0/16" # VPN Clients
auth:
# Always allow health checks without auth
bypass_paths:
- "/health"
- "/api/cloud/health"
- "/api/cloud/metrics/prometheus"
Load Balancer & TLS Configuration¶
SimpleTuner expects an upstream reverse proxy for TLS termination.
nginx reverse proxy example
server {
listen 443 ssl http2;
server_name trainer.internal;
ssl_certificate /etc/ssl/certs/simpletuner.crt;
ssl_certificate_key /etc/ssl/private/simpletuner.key;
location / {
proxy_pass http://127.0.0.1:8080;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
# WebSocket support for real-time logs/SSE
proxy_http_version 1.1;
proxy_set_header Upgrade $http_upgrade;
proxy_set_header Connection "upgrade";
proxy_read_timeout 86400;
}
}
Observability (Prometheus & Logging)¶
Metrics:
Scrape GET /api/cloud/metrics/prometheus for operational insights.
* simpletuner_jobs_active: Current queue depth.
* simpletuner_cost_total_usd: Spend tracking.
* simpletuner_uptime_seconds: Availability.
Logging:
Set SIMPLETUNER_LOG_FORMAT=json for ingestion into Splunk/Datadog/ELK.
Data Retention Configuration
Configure retention periods for compliance requirements via environment variables: | Variable | Default | Description | |----------|---------|-------------| | `SIMPLETUNER_JOB_RETENTION_DAYS` | 90 | Days to retain completed job records | | `SIMPLETUNER_AUDIT_RETENTION_DAYS` | 90 | Days to retain audit log entries | Setting to `0` disables automatic cleanup. Cleanup runs daily.2. Identity & Access Management (SSO)¶
SimpleTuner supports OIDC (OpenID Connect) and LDAP for SSO with Okta, Azure AD, Keycloak, and Active Directory.
Configuring Providers¶
Via Web UI: Navigate to Administration → Auth to add and configure providers.
Via API: See the API Cookbook for curl examples.
Via config file (for IaC/GitOps workflows)
Add to your `simpletuner-enterprise.yaml`:oidc:
enabled: true
provider: "okta" # or "azure_ad", "google"
client_id: "0oa1234567890abcdef"
client_secret: "${OIDC_CLIENT_SECRET}"
issuer_url: "https://your-org.okta.com/oauth2/default"
scopes: ["openid", "email", "profile", "groups"]
# Map Identity Provider groups to SimpleTuner Roles
role_mapping:
claim: "groups"
admin_groups: ["ML-Platform-Admins"]
user_groups: ["ML-Researchers"]
Cross-Worker OAuth State Validation
When using OIDC authentication in multi-worker deployments (e.g., behind a load balancer with multiple Gunicorn workers), OAuth state validation must work across all workers. SimpleTuner handles this automatically by storing OAuth state in the database. **How it works:** 1. **State Generation**: When a user initiates OIDC login, a cryptographically random state token is generated and stored in the database with the provider name, redirect URI, and a 10-minute expiration. 2. **State Validation**: When the callback arrives (potentially to a different worker), the state is looked up and atomically consumed (single-use). 3. **Cleanup**: Expired states are automatically purged during normal operations. No additional configuration is needed. OAuth state storage uses the same database as jobs and users. **Troubleshooting "Invalid OAuth state" errors:** 1. Check if the callback arrived within 10 minutes of login initiation 2. Verify all workers share the same database path 3. Check database write permissions 4. Look for "Failed to store OAuth state" errors in logsUser Management & Roles¶
SimpleTuner uses a hierarchical role system. Users can be managed via GET/POST /api/users.
| Role | Priority | Description |
|---|---|---|
| Viewer | 10 | Read-only access to job history and logs. |
| Researcher | 20 | Standard access. Can submit jobs and manage their own API keys. |
| Lead | 30 | Can approve pending jobs and view team resource usage. |
| Admin | 100 | Full system access, including user management and rule configuration. |
3. Governance & Operations¶
Approval Workflows¶
Control costs and resource usage by requiring approvals for specific criteria. Rules are evaluated at submission time.
Workflow:
1. User submits job -> Status becomes pending_approval.
2. Leads check GET /api/approvals/requests.
3. Lead calls POST /.../approve or reject.
4. Job automatically proceeds to queue or is cancelled.
Approval Rules Engine
The rules engine evaluates job submissions against configured rules. Rules are processed in priority order; the first matching rule triggers the approval requirement. **Available Rule Conditions:** | Condition | Description | |-----------|-------------| | `cost_exceeds` | Triggers when estimated cost exceeds threshold (USD) | | `hardware_type` | Matches hardware type (glob pattern, e.g., `a100*`) | | `daily_jobs_exceed` | Triggers when user's daily job count exceeds threshold | | `first_job` | Triggers for a user's very first job | | `config_pattern` | Matches config name patterns | | `provider` | Matches specific provider name | **Example: Require approval for jobs over $50:** Rules can specify `exempt_levels` to allow certain users to bypass approval, and `applies_to_provider`/`applies_to_level` to scope rules.Email-Based Approval (IMAP Workflow)
For teams that prefer email-based workflows, SimpleTuner supports approval via email replies using IMAP IDLE. **How It Works:** 1. Job submission triggers approval requirement 2. Notification email sent to approvers with unique response token 3. IMAP handler monitors inbox using IDLE (push notifications) 4. Approver replies with "approve" or "reject" (or aliases like `yes`, `lgtm`, `+1`) 5. System parses response and processes approval Configure via **Administration → Notifications** or API. Response tokens expire after 24 hours and are single-use.Job Queue & Concurrency¶
The scheduler manages fair usage of resources. See its dedicated documentation for details.
- Priority: Admins > Leads > Researchers > Viewers.
- Concurrency: Limits are enforced globally and per-user.
- Update limits via UI: Cloud tab → Job Queue panel (admin only)
- Update limits via API:
POST /api/queue/concurrencywith{"max_concurrent": 10, "user_max_concurrent": 3}
Job Status Polling (No Webhooks Required)¶
For secure environments where public webhooks are impossible, SimpleTuner includes a background poller.
Add to simpletuner-enterprise.yaml:
This service queries the provider API every 30s and updates the local database, emitting real-time events to the UI via SSE.
API Key Rotation¶
Securely manage cloud provider credentials. See API Cookbook for rotation scripts and provider-specific details in the Cloud Training documentation.
4. API Cookbook¶
OIDC/LDAP Configuration Examples
**Keycloak (OIDC):**curl -X POST http://localhost:8080/api/cloud/external-auth/providers \
-H "Authorization: Bearer $ADMIN_TOKEN" \
-d '{
"name": "keycloak",
"provider_type": "oidc",
"enabled": true,
"config": {
"issuer": "https://keycloak.example.com/realms/ml-training",
"client_id": "simpletuner",
"client_secret": "your-client-secret",
"scopes": ["openid", "email", "profile", "roles"],
"roles_claim": "realm_access.roles"
}
}'
curl -X POST http://localhost:8080/api/cloud/external-auth/providers \
-H "Authorization: Bearer $ADMIN_TOKEN" \
-d '{
"name": "corporate-ad",
"provider_type": "ldap",
"enabled": true,
"level_mapping": {
"CN=ML-Admins,OU=Groups,DC=corp,DC=com": ["admin"]
},
"config": {
"server": "ldaps://ldap.corp.com:636",
"base_dn": "DC=corp,DC=com",
"bind_dn": "CN=svc-simpletuner,OU=Service Accounts,DC=corp,DC=com",
"bind_password": "service-account-password",
"user_search_filter": "(sAMAccountName={username})",
"use_ssl": true
}
}'
User Administration Examples
**Create a Researcher:**curl -X POST http://localhost:8080/api/users \
-H "Authorization: Bearer $ADMIN_TOKEN" \
-d '{
"email": "[email protected]",
"username": "jsmith",
"password": "secure_password_123",
"level_names": ["researcher"]
}'
Credential Management
SimpleTuner includes credential lifecycle management for tracking, rotating, and auditing API credentials. **Credential Resolution:** When submitting jobs, SimpleTuner checks per-user credentials first, then falls back to global credentials (environment variables). | Scenario | Per-User | Global | Behavior | |----------|----------|--------|----------| | **Shared org key** | ❌ | ✅ | All users share the org's API key | | **BYOK** | ✅ | ❌ | Each user provides their own key | | **Hybrid** | Some | ✅ | Users with keys use theirs, others use global | **Rotation:** Navigate to **Admin > Auth** → user → **Manage Credentials** → **Rotate**. Stale credentials (>90 days) display a warning badge.External Orchestration¶
Airflow example
def submit_and_wait(job_config, provider="replicate", **context):
resp = requests.post(
f"http://localhost:8080/api/cloud/{provider}/submit",
json=job_config,
headers={"Authorization": f"Bearer {TOKEN}"}
)
job_id = resp.json()["job_id"]
while True:
status = requests.get(f"http://localhost:8080/api/cloud/jobs/{job_id}")
state = status.json()["status"]
if state in ("completed", "failed", "cancelled"):
return status.json()
time.sleep(30)
5. Troubleshooting¶
Health Check Failures
* 503 Service Unavailable: Check database connectivity.
* Degraded: Usually means an optional component (like a cloud provider API) is unreachable or unconfigured.
Authentication Issues
* OIDC Redirect Loop: Verify issuer_url matches exactly what the provider expects (trailing slashes matter!).
* Internal Auth Bypass: Check server logs for "Auth bypassed for IP..." to confirm your load balancer is passing the correct X-Real-IP.
Job Updates Stalled
* If webhooks are blocked, ensure Job Status Polling is enabled in simpletuner-enterprise.yaml.
* Check GET /api/cloud/metrics/prometheus for simpletuner_jobs_active to see if the internal state thinks jobs are running.
Missing Metrics
* Ensure your Prometheus scraper is configured to hit /api/cloud/metrics/prometheus and not just /metrics.
6. Organizations & Team Quotas¶
SimpleTuner supports hierarchical organizations and teams with ceiling-based quota enforcement.
Hierarchy¶
Organization (quota ceiling)
└── Team (quota ceiling, bounded by org)
└── User (limit, bounded by team and org)
Ceiling Model¶
Quotas use a ceiling model where org limits are absolute ceilings: - Org quota: Absolute ceiling for all members - Team quota: Ceiling for team members (cannot exceed org) - User/Level quota: Specific limits (bounded by team and org)
Example: - Org ceiling: 100 concurrent jobs - Team ceiling: 20 concurrent jobs - User limit: 50 concurrent jobs → Effective: 20 (team ceiling applies)
Enforcement Rules: - Team quotas are validated at set-time: attempting to set a team quota higher than the org ceiling returns HTTP 400 - User quotas are validated at runtime: effective limit is the minimum of user, team, and org ceilings - Reducing an org ceiling does not automatically reduce existing team ceilings (admin must update manually)
API Examples
**Create Organization:**curl -X POST http://localhost:8080/api/orgs \
-H "Authorization: Bearer $ADMIN_TOKEN" \
-d '{"name": "ML Research", "slug": "ml-research"}'
curl -X POST http://localhost:8080/api/orgs/1/quotas \
-H "Authorization: Bearer $ADMIN_TOKEN" \
-d '{"quota_type": "concurrent_jobs", "limit_value": 100, "action": "block"}'
Quota and Cost Limit Actions¶
When a quota or cost limit is reached, the configured action determines behavior:
| Action | Behavior |
|---|---|
warn |
Job proceeds with warning in logs/UI |
block |
Job submission rejected |
notify |
Job proceeds, admins alerted |
Cost Limit Configuration
Cost limits can be configured per-provider via **Cloud tab → Settings** or API: Check status: `GET /api/cloud/metrics/cost-limit-status`7. Limitations¶
Workflow / Pipeline Jobs (DAGs)¶
SimpleTuner does not support job dependencies or multi-step workflows where one job's output feeds into another. Each cloud job is independent.
Recommended approach: Use external orchestration tools like Airflow, Prefect, or Dagster to chain jobs via the REST API. See the Airflow example in the API Cookbook above.
Resuming Training Runs¶
There is no built-in support for resuming interrupted, failed, or cancelled training runs. Cloud jobs do not automatically recover from checkpoints.
Workarounds:
- Configure frequent HuggingFace Hub pushes (--push_checkpoints_to_hub) to save intermediate state
- Implement custom checkpoint management by downloading outputs and using them as starting points for new jobs
- For mission-critical workloads, consider breaking long training runs into smaller segments
UI Feature Reference
| Feature | UI Location | API | |---------|-------------|-----| | Organizations & Teams | Administration → Orgs | `/api/orgs` | | Quotas | Administration → Quotas | `/api/orgs/{id}/quotas` | | OIDC/LDAP | Administration → Auth | `/api/cloud/external-auth/providers` | | Users | Administration → Users | `/api/users` | | Audit Logs | Sidebar → Audit Log | `/api/audit` | | Queue | Cloud tab → Job Queue | `/api/queue/concurrency` | | Approvals | Administration → Approvals | `/api/approvals/requests` | The Administration section is visible when no auth is configured (single-user mode) or the user has admin privileges.Enterprise Onboarding Flow
The Admin panel includes a guided onboarding that helps set up authentication, organizations, teams, quotas, and credentials in order. | Step | Feature | |------|---------| | 1 | Authentication (OIDC/LDAP) | | 2 | Organization | | 3 | Teams | | 4 | Quotas | | 5 | Credentials | Each step can be completed or skipped. State persists in browser localStorage.8. Notification System¶
SimpleTuner includes a multi-channel notification system for job status, approvals, quotas, and system events.
| Channel | Use Case |
|---|---|
| Approval requests, job completion (SMTP/IMAP) | |
| Webhook | CI/CD integration (JSON + HMAC signatures) |
| Slack | Team notifications (incoming webhooks) |
Configure via Administration → Notifications or API.
Event Types
| Category | Events | |----------|--------| | Approval | `approval.required`, `approval.granted`, `approval.rejected`, `approval.expired` | | Job | `job.submitted`, `job.started`, `job.completed`, `job.failed`, `job.cancelled` | | Quota | `quota.warning`, `quota.exceeded`, `cost.warning`, `cost.exceeded` | | System | `system.provider_error`, `system.provider_degraded`, `system.webhook_failure` | | Auth | `auth.login_failure`, `auth.new_device` |Channel Configuration Examples
**Email:** **Slack:** **Webhook:** Payloads signed with HMAC-SHA256 (`X-SimpleTuner-Signature` header).9. Resource Rules¶
Resource rules provide fine-grained access control for configs, hardware types, and output paths using glob patterns.
| Type | Example Pattern |
|---|---|
config |
team-x-*, production-* |
hardware |
gpu-a100*, *-80gb |
provider |
replicate, runpod |
Rules use allow/deny actions with "most permissive wins" logic. Configure via Administration → Rules.
Rule Examples
**Team Isolation:** Researchers can only use configs starting with "team-x-" **Hardware Restrictions:** Researchers limited to T4/V100, leads can use any hardware10. Permission Matrix¶
Full Permission Matrix
### Job Permissions | Permission | Viewer | Researcher | Lead | Admin | |------------|:------:|:----------:|:----:|:-----:| | `job.submit` | | ✓ | ✓ | ✓ | | `job.view.own` | ✓ | ✓ | ✓ | ✓ | | `job.view.all` | | | ✓ | ✓ | | `job.cancel.own` | | ✓ | ✓ | ✓ | | `job.cancel.all` | | | | ✓ | | `job.priority.high` | | | ✓ | ✓ | | `job.bypass.queue` | | | | ✓ | | `job.bypass.approval` | | | | ✓ | ### Config Permissions | Permission | Viewer | Researcher | Lead | Admin | |------------|:------:|:----------:|:----:|:-----:| | `config.view` | ✓ | ✓ | ✓ | ✓ | | `config.create` | | ✓ | ✓ | ✓ | | `config.edit.own` | | ✓ | ✓ | ✓ | | `config.edit.all` | | | | ✓ | ### Admin Permissions | Permission | Viewer | Researcher | Lead | Admin | |------------|:------:|:----------:|:----:|:-----:| | `admin.users` | | | | ✓ | | `admin.approve` | | | ✓ | ✓ | | `admin.audit` | | | ✓ | ✓ | | `admin.config` | | | | ✓ | | `queue.approve` | | | ✓ | ✓ | | `queue.manage` | | | | ✓ | ### Org/Team Permissions | Permission | Viewer | Researcher | Lead | Admin | |------------|:------:|:----------:|:----:|:-----:| | `org.view` | | | ✓ | ✓ | | `org.create` | | | | ✓ | | `team.view` | | | ✓ | ✓ | | `team.create` | | | ✓ | ✓ | | `team.manage.members` | | | ✓ | ✓ |Permission Overrides: Individual users can have permissions granted or denied via Administration → Users → Permission Overrides.