Structured Logging and Background Tasks¶
This document covers the structured logging system and background task workers in SimpleTuner's cloud training feature.
Table of Contents¶
- Structured Logging
- Configuration
- JSON Log Format
- LogContext for Field Injection
- Correlation IDs
- Background Tasks
- Job Status Polling Worker
- Queue Processing Worker
- Approval Expiration Worker
- Configuration Options
- Debugging with Logs
Structured Logging¶
SimpleTuner's cloud training uses a structured JSON logging system that provides consistent, parseable log output with automatic correlation ID tracking for distributed tracing.
Configuration¶
Configure logging via environment variables:
# Log level: DEBUG, INFO, WARNING, ERROR, CRITICAL
export SIMPLETUNER_LOG_LEVEL="INFO"
# Format: "json" (structured) or "text" (traditional)
export SIMPLETUNER_LOG_FORMAT="json"
# Optional: Log to file in addition to stdout
export SIMPLETUNER_LOG_FILE="/var/log/simpletuner/cloud.log"
Programmatic configuration
from simpletuner.simpletuner_sdk.server.services.cloud.structured_logging import (
configure_structured_logging,
init_from_env,
)
# Configure with explicit options
configure_structured_logging(
level="INFO",
json_output=True,
log_file="/var/log/simpletuner/cloud.log",
include_stack_info=False, # Include stack traces for errors
)
# Or initialize from environment variables
init_from_env()
JSON Log Format¶
When JSON output is enabled, each log entry includes:
Example JSON log entry
{
"timestamp": "2024-01-15T10:30:00.000Z",
"level": "INFO",
"logger": "simpletuner.cloud.jobs",
"message": "Job submitted successfully",
"correlation_id": "abc123def456",
"source": {
"file": "jobs.py",
"line": 350,
"function": "submit_job"
},
"extra": {
"job_id": "xyz789",
"provider": "replicate",
"cost_estimate": 2.50
}
}
| Field | Description |
|---|---|
timestamp |
ISO 8601 timestamp in UTC |
level |
Log level (DEBUG, INFO, WARNING, ERROR, CRITICAL) |
logger |
Logger name hierarchy |
message |
Human-readable log message |
correlation_id |
Request tracing ID (auto-generated or propagated) |
source |
File, line number, and function name |
extra |
Additional structured fields from LogContext |
LogContext for Field Injection¶
Use LogContext to automatically add structured fields to all logs within a scope:
LogContext usage example
from simpletuner.simpletuner_sdk.server.services.cloud.structured_logging import (
get_logger,
LogContext,
)
logger = get_logger("simpletuner.cloud.jobs")
async def process_job(job_id: str, provider: str):
# All logs within this block include job_id and provider
with LogContext(job_id=job_id, provider=provider):
logger.info("Starting job processing")
# Nested context adds more fields
with LogContext(step="validation"):
logger.info("Validating configuration")
with LogContext(step="submission"):
logger.info("Submitting to provider")
logger.info("Job processing complete")
Common fields to inject:
| Field | Purpose |
|---|---|
job_id |
Training job identifier |
provider |
Cloud provider (replicate, etc.) |
user_id |
Authenticated user |
step |
Processing phase (validation, upload, submission) |
attempt |
Retry attempt number |
Correlation IDs¶
Correlation IDs enable request tracing across service boundaries. They are:
- Auto-generated for each new request thread if not present
- Propagated via the
X-Correlation-IDHTTP header - Stored in thread-local storage for automatic log injection
- Included in outbound HTTP requests to cloud providers
Correlation ID flow diagram
Manual correlation ID management
from simpletuner.simpletuner_sdk.server.services.cloud.http_client import (
get_correlation_id,
set_correlation_id,
clear_correlation_id,
)
# Get current ID (auto-generates if none exists)
current_id = get_correlation_id()
# Set a specific ID (e.g., from incoming request header)
set_correlation_id("request-abc-123")
# Clear when request completes
clear_correlation_id()
Correlation ID in HTTP clients
The HTTP client factory automatically includes the correlation ID in outbound requests:from simpletuner.simpletuner_sdk.server.services.cloud.http_client import (
get_async_client,
)
# Correlation ID is automatically added to X-Correlation-ID header
async with get_async_client() as client:
response = await client.get("https://api.replicate.com/v1/predictions")
# Request includes: X-Correlation-ID: <current-id>
Background Tasks¶
The cloud training system runs several background workers to handle asynchronous operations.
Background Task Manager¶
All background tasks are managed by the BackgroundTaskManager singleton:
Task manager usage
Job Status Polling Worker¶
The job polling worker synchronizes job statuses from cloud providers. This is useful when webhooks are not available (e.g., behind a firewall).
Purpose: - Poll active jobs (pending, uploading, queued, running) from cloud providers - Update local job store with current status - Emit SSE events when status changes - Update queue entries for terminal statuses
Polling flow diagram
Auto-enable logic
The polling worker starts automatically if no webhook URL is configured:Queue Processing Worker¶
Handles job scheduling and dispatch based on queue priority and concurrency limits.
Purpose: - Process the job queue every 5 seconds - Dispatch jobs according to priority - Respect concurrency limits per user/organization - Handle queue entry state transitions
Queue Processing Interval: 5 seconds (fixed)
Approval Expiration Worker¶
Automatically expires and rejects pending approval requests that have passed their deadline.
Purpose: - Check for expired approval requests every 5 minutes - Auto-reject jobs with expired approvals - Update queue entries to failed state - Emit SSE notifications for expired approvals
Processing flow diagram
Configuration Options¶
Environment Variable¶
Enterprise configuration file
Create `simpletuner-enterprise.yaml`:Configuration Properties¶
| Property | Default | Description |
|---|---|---|
job_polling_enabled |
false (auto if no webhook) | Enable explicit polling |
job_polling_interval |
30 seconds | Polling interval |
| Queue processing | Always enabled | Cannot be disabled |
| Approval expiration | Always enabled | Checks every 5 minutes |
Accessing configuration programmatically
Debugging with Logs¶
Finding Related Log Entries¶
Use the correlation ID to trace a request across all components:
Log filtering commands
Filtering by job
Monitoring background tasks
Log Level Recommendations¶
| Environment | Level | Rationale |
|---|---|---|
| Development | DEBUG | Full visibility for troubleshooting |
| Staging | INFO | Normal operation with key events |
| Production | INFO or WARNING | Balance between visibility and volume |
Common Log Messages¶
| Message | Level | Meaning |
|---|---|---|
| "Starting job status polling" | INFO | Polling worker started |
| "Synced N active jobs" | DEBUG | Polling cycle completed |
| "Queue scheduler started" | INFO | Queue processing active |
| "Expired N approval requests" | INFO | Approvals auto-rejected |
| "Failed to sync job X" | DEBUG | Single job sync failed (transient) |
| "Error in job polling" | ERROR | Polling loop encountered error |
Integrating with Log Aggregators¶
The JSON log format is compatible with:
- Elasticsearch/Kibana: Direct ingestion of JSON logs
- Splunk: JSON parsing with field extraction
- Datadog: Log pipeline with JSON parsing
- Loki/Grafana: Use
jsonparser
Loki/Promtail configuration example
Troubleshooting Checklist¶
- Request not being traced?
- Check if
X-Correlation-IDheader is being set -
Verify
CorrelationIDFilteris attached to loggers -
Context fields not appearing?
- Ensure code is within
LogContextblock -
Check that JSON output is enabled
-
Polling not working?
- Check if webhook URL is configured (disables auto-polling)
- Verify enterprise config if using explicit polling
-
Look for "Starting job status polling" log message
-
Queue not processing?
- Check for "Queue scheduler started" message
- Look for errors in "Failed to start queue processing"