Workers
Distributed workers execute DAG tasks across multiple machines, enabling horizontal scaling and specialized hardware utilization.
Architecture
Workers connect to a coordinator service and poll for tasks via gRPC long-polling. The coordinator distributes tasks based on worker labels and availability.
┌─────────────────────────────────────────────────────────────┐
│ Boltbase Instance │
├──────────────┬────────────────┬─────────────────────────────┤
│ Scheduler │ Web UI │ Coordinator Service │
│ │ │ (gRPC Server) │
└──────────────┴────────────────┴─────────────────────────────┘
│
│ gRPC (Long Polling)
│
┌─────────────────────────────┴────────────────┐
│ │
┌──────▼───────┐ ┌────────▼──────┐
│ Worker 1 │ │ Worker N │
│ │ │ │
│ Labels: │ │ Labels: │
│ - gpu=true │ │ - region=eu │
│ - memory=64G │ │ - cpu=high │
└──────────────┘ └───────────────┘How Workers Operate
- Polling: Each worker runs multiple concurrent pollers (configurable via
max_active_runs, default: 100) - Task Assignment: Coordinator matches tasks to workers based on
worker_selectorlabels - Heartbeat: Workers send heartbeats every 1 second to report health status
- Execution: Workers execute assigned DAGs using the same execution engine as the main instance
Worker Identification
Workers are identified by a unique ID that defaults to hostname@PID. This can be customized:
boltbase worker --worker.id=gpu-worker-01Deployment Modes
Workers support two deployment modes based on your infrastructure:
| Feature | Shared Filesystem | Shared Nothing |
|---|---|---|
| Storage Requirement | NFS/shared volume | None |
| Service Discovery | File-based registry | Static coordinator list |
| Status Persistence | Direct file writes | gRPC ReportStatus |
| Log Storage | Direct file writes | gRPC StreamLogs |
| Zombie Detection | File-based heartbeats | Coordinator-based |
| Use Cases | Docker Compose, single-cluster | Kubernetes, multi-cloud |
Shared Filesystem Mode
Workers share filesystem access with the coordinator. Workers write status and logs directly to shared storage.
Shared Nothing Mode
Workers operate without any shared storage. All communication happens via gRPC to the coordinator.
Monitoring
Web UI Workers Page
The Workers page in the Web UI shows:
- Connected workers and their labels
- Worker health status
- Currently running tasks on each worker
- Task hierarchy (root/parent/sub DAGs)
Health Status
The coordinator tracks worker health based on heartbeat recency:
| Status | Condition |
|---|---|
| Healthy | Last heartbeat < 5 seconds ago |
| Warning | Last heartbeat 5-15 seconds ago |
| Unhealthy | Last heartbeat > 15 seconds ago |
| Offline | No heartbeat for > 30 seconds |
When a worker's heartbeat becomes stale (>30 seconds), the coordinator's zombie detector marks all running tasks from that worker as failed.
API Endpoint
# Get worker status via API
curl -H "Authorization: Bearer $TOKEN" \
http://localhost:8080/api/v1/workersResponse:
{
"workers": [
{
"id": "worker-gpu-01",
"labels": {"gpu": "true", "memory": "64G"},
"health_status": "HEALTHY",
"last_heartbeat": "2024-02-11T12:00:00Z",
"running_tasks": [
{
"dag_name": "ml-pipeline",
"dag_run_id": "20240211_120000",
"root_dag_run_name": "ml-pipeline",
"started_at": "2024-02-11T12:00:00Z"
}
]
}
]
}Configuration Reference
Worker Configuration
# config.yaml
worker:
id: "worker-gpu-01" # Defaults to hostname@PID
max_active_runs: 100 # Number of concurrent pollers
labels:
gpu: "true"
memory: "64G"PostgreSQL Connection Pool
In shared-nothing mode (when worker.coordinators is configured), workers use a global PostgreSQL connection pool to prevent connection exhaustion when running multiple concurrent DAGs.
# config.yaml
worker:
id: "worker-gpu-01"
max_active_runs: 100
postgres_pool:
max_open_conns: 25 # Total connections across ALL PostgreSQL DSNs
max_idle_conns: 5 # Idle connections per DSN
conn_max_lifetime: 300 # Connection lifetime in seconds
conn_max_idle_time: 60 # Idle connection timeout in secondsThis applies only in shared-nothing mode and only to PostgreSQL. SQLite always uses 1 connection per step.
See Shared Nothing Mode — PostgreSQL Connection Pool Management for detailed configuration guidance.
Environment Variables
export BOLTBASE_WORKER_ID=worker-01
export BOLTBASE_WORKER_LABELS="gpu=true,region=us-east-1"
export BOLTBASE_WORKER_MAX_ACTIVE_RUNS=50
# PostgreSQL connection pool (shared-nothing mode only)
export BOLTBASE_WORKER_POSTGRES_POOL_MAX_OPEN_CONNS=25
export BOLTBASE_WORKER_POSTGRES_POOL_MAX_IDLE_CONNS=5
export BOLTBASE_WORKER_POSTGRES_POOL_CONN_MAX_LIFETIME=300
export BOLTBASE_WORKER_POSTGRES_POOL_CONN_MAX_IDLE_TIME=60Technical Details
| Parameter | Value | Description |
|---|---|---|
| Heartbeat interval | 1 second | How often workers report health |
| Heartbeat backoff | 1s base, 1.5x factor, 15s max | Backoff on heartbeat failures |
| Poll backoff | 1s base, 2.0x factor, 1 minute max | Backoff on poll failures |
| Stale threshold | 30 seconds | When workers are considered offline |
| Default port | 50055 | Coordinator gRPC port |
