Shared Nothing Mode
In shared nothing mode, workers operate without any shared filesystem access. All status updates and logs are transmitted to the coordinator via gRPC. No shared storage is required, but status and logs depend on network connectivity to the coordinator.
Overview
┌─────────────────────────────────────────────────────────────┐
│ Boltbase Instance │
│ (Scheduler + Web UI + Coordinator) │
│ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ Local Storage │ │
│ │ ┌───────────────┬─────────────────┬────────────┐ │ │
│ │ │ dag-runs/ │ logs/ │ dags/ │ │ │
│ │ │ (status) │ (execution logs)│ │ │ │
│ │ └───────────────┴─────────────────┴────────────┘ │ │
│ └─────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────┘
▲ │
│ ReportStatus + StreamLogs │ Task Dispatch
│ (gRPC) │ (gRPC)
│ ▼
┌────────┴───────────┐ ┌────────────────────┐
│ Worker 1 │ │ Worker N │
│ (No local state) │ │ (No local state) │
└────────────────────┘ └────────────────────┘How It Works
Static Discovery
Workers connect directly to coordinators using explicit addresses:
boltbase worker --worker.coordinators=coordinator-1:50055,coordinator-2:50055No service registry or shared storage is required.
Status Pushing
Workers send execution status to the coordinator via the ReportStatus gRPC call:
- Worker executes a DAG step
- Worker calls
ReportStatuswith fullDAGRunStatus - Coordinator persists status to its local
DAGRunStore - Web UI reads status from coordinator's local storage
Log Streaming
Workers stream stdout/stderr to the coordinator via the StreamLogs gRPC call:
- Worker buffers log output in 32KB chunks
- Worker sends
LogChunkmessages with sequence numbers - Coordinator writes to local log files, flushing every 64KB
- Worker sends final marker when execution completes
Log streaming is best-effort: failures don't fail the step execution. Some logs may be lost if network issues occur during streaming.
Zombie Detection
The coordinator monitors worker heartbeats and marks tasks as failed when workers become unresponsive:
| Parameter | Value |
|---|---|
| Heartbeat interval | 1 second |
| Stale threshold | 30 seconds |
| Detector interval | 45 seconds |
When a worker stops sending heartbeats:
- Coordinator detects stale heartbeat (> 30 seconds old)
- Coordinator marks all running tasks from that worker as
FAILED - Error message:
"worker {workerID} became unresponsive" - All running nodes within the task are also marked as
FAILED
Configuration
Coordinator
# Bind to all interfaces
boltbase coordinator --coordinator.host=0.0.0.0 --coordinator.port=50055
# With advertise address for Kubernetes/Docker
boltbase coordinator \
--coordinator.host=0.0.0.0 \
--coordinator.advertise=boltbase-coordinator.default.svc.cluster.local \
--coordinator.port=50055Workers
# Connect to specific coordinators (no service registry)
boltbase worker \
--worker.coordinators=coordinator-1:50055,coordinator-2:50055 \
--worker.labels=gpu=true,region=us-east-1Configuration File
# Coordinator config.yaml
coordinator:
host: 0.0.0.0
port: 50055
advertise: boltbase-coordinator.default.svc.cluster.local
paths:
data_dir: "/var/lib/boltbase/data" # Local storage for status
log_dir: "/var/lib/boltbase/logs" # Local storage for logs
---
# Worker config.yaml
worker:
id: "worker-gpu-01"
coordinators:
- "coordinator-1:50055"
- "coordinator-2:50055"
labels:
gpu: "true"
region: "us-east-1"
postgres_pool:
max_open_conns: 25 # Total connections across ALL PostgreSQL DSNs
max_idle_conns: 5 # Per-DSN idle connections
conn_max_lifetime: 300 # Seconds
conn_max_idle_time: 60 # SecondsEnvironment Variables
# Worker
export BOLTBASE_WORKER_COORDINATORS="coordinator-1:50055,coordinator-2:50055"
export BOLTBASE_WORKER_ID=worker-01
export BOLTBASE_WORKER_LABELS="gpu=true,region=us-east-1"
# PostgreSQL connection pool (optional, defaults shown)
export BOLTBASE_WORKER_POSTGRES_POOL_MAX_OPEN_CONNS=25
export BOLTBASE_WORKER_POSTGRES_POOL_MAX_IDLE_CONNS=5
export BOLTBASE_WORKER_POSTGRES_POOL_CONN_MAX_LIFETIME=300
export BOLTBASE_WORKER_POSTGRES_POOL_CONN_MAX_IDLE_TIME=60PostgreSQL Connection Pool Management
In shared-nothing mode, multiple DAGs run concurrently within a single worker process. Without global connection pool management, each DAG's PostgreSQL steps could create unlimited connections, leading to connection exhaustion.
How It Works
The global PostgreSQL connection pool:
- Limits total connections across ALL databases and DAG executions
- Shares connections between concurrent DAG runs
- Reuses connections across sequential DAG executions
- Manages per-DSN pools while enforcing a global limit
Configuration
worker:
postgres_pool:
max_open_conns: 25 # Hard limit across ALL PostgreSQL DSNs
max_idle_conns: 5 # Per-DSN idle connection limit
conn_max_lifetime: 300 # Max connection age (seconds)
conn_max_idle_time: 60 # Max idle time before closure (seconds)Size max_open_conns based on your PostgreSQL server's max_connections:
worker.postgres_pool.max_open_conns = PostgreSQL max_connections / number_of_workers / 2Example: PostgreSQL with max_connections: 100, 4 workers → per-worker limit: 100 / 4 / 2 = 12 (leaving headroom).
Global pool management applies only to PostgreSQL. SQLite steps always use 1 connection per step. When running DAGs directly (not via workers), PostgreSQL steps use fixed defaults: 1 max connection, 1 idle connection.
Kubernetes Deployment
apiVersion: apps/v1
kind: Deployment
metadata:
name: boltbase-coordinator
spec:
replicas: 1
template:
spec:
containers:
- name: boltbase
image: boltbase:latest
args:
- "start-all"
- "--host=0.0.0.0"
- "--coordinator.host=0.0.0.0"
- "--coordinator.advertise=boltbase-coordinator.default.svc.cluster.local"
ports:
- containerPort: 8080
name: http
- containerPort: 50055
name: grpc
volumeMounts:
- name: data
mountPath: /var/lib/boltbase
- name: dags
mountPath: /etc/boltbase/dags
volumes:
- name: data
emptyDir: {} # Local ephemeral storage
- name: dags
configMap:
name: boltbase-dags
---
apiVersion: v1
kind: Service
metadata:
name: boltbase-coordinator
spec:
ports:
- port: 8080
name: http
- port: 50055
name: grpc
selector:
app: boltbase-coordinator
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: boltbase-worker
spec:
replicas: 5
template:
spec:
containers:
- name: worker
image: boltbase:latest
args:
- "worker"
- "--worker.coordinators=boltbase-coordinator.default.svc.cluster.local:50055"
- "--worker.labels=region=us-east-1"
# No volume mounts needed - all state via gRPCFor Helm-based Kubernetes deployment, see Kubernetes (Helm).
Multi-Cluster Deployment
Workers can connect to coordinators across different clusters or clouds:
# Cluster A - Coordinator
apiVersion: apps/v1
kind: Deployment
metadata:
name: boltbase-coordinator
spec:
template:
spec:
containers:
- name: boltbase
args:
- "start-all"
- "--coordinator.host=0.0.0.0"
- "--coordinator.advertise=coordinator.cluster-a.example.com"
---
# Cluster B - Workers
apiVersion: apps/v1
kind: Deployment
metadata:
name: boltbase-worker
spec:
template:
spec:
containers:
- name: worker
args:
- "worker"
- "--worker.coordinators=coordinator.cluster-a.example.com:50055"
- "--worker.labels=region=us-west-2,cluster=cluster-b"TLS Configuration
For production deployments, enable TLS for gRPC communication:
# Coordinator with TLS
boltbase coordinator \
--coordinator.host=0.0.0.0 \
--peer.insecure=false \
--peer.cert-file=/certs/server.crt \
--peer.key-file=/certs/server.key
# Worker with TLS
boltbase worker \
--worker.coordinators=coordinator:50055 \
--peer.insecure=false \
--peer.cert-file=/certs/client.crt \
--peer.key-file=/certs/client.key \
--peer.client-ca-file=/certs/ca.crtTechnical Details
Log Streaming Protocol
| Parameter | Value |
|---|---|
| Worker buffer size | 32KB |
| Coordinator flush threshold | 64KB |
| Stream type | Bidirectional gRPC stream |
Log chunks include:
dag_name: Name of the DAGdag_run_id: Unique run identifierstep_name: Name of the step producing logsstream_type:STDOUTorSTDERRsequence: Ordering sequence numberdata: Log content bytesfinal: Marker for stream completion
Status Pushing Protocol
Workers send ReportStatusRequest containing:
- Full
DAGRunStatusprotobuf message - Status updates for all nodes
- Error messages and timestamps
The coordinator:
- Finds or opens the DAGRunAttempt for the run
- Writes status to the local
DAGRunStore - Returns acceptance confirmation
Queue Dispatch with Previous Status
When the scheduler dispatches queued DAGs to workers:
- Scheduler reads the current status from
DAGRunStore - Status is included in the task as
previous_statusfield - Worker receives status with the task (no local store access needed)
- Worker uses
previous_statusfor retry operations
Temporary File Cleanup
Workers automatically clean up temporary files after each execution:
| File Type | Location | Cleaned After |
|---|---|---|
| DAG files | /tmp/boltbase/worker-dags/ | Each execution |
| Log directories | /tmp/boltbase/worker-logs/ | Each execution |
Workers are safe to run on ephemeral nodes without risk of disk accumulation.
