Shared Filesystem Mode
In shared filesystem mode, workers share storage access with the coordinator. All nodes must have access to the same filesystem (NFS, EFS, Azure Files, etc.). For environments without shared storage, see Shared Nothing Mode.
Overview
┌─────────────────────────────────────────────────────────────┐
│ Boltbase Instance │
│ (Scheduler + Web UI + Coordinator) │
└─────────────────────────────────────────────────────────────┘
│ │
│ File-based Service Registry │ gRPC Task Dispatch
│ │
▼ ▼
┌─────────────────────────────────────────────────────────────┐
│ Shared Storage │
│ ┌───────────────┬─────────────────┬─────────────────────┐ │
│ │ service- │ dag-runs/ │ logs/ │ │
│ │ registry/ │ (status) │ (execution logs) │ │
│ └───────────────┴─────────────────┴─────────────────────┘ │
└─────────────────────────────────────────────────────────────┘
▲ ▲
│ │
│ Direct File Access │
│ │
┌────────┴───────────┐ ┌────────┴───────────┐
│ Worker 1 │ │ Worker N │
└────────────────────┘ └────────────────────┘How It Works
Service Registry
Workers automatically register themselves in a file-based service registry. Workers discover coordinators by scanning the registry directory — no manual coordinator address configuration is required.
Registry Directory Structure
{data}/service-registry/
├── coordinator/
│ └── coordinator-primary-host1-50055.json
└── worker/
├── worker-gpu-01-host2-1234.json
└── worker-cpu-02-host3-5678.jsonRegistration Process
- Worker starts: creates a registry file containing its ID, host, port, PID, and timestamp.
- Heartbeat updates: workers update their registry files every 10 seconds with the current timestamp, health status, and active task count.
- Coordinator monitoring: the coordinator continuously scans registry files to track available workers, detect unhealthy workers, and remove stale entries (no heartbeat for 30+ seconds).
- Cleanup: registry files are removed when a worker shuts down gracefully or when the heartbeat timeout is exceeded.
Configuring the Registry Directory
The registry directory must be accessible by all nodes:
# config.yaml
paths:
service_registry_dir: "/nfs/shared/boltbase/service-registry"export BOLTBASE_PATHS_SERVICE_REGISTRY_DIR=/nfs/shared/boltbase/service-registryStatus Persistence
Workers write execution status directly to the shared filesystem:
{data}/dag-runs/
└── my-workflow/
└── dag-runs/
└── 2024/03/15/
└── dag-run_20240315_120000Z_abc123/
└── attempt_20240315_120001_123Z_def456/
└── status.jsonlLog Storage
Workers write execution logs directly to the shared log directory:
{logs}/dags/
└── my-workflow/
└── 20240315_120000_abc123/
├── step1.stdout.log
├── step1.stderr.log
└── status.yamlRequirements
Shared storage must be mounted at the same path on all nodes, or configured via BOLTBASE_HOME to point to the shared location.
Required shared directories:
{data}/service-registry/— worker registration and discovery{data}/dag-runs/— execution status{logs}/— execution logs (required for Web UI log display)
NOTE
DAG definitions (dags/) do not need to be shared. Workers receive DAG definitions via gRPC when tasks are dispatched.
Configuration
Coordinator
# Start coordinator on the main server
boltbase start-all --coordinator.host=0.0.0.0 --port=8080
# Or start coordinator separately
boltbase coordinator --coordinator.host=0.0.0.0 --coordinator.port=50055Workers
# Workers auto-discover coordinators via service registry
boltbase worker --worker.labels gpu=true,memory=64GConfiguration File
# config.yaml (same on all nodes)
paths:
data_dir: "/shared/boltbase/data" # Must be shared
log_dir: "/shared/boltbase/logs" # Must be shared
service_registry_dir: "/shared/boltbase/service-registry" # Must be shared
worker:
id: "worker-gpu-01"
labels:
gpu: "true"
memory: "64G"
coordinator:
host: 0.0.0.0
port: 50055Docker Compose Example
version: '3.8'
services:
boltbase-main:
image: boltbase:latest
command: start-all --host=0.0.0.0 --coordinator.host=0.0.0.0
ports:
- "8080:8080"
- "50055:50055"
volumes:
- ./dags:/etc/boltbase/dags # DAG definitions (only main instance)
- shared-data:/var/lib/boltbase # Shared: data + logs + service registry
worker-gpu:
image: boltbase:latest
command: worker --worker.labels=gpu=true,cuda=11.8
volumes:
- shared-data:/var/lib/boltbase # Shared storage (no dags needed)
deploy:
replicas: 2
resources:
reservations:
devices:
- capabilities: [gpu]
worker-cpu:
image: boltbase:latest
command: worker --worker.labels=cpu-only=true
volumes:
- shared-data:/var/lib/boltbase # Shared storage (no dags needed)
deploy:
replicas: 5
volumes:
shared-data:
driver: localKubernetes with NFS
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: boltbase-shared
spec:
accessModes:
- ReadWriteMany
storageClassName: nfs
resources:
requests:
storage: 10Gi
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: boltbase-main
spec:
replicas: 1
template:
spec:
containers:
- name: boltbase
image: boltbase:latest
args: ["start-all", "--host=0.0.0.0", "--coordinator.host=0.0.0.0"]
volumeMounts:
- name: shared
mountPath: /var/lib/boltbase
- name: dags
mountPath: /etc/boltbase/dags
volumes:
- name: shared
persistentVolumeClaim:
claimName: boltbase-shared
- name: dags
configMap:
name: boltbase-dags
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: boltbase-worker
spec:
replicas: 3
template:
spec:
containers:
- name: worker
image: boltbase:latest
args: ["worker", "--worker.labels=region=us-east-1"]
volumeMounts:
- name: shared
mountPath: /var/lib/boltbase
volumes:
- name: shared
persistentVolumeClaim:
claimName: boltbase-sharedFor Helm-based Kubernetes deployment, see Kubernetes (Helm).
