Architecture
This page describes the internal architecture and design of Workforce.
Overview
Workforce uses a client-server architecture to manage workflow execution. The system is built around a Flask API server that maintains workflow state and coordinates execution across multiple clients.
Backward Compatibility Note: Workforce 2.0 introduces non-blocking edges for more flexible workflow execution. Existing workflows remain fully compatible—all edges default to blocking edges, preserving the strict dependency semantics of earlier versions. New workflows can optionally mix blocking and non-blocking edges for advanced execution patterns.
Core Components
Server
The server component is the heart of Workforce’s execution engine. A single machine-wide server manages multiple workflows through isolated workspace contexts. It manages:
Workflow State - Tracks node and edge status during execution
Event System - Pub/sub event broadcasting for real-time updates
Execution Queue - Schedules nodes based on dependency resolution
Client Coordination - Handles multiple simultaneous clients
Workspace Contexts - Isolated execution environments per workfile
Server Startup Process
When starting a server (typically automatic when running wf commands):
Singleton Check: The system checks the PID file registry to detect any existing server
Server Discovery: If a server is already running:
The system logs the existing server location
Exits without starting a duplicate server
Enforces single machine-wide server policy
Port Configuration: If no server exists:
Uses explicitly configured port (default: 5049)
Port can be overridden via command-line argument or environment variable
Server fails if port is already in use
Server Initialization: Flask + Socket.IO server starts:
Listens on discovered port
Ready to accept workspace connections
Creates workspace contexts on-demand as clients connect
Server Ready: The server begins accepting API requests and client connections
Server Operations
Once running, the server exposes workspace-scoped APIs at http://host:port/workspace/{workspace_id}/...:
Edit API: Modifies the Workfile structure (add/remove nodes and edges)
Run API: Initiates workflow execution with parameters:
nodes: Specific nodes to include in executionwrapper: Command prefix/suffix wrapper
Status API: Provides real-time status of nodes and edges via Socket.IO
Logs API: Returns stdout/stderr from completed nodes
Each workspace operates independently with isolated state and event streams.
Server Shutdown
Servers automatically shut down when idle (no clients and no active runs):
All running processes are gracefully terminated
Node statuses are updated to reflect termination
Workspace contexts are destroyed
Resources are cleaned up (Socket.IO connections closed)
Server process exits cleanly
Manual shutdown via wf server stop is also supported.
Client
Clients connect to workspace-scoped URLs to interact with workflows. Multiple types of clients exist:
GUI Client - Tkinter-based visual editor
Run Client - CLI-based workflow executor
Edit Client - Programmatic workflow modifier
All clients communicate with the server via:
HTTP API calls for workflow modifications (workspace-scoped endpoints)
Socket.IO connections for real-time status updates (workspace-specific rooms)
Multiple clients can connect to the same workspace context simultaneously, sharing state and receiving synchronized updates.
Workspace Management
Workforce uses a single machine-wide server that manages multiple workflows through isolated workspace contexts.
Server Discovery:
PID file registry tracks running server location and process ID
Detects existing server before attempting to start new one
Returns server URL for client connections (default: http://127.0.0.1:5049)
Workspace Identification:
Each workfile gets a deterministic workspace ID via
compute_workspace_id()Workspace ID is SHA256 hash of absolute file path
Ensures consistent identification across multiple sessions
Workspace Contexts:
Server maintains dict of
ServerContextobjects keyed by workspace_idEach context created on-demand when first client connects
Context includes:
mod_queue- Serialized graph modification queueEventBus- Domain event system for that workspaceWorker thread - Dedicated queue processor
Socket.IO room - Isolated event broadcasting
Active runs tracking - Per-run node sets and metadata
Context Lifecycle:
Created: When first client connects to workspace
Destroyed: When last client disconnects from workspace
Isolation: Each workspace operates independently
This architecture allows:
Multiple workflows to run simultaneously on one server
Automatic workspace context creation and cleanup
Complete isolation between different workflows
Connection sharing across multiple GUI/CLI instances for same workflow
Execution Model
Workforce employs a unified execution model where every run is treated as a subset run.
Unified Subset Execution
Philosophy: Whether running the entire workflow or just a few nodes, the system treats all execution as subset operations. This provides consistency and prevents edge cases.
Node Selection Logic
When a workflow run is initiated:
Explicit Selection: If specific nodes are selected (via CLI
--nodesflag or GUI selection):Those nodes form an induced subgraph for execution
All edges and dependencies within this subgraph are preserved
Failed Node Selection: If no nodes are explicitly selected:
The system checks for nodes in a
failedstateAll failed nodes are automatically selected for re-execution
This enables the “resume” functionality
Full Workflow Selection: If no nodes are selected and none have failed:
All nodes with zero in-degree in the full workflow are selected
This effectively runs the entire workflow from the beginning
Execution Initialization
Upon starting a run, the scheduler:
Subgraph Extraction: Extracts the target subset from the main workflow graph
Dependency Analysis: Identifies all nodes within the subset that have:
Zero in-degree relative only to that subset
(Not zero in-degree in the full graph)
Initial Scheduling: Transitions these zero-in-degree nodes to
runstateBoundary Enforcement: Ensures nodes start immediately if their dependencies in the master workfile are omitted from the current run scope
Subgraph Boundary Enforcement
To prevent execution from bleeding into the rest of the workfile:
The scheduler strictly enforces subnetwork boundaries
Propagation is confined entirely to the active selection
When a node completes:
Only outgoing edges within the filtered subnetwork are evaluated
Edges leading to nodes outside the original subset are ignored
This effectively “caps” the execution at the subset boundary
Execution Loop
The execution loop follows this pattern:
Node Execution
Node command is executed via subprocess
stdout and stderr are captured in real-time
Outputs are stored as node attributes in the graph
Logs are viewable from GUI (press ‘s’)
Event Emission
Upon completion, a
NODE_FINISHEDevent is emitted (orNODE_FAILEDon error)Event includes client ID for multi-client coordination
Server broadcasts event to all connected clients via WebSocket
Scheduler Update
Emission triggers the scheduler to retrieve the filtered subnetwork map
All valid outgoing edges (within the subnetwork) are updated to
to_runstatusA
GRAPH_UPDATEDevent is broadcast
Dependency Check
Status change prompts target nodes to check dependencies
Node transitions to
runstate only if:ALL incoming edges (within the subnetwork context) are marked
to_run
Once satisfied:
Node clears the statuses from incoming edges
Begins execution
Loop returns to step 1
This mechanism ensures the engine only advances when subset-specific dependencies are fully met.
Dependency Resolution
The dependency resolution system is the core of Workforce’s scheduling logic. It determines which nodes are ready to execute based on their incoming Edge Type attributes and the current status of those edges.
Edge Types in Dependency Resolution:
Workforce supports two edge types that affect dependency checking:
Blocking Edges (Blocking Edge) - The target node waits for ALL incoming blocking edges to be
to_runNon-Blocking Edges (Non-Blocking Edge) - The target node runs immediately when ANY incoming non-blocking edge becomes
to_run
Resolution Algorithm:
When an upstream node completes (status becomes ran), the scheduler:
Marks all outgoing edges as
to_runFor each downstream node, checks its incoming edges:
If the incoming edge is BLOCKING:
Check if ALL incoming blocking edges are
to_runOnly if true: Set target node status to
runNon-blocking edges do not affect blocking edge checks
If the incoming edge is NON-BLOCKING:
Immediately set target node status to
runDo not wait for other incoming edges
Allows target to execute (or re-execute) from this single trigger
All propagation respects Subset Run boundaries:
Only edges within the active subset are processed
Edges leading to nodes outside the subset are ignored
Ensures execution remains confined to selected scope
Visual Representation:
Here’s how blocking and non-blocking edges interact during execution:
Initial State:
┌─────┐ ┌─────┐
│ A │ (ran) │ B │ (waiting)
└─────┘ └─────┘
│ blocked
│ to_run
With all blocking edges to_run and no non-blocking incoming:
┌─────┐ ┌─────┐
│ A │ (ran) │ B │ (run) ──> executes once
└─────┘ └─────┘
│ blocked
│ to_run
With non-blocking edge triggering:
┌─────┐ ┌─────┐
│ A │ (ran) │ B │ (run) ──> re-executes
└─────┘ └─────┘
│ non-blocked
│ to_run
Mixed Dependencies Example:
Consider a node C with one blocking edge from A and one non-blocking edge from B:
┌─────┐ ┌─────┐
│ A │ │ B │
└─────┘ └─────┘
│ (blocking) │ (non-blocking)
└────┬────────┘
│
┌─────┐
│ C │
└─────┘
Execution sequence:
A completes: A→C marked
to_run, C checks dependenciesA→C is blocking and
to_run✓Need to wait for B (no edges from B are
to_runyet)C status remains waiting
B completes: B→C marked
to_runB→C is non-blocking, immediately set C to
runC executes (does not wait for A→C to be
to_run)
If B runs again: B→C marked
to_runagainC immediately runs again (re-triggers)
A→C status does not affect re-triggering
Cycle Detection:
Before execution begins, Workforce checks for cycles using only Blocking Edge edges:
Constructs a subgraph containing only blocking edges
Checks if the subgraph is a directed acyclic graph (DAG)
Non-blocking edges are ignored for cycle detection
This allows workflows with non-blocking cycles (safe, no infinite loops)
Blocking cycles cause an error before run initiation
Subset Run Propagation:
During dependency resolution, the system:
Maintains an active set of nodes for the current run (the subset)
Only propagates edges where both endpoints are in the subset
Ignores edges leading to nodes outside the subset
Prevents execution from “leaking” beyond the intended scope
Enables safe subset runs and node recovery without side effects
Resume Functionality
The resume feature (Shift+R in GUI, or re-running failed nodes) handles workflow recovery:
How Resume Works:
Failed Node Detection: System identifies nodes in
failedstateStatus Reset: Failed node status is replaced with
runEvent Loop Trigger: Status change re-triggers the event loop
Dependency Re-check: Scheduler re-evaluates dependencies for the failed node
Queue for Execution: If dependencies are met, node is queued for execution
Pipeline Continuation: Remainder of pipeline proceeds through normal dependency checking
Boundary Enforcement:
Resume is strictly bounded by the original subset
Resume never propagates to nodes outside the original selection
Ensures nodes do not remain in a running state indefinitely
Clean status management prevents zombie processes
Event System
Workforce uses a publish-subscribe event system for coordinating workflow execution.
Event Types
Node Events:
NODE_READY- Node is ready to execute (all dependencies met, status set torun)NODE_STARTED- Node execution has begun (status set torunning)NODE_FINISHED- Node finished successfully (status set toran)NODE_FAILED- Node execution failed (status set tofail)
Workflow Events:
RUN_COMPLETE- All nodes in the run have completed or failedGRAPH_UPDATED- Graph structure or attributes were modified
Event Flow
Event Generation: Server generates events during workflow execution
Event Broadcasting: Events are broadcast via WebSocket to all connected clients
Client Handling: Each client receives events and updates its local state
Client ID Tagging: Events are tagged with originating client ID to prevent conflicts
State Synchronization: All clients maintain synchronized view of workflow state
Multi-User Support
The event system enables true multi-user collaboration:
Multiple GUI clients can connect to the same workspace context simultaneously
Changes made in one client are broadcast to all others in real-time via Socket.IO rooms
Execution initiated by one client is visible to all connected clients
Each workspace maintains isolated event streams preventing cross-workspace interference
Client connections are workspace-scoped, ensuring proper event routing
Data Flow
Workflow File (GraphML)
The workflow is stored as a GraphML file with:
Nodes: Represent bash commands with attributes (
id,label,status,log,x,y)id- UUID for the nodelabel- The bash command to executestatus- Current state:""(empty),run,running,ran,faillog- Combined stdout/stderr from executionx,y- Position coordinates (stored as strings)
Edges: Represent dependencies with attributes (
id,status)id- UUID for the edgestatus- Either""(empty) orto_run(source completed)
Graph: Graph-level attributes
wrapper- Command template with{}placeholder (e.g.,bash -c '{}')
File Loading and Saving
load_graph(path): Loads GraphML into NetworkX DiGraphsave_graph(graph, path): Writes NetworkX DiGraph to GraphML using atomic temp file + os.replaceGraph operations like
add_node_to_graph()automatically save changesConcurrency Safety: All graph mutations are serialized through the server’s single-threaded queue worker (one per workspace), preventing concurrent writes
Crash Safety: Atomic file replacement (temp file + os.replace) ensures files are never partially written
The singleton server architecture with queue-based serialization eliminates the need for file locking
Network Communication
HTTP API:
RESTful endpoints for workflow modification
JSON request/response format
Workspace-scoped URLs:
/workspace/{workspace_id}/...Server URL discovered via
find_running_server()
Socket.IO:
Real-time bidirectional communication
Event broadcasting from server to clients via workspace-specific rooms
Status updates and log streaming
Persistent connection during workflow execution
Room-based isolation ensures events only reach relevant clients
Process Management
Command Execution
Commands run via Python’s
subprocessmoduleSeparate process for each node
stdout/stderr captured in real-time
Exit codes determine success/failure
Process Lifecycle
Spawn: Process created when node transitions to
runstateMonitor: Output streams monitored via threads
Complete: Process terminates, exit code checked
Cleanup: Resources released, status updated
Parallel Execution
Multiple nodes can run simultaneously
Limited only by available system resources
Dependency constraints prevent invalid parallelism
No explicit parallelism limit (user-controlled via workflow design)
Error Handling
Node Failures
When a node fails:
Node status set to
failedError information captured in stderr attribute
NODE_FAILEDevent broadcast to clientsExecution continues for independent branches
Failed node prevents downstream execution
Workflow Failures
Failed nodes do not stop the entire workflow
Independent branches continue executing
Workflow completes when all executable nodes finish
Resume functionality allows recovery from failures
Server Failures
Server auto-discovery detects running servers via health checks
Idle servers automatically shut down (no clients + no active runs)
Deferred shutdown with 1-second delay prevents race conditions
Clients detect disconnection and notify user
Manual cleanup via
wf server stopif needed
Security Considerations
Commands execute with user’s shell permissions
No authentication currently implemented (local use)
Registry file permissions control access
WebSocket connections not encrypted (localhost)
Command injection risks if workflow files untrusted
Performance Considerations
Graph size limited by memory
NetworkX provides efficient graph operations
WebSocket events add minimal overhead
Subprocess spawning has system-dependent limits
Large stdout/stderr captured in memory (consider log rotation)