Architecture

This page describes the internal architecture and design of Workforce.

Overview

Workforce uses a client-server architecture to manage workflow execution. The system is built around a Flask API server that maintains workflow state and coordinates execution across multiple clients.

Backward Compatibility Note: Workforce 2.0 introduces non-blocking edges for more flexible workflow execution. Existing workflows remain fully compatible—all edges default to blocking edges, preserving the strict dependency semantics of earlier versions. New workflows can optionally mix blocking and non-blocking edges for advanced execution patterns.

Core Components

Server

The server component is the heart of Workforce’s execution engine. A single machine-wide server manages multiple workflows through isolated workspace contexts. It manages:

Workflow State - Tracks node and edge status during execution
Event System - Pub/sub event broadcasting for real-time updates
Execution Queue - Schedules nodes based on dependency resolution
Client Coordination - Handles multiple simultaneous clients
Workspace Contexts - Isolated execution environments per workfile

Server Startup Process

When starting a server (typically automatic when running wf commands):

Singleton Check: The system checks the PID file registry to detect any existing server
Server Discovery: If a server is already running:
- The system logs the existing server location
- Exits without starting a duplicate server
- Enforces single machine-wide server policy
Port Configuration: If no server exists:
- Uses explicitly configured port (default: 5049)
- Port can be overridden via command-line argument or environment variable
- Server fails if port is already in use
Server Initialization: Flask + Socket.IO server starts:
- Listens on discovered port
- Ready to accept workspace connections
- Creates workspace contexts on-demand as clients connect
Server Ready: The server begins accepting API requests and client connections

Server Operations

Once running, the server exposes workspace-scoped APIs at http://host:port/workspace/{workspace_id}/...:

Edit API: Modifies the Workfile structure (add/remove nodes and edges)
Run API: Initiates workflow execution with parameters:
- nodes: Specific nodes to include in execution
- wrapper: Command prefix/suffix wrapper
Status API: Provides real-time status of nodes and edges via Socket.IO
Logs API: Returns stdout/stderr from completed nodes

Each workspace operates independently with isolated state and event streams.

Server Shutdown

Servers automatically shut down when idle (no clients and no active runs):

All running processes are gracefully terminated
Node statuses are updated to reflect termination
Workspace contexts are destroyed
Resources are cleaned up (Socket.IO connections closed)
Server process exits cleanly

Manual shutdown via wf server stop is also supported.

Client

Clients connect to workspace-scoped URLs to interact with workflows. Multiple types of clients exist:

GUI Client - Tkinter-based visual editor
Run Client - CLI-based workflow executor
Edit Client - Programmatic workflow modifier

All clients communicate with the server via:

HTTP API calls for workflow modifications (workspace-scoped endpoints)
Socket.IO connections for real-time status updates (workspace-specific rooms)

Multiple clients can connect to the same workspace context simultaneously, sharing state and receiving synchronized updates.

Workspace Management

Workforce uses a single machine-wide server that manages multiple workflows through isolated workspace contexts.

Server Discovery:

PID file registry tracks running server location and process ID
Detects existing server before attempting to start new one
Returns server URL for client connections (default: http://127.0.0.1:5049)

Workspace Identification:

Each workfile gets a deterministic workspace ID via compute_workspace_id()
Workspace ID is SHA256 hash of absolute file path
Ensures consistent identification across multiple sessions

Workspace Contexts:

Server maintains dict of ServerContext objects keyed by workspace_id
Each context created on-demand when first client connects
Context includes:
- mod_queue - Serialized graph modification queue
- EventBus - Domain event system for that workspace
- Worker thread - Dedicated queue processor
- Socket.IO room - Isolated event broadcasting
- Active runs tracking - Per-run node sets and metadata

Context Lifecycle:

Created: When first client connects to workspace
Destroyed: When last client disconnects from workspace
Isolation: Each workspace operates independently

This architecture allows:

Multiple workflows to run simultaneously on one server
Automatic workspace context creation and cleanup
Complete isolation between different workflows
Connection sharing across multiple GUI/CLI instances for same workflow

Execution Model

Workforce employs a unified execution model where every run is treated as a subset run.

Unified Subset Execution

Philosophy: Whether running the entire workflow or just a few nodes, the system treats all execution as subset operations. This provides consistency and prevents edge cases.

Node Selection Logic

When a workflow run is initiated:

Explicit Selection: If specific nodes are selected (via CLI --nodes flag or GUI selection):
- Those nodes form an induced subgraph for execution
- All edges and dependencies within this subgraph are preserved
Failed Node Selection: If no nodes are explicitly selected:
- The system checks for nodes in a failed state
- All failed nodes are automatically selected for re-execution
- This enables the “resume” functionality
Full Workflow Selection: If no nodes are selected and none have failed:
- All nodes with zero in-degree in the full workflow are selected
- This effectively runs the entire workflow from the beginning

Execution Initialization

Upon starting a run, the scheduler:

Subgraph Extraction: Extracts the target subset from the main workflow graph
Dependency Analysis: Identifies all nodes within the subset that have:
- Zero in-degree relative only to that subset
- (Not zero in-degree in the full graph)
Initial Scheduling: Transitions these zero-in-degree nodes to run state
Boundary Enforcement: Ensures nodes start immediately if their dependencies in the master workfile are omitted from the current run scope

Subgraph Boundary Enforcement

To prevent execution from bleeding into the rest of the workfile:

The scheduler strictly enforces subnetwork boundaries
Propagation is confined entirely to the active selection
When a node completes:
- Only outgoing edges within the filtered subnetwork are evaluated
- Edges leading to nodes outside the original subset are ignored
- This effectively “caps” the execution at the subset boundary

Execution Loop

The execution loop follows this pattern:

Node Execution
- Node command is executed via subprocess
- stdout and stderr are captured in real-time
- Outputs are stored as node attributes in the graph
- Logs are viewable from GUI (press ‘s’)
Event Emission
- Upon completion, a NODE_FINISHED event is emitted (or NODE_FAILED on error)
- Event includes client ID for multi-client coordination
- Server broadcasts event to all connected clients via WebSocket
Scheduler Update
- Emission triggers the scheduler to retrieve the filtered subnetwork map
- All valid outgoing edges (within the subnetwork) are updated to to_run status
- A GRAPH_UPDATED event is broadcast
Dependency Check
- Status change prompts target nodes to check dependencies
- Node transitions to run state only if:
  - ALL incoming edges (within the subnetwork context) are marked to_run
- Once satisfied:
  - Node clears the statuses from incoming edges
  - Begins execution
  - Loop returns to step 1

This mechanism ensures the engine only advances when subset-specific dependencies are fully met.

Dependency Resolution

The dependency resolution system is the core of Workforce’s scheduling logic. It determines which nodes are ready to execute based on their incoming Edge Type attributes and the current status of those edges.

Edge Types in Dependency Resolution:

Workforce supports two edge types that affect dependency checking:

Blocking Edges (Blocking Edge) - The target node waits for ALL incoming blocking edges to be to_run
Non-Blocking Edges (Non-Blocking Edge) - The target node runs immediately when ANY incoming non-blocking edge becomes to_run

Resolution Algorithm:

When an upstream node completes (status becomes ran), the scheduler:

Marks all outgoing edges as to_run
For each downstream node, checks its incoming edges:

If the incoming edge is BLOCKING:
- Check if ALL incoming blocking edges are to_run
- Only if true: Set target node status to run
- Non-blocking edges do not affect blocking edge checks
If the incoming edge is NON-BLOCKING:
- Immediately set target node status to run
- Do not wait for other incoming edges
- Allows target to execute (or re-execute) from this single trigger
All propagation respects Subset Run boundaries:
- Only edges within the active subset are processed
- Edges leading to nodes outside the subset are ignored
- Ensures execution remains confined to selected scope

Visual Representation:

Here’s how blocking and non-blocking edges interact during execution:

Initial State:
┌─────┐         ┌─────┐
│ A   │ (ran)   │ B   │ (waiting)
└─────┘         └─────┘
  │ blocked
  │ to_run

With all blocking edges to_run and no non-blocking incoming:
┌─────┐         ┌─────┐
│ A   │ (ran)   │ B   │ (run) ──> executes once
└─────┘         └─────┘
  │ blocked
  │ to_run

With non-blocking edge triggering:
┌─────┐         ┌─────┐
│ A   │ (ran)   │ B   │ (run) ──> re-executes
└─────┘         └─────┘
  │ non-blocked
  │ to_run

Mixed Dependencies Example:

Consider a node C with one blocking edge from A and one non-blocking edge from B:

┌─────┐      ┌─────┐
│ A   │      │ B   │
└─────┘      └─────┘
  │ (blocking)  │ (non-blocking)
  └────┬────────┘
       │
     ┌─────┐
     │ C   │
     └─────┘

Execution sequence:

A completes: A→C marked to_run, C checks dependencies
- A→C is blocking and to_run ✓
- Need to wait for B (no edges from B are to_run yet)
- C status remains waiting
B completes: B→C marked to_run
- B→C is non-blocking, immediately set C to run
- C executes (does not wait for A→C to be to_run)
If B runs again: B→C marked to_run again
- C immediately runs again (re-triggers)
- A→C status does not affect re-triggering

Cycle Detection:

Before execution begins, Workforce checks for cycles using only Blocking Edge edges:

Constructs a subgraph containing only blocking edges
Checks if the subgraph is a directed acyclic graph (DAG)
Non-blocking edges are ignored for cycle detection
This allows workflows with non-blocking cycles (safe, no infinite loops)
Blocking cycles cause an error before run initiation

Subset Run Propagation:

During dependency resolution, the system:

Maintains an active set of nodes for the current run (the subset)
Only propagates edges where both endpoints are in the subset
Ignores edges leading to nodes outside the subset
Prevents execution from “leaking” beyond the intended scope
Enables safe subset runs and node recovery without side effects

Resume Functionality

The resume feature (Shift+R in GUI, or re-running failed nodes) handles workflow recovery:

How Resume Works:

Failed Node Detection: System identifies nodes in failed state
Status Reset: Failed node status is replaced with run
Event Loop Trigger: Status change re-triggers the event loop
Dependency Re-check: Scheduler re-evaluates dependencies for the failed node
Queue for Execution: If dependencies are met, node is queued for execution
Pipeline Continuation: Remainder of pipeline proceeds through normal dependency checking

Boundary Enforcement:

Resume is strictly bounded by the original subset
Resume never propagates to nodes outside the original selection
Ensures nodes do not remain in a running state indefinitely
Clean status management prevents zombie processes

Event System

Workforce uses a publish-subscribe event system for coordinating workflow execution.

Event Types

Node Events:

NODE_READY - Node is ready to execute (all dependencies met, status set to run)
NODE_STARTED - Node execution has begun (status set to running)
NODE_FINISHED - Node finished successfully (status set to ran)
NODE_FAILED - Node execution failed (status set to fail)

Workflow Events:

RUN_COMPLETE - All nodes in the run have completed or failed
GRAPH_UPDATED - Graph structure or attributes were modified

Event Flow

Event Generation: Server generates events during workflow execution
Event Broadcasting: Events are broadcast via WebSocket to all connected clients
Client Handling: Each client receives events and updates its local state
Client ID Tagging: Events are tagged with originating client ID to prevent conflicts
State Synchronization: All clients maintain synchronized view of workflow state

Multi-User Support

The event system enables true multi-user collaboration:

Multiple GUI clients can connect to the same workspace context simultaneously
Changes made in one client are broadcast to all others in real-time via Socket.IO rooms
Execution initiated by one client is visible to all connected clients
Each workspace maintains isolated event streams preventing cross-workspace interference
Client connections are workspace-scoped, ensuring proper event routing

Data Flow

Workflow File (GraphML)

The workflow is stored as a GraphML file with:

Nodes: Represent bash commands with attributes (id, label, status, log, x, y)
- id - UUID for the node
- label - The bash command to execute
- status - Current state: "" (empty), run, running, ran, fail
- log - Combined stdout/stderr from execution
- x, y - Position coordinates (stored as strings)
Edges: Represent dependencies with attributes (id, status)
- id - UUID for the edge
- status - Either "" (empty) or to_run (source completed)
Graph: Graph-level attributes
- wrapper - Command template with {} placeholder (e.g., bash -c '{}')

File Loading and Saving

load_graph(path): Loads GraphML into NetworkX DiGraph
save_graph(graph, path): Writes NetworkX DiGraph to GraphML using atomic temp file + os.replace
Graph operations like add_node_to_graph() automatically save changes
Concurrency Safety: All graph mutations are serialized through the server’s single-threaded queue worker (one per workspace), preventing concurrent writes
Crash Safety: Atomic file replacement (temp file + os.replace) ensures files are never partially written
The singleton server architecture with queue-based serialization eliminates the need for file locking

Network Communication

HTTP API:

RESTful endpoints for workflow modification
JSON request/response format
Workspace-scoped URLs: /workspace/{workspace_id}/...
Server URL discovered via find_running_server()

Socket.IO:

Real-time bidirectional communication
Event broadcasting from server to clients via workspace-specific rooms
Status updates and log streaming
Persistent connection during workflow execution
Room-based isolation ensures events only reach relevant clients

Process Management

Command Execution

Commands run via Python’s subprocess module
Separate process for each node
stdout/stderr captured in real-time
Exit codes determine success/failure

Process Lifecycle

Spawn: Process created when node transitions to run state
Monitor: Output streams monitored via threads
Complete: Process terminates, exit code checked
Cleanup: Resources released, status updated

Parallel Execution

Multiple nodes can run simultaneously
Limited only by available system resources
Dependency constraints prevent invalid parallelism
No explicit parallelism limit (user-controlled via workflow design)

Error Handling

Node Failures

When a node fails:

Node status set to failed
Error information captured in stderr attribute
NODE_FAILED event broadcast to clients
Execution continues for independent branches
Failed node prevents downstream execution

Workflow Failures

Failed nodes do not stop the entire workflow
Independent branches continue executing
Workflow completes when all executable nodes finish
Resume functionality allows recovery from failures

Server Failures

Server auto-discovery detects running servers via health checks
Idle servers automatically shut down (no clients + no active runs)
Deferred shutdown with 1-second delay prevents race conditions
Clients detect disconnection and notify user
Manual cleanup via wf server stop if needed

Security Considerations

Commands execute with user’s shell permissions
No authentication currently implemented (local use)
Registry file permissions control access
WebSocket connections not encrypted (localhost)
Command injection risks if workflow files untrusted

Performance Considerations

Graph size limited by memory
NetworkX provides efficient graph operations
WebSocket events add minimal overhead
Subprocess spawning has system-dependent limits
Large stdout/stderr captured in memory (consider log rotation)