Skip to content
Merged
34 changes: 32 additions & 2 deletions CLAUDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -316,6 +316,33 @@ performance reasons.
4. Update client API in `src/client/apis/`
5. Add CLI command handler if needed in `src/client/commands/`

### Adding a New CLI Subcommand

For a new subcommand (e.g., `torc workflows correct-resources`):

1. **Implement handler function** in `src/client/commands/{module}.rs`
- Follow existing pattern from other commands in the same file
- Use `#[command(...)]` attributes for clap configuration

2. **Add to enum variant** in the `#[derive(Subcommand)]` enum
- Add field struct with `#[arg(...)]` attributes for options/flags
- Use `#[command(name = "...")]` to set the subcommand name

3. **Update help template** (if applicable)
- For `workflows` commands: Update `WORKFLOWS_HELP_TEMPLATE` constant at top of file
- Add entry to the appropriate category with description (format: `command_name Description`)
- Use ANSI color codes for consistency: `\x1b[1;36m` for command, `\x1b[1;32m` for category

4. **Remove `hide = true`** if command should be visible
- Default behavior shows command in help unless explicitly hidden

5. **Add well-formatted help text** in `#[command(...)]` attribute
- Use `after_long_help = "..."` for detailed examples
- Examples are shown when user runs `torc workflows command-name --help`

6. **Wire up in match statement**
- Add case in the match block that calls your handler function (usually around line 3400+)

### Creating a Workflow from Specification

1. Write workflow spec file (JSON/JSON5/YAML) following `WorkflowSpec` format
Expand Down Expand Up @@ -448,9 +475,12 @@ unified CLI.
- `torc submit-slurm --account <account> <spec_file>` - Submit with auto-generated Slurm schedulers
- `torc tui` - Interactive terminal UI

**Utilities**:
**Reports & Analysis**:

- `torc reports <subcommand>` - Generate reports
- `torc reports check-resource-utilization <id>` - Check which jobs exceeded resource limits
- `torc reports check-resource-utilization <id> --correct` - Automatically fix resource violations
- `torc reports results <id>` - Get detailed job execution results
- `torc reports summary <id>` - Get workflow completion summary

**Global Options** (available on all commands):

Expand Down
6 changes: 6 additions & 0 deletions api/openapi.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -5992,6 +5992,12 @@ components:
timestamp:
description: Timestamp of workflow creation
type: string
project:
description: Project name or identifier for grouping workflows
type: string
metadata:
description: Arbitrary metadata as JSON string
type: string

compute_node_expiration_buffer_seconds:
default: 180
Expand Down
38 changes: 35 additions & 3 deletions docs/src/core/how-to/check-resource-utilization.md
Original file line number Diff line number Diff line change
Expand Up @@ -36,22 +36,54 @@ For workflows that have been reinitialized multiple times:
torc reports check-resource-utilization <workflow_id> --run-id 2
```

## Adjusting Requirements
## Automatically Correct Requirements

When jobs exceed their limits, update your workflow specification with a buffer:
Instead of manually adjusting requirements, use the `--correct` flag to automatically increase
resource allocations based on actual usage:

```bash
torc reports check-resource-utilization <workflow_id> --correct
```

This will:

- Detect all resource over-utilization violations (memory, CPU, runtime)
- Calculate new requirements with a 1.2x multiplier for safety margin
- Update the workflow's resource requirements immediately

Example:

```
⚠ Found 1 resource over-utilization violations:
Job 15 (train_model): Memory over-utilization detected, peak 10.5 GB → allocating 12.6 GB (1.2x)
✓ Updated 1 resource requirements
```

### Preview Changes Without Applying

Use `--dry-run` to see what changes would be made:

```bash
torc reports check-resource-utilization <workflow_id> --correct --dry-run
```

## Manual Adjustment

For more control, update your workflow specification with a buffer:

```yaml
resource_requirements:
- name: training
memory: 12g # 10.5 GB peak + 15% buffer
runtime: PT3H # 2h 45m actual + buffer
num_cpus: 7 # Enough for peak CPU usage
```

**Guidelines:**

- Memory: Add 10-20% above peak usage
- Runtime: Add 15-30% above actual duration
- CPU: Round up to next core count
- CPU: Round up to accommodate peak percentage (e.g., 501% CPU → 6 cores)

## See Also

Expand Down
3 changes: 3 additions & 0 deletions docs/src/core/reference/cli.md
Original file line number Diff line number Diff line change
Expand Up @@ -1967,6 +1967,9 @@ Check resource utilization and report jobs that exceeded their specified require
- `-r`, `--run-id <RUN_ID>` — Run ID to analyze (optional - analyzes latest run if not provided)
- `-a`, `--all` — Show all jobs (default: only show jobs that exceeded requirements)
- `--include-failed` — Include failed and terminated jobs in the analysis (for recovery diagnostics)
- `--correct` — Automatically correct resource requirements for over-utilized jobs
- `--min-over-utilization <MIN_OVER_UTILIZATION>` — Minimum over-utilization percentage to flag as
violation (default: 1.0%)

## `torc reports results`

Expand Down
40 changes: 40 additions & 0 deletions docs/src/core/reference/workflow-spec.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,8 @@ The top-level container for a complete workflow definition.
| `name` | string | _required_ | Name of the workflow |
| `user` | string | current user | User who owns this workflow |
| `description` | string | none | Description of the workflow |
| `project` | string | none | Project name or identifier for grouping workflows |
| `metadata` | string | none | Arbitrary metadata as JSON string |
| `parameters` | map\<string, string\> | none | Shared parameters that can be used by jobs and files via `use_parameters` |
| `jobs` | [[JobSpec](#jobspec)] | _required_ | Jobs that make up this workflow |
| `files` | [[FileSpec](#filespec)] | none | Files associated with this workflow |
Expand All @@ -29,6 +31,44 @@ The top-level container for a complete workflow definition.
| `compute_node_wait_for_healthy_database_minutes` | integer | none | Compute nodes wait this many minutes for database recovery |
| `jobs_sort_method` | [ClaimJobsSortMethod](#claimjobssortmethod) | `none` | Method for sorting jobs when claiming them |

### Examples with project and metadata

The `project` and `metadata` fields are useful for organizing and categorizing workflows. For more
detailed guidance on organizing workflows, see
[Organizing and Managing Workflows](../workflows/organizing-workflows.md).

**YAML example:**

```yaml
name: "ml_training_workflow"
project: "customer-churn-prediction"
metadata: '{"environment":"staging","version":"1.0.0","team":"ml-engineering"}'
description: "Train and evaluate churn prediction model"
jobs:
- name: "preprocess"
command: "python preprocess.py"
- name: "train"
command: "python train.py"
depends_on: ["preprocess"]
```

**JSON example:**

```json
{
"name": "data_pipeline",
"project": "analytics-platform",
"metadata": "{\"cost_center\":\"eng-data\",\"priority\":\"high\"}",
"description": "Daily data processing pipeline",
"jobs": [
{
"name": "extract",
"command": "python extract.py"
}
]
}
```

## JobSpec

Defines a single computational task within a workflow.
Expand Down
7 changes: 7 additions & 0 deletions docs/src/core/workflows/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,8 +2,15 @@

This section covers how to create, configure, and manage workflows.

## Creating and Formatting

- [Creating Workflows](./creating-workflows.md) - Getting started with workflow creation
- [Workflow Specification Formats](./workflow-formats.md) - JSON, YAML, and other formats

## Organization and Management

- [Organizing and Managing Workflows](./organizing-workflows.md) - Using project and metadata fields
to categorize and track workflows
- [Visualizing Workflow Structure](./visualizing-workflows.md) - Viewing workflow graphs
- [Exporting and Importing Workflows](./export-import-workflows.md) - Moving workflows between
systems
Expand Down
Loading
Loading