Skip to content
Merged
Show file tree
Hide file tree
Changes from 5 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
34 changes: 32 additions & 2 deletions CLAUDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -316,6 +316,33 @@ performance reasons.
4. Update client API in `src/client/apis/`
5. Add CLI command handler if needed in `src/client/commands/`

### Adding a New CLI Subcommand

For a new subcommand (e.g., `torc workflows correct-resources`):

1. **Implement handler function** in `src/client/commands/{module}.rs`
- Follow existing pattern from other commands in the same file
- Use `#[command(...)]` attributes for clap configuration

2. **Add to enum variant** in the `#[derive(Subcommand)]` enum
- Add field struct with `#[arg(...)]` attributes for options/flags
- Use `#[command(name = "...")]` to set the subcommand name

3. **Update help template** (if applicable)
- For `workflows` commands: Update `WORKFLOWS_HELP_TEMPLATE` constant at top of file
- Add entry to the appropriate category with description (format: `command_name Description`)
- Use ANSI color codes for consistency: `\x1b[1;36m` for command, `\x1b[1;32m` for category

4. **Remove `hide = true`** if command should be visible
- Default behavior shows command in help unless explicitly hidden

5. **Add well-formatted help text** in `#[command(...)]` attribute
- Use `after_long_help = "..."` for detailed examples
- Examples are shown when user runs `torc workflows command-name --help`

6. **Wire up in match statement**
- Add case in the match block that calls your handler function (usually around line 3400+)

### Creating a Workflow from Specification

1. Write workflow spec file (JSON/JSON5/YAML) following `WorkflowSpec` format
Expand Down Expand Up @@ -448,9 +475,12 @@ unified CLI.
- `torc submit-slurm --account <account> <spec_file>` - Submit with auto-generated Slurm schedulers
- `torc tui` - Interactive terminal UI

**Utilities**:
**Reports & Analysis**:

- `torc reports <subcommand>` - Generate reports
- `torc reports check-resource-utilization <id>` - Check which jobs exceeded resource limits
- `torc reports check-resource-utilization <id> --correct` - Automatically fix resource violations
- `torc reports results <id>` - Get detailed job execution results
- `torc reports summary <id>` - Get workflow completion summary

**Global Options** (available on all commands):

Expand Down
6 changes: 6 additions & 0 deletions api/openapi.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -5992,6 +5992,12 @@ components:
timestamp:
description: Timestamp of workflow creation
type: string
project:
description: Project name or identifier for grouping workflows
type: string
metadata:
description: Arbitrary metadata as JSON string
type: string

compute_node_expiration_buffer_seconds:
default: 180
Expand Down
64 changes: 61 additions & 3 deletions docs/src/core/how-to/check-resource-utilization.md
Original file line number Diff line number Diff line change
Expand Up @@ -36,22 +36,80 @@ For workflows that have been reinitialized multiple times:
torc reports check-resource-utilization <workflow_id> --run-id 2
```

## Adjusting Requirements
## Automatically Correct Requirements

When jobs exceed their limits, update your workflow specification with a buffer:
Use the separate `correct-resources` command to automatically adjust resource allocations based on
actual resource measurements:

```bash
torc workflows correct-resources <workflow_id>
```

This analyzes both completed and failed jobs to detect:

- **Memory violations** — Jobs using more memory than allocated
- **CPU violations** — Jobs using more CPU than allocated
- **Runtime violations** — Jobs running longer than allocated time

The command will:

- Calculate new requirements using actual peak usage data
- Apply a 1.2x safety multiplier to each resource
- Update the workflow's resource requirements for future runs

Example:

```
Analyzing and correcting resource requirements for workflow 5
✓ Resource requirements updated successfully

Corrections applied:
memory_training: 8g → 10g (+25.0%)
cpu_training: 4 → 5 cores (+25.0%)
runtime_training: PT2H → PT2H 30M (+25.0%)
```

### Preview Changes Without Applying

Use `--dry-run` to see what changes would be made:

```bash
torc workflows correct-resources <workflow_id> --dry-run
```

### Correct Only Specific Jobs

To update only certain jobs (by ID):

```bash
torc workflows correct-resources <workflow_id> --job-ids 15,16,18
```

### Custom Correction Multiplier

Adjust the safety margin (default 1.2x):

```bash
torc workflows correct-resources <workflow_id> --memory-multiplier 1.5 --runtime-multiplier 1.4
```

## Manual Adjustment

For more control, update your workflow specification with a buffer:

```yaml
resource_requirements:
- name: training
memory: 12g # 10.5 GB peak + 15% buffer
runtime: PT3H # 2h 45m actual + buffer
num_cpus: 7 # Enough for peak CPU usage
```

**Guidelines:**

- Memory: Add 10-20% above peak usage
- Runtime: Add 15-30% above actual duration
- CPU: Round up to next core count
- CPU: Round up to accommodate peak percentage (e.g., 501% CPU → 6 cores)

## See Also

Expand Down
1 change: 1 addition & 0 deletions docs/src/core/reference/cli-cheatsheet.md
Original file line number Diff line number Diff line change
Expand Up @@ -43,6 +43,7 @@
| `torc watch <id> --recover --auto-schedule` | Full production recovery mode |
| `torc workflows sync-status <id>` | Fix orphaned jobs (stuck in "running") |
| `torc reports check-resource-utilization <id>` | Check memory/CPU/time usage |
| `torc workflows correct-resources <id>` | Auto-correct resource requirements |
| `torc reports summary <id>` | Workflow completion summary |
| `torc reports results <id>` | JSON report of job results with log paths |
| `torc slurm sacct <wf_id>` | Get Slurm accounting data |
Expand Down
13 changes: 3 additions & 10 deletions docs/src/core/reference/cli.md
Original file line number Diff line number Diff line change
@@ -1,15 +1,5 @@
# CLI Reference

This documentation is automatically generated from the CLI help text.

To regenerate, run:

```bash
cargo run --bin generate-cli-docs --features "client,tui,plot_resources"
```

# Command-Line Help for `torc`

This document contains the help content for the `torc` command-line program.

**Command Overview:**
Expand Down Expand Up @@ -1967,6 +1957,9 @@ Check resource utilization and report jobs that exceeded their specified require
- `-r`, `--run-id <RUN_ID>` — Run ID to analyze (optional - analyzes latest run if not provided)
- `-a`, `--all` — Show all jobs (default: only show jobs that exceeded requirements)
- `--include-failed` — Include failed and terminated jobs in the analysis (for recovery diagnostics)
- `--correct` — Automatically correct resource requirements for over-utilized jobs
- `--min-over-utilization <MIN_OVER_UTILIZATION>` — Minimum over-utilization percentage to flag as
violation (default: 1.0%)

## `torc reports results`

Expand Down
40 changes: 40 additions & 0 deletions docs/src/core/reference/workflow-spec.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,8 @@ The top-level container for a complete workflow definition.
| `name` | string | _required_ | Name of the workflow |
| `user` | string | current user | User who owns this workflow |
| `description` | string | none | Description of the workflow |
| `project` | string | none | Project name or identifier for grouping workflows |
| `metadata` | string | none | Arbitrary metadata as JSON string |
| `parameters` | map\<string, string\> | none | Shared parameters that can be used by jobs and files via `use_parameters` |
| `jobs` | [[JobSpec](#jobspec)] | _required_ | Jobs that make up this workflow |
| `files` | [[FileSpec](#filespec)] | none | Files associated with this workflow |
Expand All @@ -29,6 +31,44 @@ The top-level container for a complete workflow definition.
| `compute_node_wait_for_healthy_database_minutes` | integer | none | Compute nodes wait this many minutes for database recovery |
| `jobs_sort_method` | [ClaimJobsSortMethod](#claimjobssortmethod) | `none` | Method for sorting jobs when claiming them |

### Examples with project and metadata

The `project` and `metadata` fields are useful for organizing and categorizing workflows. For more
detailed guidance on organizing workflows, see
[Organizing and Managing Workflows](../workflows/organizing-workflows.md).

**YAML example:**

```yaml
name: "ml_training_workflow"
project: "customer-churn-prediction"
metadata: '{"environment":"staging","version":"1.0.0","team":"ml-engineering"}'
description: "Train and evaluate churn prediction model"
jobs:
- name: "preprocess"
command: "python preprocess.py"
- name: "train"
command: "python train.py"
depends_on: ["preprocess"]
```
**JSON example:**
```json
{
"name": "data_pipeline",
"project": "analytics-platform",
"metadata": "{\"cost_center\":\"eng-data\",\"priority\":\"high\"}",
"description": "Daily data processing pipeline",
"jobs": [
{
"name": "extract",
"command": "python extract.py"
}
]
}
```

## JobSpec

Defines a single computational task within a workflow.
Expand Down
7 changes: 7 additions & 0 deletions docs/src/core/workflows/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,8 +2,15 @@

This section covers how to create, configure, and manage workflows.

## Creating and Formatting

- [Creating Workflows](./creating-workflows.md) - Getting started with workflow creation
- [Workflow Specification Formats](./workflow-formats.md) - JSON, YAML, and other formats

## Organization and Management

- [Organizing and Managing Workflows](./organizing-workflows.md) - Using project and metadata fields
to categorize and track workflows
- [Visualizing Workflow Structure](./visualizing-workflows.md) - Viewing workflow graphs
- [Exporting and Importing Workflows](./export-import-workflows.md) - Moving workflows between
systems
Expand Down
Loading
Loading