Skip to content

Track azure resource costs #155

Open
kanghengliu wants to merge 7 commits intomainfrom
feat/tracking-az-costs
Open

Track azure resource costs #155
kanghengliu wants to merge 7 commits intomainfrom
feat/tracking-az-costs

Conversation

@kanghengliu
Copy link
Copy Markdown
Collaborator

This PR adds an Azure Function to track daily azure resource costs.

Feature

  • azure-hosted daily cost tracking for hal_group
  • slack integration with webhook

@kanghengliu kanghengliu force-pushed the feat/tracking-az-costs branch from 9156d77 to 45269bb Compare March 10, 2026 13:15
@kanghengliu
Copy link
Copy Markdown
Collaborator Author

@schwartzadev Could we merge this PR soon to keep better check on Azure costs?

@ksayash
Copy link
Copy Markdown
Collaborator

ksayash commented Mar 11, 2026

@codex review

@ksayash
Copy link
Copy Markdown
Collaborator

ksayash commented Mar 11, 2026

@claude review

@claude
Copy link
Copy Markdown

claude bot commented Mar 11, 2026

Claude finished @ksayash's task in 1m 55s —— View job


Code Review: Track Azure Resource Costs

  • Gather context and read changed files
  • Review azure_functions/ (new files)
  • Review hal/utils/ changes
  • Review other changes (.env.template, pyproject.toml, agents/)
  • Post comprehensive review

Overall the feature is well-structured and sensible. A few issues worth addressing before merge, ranging from a latent bug to minor concerns.


Issues

Bug: Unguarded e.response.headers access

azure_functions/azure_cost_report.py:69

retry_after = int(e.response.headers.get("Retry-After", 0))

e.response can be None on certain network-level HttpResponseErrors, which would raise AttributeError and crash the retry loop. This should be:

retry_after = int((e.response.headers.get("Retry-After", 0) if e.response else 0))

Fix this →

Security: Webhook URL in process args

azure_functions/deploy.sh:52

--settings \
    SLACK_WEBHOOK_URL="${SLACK_WEBHOOK_URL}"

Passing secrets as CLI arguments exposes them in ps output and shell history. The az functionapp config appsettings set command supports reading from a JSON file via --settings @file.json, or you can pipe the secret via az keyvault secret references. At minimum, add a comment warning operators not to log CI output.

Hardcoded globally-unique resource names

azure_functions/deploy.sh:19-20

FUNC_APP_NAME="hal-cost-report"
STORAGE_ACCOUNT="halcostreport"

Azure storage account names must be globally unique across all Azure tenants. halcostreport is 13 characters and likely already taken (or will conflict if anyone forks this). Consider deriving the name from a unique suffix (e.g. the subscription ID short hash) or making it a required env var. The function app name has the same naming-conflict concern.

Fragile column index assumptions

azure_functions/azure_cost_report.py:83-88

cost = float(row[0])
user = row[2] if row[2] else "untagged"
cur = row[3]

The column order is inferred from a comment and assumed stable. It would be more robust to extract column names from result.columns and build an index map. If the API or query shape ever changes, this will silently produce wrong results.


Observations / Minor Issues

azure_functions/function_app.py:26-29 — The error handler calls post_to_slack(webhook_url, ...) without a try/except. If Slack itself is unreachable (network issue, rotated webhook), the error handler will raise an unhandled exception, hiding the root cause. Consider wrapping in a bare except.

azure_functions/local.settings.json — The file only contains FUNCTIONS_WORKER_RUNTIME and AzureWebJobsStorage. A developer checking this out will get a confusing KeyError on AZURE_SUBSCRIPTION_ID or SLACK_WEBHOOK_URL during local testing. Consider adding the required keys with empty values (matching the pattern of .env.template).

azure_functions/azure_cost_report.py:108 — The Slack message uses a fixed 16-char column width for usernames. Usernames longer than 16 chars will break the table alignment. This is cosmetic but worth noting.

pyproject.toml — The removal of ruff==0.15.4 from [project.optional-dependencies].dev seems unintentional (or belongs in a separate PR). If ruff is still used for linting (it's referenced in CI), dropping it from the dev deps may cause confusion.


Unrelated change in this PR

agents/scicode_codex_agent/main.py (+30/-14) — This is listed as changed but appears unrelated to Azure cost tracking. Could you clarify what changed here, or split it into a separate PR? Including unrelated logic changes makes this harder to review and revert if needed.


What looks good

  • The exponential backoff with Retry-After header respect in azure_cost_report.py is well-implemented.
  • The EXECUTED_BY tag enforcement in VirtualMachineManager is a good addition for cost attribution.
  • The Azure Function timer trigger with WEBSITE_TIME_ZONE for DST-aware scheduling is the correct approach.
  • .env.template update is clear and helpful.
  • The --dry-run mode in azure_cost_report.py is a nice touch for local testing.

Copy link
Copy Markdown
Collaborator

@schwartzadev schwartzadev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants