feat(triggers): DLQ classification + UI Replay + instrumentation (PR-3)#80
Open
mt-alarcon wants to merge 4 commits into
Open
feat(triggers): DLQ classification + UI Replay + instrumentation (PR-3)#80mt-alarcon wants to merge 4 commits into
mt-alarcon wants to merge 4 commits into
Conversation
…ps 1+2) Step 1 — Migration (models.py + app.py): - TriggerExecution ganha 3 colunas nullable: idempotency_key, error_category, last_replay_at - to_dict() exposto com os 3 campos novos - Auto-migrate idempotente no startup: ALTER TABLE + IF NOT EXISTS em cada bloco - Partial unique index uq_trigger_idem (trigger_id, idempotency_key) WHERE NOT NULL - Basic index ix_trigger_executions_idem_key para lookups por key - SQLite 3.51 confirmado — partial index nativo; EXPLAIN QUERY PLAN confirma uso do índice Step 2 — Silent dedup (triggers.py): - webhook_receiver extrai idem_key de idempotency_key / messageId / data.messageId - Se key já existe: log idempotent_replay + 200 OK silencioso (pattern F6) - Race condition: IntegrityError no db.commit() → rollback + 200 OK silencioso - Legado (GitHub, Stripe, Linear): sem key → idem_key=None → fluxo normal inalterado - Limpeza: current_app movido para import no topo; imports inline removidos Testes passados: migration up/down idempotente, partial index unicidade, NULLs livres, extração de key (6 casos), race condition via IntegrityError, EXPLAIN QUERY PLAN. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…PR-2) Adds resilient retry logic to the Evolution Go API client. Transient HTTP errors (5xx, URLError, socket.timeout) now retry up to 3 times with exponential backoff + jitter; HTTP 4xx errors raise immediately (no retry, deterministic client errors). ## Changes **`.claude/skills/int-evolution-go/scripts/evolution_go_client.py`:** - New helper `_retry_http_call_client(do_call, max_attempts=3, base_delay=2.0, max_delay=8.0)` — generic retry wrapper with exponential backoff + jitter. Logs structured JSON events (`api_request_retry` / `api_request_failed`). - Refactor `api_request` to wrap the actual HTTP call in `_do_call` and delegate to the retry helper. Removes inline try/except/sys.exit so library callers can handle errors; the CLI `main()` catches and exits as before. - Refactor `main()` to wrap handler invocation in try/except for `urllib.error.HTTPError` / `URLError` / `socket.timeout` — preserves the exact CLI exit-1-on-failure behavior while letting library users propagate. **`tests/whatsapp/test_retry_backoff.py` (new):** 4 synthetic tests covering: 1. HTTP 500 x3 → retries then raises (TestApiRequestRetry) 2. HTTP 400 → raises immediately, no retry 3. URLError x3 → retries then raises 4. Success on third attempt → returns result, call_count=3 All 4 tests pass against the modified `api_request`. ## Worst-case latency 3 attempts with all 5xx: sleep ≤ 2^0 + 0.5 + 2^1 + 0.5 = 5.5s total (capped at max_delay=8s per attempt). Acceptable for an API call retry. ## Series - PR-1 (evolution-foundation#78 — merged or pending): idempotency_key migration + silent dedup - **PR-2 (this):** exponential backoff + jitter - PR-3 (next): DLQ classification + UI Replay + instrumentation
…teps 4+5+6) Step 4: _classify_error helper em triggers.py; _execute_trigger popula error_category (transient | permanent) e usa status failed_retryable para timeouts e HTTP 5xx, status failed para erros permanentes. Step 5: POST /api/triggers/executions/<id>/replay com rate-limit 60s, modal de confirmação no frontend (preview: destinatário/comando/timestamp), botão Replay condicional a status=failed_retryable, status replayed na execution original após replay iniciado. Step 6: GET /api/triggers/stats retorna 8 métricas (total, by_status, dlq_size, idempotent_replays, wpp_command_count, retries_observed, circuit_breaker_watermark_hit). Badge no /triggers UI com alerta amarelo quando watermark_hit=true (>50 WPP/dia). Log WARNING ao virar True. Logs estruturados evt= em webhook_receiver, _execute_trigger, replay. Testes sintéticos: 24/24 passam (unit: _classify_error 13 cenários, watermark 5 cenários, rate-limit 4 cenários; integração: auth guard + stats shape contra dashboard.db real). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Reviewer's GuideImplements DLQ-style error classification for trigger executions, adds a replay endpoint and UI for retryable failures, and exposes a stats endpoint plus UI badge for trigger observability, along with necessary schema/migration updates and retry-client tests. Sequence diagram for trigger execution replay flow with DLQ classificationsequenceDiagram
actor User
participant Frontend as TriggersPage
participant Backend as replay_execution
participant DB as TriggerExecution_DB
participant Worker as _execute_trigger
User->>Frontend: Click Replay (failed_retryable row)
Frontend->>Backend: POST /api/triggers/executions/exec_id/replay
Backend->>DB: SELECT TriggerExecution BY id
DB-->>Backend: execution (status=failed_retryable)
Backend->>DB: UPDATE execution.status = replayed
Backend->>DB: INSERT TriggerExecution (status=pending, idempotency_key)
DB-->>Backend: new_execution_id
Backend-->>Frontend: 200 { new_execution_id }
Backend->>Worker: start thread _execute_trigger(trigger_id, new_execution_id, original_event_data)
Worker->>DB: SELECT TriggerExecution BY new_execution_id
Worker->>Worker: run subprocess / action
Worker->>Worker: _classify_error(stderr, exc)
alt success
Worker->>DB: UPDATE status = completed, error_category = null
else transient failure
Worker->>DB: UPDATE status = failed_retryable, error_category = transient
else permanent failure
Worker->>DB: UPDATE status = failed, error_category = permanent
end
Sequence diagram for triggers stats badge loadingsequenceDiagram
actor User
participant Frontend as TriggersPage
participant Backend as trigger_stats
participant DB as TriggerExecution_DB
User->>Frontend: Open /triggers
Frontend->>Backend: GET /api/triggers/stats?days=1
Backend->>DB: aggregate SELECTs on trigger_executions
DB-->>Backend: by_status, dlq_size, counts
Backend-->>Frontend: 200 { dlq_size, idempotent_replays, wpp_command_count, circuit_breaker_watermark_hit, ... }
Frontend->>Frontend: update stats badge and watermark banner
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
There was a problem hiding this comment.
Hey - I've found 2 issues, and left some high level feedback:
- In
_classify_errorand theTriggerExecution.error_categorycomments you mention/expect categories likeunknown(andvalidation), but the implementation only ever returnstransientorpermanent; consider either adding the missing categories in code paths or tightening the comments/typing so consumers don’t rely on categories that can’t occur. - In
trigger_stats, the WPPINclause is built via string interpolation (id_list) and embeds thesince_clausedirectly into the SQL string; even though these values are currently derived from integers and a constraineddaysparam, it would be more robust and maintainable to switch to parameterized queries or SQLAlchemyin_expressions to avoid subtle SQL injection or quoting bugs if this logic evolves.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- In `_classify_error` and the `TriggerExecution.error_category` comments you mention/expect categories like `unknown` (and `validation`), but the implementation only ever returns `transient` or `permanent`; consider either adding the missing categories in code paths or tightening the comments/typing so consumers don’t rely on categories that can’t occur.
- In `trigger_stats`, the WPP `IN` clause is built via string interpolation (`id_list`) and embeds the `since_clause` directly into the SQL string; even though these values are currently derived from integers and a constrained `days` param, it would be more robust and maintainable to switch to parameterized queries or SQLAlchemy `in_` expressions to avoid subtle SQL injection or quoting bugs if this logic evolves.
## Individual Comments
### Comment 1
<location path="dashboard/backend/routes/triggers.py" line_range="387-353" />
<code_context>
+ """
+ last_exc = None
+ for attempt in range(max_attempts):
+ try:
+ return do_call()
+ except urllib.error.HTTPError as e:
</code_context>
<issue_to_address>
**issue (bug_risk):** Catching all IntegrityError on commit can hide non-idempotency-related DB issues
The `except IntegrityError:` in `webhook_receiver` treats all integrity violations as the `(trigger_id, idempotency_key)` uniqueness race and responds 200. If any other constraint (e.g., NOT NULL, FK, new unique index) fails, it would be misclassified as an idempotent replay and silently masked. Please narrow the handler to only the idempotency constraint (e.g., by checking constraint name/message or isolating that insert in its own transaction) and let other `IntegrityError`s propagate.
</issue_to_address>
### Comment 2
<location path="dashboard/frontend/src/pages/Triggers.tsx" line_range="262-271" />
<code_context>
+ })
+ }
+
+ const confirmReplay = async () => {
+ if (!replayPreview) return
+ setReplaying(true)
+ try {
+ const result = await api.post(`/triggers/executions/${replayPreview.execId}/replay`)
+ toast.success(`Replay iniciado — nova execução #${result.new_execution_id}`)
+ setReplayPreview(null)
+ // Refresh executions list
+ if (execModal) {
+ const data = await api.get(`/triggers/${execModal.triggerId}/executions`)
+ setExecutions(data.executions || [])
+ }
+ } catch (e: unknown) {
+ const msg = e instanceof Error ? e.message : String(e)
+ if (msg.includes('rate_limited') || msg.includes('429')) {
+ toast.error('Rate limit: aguarde 60s antes de fazer replay novamente')
+ } else {
</code_context>
<issue_to_address>
**issue (bug_risk):** Replay error handling relies on substring matches in Error.message, which may be brittle
In `confirmReplay`, rate limiting is inferred by checking `Error.message` for `rate_limited` or `429`. Depending on how `api` throws (axios, fetch, custom wrapper), the HTTP status or backend `error` may not appear in `.message`, so rate-limited responses could fall through to the generic error toast. Prefer checking structured fields from the client (e.g. `e.response.status` or `e.response.data.error`) instead of parsing the message string.
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
mt-alarcon
pushed a commit
to mt-alarcon/evo-nexus
that referenced
this pull request
May 13, 2026
…iggers Endereça review do Sourcery em PR evolution-foundation#80 (Comment 1) e estende a mesma proteção para o endpoint de replay (não citado pelo Sourcery, mas contém o mesmo bug latente — `except IntegrityError` genérico). Dois pontos corrigidos em dashboard/backend/routes/triggers.py: 1. `webhook_receiver` (linha ~389): mesmo bug do PR evolution-foundation#78. Capturava qualquer IntegrityError e respondia 200 OK silencioso. Agora só absorve se a mensagem do driver menciona `uq_trigger_idem` ou `idempotency_key` e `idem_key` está definido. Caso contrário, loga erro completo e re-raise. 2. Endpoint `/triggers/executions/<id>/replay` (linha ~476): cria nova execution reusando `ex.idempotency_key`. Tinha o mesmo catch genérico, mascarando outros IntegrityError como "idempotent_skip". Aplicada a mesma proteção, usando `ex.idempotency_key` no check. Sem teste dedicado pra esse path nessa branch — sintaxe validada com py_compile.
mt-alarcon
pushed a commit
to mt-alarcon/evo-nexus
that referenced
this pull request
May 13, 2026
…urado Polish do review do Sourcery em PR evolution-foundation#80 (sugestões code-quality, 3 frentes): 1. `_classify_error` — tighten typing - Define alias `ErrorCategory = Literal["transient", "permanent"]` - Atualiza return type pra refletir o que a função realmente retorna - Atualiza comentário em `TriggerExecution.error_category` (models.py) que listava 4 categorias (validation/unknown) que nunca eram emitidas pelo classificador — consumidores não vão depender de valores que não existem no código. 2. `trigger_stats` — parametriza SQL - `datetime('now', '-{days} days')` interpolado vira `:cutoff` bindado (cutoff calculado em Python com timedelta). - `WHERE trigger_id IN ({id_list})` vira `IN :wpp_ids` com `bindparam("wpp_ids", expanding=True)` — SQLAlchemy expande pra `(:id_1, :id_2, ...)` e binda cada valor. - Inputs hoje são "seguros" (ints + days constrained), mas parametrizar é mais robusto contra regressões e mais legível. 3. `confirmReplay` (Triggers.tsx) — error handling estruturado - Cria `ApiError` em `lib/api.ts` carregando `status` (número HTTP) e `code` (string do JSON do backend, ex.: `rate_limited`). - `buildError` agora retorna `ApiError` em vez de `Error` simples. - `Triggers.tsx::confirmReplay` checa `e instanceof ApiError && (e.status === 429 || e.code === 'rate_limited')` em vez de `msg.includes('429')`. Mais resiliente a mudanças no formato da mensagem (ex.: tradução, prefixo). - `.message` preservado no mesmo formato (`"<status> <text>: <detail>"`) pra não quebrar os 3 outros call sites que usam `msg.includes(...)` (Backups.tsx, StepBrainConnect.tsx, RestoreSelectRepo.tsx) — migração desses fica pra um PR dedicado. Sem mudança funcional em runtime saudável; só estrutura/robustez.
mt-alarcon
pushed a commit
to mt-alarcon/evo-nexus
that referenced
this pull request
May 13, 2026
… índice + logs Consolidado dos fixes pedidos pelo Sourcery no review do PR evolution-foundation#78: 1. `except IntegrityError` restrito à violação de idempotência (dashboard/backend/routes/triggers.py) Antes, qualquer IntegrityError no `webhook_receiver` virava 200 OK silencioso — tratado como "replay idempotente". Outros problemas (NOT NULL, FK quebrada, novo unique constraint adicionado depois) ficariam mascarados como sucesso e o cliente nunca saberia que o evento não foi processado. Agora o catch só absorve o erro se: a) `idem_key` está definido (sem chave de idempotência não há como ser violação de uq_trigger_idem) b) A mensagem do erro do driver menciona `uq_trigger_idem` ou `idempotency_key` (constraint específica) Para os demais IntegrityError, loga com contexto completo (trigger_id, key, mensagem original) e re-raise — webhook retorna 500 e o erro fica visível em logs em vez de virar 200 silencioso. 2. Remove índice redundante (models.py) `TriggerExecution.idempotency_key` estava criando 3 índices pra mesma coluna: - `ix_trigger_executions_idempotency_key` (auto-gerado via index=True no model) - `ix_trigger_executions_idem_key` (raw SQL no startup) - `uq_trigger_idem` (partial unique, raw SQL) Mantemos só os dois últimos — explicitamente criados pela migration de startup, com nomes versionados no rollback plan. `index=True` removido da definição do model. 3. Loga falha de CREATE INDEX em vez de silenciar (app.py) Substitui `except Exception: pass` por log estruturado nos dois `CREATE INDEX` (basic e partial unique). Antes, qualquer falha (ex.: versão de SQLite sem suporte a partial index) era engolida silenciosamente — operador só descobriria no primeiro race. Agora loga `sqlite_lib`, `sqlite_runtime` e a exceção original. Sintaxe validada com `python3 -m py_compile`. Sem teste dedicado pro path nessa branch (testes WPP só entram nos PRs evolution-foundation#79/evolution-foundation#80).
… SQL, ApiError Consolidado dos fixes pedidos pelo Sourcery no review do PR evolution-foundation#80: 1. `except IntegrityError` restrito à violação de idempotência (dashboard/backend/routes/triggers.py — 2 pontos) Mesmo bug do PR evolution-foundation#78 estendido a este PR. Dois pontos no triggers.py: a) `webhook_receiver` (~linha 389): mesma correção do PR evolution-foundation#78 — só absorve o erro se a mensagem do driver menciona `uq_trigger_idem` ou `idempotency_key` E `idem_key` está set; caso contrário loga com contexto completo e re-raise. b) Endpoint `/triggers/executions/<id>/replay` (~linha 476): cria nova execution reusando `ex.idempotency_key`. Tinha o mesmo catch genérico, mascarando outros IntegrityError como "idempotent_skip". Aplicada a mesma proteção usando `ex.idempotency_key` no check. (Sourcery não citou este segundo ponto, mas é o mesmo bug — corrigido por simetria.) 2. `_classify_error` — tighten typing - Define alias `ErrorCategory = Literal["transient", "permanent"]`. - Atualiza return type pra refletir o que a função realmente retorna. - Atualiza comentário em `TriggerExecution.error_category` (models.py) que listava 4 categorias (validation/unknown) que nunca eram emitidas pelo classificador — consumidores não vão mais depender de valores inexistentes. 3. `trigger_stats` — parametriza SQL - `datetime('now', '-{days} days')` interpolado vira `:cutoff` bindado (cutoff calculado em Python com `timedelta`). - `WHERE trigger_id IN ({id_list})` vira `IN :wpp_ids` com `bindparam("wpp_ids", expanding=True)` — SQLAlchemy expande pra `(:id_1, :id_2, ...)` e binda cada valor. - Inputs hoje são "seguros" (ints + days constrained), mas parametrizar é mais robusto contra regressões e mais legível. 4. `confirmReplay` (Triggers.tsx) — error handling estruturado - Cria `ApiError` em `lib/api.ts` carregando `status` (número HTTP) e `code` (string do JSON do backend, ex.: `rate_limited`). - `buildError` agora retorna `ApiError` em vez de `Error` simples. - `Triggers.tsx::confirmReplay` checa `e instanceof ApiError && (e.status === 429 || e.code === 'rate_limited')` em vez de `msg.includes('429')`. Mais resiliente a mudanças no formato da mensagem (ex.: tradução, prefixo). - `.message` preservado no mesmo formato (`"<status> <text>: <detail>"`) pra não quebrar os 3 outros call sites que usam `msg.includes(...)` (Backups.tsx, StepBrainConnect.tsx, RestoreSelectRepo.tsx) — migração desses fica pra um PR dedicado. Sintaxe Python validada com `python3 -m py_compile`. Typecheck do frontend deixado pro CI do PR (deps não instaladas localmente).
da958c0 to
935d4f9
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Context
Third and final of a 3-PR series implementing the WhatsApp retry pattern. Builds on #78 (PR-1: idempotency_key + silent dedup) and #79 (PR-2: backoff + jitter on api_request) by adding:
/stats) for observabilityDiff includes PR-1 + PR-2 + PR-3 because they stack. Reviewer can focus on the PR-3 specific changes (the 3 files in this PR's commit):
dashboard/backend/routes/triggers.py(+263)dashboard/frontend/src/pages/Triggers.tsx(+173)dashboard/tests/test_wpp_retry_pr3.py(+203)Changes
Step 4 —
_classify_error+failed_retryableclassificationNew
_classify_error(err_msg, exc)returns one of:transient— timeouts, connection errors, HTTP 5xx, retry markers in stdoutpermanent—ValueError,KeyError, missing-key/auth/forbidden errorsunknown— everything else_execute_triggernow records the classification inresult_summary["error_category"]so the UI can filter retryable failures.Step 5 —
POST /api/triggers/executions/<id>/replayEndpoint replays a failed execution:
not_replayableif the execution succeeded or was already replayedidper 60s (in-memory cache; multi-instance would need Redis)Step 6 —
GET /api/triggers/statsReturns a JSON snapshot for observability:
dlq_size > 10→"dlq_growing"warningUI shows a small stats banner above the executions table when watermark trips.
Tests
dashboard/tests/test_wpp_retry_pr3.py(203 lines):_classify_errorfor timeouts, ValueError, KeyError, HTTP 5xx markers, etc.Tests use a fresh
dashboard.dbper test and run end-to-end against the Flask app (no mocking of the DB layer).Verification
ast.parsetsc(existing project config) acceptsTriggers.tsxchangesSeries
Summary by Sourcery
Introduce DLQ-style classification for trigger execution failures, add UI-driven replay of retryable executions, and expose operational stats for WhatsApp-trigger workloads.
New Features:
Enhancements:
Tests: