Skip to content

Add workspace typing-lag stress test#1201

Open
lawrencecchen wants to merge 4 commits intomainfrom
task-workspace-typing-lag
Open

Add workspace typing-lag stress test#1201
lawrencecchen wants to merge 4 commits intomainfrom
task-workspace-typing-lag

Conversation

@lawrencecchen
Copy link
Copy Markdown
Contributor

@lawrencecchen lawrencecchen commented Mar 11, 2026

Summary

  • add a visibility-aware regression test that creates many workspaces, splits, and Bonsplit tabs, then types into every visible terminal and reports lag
  • reduce Bonsplit tab-bar layout churn by only publishing the selected tab frame preference
  • validate on cmux-macmini with tagged builds so the test run does not steal local focus

Profiling

Before the Bonsplit change, sample showed SelectedTabFramePreferenceKey.reduce, TabBarView.tabItem, and GeometryProxy.frame(in:) dominating the main thread during dense workspace churn.

After the change, the full-load run no longer showed that path dominating the captured sample.

Testing

  • python3 -m py_compile tests/test_workspace_split_tab_typing_lag.py
  • remote: CMUX_SOCKET=/tmp/cmux-debug-task-workspace-typing-lag.sock CMUX_TYPING_LAG_TOTAL_WORKSPACES=2 python3 -u tests/test_workspace_split_tab_typing_lag.py
  • remote: CMUX_SOCKET=/tmp/cmux-debug-task-workspace-typing-lag.sock python3 -u tests/test_workspace_split_tab_typing_lag.py

Dependency


Summary by cubic

Adds a visibility-aware typing-lag regression test that spins up many workspaces, splits, and bonsplit tabs, types in each visible terminal, and fails on regressions. Also updates vendor/bonsplit to reduce tab-bar layout churn and lower main-thread use.

  • New Features

    • Adds tests/test_workspace_split_tab_typing_lag.py to measure shortcut and visible typing latency vs a clean baseline.
    • Counts only visible terminals (selected workspace/tab, focused terminal, pixels changed).
    • Enforces p95/avg ratio and delta thresholds; prints stats and failures; captures sample on failure; continues without cmux PID (disables failure sampling); adds snapshot retries and focus recovery; refuses main sockets by default.
  • Dependencies

Written for commit e2fc0d4. Summary will update on new commits.

Summary by CodeRabbit

  • Tests

    • Added comprehensive regression testing for typing latency in workspace, split, and tab scenarios, including visibility validation and performance measurement.
  • Chores

    • Updated bonsplit vendor dependency.

@vercel
Copy link
Copy Markdown

vercel bot commented Mar 11, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
cmux Ready Ready Preview, Comment Mar 14, 2026 1:45am

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Mar 11, 2026

📝 Walkthrough

Walkthrough

A new regression testing harness (tests/test_workspace_split_tab_typing_lag.py) is introduced to measure typing latency across multiple workspaces, panes, and Bonsplit tabs. It provides socket communication, workspace management, latency statistics collection, and baseline versus stress testing workflows. The Bonsplit vendor submodule is also updated to a newer commit.

Changes

Cohort / File(s) Summary
Regression Testing Harness
tests/test_workspace_split_tab_typing_lag.py
New 622-line test file with RawSocketClient class for Unix socket communication, LatencyStats and SurfaceTarget dataclasses, utility functions for workspace/pane creation and management, latency collection and statistical reporting, terminal visibility verification, pixel-based snapshot validation, and orchestration logic for baseline and stress scenario execution with configurable parameters and failure diagnostics.
Vendor Dependencies
vendor/bonsplit
Submodule pointer updated to commit 085411e6b19ee0d60a535651efad1a90b2659e91 (from fa452db181f361514087558a29204bda7e38218f).

Sequence Diagram

sequenceDiagram
    participant Test as Test Harness
    participant RawSocket as RawSocketClient
    participant Cmux as cmux Client
    participant Terminal as Terminal Surface
    participant Snapshot as Pixel Verification

    Test->>RawSocket: connect()
    RawSocket-->>Test: socket connected
    
    Test->>Cmux: reset_to_fresh_workspace()
    Cmux-->>Test: workspace_id
    
    Test->>Cmux: build_workspace_grid()
    Cmux-->>Test: list[workspace_ids]
    
    Test->>Cmux: create_surface_targets()
    Cmux-->>Test: list[SurfaceTarget]
    
    loop For Each Target
        Test->>Cmux: wait_for_visible_terminal(target)
        Cmux-->>Test: terminal visible
        
        Test->>RawSocket: command(type_token)
        RawSocket->>Terminal: send typing input
        Terminal-->>RawSocket: command response
        RawSocket-->>Test: latency captured
        
        Test->>Snapshot: panel_snapshot_retry()
        Snapshot-->>Test: snapshot dict (pixel change verified)
        
        Test->>Test: collect latency value
    end
    
    Test->>Test: compute_stats(baseline_latencies)
    Test->>Test: compute_stats(stress_latencies)
    
    Test->>Test: compare results with thresholds
    Test-->>Test: regression decision
    
    Test->>RawSocket: close()
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

Possibly related PRs

Poem

🐰 twitches whiskers with testing delight

Through tabs and splits the typing flows fast,
With sockets so sturdy, stress tests are cast!
Latencies captured, percentiles all found,
This regression harness hops all around.
Bonsplit updated, the metrics ring clear—
A feast for QA engineers to cheer! 🎯

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the main change: adding a workspace typing-lag stress test, which aligns with the primary objective of introducing a regression testing harness.
Description check ✅ Passed The pull request description covers all required sections: Summary explains what changed and why, Testing details how it was validated, and a comprehensive Checklist is provided. Demo Video section is N/A for this backend test change.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch task-workspace-typing-lag
📝 Coding Plan
  • Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Mar 11, 2026

Greptile Summary

This PR adds a visibility-aware typing-latency regression harness (test_workspace_split_tab_typing_lag.py) that creates many workspaces, pane splits, and Bonsplit tabs, types into every visible terminal, and compares shortcut and end-to-end latencies against a clean single-workspace baseline. It also bumps the vendor/bonsplit submodule to a commit that limits SelectedTabFramePreferenceKey publishing to the selected tab, directly addressing the main-thread layout churn identified in profiling.

Key observations:

  • Logic bug: build_workspace_grid hardcodes a 2×2 layout (always 4 panes) but asserts len(panes) == PANES_PER_WORKSPACE; setting CMUX_TYPING_LAG_PANES_PER_WORKSPACE to anything other than 4 causes a guaranteed timeout with no helpful error message.
  • Measurement accuracy: visible_ms is measured via a wait_for poll loop with a 50 ms step, so the reported value can be up to ~50 ms higher than actual render latency. This inflates absolute thresholds (MAX_VISIBLE_P95_MS) but does not affect ratio comparisons.
  • Dead code: make_token contains an unreachable second padding branch for typical TOKEN_LENGTH values.
  • The socket-protection logic (ALLOW_MAIN_SOCKET + tagged socket check) is a good guard against accidentally running the disruptive test against a production session.

Confidence Score: 4/5

  • Safe to merge — the test file is additive and the submodule bump is a narrow, profiling-validated fix; no production logic is changed.
  • The changes are a new test file and a submodule bump; neither affects production code paths. The one logic bug (PANES_PER_WORKSPACE mismatch) only causes a test timeout when someone customises the env var away from its default value — it does not affect correctness of the Bonsplit fix itself. The measurement inaccuracy is minor and does not make the test produce false positives at its current thresholds.
  • tests/test_workspace_split_tab_typing_lag.py — specifically build_workspace_grid (PANES_PER_WORKSPACE mismatch) and visible_ms measurement accuracy.

Important Files Changed

Filename Overview
tests/test_workspace_split_tab_typing_lag.py New 622-line regression harness that stress-tests typing latency across many workspaces, pane splits, and Bonsplit tabs. Contains a logic bug where PANES_PER_WORKSPACE is configurable but build_workspace_grid always creates exactly 4 panes, causing silent timeout on any non-default value. visible_ms measurement also inherits up to 50ms polling overhead from wait_for.
vendor/bonsplit Submodule pointer bumped from fa452db to 085411e to pull in the Bonsplit change that limits SelectedTabFramePreferenceKey publishing to the selected tab only, reducing layout churn during workspace switching.

Sequence Diagram

sequenceDiagram
    participant Test as test script
    participant cmux as cmux client
    participant Raw as RawSocketClient
    participant App as cmux app

    Test->>cmux: connect()
    Test->>Raw: connect() [same socket]

    Note over Test,App: Baseline scenario
    Test->>cmux: reset_to_fresh_workspace()
    cmux->>App: new_workspace / close_workspace
    Test->>Raw: simulate_shortcut(ch) × TOKEN_LENGTH × BASELINE_TOKEN_COUNT
    Raw-->>Test: OK (shortcut_latency_ms each)
    Test->>cmux: read_terminal_text() [poll until token visible]
    Test->>cmux: panel_snapshot() [verify changed_pixels]

    Note over Test,App: Build stress targets
    Test->>cmux: reset_to_fresh_workspace()
    loop TOTAL_WORKSPACES
        Test->>cmux: new_workspace / select_workspace
        Test->>cmux: new_pane("right/down") × 3
        loop PANES_PER_WORKSPACE panes
            loop until TABS_PER_PANE tabs
                Test->>cmux: new_surface(terminal)
            end
        end
    end

    Note over Test,App: Stress scenario
    loop each SurfaceTarget
        Test->>cmux: select_workspace / focus_pane / focus_surface
        Test->>cmux: wait_for_visible_terminal()
        Test->>Raw: simulate_shortcut(ch) × TOKEN_LENGTH
        Raw-->>Test: OK (shortcut_latency_ms each)
        Test->>cmux: read_terminal_text() [poll until token visible]
        Test->>cmux: panel_snapshot() [verify changed_pixels ≥ MIN]
    end

    Note over Test: Compare baseline vs stress latency stats → PASS/FAIL
Loading

Last reviewed commit: cda2950

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
tests/test_workspace_split_tab_typing_lag.py (1)

612-618: Consider logging cleanup failures instead of silently ignoring.

The try-except-pass at lines 616-617 silently swallows cleanup exceptions. While this is intentional to avoid masking the original test result, completely silent failures can hide environmental issues.

♻️ Optional: Add minimal logging for cleanup failures
     finally:
         if client is not None:
             try:
                 reset_to_fresh_workspace(client)
-            except Exception:
-                pass
+            except Exception as cleanup_exc:
+                print(f"Warning: cleanup failed: {cleanup_exc}", file=sys.stderr)
             client.close()
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/test_workspace_split_tab_typing_lag.py` around lines 612 - 618, The
cleanup block currently swallows exceptions silently; change the except block to
catch Exception as e and log the failure (e.g., logger.warning or
logger.exception) including the exception message/traceback so cleanup errors
are visible while still not failing the test; update the finally to log errors
from reset_to_fresh_workspace and/or client.close (reference
reset_to_fresh_workspace and client.close) and keep the test behavior of not
re-raising the exception.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@tests/test_workspace_split_tab_typing_lag.py`:
- Around line 612-618: The cleanup block currently swallows exceptions silently;
change the except block to catch Exception as e and log the failure (e.g.,
logger.warning or logger.exception) including the exception message/traceback so
cleanup errors are visible while still not failing the test; update the finally
to log errors from reset_to_fresh_workspace and/or client.close (reference
reset_to_fresh_workspace and client.close) and keep the test behavior of not
re-raising the exception.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: e509676b-3ff5-4634-a1bb-c42d35938b79

📥 Commits

Reviewing files that changed from the base of the PR and between 18bdbef and cda2950.

📒 Files selected for processing (2)
  • tests/test_workspace_split_tab_typing_lag.py
  • vendor/bonsplit

Copy link
Copy Markdown

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 2 files

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="tests/test_workspace_split_tab_typing_lag.py">

<violation number="1" location="tests/test_workspace_split_tab_typing_lag.py:264">
P2: `build_workspace_grid` hard-codes a 4-pane layout, so `CMUX_TYPING_LAG_PANES_PER_WORKSPACE` values other than 4 will time out.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: cda2950813

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 44c5d14fde

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant