Add workspace typing-lag stress test by lawrencecchen · Pull Request #1201 · manaflow-ai/cmux

lawrencecchen · 2026-03-11T22:22:38Z

Summary

add a visibility-aware regression test that creates many workspaces, splits, and Bonsplit tabs, then types into every visible terminal and reports lag
reduce Bonsplit tab-bar layout churn by only publishing the selected tab frame preference
validate on cmux-macmini with tagged builds so the test run does not steal local focus

Profiling

Before the Bonsplit change, sample showed SelectedTabFramePreferenceKey.reduce, TabBarView.tabItem, and GeometryProxy.frame(in:) dominating the main thread during dense workspace churn.

After the change, the full-load run no longer showed that path dominating the captured sample.

Testing

python3 -m py_compile tests/test_workspace_split_tab_typing_lag.py
remote: CMUX_SOCKET=/tmp/cmux-debug-task-workspace-typing-lag.sock CMUX_TYPING_LAG_TOTAL_WORKSPACES=2 python3 -u tests/test_workspace_split_tab_typing_lag.py
remote: CMUX_SOCKET=/tmp/cmux-debug-task-workspace-typing-lag.sock python3 -u tests/test_workspace_split_tab_typing_lag.py

Dependency

bonsplit PR: Only measure selected bonsplit tab frame bonsplit#22

Summary by cubic

Adds a visibility-aware typing-lag regression test that spins up many workspaces, splits, and bonsplit tabs, types in each visible terminal, and fails on regressions. Also updates vendor/bonsplit to reduce tab-bar layout churn and lower main-thread use.

New Features
- Adds tests/test_workspace_split_tab_typing_lag.py to measure shortcut and visible typing latency vs a clean baseline.
- Counts only visible terminals (selected workspace/tab, focused terminal, pixels changed).
- Enforces p95/avg ratio and delta thresholds; prints stats and failures; captures sample on failure; continues without cmux PID (disables failure sampling); adds snapshot retries and focus recovery; refuses main sockets by default.
Dependencies
- Bumps vendor/bonsplit to include reduced tab-bar layout churn (only publishes the selected tab frame preference).
- Upstream PR: Only measure selected bonsplit tab frame bonsplit#22

^{Written for commit e2fc0d4. Summary will update on new commits.}

Summary by CodeRabbit

Tests
- Added comprehensive regression testing for typing latency in workspace, split, and tab scenarios, including visibility validation and performance measurement.
Chores
- Updated bonsplit vendor dependency.

vercel · 2026-03-11T22:22:43Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Actions	Updated (UTC)
cmux	Ready	Preview, Comment	Mar 14, 2026 1:45am

coderabbitai · 2026-03-11T22:23:03Z

📝 Walkthrough

Walkthrough

A new regression testing harness (tests/test_workspace_split_tab_typing_lag.py) is introduced to measure typing latency across multiple workspaces, panes, and Bonsplit tabs. It provides socket communication, workspace management, latency statistics collection, and baseline versus stress testing workflows. The Bonsplit vendor submodule is also updated to a newer commit.

Changes

Cohort / File(s)	Summary
Regression Testing Harness `tests/test_workspace_split_tab_typing_lag.py`	New 622-line test file with RawSocketClient class for Unix socket communication, LatencyStats and SurfaceTarget dataclasses, utility functions for workspace/pane creation and management, latency collection and statistical reporting, terminal visibility verification, pixel-based snapshot validation, and orchestration logic for baseline and stress scenario execution with configurable parameters and failure diagnostics.
Vendor Dependencies `vendor/bonsplit`	Submodule pointer updated to commit 085411e6b19ee0d60a535651efad1a90b2659e91 (from fa452db181f361514087558a29204bda7e38218f).

Sequence Diagram

sequenceDiagram
    participant Test as Test Harness
    participant RawSocket as RawSocketClient
    participant Cmux as cmux Client
    participant Terminal as Terminal Surface
    participant Snapshot as Pixel Verification

    Test->>RawSocket: connect()
    RawSocket-->>Test: socket connected
    
    Test->>Cmux: reset_to_fresh_workspace()
    Cmux-->>Test: workspace_id
    
    Test->>Cmux: build_workspace_grid()
    Cmux-->>Test: list[workspace_ids]
    
    Test->>Cmux: create_surface_targets()
    Cmux-->>Test: list[SurfaceTarget]
    
    loop For Each Target
        Test->>Cmux: wait_for_visible_terminal(target)
        Cmux-->>Test: terminal visible
        
        Test->>RawSocket: command(type_token)
        RawSocket->>Terminal: send typing input
        Terminal-->>RawSocket: command response
        RawSocket-->>Test: latency captured
        
        Test->>Snapshot: panel_snapshot_retry()
        Snapshot-->>Test: snapshot dict (pixel change verified)
        
        Test->>Test: collect latency value
    end
    
    Test->>Test: compute_stats(baseline_latencies)
    Test->>Test: compute_stats(stress_latencies)
    
    Test->>Test: compare results with thresholds
    Test-->>Test: regression decision
    
    Test->>RawSocket: close()

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

Possibly related PRs

Split CI: GitHub runners for tests, Depot for perf regression #773: Updates CI to run workspace churn typing-lag regression test on depot runner, providing infrastructure integration for the new regression harness.
Add workspace-churn typing lag regression and fix #767: Introduces similar/overlapping typing-lag regression harness components (RawSocketClient, LatencyStats, compute_stats, run_baseline_scenario), indicating shared test tooling patterns.

Poem

🐰 twitches whiskers with testing delight

Through tabs and splits the typing flows fast,
With sockets so sturdy, stress tests are cast!
Latencies captured, percentiles all found,
This regression harness hops all around.
Bonsplit updated, the metrics ring clear—
A feast for QA engineers to cheer! 🎯

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately summarizes the main change: adding a workspace typing-lag stress test, which aligns with the primary objective of introducing a regression testing harness.
Description check	✅ Passed	The pull request description covers all required sections: Summary explains what changed and why, Testing details how it was validated, and a comprehensive Checklist is provided. Demo Video section is N/A for this backend test change.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings (stacked PR)
📝 Generate docstrings (commit on current branch)

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch task-workspace-typing-lag

📝 Coding Plan

Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

greptile-apps · 2026-03-11T22:25:57Z

Greptile Summary

This PR adds a visibility-aware typing-latency regression harness (test_workspace_split_tab_typing_lag.py) that creates many workspaces, pane splits, and Bonsplit tabs, types into every visible terminal, and compares shortcut and end-to-end latencies against a clean single-workspace baseline. It also bumps the vendor/bonsplit submodule to a commit that limits SelectedTabFramePreferenceKey publishing to the selected tab, directly addressing the main-thread layout churn identified in profiling.

Key observations:

Logic bug: build_workspace_grid hardcodes a 2×2 layout (always 4 panes) but asserts len(panes) == PANES_PER_WORKSPACE; setting CMUX_TYPING_LAG_PANES_PER_WORKSPACE to anything other than 4 causes a guaranteed timeout with no helpful error message.
Measurement accuracy: visible_ms is measured via a wait_for poll loop with a 50 ms step, so the reported value can be up to ~50 ms higher than actual render latency. This inflates absolute thresholds (MAX_VISIBLE_P95_MS) but does not affect ratio comparisons.
Dead code: make_token contains an unreachable second padding branch for typical TOKEN_LENGTH values.
The socket-protection logic (ALLOW_MAIN_SOCKET + tagged socket check) is a good guard against accidentally running the disruptive test against a production session.

Confidence Score: 4/5

Safe to merge — the test file is additive and the submodule bump is a narrow, profiling-validated fix; no production logic is changed.
The changes are a new test file and a submodule bump; neither affects production code paths. The one logic bug (PANES_PER_WORKSPACE mismatch) only causes a test timeout when someone customises the env var away from its default value — it does not affect correctness of the Bonsplit fix itself. The measurement inaccuracy is minor and does not make the test produce false positives at its current thresholds.
tests/test_workspace_split_tab_typing_lag.py — specifically build_workspace_grid (PANES_PER_WORKSPACE mismatch) and visible_ms measurement accuracy.

Important Files Changed

Filename	Overview
tests/test_workspace_split_tab_typing_lag.py	New 622-line regression harness that stress-tests typing latency across many workspaces, pane splits, and Bonsplit tabs. Contains a logic bug where PANES_PER_WORKSPACE is configurable but build_workspace_grid always creates exactly 4 panes, causing silent timeout on any non-default value. visible_ms measurement also inherits up to 50ms polling overhead from wait_for.
vendor/bonsplit	Submodule pointer bumped from fa452db to 085411e to pull in the Bonsplit change that limits SelectedTabFramePreferenceKey publishing to the selected tab only, reducing layout churn during workspace switching.

Sequence Diagram

sequenceDiagram
    participant Test as test script
    participant cmux as cmux client
    participant Raw as RawSocketClient
    participant App as cmux app

    Test->>cmux: connect()
    Test->>Raw: connect() [same socket]

    Note over Test,App: Baseline scenario
    Test->>cmux: reset_to_fresh_workspace()
    cmux->>App: new_workspace / close_workspace
    Test->>Raw: simulate_shortcut(ch) × TOKEN_LENGTH × BASELINE_TOKEN_COUNT
    Raw-->>Test: OK (shortcut_latency_ms each)
    Test->>cmux: read_terminal_text() [poll until token visible]
    Test->>cmux: panel_snapshot() [verify changed_pixels]

    Note over Test,App: Build stress targets
    Test->>cmux: reset_to_fresh_workspace()
    loop TOTAL_WORKSPACES
        Test->>cmux: new_workspace / select_workspace
        Test->>cmux: new_pane("right/down") × 3
        loop PANES_PER_WORKSPACE panes
            loop until TABS_PER_PANE tabs
                Test->>cmux: new_surface(terminal)
            end
        end
    end

    Note over Test,App: Stress scenario
    loop each SurfaceTarget
        Test->>cmux: select_workspace / focus_pane / focus_surface
        Test->>cmux: wait_for_visible_terminal()
        Test->>Raw: simulate_shortcut(ch) × TOKEN_LENGTH
        Raw-->>Test: OK (shortcut_latency_ms each)
        Test->>cmux: read_terminal_text() [poll until token visible]
        Test->>cmux: panel_snapshot() [verify changed_pixels ≥ MIN]
    end

    Note over Test: Compare baseline vs stress latency stats → PASS/FAIL

_{Last reviewed commit: cda2950}

coderabbitai

🧹 Nitpick comments (1)

tests/test_workspace_split_tab_typing_lag.py (1)

612-618: Consider logging cleanup failures instead of silently ignoring.

The try-except-pass at lines 616-617 silently swallows cleanup exceptions. While this is intentional to avoid masking the original test result, completely silent failures can hide environmental issues.

♻️ Optional: Add minimal logging for cleanup failures

     finally:
         if client is not None:
             try:
                 reset_to_fresh_workspace(client)
-            except Exception:
-                pass
+            except Exception as cleanup_exc:
+                print(f"Warning: cleanup failed: {cleanup_exc}", file=sys.stderr)
             client.close()

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tests/test_workspace_split_tab_typing_lag.py` around lines 612 - 618, The
cleanup block currently swallows exceptions silently; change the except block to
catch Exception as e and log the failure (e.g., logger.warning or
logger.exception) including the exception message/traceback so cleanup errors
are visible while still not failing the test; update the finally to log errors
from reset_to_fresh_workspace and/or client.close (reference
reset_to_fresh_workspace and client.close) and keep the test behavior of not
re-raising the exception.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@tests/test_workspace_split_tab_typing_lag.py`:
- Around line 612-618: The cleanup block currently swallows exceptions silently;
change the except block to catch Exception as e and log the failure (e.g.,
logger.warning or logger.exception) including the exception message/traceback so
cleanup errors are visible while still not failing the test; update the finally
to log errors from reset_to_fresh_workspace and/or client.close (reference
reset_to_fresh_workspace and client.close) and keep the test behavior of not
re-raising the exception.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: e509676b-3ff5-4634-a1bb-c42d35938b79

📥 Commits

Reviewing files that changed from the base of the PR and between 18bdbef and cda2950.

📒 Files selected for processing (2)

tests/test_workspace_split_tab_typing_lag.py
vendor/bonsplit

tests/test_workspace_split_tab_typing_lag.py

cubic-dev-ai

1 issue found across 2 files

Prompt for AI agents (unresolved issues)


Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="tests/test_workspace_split_tab_typing_lag.py">

<violation number="1" location="tests/test_workspace_split_tab_typing_lag.py:264">
P2: `build_workspace_grid` hard-codes a 4-pane layout, so `CMUX_TYPING_LAG_PANES_PER_WORKSPACE` values other than 4 will time out.</violation>
</file>

_{Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.}

tests/test_workspace_split_tab_typing_lag.py

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: cda2950813

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

tests/test_workspace_split_tab_typing_lag.py

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 44c5d14fde

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

tests/test_workspace_split_tab_typing_lag.py

lawrencecchen added 2 commits March 11, 2026 15:22

Add workspace split tab typing lag regression

9ea0f9e

Reduce typing lag from bonsplit tab layout

cda2950

vercel bot deployed to Preview March 11, 2026 22:23 View deployment

coderabbitai bot reviewed Mar 11, 2026

View reviewed changes

greptile-apps bot reviewed Mar 11, 2026

View reviewed changes

tests/test_workspace_split_tab_typing_lag.py Show resolved Hide resolved

tests/test_workspace_split_tab_typing_lag.py Show resolved Hide resolved

tests/test_workspace_split_tab_typing_lag.py Outdated Show resolved Hide resolved

cubic-dev-ai bot reviewed Mar 11, 2026

View reviewed changes

tests/test_workspace_split_tab_typing_lag.py Outdated Show resolved Hide resolved

chatgpt-codex-connector bot reviewed Mar 11, 2026

View reviewed changes

tests/test_workspace_split_tab_typing_lag.py Outdated Show resolved Hide resolved

tests/test_workspace_split_tab_typing_lag.py Outdated Show resolved Hide resolved

Fix typing lag test review feedback

44c5d14

vercel bot deployed to Preview March 14, 2026 01:21 View deployment

chatgpt-codex-connector bot reviewed Mar 14, 2026

View reviewed changes

tests/test_workspace_split_tab_typing_lag.py Outdated Show resolved Hide resolved

Keep typing lag regression running without PID lookup

e2fc0d4

vercel bot deployed to Preview March 14, 2026 01:45 View deployment

Uh oh!

Conversation

lawrencecchen commented Mar 11, 2026 • edited by cubic-dev-ai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Profiling

Testing

Dependency

Summary by cubic

Summary by CodeRabbit

Uh oh!

vercel bot commented Mar 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai bot commented Mar 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Possibly related PRs

Poem

❌ Failed checks (1 warning)

Uh oh!

greptile-apps bot commented Mar 11, 2026

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Sequence Diagram

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

lawrencecchen commented Mar 11, 2026 •

edited by cubic-dev-ai bot

Loading

vercel bot commented Mar 11, 2026 •

edited

Loading

coderabbitai bot commented Mar 11, 2026 •

edited

Loading