Whiteboard layout investigation: from community reports to a measured prompt overhaul #489

wyuc · 2026-04-25T07:41:55Z

wyuc
Apr 25, 2026
Maintainer

Whiteboard layout investigation: from community reports to a measured prompt overhaul

A technical write-up, not release notes. We're publishing the full investigation — including the path that didn't work — because we think the negative result and the open questions will be useful to anyone building LLM-driven canvas / whiteboard applications. We'd love to hear your suggestions on either the eval setup or the whiteboard system itself — see the open questions and "Get involved" sections at the end.

How the whiteboard system is designed (and why it constrains the optimization space)

A few design choices shape what we could and couldn't try in this investigation. Sharing them here so the trade-offs in the rest of the post make sense:

One-pass per agent turn: each agent's response is a single LLM call that emits a full JSON array of [text, action, text, action, ...] items. Actions execute in emission order after the response is parsed. This keeps latency tight (one round-trip per turn) and matches the natural rhythm of "speak while drawing" in a real classroom — but it means an agent cannot observe the result of its own action 3 before deciding action 4.
Action wrapper around DSL (wb_draw_text / wb_draw_shape / wb_draw_latex / ...): each wb_draw_* is a thin wrapper that the action engine converts into a PPTTextElement / PPTLatexElement / etc. — the same DSL the offline slide generator uses. The wrapper exists for permission gating, fade-in animation timing, and to hide a few fields the agent shouldn't touch (rotate defaults to 0, font name is fixed). This adds a small abstraction layer at the cost of two slightly different surfaces sharing one renderer.
Multi-agent serial turns: in discussion mode, multiple agents (teacher / assistant / student) take turns within a single user round. Each agent's actions persist before the next agent's prompt is built, so cross-agent visibility is real. But intra-turn (one agent's own back-to-back actions) is invisible to itself — see open question 1.
Production latency budget is ~2-4s per turn. Anything that 25× this number (e.g. enabling thinking) is out of bounds for live classroom use, regardless of quality wins.

These constraints define the "feasible region" of this work. Several of the open questions at the bottom point at relaxing them.

Where this started: community reports

We've received a steady stream of issues about whiteboard content quality. The most directly relevant ones:

feat: Whiteboard element overlap detection and auto-layout #38 — feat: Whiteboard element overlap detection and auto-layout (proposed exactly the approach we ended up implementing)
[Bug]: 多个讨论的时候，白板重叠了 #74 — 多个讨论的时候，白板重叠了 (whiteboards overlap during multi-agent discussion)
[Bug]: 互動白板文字重疊 #208 — 互動白板文字重疊 (text on the whiteboard piles up when multiple speakers are involved)
[Bug]: 内容超过显式区域下边界 #314 — 内容超过显式区域下边界 (content overflows the visible area)
bug: whiteboard math equations render as raw LaTeX text instead of formatted equations #424 — bug: whiteboard math equations render as raw LaTeX text

These all point at the same underlying problem: the LLM produces unreliable left/top coordinates, leading to overlapping elements, off-canvas content, and inconsistent font sizing. This PR (#485) is a systematic response to that whole class of issues — instead of patching individual symptoms, we built a measurable eval and only shipped what moved the numbers.

TL;DR

Across 6 real-classroom scenarios × N=3 repeats, scored by gemini-3.1-pro as a VLM judge, gemini-3-flash went from baseline overall 5.4 / overlap 6.3 to 8.0 / 9.6.
The two biggest levers: (1) computing geometric conflicts in code and feeding the result to the model lifted overlap by +1.8; (2) enabling provider-side thinking lifted overall by +1.9 but at ~25× the latency.
Most counterintuitive finding: feeding the rendered whiteboard image back to the model as visual feedback hurts strong models by −0.6 and barely helps weak ones (+0.4). Net ROI ≈ 0. We removed it.
Practical takeaway: flash + thinking ≥ pro + default. You may not need to upgrade to a top-tier model — flipping the right flag on a cheaper one can match it.

Setup

eval/whiteboard-layout/ contains 6 scenario JSONs spanning primary-school math, junior-high physics, high-school math, medical compliance, development economics, and corporate finance. Each scenario has an initial slide plus several user turns.

Runner: drives the real /api/chat endpoint via runAgentLoop, simulating a live classroom session.
Capture: at each checkpoint, Playwright takes a screenshot of the current whiteboard (1000×563 canvas).
Scorer: the screenshot plus a scoring rubric is sent to gemini-3.1-pro-preview, which returns 6 dimensions (readability, overlap, rendering_correctness, content_completeness, layout_logic, overall).
Repeat: we settled on N=3 after discovering that N=1 variance was large enough to mask any real signal.

Pointers to the relevant commits and code are in the PR description.

Findings

1. Prompt overhaul + programmatic conflict detection: overlap +1.8 (clear win)

Prompt-side changes (most of the diff):

Migrated whiteboard role guidance from inline TS strings to file-based markdown templates (lib/prompts/templates/agent-system-wb-{teacher,assistant,student}/system.md plus a shared snippets/whiteboard-reference.md).
Rewrote each role's prompt to tighten responsibilities — teacher capped at 1-3 elements per turn, assistant explicitly forbidden from deleting the teacher's elements, student silent by default.
Strong anti-action anchor: "do nothing about existing elements; only act when the conflict list is non-empty."

Geometric conflict detection (new module lib/orchestration/summarizers/whiteboard-conflicts.ts):

Pure-function geometry on the current whiteboard JSON: bbox IoU (30% threshold), line-segment vs bbox intersection, canvas-edge clipping.
Rendered as a plain-text block injected into the system prompt's state section:

## ⚠ Layout Conflicts Detected (computed from current whiteboard JSON)
  - OVERLAP: text "实时反馈..." [id:e3] and table 3×3 [id:e5] share 72% of the smaller one's area — they sit on top of each other.
  - LINE CROSSES: arrow [id:l2] from (480,80) to (480,420) passes through table [id:e7] — the line is drawn over content.
  - OUT OF CANVAS: latex "R = P × I²" [id:e9] extends past right edge by 40px — content is clipped.

Config	overall	overlap
baseline	5.4	6.3
+ conflict detection	6.1	8.1 (+1.8)

5 out of 6 scenarios reached near-perfect overlap on the final turn. The one that didn't budge was med-gcp-compliance (multi-agent, dense tables) — that scenario became our "stress test" benchmark.

2. Image feedback loop: dropped

We invested significant effort in a "feed the rendered whiteboard from the previous turn back as a file part on the next user message" mechanism. The intuition was that an agent seeing what it had just drawn would self-correct.

We measured 2×2 (model × image flag) at N=3:

	image OFF	image ON	Δ
flash overall	5.0	5.4	+0.4
pro overall	8.0	7.4	−0.6

Verdict: weak models barely benefit (they see the image but can't act on it); strong models actively over-correct ("this could be tidier — let me wb_clear and redo"), making clean layouts worse. Net ROI is approximately zero, and is negative on the model tier you'd actually deploy in production.

The supporting plumbing (image part extraction in message-converter.ts, multimodal passthrough in ai-sdk-adapter.ts, the "Prior-state image" sections in role templates) has all been reverted in this PR. It can be reintroduced cleanly from git history if someone wants to ship user-uploaded images later.

3. Thinking: largest single lever, but the latency cost is real

We added an opt-in thinking field to the chat API (default enabled: false). When enabled on gemini-3-flash-preview, this sets thinkingLevel: 'high' provider-side.

Config	overall	overlap	p50 turn latency	p95 turn latency
flash + image + conflict	6.1	8.1	2-4s	5-8s
+ thinking ON	8.0	9.6	63s	258s
pro + default (reference)	7.4	8.9	(not measured)	(not measured)

The med-gcp-compliance scenario that we couldn't budge with prompt alone (3.3 baseline) jumped to 8.3 with thinking enabled. The hard scenario isn't a prompt problem — it's a reasoning capacity problem.

But p50 = 63 seconds means a teacher pause of a full minute between sentences. For a live classroom, this is unusable. We exposed EVAL_ENABLE_THINKING as an opt-in env variable; the production chat path stays on enabled: false.

Worth recording: flash + thinking outscores pro + default on every metric in our eval. For cost-sensitive deployments where latency can be amortized (e.g., offline pre-generation), enabling thinking on a cheaper model may dominate upgrading to a more expensive one.

Open questions (we'd love your input)

A few directions we considered but haven't measured. If you have data, opinions, or related work, please drop a comment:

Intra-turn geometric feedback. Our conflict detection is turn-to-turn — it shows the state at the start of the agent's turn. When a single agent emits 5 wb_draw_* calls in one response, collisions among those 5 are invisible. Of the available approaches — server-side action interception, CoT-style "plan-then-commit", or true agentic loops (one draw, observe, next draw) — which has the best ROI?
Element ownership signal in the conflict list. The ledger summary already attributes elements with by agentName, but the conflict list does not. Would surfacing ownership in the conflict output (OVERLAP: text [id:e3, by Prof. Lin] and table [id:e7, by TA Su]) help downstream agents distinguish their own work from teammates'?
Whiteboard action wrapper vs raw DSL. Our current wb_draw_* actions are a thin abstraction over the underlying PPTTextElement / PPTLatexElement / ... schema, but they hide useful fields (rotate, lineHeight, opacity) and force us to maintain layout knowledge in two places (slide-content prompt and whiteboard-reference snippet). Has anyone benchmarked "thin action wrapper" vs "agent emits raw DSL JSON"?
Why does visual feedback hurt strong models. Our hypothesis is "they over-correct because they always find something to improve." We didn't run the controlled experiment to confirm this. Has anyone observed similar reversal in their multimodal agent setups?
Eval scorer rubric: "fewer elements = higher overlap score" loophole. When image off configurations produce sparse boards (sometimes a single element), the scorer awards them 10/10 on overlap. This effectively lets the lazy mode cheat. Are there published whiteboard / canvas layout rubrics that handle this better?

Reproduce / Get involved

Full PR: #485
Conflict detector: lib/orchestration/summarizers/whiteboard-conflicts.ts + tests/orchestration/whiteboard-conflicts.test.ts (17 cases)
Eval framework: eval/whiteboard-layout/
Scenario files: eval/whiteboard-layout/scenarios/*.json — PRs adding new scenarios are welcome
Run it yourself:

EVAL_CHAT_MODEL=google:gemini-3-flash-preview \
EVAL_SCORER_MODEL=google:gemini-3.1-pro-preview \
EVAL_ENABLE_THINKING=1 \
pnpm eval:whiteboard --repeat 3

A direct ask: where we'd love community input

This investigation only scratched the surface, and we know the most useful next steps will come from people who think about layout / multi-agent orchestration / eval design differently than we do. Concretely, we'd love help with any of the following — comments on this discussion, draft PRs, or even just "we tried X in a sibling project and it did/didn't work" reports are all welcome:

More eval scenarios. The current 6 scenarios (under eval/whiteboard-layout/scenarios/) cover K-12 + a few professional domains in Chinese. Scenarios in other languages, scripts (Arabic / Devanagari / Thai), or subject matter (humanities, art, music) would expose failure modes we can't see today. The JSON format is tiny — adding one is a 10-line PR.
Better scoring rubrics. Open question 5 below is the headliner: our VLM scorer rewards sparse boards, which lets the "do nothing" agent cheat. If you've designed a layout-quality rubric (for whiteboards, slide decks, dashboards, anything), we'd love to crib from it.
Counter-evidence on the image-feedback negative result. N=3 isn't infinity. If image feedback works in your application or at your model tier, please tell us — we'll happily eat the negative result if it's local.
Whiteboard system improvements, not just optimization tweaks. The "How the whiteboard system is designed" section above lists several of our load-bearing choices (one-pass, action wrapper, serial turns). We'd love to hear from anyone who has tried alternatives — true agentic loops, raw-DSL emission, parallel-agent canvases, server-side layout solvers — and what the trade-offs looked like.
Anything from the open questions above — concrete experience or pointers to related work are gold.

If you're not sure whether your input fits, post anyway. The cost of an "off-topic" comment is tiny; the cost of an unshared insight is real.

Special thanks to the reporters of #38, #74, #208, #314, and #424 — those issues were the starting point for this whole investigation.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Whiteboard layout investigation: from community reports to a measured prompt overhaul #489

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Whiteboard layout investigation: from community reports to a measured prompt overhaul #489

Uh oh!

wyuc Apr 25, 2026 Maintainer

Whiteboard layout investigation: from community reports to a measured prompt overhaul

How the whiteboard system is designed (and why it constrains the optimization space)

Where this started: community reports

TL;DR

Setup

Findings

1. Prompt overhaul + programmatic conflict detection: overlap +1.8 (clear win)

2. Image feedback loop: dropped

3. Thinking: largest single lever, but the latency cost is real

Open questions (we'd love your input)

Reproduce / Get involved

A direct ask: where we'd love community input

Replies: 0 comments

wyuc
Apr 25, 2026
Maintainer