Whiteboard layout investigation: from community reports to a measured prompt overhaul #489
wyuc
started this conversation in
Show and tell
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Whiteboard layout investigation: from community reports to a measured prompt overhaul
How the whiteboard system is designed (and why it constrains the optimization space)
A few design choices shape what we could and couldn't try in this investigation. Sharing them here so the trade-offs in the rest of the post make sense:
[text, action, text, action, ...]items. Actions execute in emission order after the response is parsed. This keeps latency tight (one round-trip per turn) and matches the natural rhythm of "speak while drawing" in a real classroom — but it means an agent cannot observe the result of its own action 3 before deciding action 4.wb_draw_text/wb_draw_shape/wb_draw_latex/ ...): eachwb_draw_*is a thin wrapper that the action engine converts into aPPTTextElement/PPTLatexElement/ etc. — the same DSL the offline slide generator uses. The wrapper exists for permission gating, fade-in animation timing, and to hide a few fields the agent shouldn't touch (rotatedefaults to 0, font name is fixed). This adds a small abstraction layer at the cost of two slightly different surfaces sharing one renderer.These constraints define the "feasible region" of this work. Several of the open questions at the bottom point at relaxing them.
Where this started: community reports
We've received a steady stream of issues about whiteboard content quality. The most directly relevant ones:
These all point at the same underlying problem: the LLM produces unreliable
left/topcoordinates, leading to overlapping elements, off-canvas content, and inconsistent font sizing. This PR (#485) is a systematic response to that whole class of issues — instead of patching individual symptoms, we built a measurable eval and only shipped what moved the numbers.TL;DR
gemini-3.1-proas a VLM judge,gemini-3-flashwent from baseline overall 5.4 / overlap 6.3 to 8.0 / 9.6.flash + thinking ≥ pro + default. You may not need to upgrade to a top-tier model — flipping the right flag on a cheaper one can match it.Setup
eval/whiteboard-layout/contains 6 scenario JSONs spanning primary-school math, junior-high physics, high-school math, medical compliance, development economics, and corporate finance. Each scenario has an initial slide plus several user turns./api/chatendpoint viarunAgentLoop, simulating a live classroom session.gemini-3.1-pro-preview, which returns 6 dimensions (readability, overlap, rendering_correctness, content_completeness, layout_logic, overall).Pointers to the relevant commits and code are in the PR description.
Findings
1. Prompt overhaul + programmatic conflict detection: overlap +1.8 (clear win)
Prompt-side changes (most of the diff):
lib/prompts/templates/agent-system-wb-{teacher,assistant,student}/system.mdplus a sharedsnippets/whiteboard-reference.md).Geometric conflict detection (new module
lib/orchestration/summarizers/whiteboard-conflicts.ts):5 out of 6 scenarios reached near-perfect overlap on the final turn. The one that didn't budge was
med-gcp-compliance(multi-agent, dense tables) — that scenario became our "stress test" benchmark.2. Image feedback loop: dropped
We invested significant effort in a "feed the rendered whiteboard from the previous turn back as a
filepart on the next user message" mechanism. The intuition was that an agent seeing what it had just drawn would self-correct.We measured 2×2 (model × image flag) at N=3:
Verdict: weak models barely benefit (they see the image but can't act on it); strong models actively over-correct ("this could be tidier — let me wb_clear and redo"), making clean layouts worse. Net ROI is approximately zero, and is negative on the model tier you'd actually deploy in production.
The supporting plumbing (image part extraction in
message-converter.ts, multimodal passthrough inai-sdk-adapter.ts, the "Prior-state image" sections in role templates) has all been reverted in this PR. It can be reintroduced cleanly from git history if someone wants to ship user-uploaded images later.3. Thinking: largest single lever, but the latency cost is real
We added an opt-in
thinkingfield to the chat API (defaultenabled: false). When enabled ongemini-3-flash-preview, this setsthinkingLevel: 'high'provider-side.The
med-gcp-compliancescenario that we couldn't budge with prompt alone (3.3 baseline) jumped to 8.3 with thinking enabled. The hard scenario isn't a prompt problem — it's a reasoning capacity problem.But p50 = 63 seconds means a teacher pause of a full minute between sentences. For a live classroom, this is unusable. We exposed
EVAL_ENABLE_THINKINGas an opt-in env variable; the production chat path stays onenabled: false.Worth recording:
flash + thinkingoutscorespro + defaulton every metric in our eval. For cost-sensitive deployments where latency can be amortized (e.g., offline pre-generation), enabling thinking on a cheaper model may dominate upgrading to a more expensive one.Open questions (we'd love your input)
A few directions we considered but haven't measured. If you have data, opinions, or related work, please drop a comment:
wb_draw_*calls in one response, collisions among those 5 are invisible. Of the available approaches — server-side action interception, CoT-style "plan-then-commit", or true agentic loops (one draw, observe, next draw) — which has the best ROI?by agentName, but the conflict list does not. Would surfacing ownership in the conflict output (OVERLAP: text [id:e3, by Prof. Lin] and table [id:e7, by TA Su]) help downstream agents distinguish their own work from teammates'?wb_draw_*actions are a thin abstraction over the underlyingPPTTextElement / PPTLatexElement / ...schema, but they hide useful fields (rotate,lineHeight,opacity) and force us to maintain layout knowledge in two places (slide-content prompt and whiteboard-reference snippet). Has anyone benchmarked "thin action wrapper" vs "agent emits raw DSL JSON"?image offconfigurations produce sparse boards (sometimes a single element), the scorer awards them 10/10 on overlap. This effectively lets the lazy mode cheat. Are there published whiteboard / canvas layout rubrics that handle this better?Reproduce / Get involved
lib/orchestration/summarizers/whiteboard-conflicts.ts+tests/orchestration/whiteboard-conflicts.test.ts(17 cases)eval/whiteboard-layout/eval/whiteboard-layout/scenarios/*.json— PRs adding new scenarios are welcomeA direct ask: where we'd love community input
This investigation only scratched the surface, and we know the most useful next steps will come from people who think about layout / multi-agent orchestration / eval design differently than we do. Concretely, we'd love help with any of the following — comments on this discussion, draft PRs, or even just "we tried X in a sibling project and it did/didn't work" reports are all welcome:
eval/whiteboard-layout/scenarios/) cover K-12 + a few professional domains in Chinese. Scenarios in other languages, scripts (Arabic / Devanagari / Thai), or subject matter (humanities, art, music) would expose failure modes we can't see today. The JSON format is tiny — adding one is a 10-line PR.If you're not sure whether your input fits, post anyway. The cost of an "off-topic" comment is tiny; the cost of an unshared insight is real.
Special thanks to the reporters of #38, #74, #208, #314, and #424 — those issues were the starting point for this whole investigation.
Beta Was this translation helpful? Give feedback.
All reactions