Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Interactivity Overhaul (User Interface & Model Instrumentation & Network Comms) #1054

Open
wants to merge 162 commits into
base: main
Choose a base branch
from

Conversation

nopdive
Copy link
Collaborator

@nopdive nopdive commented Oct 22, 2024

Interactivity Overhaul

What you want, when you want.

-- some guidance developer (circa 2024)

Screenshare of updated UI in Jupyter notebook

Overview

This PR is the first of many focusing on interactivity. It introduces an updated user interface for notebooks, new instrumentation for models, and a respective network layer to handle bidirectional communication between the IPython kernel and JavaScript client. To further support this, models have reworked rendering, added tracing logic to better support replays where required.

This PR also functions as a foundational step towards near future work including rendering across various environments (i.e. terminal support as TUI and append-only outputs), upgraded benchmarking and model inspection.

TL;DR

We added a lot of code to support better model metrics and visualization. We are getting ready for multimedia streaming, and want to have users deep inspect all the models, without overheating the computer.

Acknowledgements

Big shoutouts to:

  • Loc (co-developed this PR): model instrumentation & metrics.
  • Jingya: consult & sketches on enhanced UI design.
  • Harsha: overall feedback & collab on prototypes.

Running this PR

  • cd packages/python/stitch && pip install -e .
  • Go run a notebook.

User Interface

Design principle: All visibility. No magic.

Overall we're trying to show as much as we can on model outputs. When debugging outputs, there can be real ugliness that is often hidden away including tokenization concerns and critical points that may dictate the rest of the output. This need for inspection increases as users begin to define their own structured decoding grammars, unexpected overconstraints can occur in development.

The old user interface that displays HTML as a side-effect in notebooks when models compute, have been replaced with a custom Jupyter Widget (see Network Communications for more detail), of which hosts an interactive sandboxed iframe. We still support a legacy mode, if users desire the previous UI.

Before
image

After
image

We're getting more information to the output at the expense of less text density. There is simply more going on, and in order to keep some legibility we've increased text size and spacing, compensating for two visual elements (highlighting and underlines) that are used to convey token info for scanning. A general metrics bar is also displayed for discoverability on token reduction and other efficiency metrics relevant when prompt engineering for reduced costs.

When users want further detail on tokens, we support a tool tip that contains top 5 alternate token candidates alongside exact values for visual elements. Highlighting has been applied to candidates, accentuating tokens that include spaces.

We use a mono-space typeface such that data format outputs can be inspected quicker (i.e. verticality can matter for balancing braces and indentation).

As users learn a system: a UI with easier discoverability can come at the cost of productivity. We've made all visual components optional to keep our power users in the flow, and in the future we intend to allow users to define defaults to fully support this.

For legacy mode (modeled after previous UI). Users can execute guidance.legacy_mode(True) at the start of their notebook.
image
Old school cool.

The Code

  • Added

    • guidance.visual module. Handles renderer creation (stitch or HTML display) and all required messaging. This also handles Jupyter cell change detection for deciding when widgets need to be instantiated or reset.
    • guidance.trace module. Tracks model inputs & outputs of an engine. Important for replaying for clients.
    • graphpaper-inline NPM package has been added. This handles all client-side rendering and messaging. Written with Svelte/TypeScript/Tailwind/D3.
  • Changed

    • Rendering logic has been stripped from Model class and has been delegated to Renderer member where possible.
    • Relevant state logic has been augmented for inputs & outputs, and stored within engine for tracing across models.
    • Role processing across guidance has been thinned. Model class now generates role openers and closer text directly from its respective chat template.

Instrumentation

Instrumentation is key for model inspection, debugging and cost-sensitive prompt engineering. This includes backing the new UI. Metrics are now collected for both general compute resources (CPU/GPU/RAM) and model tokens (including token counts/reduction, latency, type, backtracking).

The Code

  • Added (metric collection feature)

    • Add Monitor class in _model.py to collect common metrics (CPU, RAM, GPU utilization, etc.)
      • Monitor runs in a separated process to prevent competing resources with model/engine process
    • Model now keeps stats of current input/output/backtrack tokens
    • At the end of notebook cell's execution, we'll collect probability of each token in the final model state, and collect associated stats per token such as
      • Latency
      • If token was generated, force-forwarded or from user input
  • Changed:

    • Replaced get_next_token with get_next_token_with_top_k to keep track issued token along with its associated top_k tokens (both constrained and unconstrained). Data will be stored in EngineOutput class
    • Model now has VisBytesChunk object to keep track of which part of the chunk is from user input, generated by engine or force-forwarded by parser.
      VisBytesChunk also stores the list of EngineOutput objects generated by the engine during chunk generation.
      This facilitates the process of checking tokens from the final state are generated, force-forwarded or from user input.
    • Add get_per_token_topk_probs function in Engine class to calculate probability of each token in the token list.
      This function is used at the end of the cell execution to calculate the probabilities of model state in unconstrained mode.
    • Add get_per_token_stats function in Model class to report stats for each token in model state in unconstrained mode.
      Stats include issued token, probability, latency, top-k, masked-top-k if available.
      Data from get_per_token_stats will be reported to the UI for new visualization.

Network Communications

We have two emerging requirements that will impact future guidance development. One, the emergence of streaming multimedia around language models (audio/video). Two, user interactivity within the UI, requesting more data or computation that may not be feasible to rpre-(?:fetch|calculate) to a static client.

For user interactivity from UI to Python, it's also important that we cover as many notebook environments as possible. Each cloud notebook provider has their own quirks of which complicates client development. Some providers love resizing cell outputs indefinitely, others refuse to display HTML unless it's secured away in an isolated iframe.

All in all, we need a solution that is isolated, somewhat available across providers and can allow streams of messages between server (Jupyter Python kernel) and client (cell output with a touch of JS).

Stitch

It's 3:15AM, bi-directional comms was a mistake.

-- some guidance developer, minutes prior to passing out (circa 2024)

stitch is an auxiliary package we've created, that handles bi-directional communication between a web client and a Jupyter python kernel. It does this by creating a thin custom Jupyter widget that handles messages between the kernel and a sandboxed iframe hosting the web client. It looks something like this:

python code -> kernel-side jupyter widget -> kernel comms (ZMQ) -> client-side jupyter widget -> window event message -> sandboxed iframe -> web client (graphpaper-inline)

This package drives messages between guidance.visual module and graphpaper-inline client. All messages are streamed to allow near-real-time rendering within a notebook. Bi-directional comms is used to repair the display if initial messages have been missed (client will request a full replay when it notices the first message it receives has a non-zero identifier).

The Code

  • Added
    • stitch Python package. Can be found at packages/python/stitch.

Future work

We wanted to shoot for the stars, and ended up in the ocean. The following will occur after this PR.

Near future tasks:

  • User defaults for UI
  • Terminal support (non-interactive & shell)
  • Restyle
  • Richer visualizations
  • Memory re-architecture (broader than this PR)
  • Interactive support for multimedia
  • Guidance quality-of-life (visual diff testing)

nopdive and others added 30 commits September 24, 2024 10:01
Visualization components do better with state handled as traces that can
rewind. As such definitions and evaluation of a guidance grammar is
separated here while minimizing changes needed at the grammar level.
Probably need to have separate fields for tracking, input and output of a given node.
Trace can now handle capture groups. State module moved to trace module.
Documentation added and some type changes.
Trace nodes have light adjustments. HTML renderer is connected but fully working yet due to role closers.
Old HTML display now fully replaced. Fixed some roles issues as well.
Uses stitch for kernel to client communication. Need to redesign and hook in instrumentation.
Tooling appears to create a nameless role. Fixed.
Kernel messages still need to be re-implemented.
This package is required for Jupyter kernel comms via
a custom ipywidget.
Copyright headers now correctly pointing to Guidance Contributors.
Trace messages are now JSON serializable. Some minor fixes like adding a manifest for package.
Client has a race condition where it skips messages that have been fired by stitch before it loads.
Had to send a heartbeat first then send all messages in buffer.
Client messages can be handled in engine. Output for print and log not working due to being in an ipywidget. Will need to re-implement with asyncio later.
Separate thread for send/recv on messages.
Final message sent on cell completion. Still needs further testing.
No more dictionaries to recv_msg!
This includes for HTML renderer.
Queue instantiation now deferred to asyncio background thread.
@hudson-ai
Copy link
Collaborator

So stoked about this @nopdive! Will start reviewing today. Anywhere in particular you feel needs a closer look?

@hudson-ai
Copy link
Collaborator

So stoked about this @nopdive! Will start reviewing today. Anywhere in particular you feel needs a closer look?

I'm going to assume the modifications to the parser and engine call loop are one such place I should spend some time with :)

BTW, what's the plan for packaging this? Is stitch standalone to be broken out into its own package? Will users be able to opt out of the additional dependencies if all they want is to use guidance as a library rather than as a visual runtime?

@Harsha-Nori
Copy link
Collaborator

Harsha-Nori commented Oct 22, 2024

I'd love for rough focus areas to be:

@hudson-ai on Loc's instrumentation and changes within the model <> parser interop layers
@paulbkoch on general efficiency/memory leaks/etc, and on the correctness of instrumentation (I.e. are we presenting and displaying the right information, specifically around utilization metrics)
@riedgar-ms to think about a longer term visual testing strategy here -- don't think we need to gate the PR on this. Sam might have thoughts.
@nking-1 , time permitting, on the client side JS pieces.

Of course, every one should feel free to play around with the whole PR, report bugs as we find them, etc. :).

fail-fast: false
matrix:
os: [ubuntu-latest, windows-latest, macos-latest]
python-version: ["3.7", "3.10"]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Python 3.7??????

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ipywidget (jupyter widgets) developers recommend building custom widgets via a cookie-cutter template they provide (https://github.com/jupyter-widgets/ipywidgets) to follow best practice for interactive widgets.

Looks like this is part of the template (https://github.com/jupyter-widgets/widget-ts-cookiecutter/blob/master/.github/workflows/build.yml).

I'm in mind that we delete this. We can also update it and hook it to the CI if needed. I'm not fussed either way.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Related note: hopefully there's a way we can hook playwright and backstopJS for clientside testing.

@nopdive
Copy link
Collaborator Author

nopdive commented Oct 22, 2024

So stoked about this @nopdive! Will start reviewing today. Anywhere in particular you feel needs a closer look?

I'm going to assume the modifications to the parser and engine call loop are one such place I should spend some time with :)

BTW, what's the plan for packaging this? Is stitch standalone to be broken out into its own package? Will users be able to opt out of the additional dependencies if all they want is to use guidance as a library rather than as a visual runtime?

Glad you're excited! As @Harsha-Nori mentioned, eyes on instrumentation and model changes regarding interop would be great. We've cut down the visual aspects on Model._state (I'm under the impression HTML rendering was the sole consumer but I could be wrong), I'm not sure if that affects other sections of the code including parser interop.

TBD, we need it specific to guidance but from an infra standpoint we'll need a separate PyPi package for it anyway. Good question on additional dependencies. I err towards a guidance-core that has zero dependencies of which developers can provide their own dependencies against, whereas guidance declares key dependencies for general usage. It's a bigger problem then the visualization component of course.

"""
if ipython_imported and get_ipython() is not None:
ipy = get_ipython()
cell_msg_id = ipy.get_parent()["msg_id"]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Getting an exception here in terminal ipython:

AttributeError: 'TerminalInteractiveShell' object has no attribute 'get_parent'

Not to go down the rabbit-hole of environment detection too much, but some bare-bones handler for the terminal would be really nice (or at least an explicit exception telling me that echo=False must be used).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, this is an oversight on my part, I should have tested on a terminal and promptly forgot.

I'll try to get the fix in today.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hudson-ai , there should be a fix for terminal now. No side-effect renders until some of the IPython shell event handling is figured out (I'll update this to the future work items list).

@hudson-ai
Copy link
Collaborator

Feature request: in the case where constraints give us only low-probability tokens to choose from, it would be really nice if we could see a few other valid tokens (if there are any). I think this would give a clearer picture of what the constraint is actually doing. Currently it'll only show four high-probability (but invalid) tokens alongside the one that was actually generated.

@JC1DA
Copy link

JC1DA commented Oct 22, 2024

I think this would give a clearer picture of what the co

We have both constrained & unconstrained probs for generated tokens, so it could be done.
In that case, should we show less (invalid unconstrained tokens) and add some more (valid constrained tokens) with lower probs than the generated one? I would like to keep numbers of tokens on the UI low, otherwise, it's kind of confusion to users. Thoughts?

@hudson-ai
Copy link
Collaborator

We have both constrained & unconstrained probs for generated tokens, so it could be done.
In that case, should we show less (invalid unconstrained tokens) and add some more (valid constrained tokens) with lower probs than the generated one? I would like to keep numbers of tokens on the UI low, otherwise, it's kind of confusion to users. Thoughts?

Not really sure what the right UX is here. But in my opinion, showing valid tokens is more helpful than showing invalid tokens, so I would personally prioritize those. Any other opinions in the room? @Harsha-Nori ?

Fake an `EngineOutput` for `ByteParser`
@nopdive
Copy link
Collaborator Author

nopdive commented Oct 22, 2024

We have both constrained & unconstrained probs for generated tokens, so it could be done.
In that case, should we show less (invalid unconstrained tokens) and add some more (valid constrained tokens) with lower probs than the generated one? I would like to keep numbers of tokens on the UI low, otherwise, it's kind of confusion to users. Thoughts?

Not really sure what the right UX is here. But in my opinion, showing valid tokens is more helpful than showing invalid tokens, so I would personally prioritize those. Any other opinions in the room? @Harsha-Nori ?

Reasonable use case, I think it's worth considering. Ideally we're 5 +/- 2 in entries (for UI tooltip, not API access). We can separate top (K-n) tokens and have n-1 neighboring/successive tokens near selected token visually.

gen_tokens_indices.append(len(gen_tokens_infos) - 1)

text = self._state
token_ids = self.engine.tokenizer.encode(text.encode("utf-8"))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Be very careful here.

  1. We may need to ensure a bos_token prefixes this, as there is additional work being done in the parser to do this (seems like this and the parser should use a common function)
  2. In working through this PR, I have realized that we're not always sending proper tokenizations to the engine in our generate loop. Re-tokenizing like this at the end may therefore give tokens (and probabilities) that mismatch the actual ones that are seen during generation.
  3. Both points above additionally mean higher costs than otherwise expected since the KV cache may be invalidated.

The point about improper tokenization is something that we should probably handle lower down, either in llguidance or in the token parser. Need to chat with @Harsha-Nori and/or @mmoskal about this.

Copy link

@JC1DA JC1DA Oct 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great catch. Will fix this BOS issue.
For 3) We don't really use KV cache to calculate the final probs as the engine will not generate probs for cached tokens.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hudson-ai added a check to append BOS token into token_ids before calling get_per_token_topk_probs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants