Speculative sampling #17

daviswer · 2024-04-11T23:06:03Z

Implements simple speculative sampling via candidate-consistent ground-truth sampling. See #12 for a discussion on implementation details and why this is needed in the first place.

Add __generate_targets() function, implementing both greedy and non-greedy selection
Enable sampling in speculative_generate()
Support temperature and top_k sampling as in non-speculative generate
Allow user to set these as arguments in the paged_speculative_inference.py demo script

Notably, for low temperature and top_k, we anecdotally observe no reduction in speculator performance compared to the greedy case!

…added args to script for sampling

daviswer · 2024-04-11T23:10:05Z

Example outputs demonstrating new sampling capabilities:

Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
Provide a list of instructions for preparing chicken soup.

### Response:

Greedy baseline:

Sure! Here are the steps to prepare chicken soup:

1. Start by chopping 1 onion and 3 cloves of garlic.
2. In a large pot, heat 2 tablespoons of olive oil over medium heat.
3. Add the chopped onion and sauté until it's translucent.
4. Add the chopped garlic and sauté for an additional 2 minutes.
5. Add 
103 tokens in 34 steps

time to first token: 0.026240825653076172
time per token (decode): 0.009298106999073213

top_k=5, temperature=2: no slowdown

Here is a recipe for preparing a delicious homemade chicken soup:

Step 1: Gather Ingredients:

* 2 lbs. boneless skinless chicken breasts or thighs, cut into bite-sized pieces
* 2 carrots, peeled and chopped
* 2 celery stalks, chopped
* 1 large onion, chopped
* 2-4 cups of
104 tokens in 32 steps

time to first token: 0.02620673179626465
time per token (decode): 0.00881159076323876

top_k=5, temperature=5: slowdown due to low likelihood of (ridiculous) output

Soup-tactical Movement:
To create this savor-ific broth:
1. Gather the squad of chicken brews and chicken parts (legs and feet work best) in one area for staging. Ensuring all ingredients are at room temperature.
2. In a galleon (Larger pot) add one-quint (about 1.3 litters) water and two pinch's salt to create the bre
102 tokens in 68 steps

time to first token: 0.026073217391967773
time per token (decode): 0.018784146682888855

nairbv · 2024-04-30T14:17:35Z

fms_extras/utils/generation.py

+    # Composite greedy and non greedy outputs
+    greedy = logits.argmax(-1)
+    mask = do_sample[:, None, None].int()
+    return samples * mask + (1 - mask) * greedy


if the mask is really a mask and not a weighting, might be better to use torch.where.

we're calculating the sampled results even if we don't use them? I guess that's something to do with compilation but I would have thought the generation code would be outside the compile path?

Good point, I'll swap to torch.where. We are calculating the sampled result for every case, and while that will be useful for compile down the road, in this case it's mostly just for efficient gpu usage - pretty sure that partitioning the greedy/non-greedy lines and then re-mixing them after is more work than just sampling everything

nairbv · 2024-04-30T14:20:44Z

fms_extras/utils/generation.py

+    For example, if the base model predicts tokens A and B with equal 50% probability, and the
+    speculator produces one candidate with A and another with B, with independent sampling there's
+    a 25% chance of rejecting both, even though one must be correct. Consistent sampling allows us
+    to avoid this.


if the goal is to speculate on a mutually exclusive set of possible continuations, why are we sampling at all and not just speculating on the top-k predictions?

We could do this, but we're more concerned with the ability to sample here than we are with the non-greediness of the approach. In this case "not greedy" is meant strictly literally, in that sampling involves not selecting greedily (assuming I'm understanding the question)

daviswer and others added 5 commits April 11, 2024 13:23

Consolidate best guess and correct tokens

3ed5a89

Add candidate-consistent next_val sampling

1ac6623

fixed issue with top_k in sampling; fixed type hint for temperature; …

be7e0cc

…added args to script for sampling

Add docstring for generate_targets

96fdc7d

Simpler/faster consistent sampling

ea646a2

daviswer requested review from nairbv and JRosenkranz April 11, 2024 23:06

daviswer added 2 commits April 12, 2024 12:00

Vectorized do_sample in generate_targets

f6d96f6

Merge branch 'main' into speculative_sample

82a591f

nairbv reviewed Apr 30, 2024

View reviewed changes

daviswer added 3 commits May 13, 2024 16:36

Cleaner sample/nosample masking

6552516

Fix do_sample typing issues

2f47c70

Linting

bc75432

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speculative sampling #17

Speculative sampling #17

daviswer commented Apr 11, 2024

daviswer commented Apr 11, 2024 •

edited

Loading

nairbv Apr 30, 2024 •

edited

Loading

daviswer May 13, 2024

nairbv Apr 30, 2024

daviswer May 13, 2024

Speculative sampling #17

Are you sure you want to change the base?

Speculative sampling #17

Conversation

daviswer commented Apr 11, 2024

daviswer commented Apr 11, 2024 • edited Loading

nairbv Apr 30, 2024 • edited Loading

Choose a reason for hiding this comment

daviswer May 13, 2024

Choose a reason for hiding this comment

nairbv Apr 30, 2024

Choose a reason for hiding this comment

daviswer May 13, 2024

Choose a reason for hiding this comment

daviswer commented Apr 11, 2024 •

edited

Loading

nairbv Apr 30, 2024 •

edited

Loading