Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can't quantize kv cache: observer = self.k_observers[layer_idx] liste index out of range #1295

Open
DreamGenX opened this issue Mar 28, 2025 · 4 comments · May be fixed by #1312
Open

Can't quantize kv cache: observer = self.k_observers[layer_idx] liste index out of range #1295

DreamGenX opened this issue Mar 28, 2025 · 4 comments · May be fixed by #1312
Assignees
Labels
bug Something isn't working

Comments

@DreamGenX
Copy link

Describe the bug
Using this recipe:

quant_stage:
    quant_modifiers:
        QuantizationModifier:
            ignore: ["lm_head", "re:.*layers.0..*", "re:.*layers.79..*"]
            config_groups:
                group_0:
                    weights:
                        num_bits: 8
                        type: float
                        strategy: channel
                        dynamic: false
                        symmetric: true
                    input_activations:
                        num_bits: 8
                        type: float
                        strategy: token
                        dynamic: true
                        symmetric: true
                    targets: ["Linear"]
            kv_cache_scheme:
                num_bits: 8
                type: float
                strategy: tensor
                dynamic: false
                symmetric: true

Results in the error below. If I remove the kv_cache_scheme, it works.
The model I am quantizing is Llama 3.3 70b.

Expected behavior
A clear and concise description of what you expected to happen.

Environment
Include all relevant environment information:

  1. OS [e.g. Ubuntu 20.04]:
  2. Python version [e.g. 3.7]:
  3. LLM Compressor version or commit hash [e.g. 0.1.0, f7245c8]: 0.4.1
  4. ML framework version(s) [e.g. torch 2.3.1]:
  5. Other Python package versions [e.g. vLLM, compressed-tensors, numpy, ONNX]:
  6. Other relevant environment information [e.g. hardware, CUDA version]: 2xH100SXM
    To Reproduce
    Exact steps to reproduce the behavior:

Errors
If applicable, add a full print-out of any errors or exceptions that are raised or include screenshots to help explain your problem.

  File "/root/venv/lib/python3.12/site-packages/llmcompressor/modifiers/quantization/quantization/base.py", line 112, in on_initialize
    self._calibrate_if_possible(module)
  File "/root/venv/lib/python3.12/site-packages/llmcompressor/modifiers/quantization/quantization/base.py", line 266, in _calibrate_if_possible
    self._calibrate(module)
  File "/root/venv/lib/python3.12/site-packages/llmcompressor/modifiers/quantization/quantization/base.py", line 314, in _calibrate
    run_calibration_forward(
  File "/root/venv/lib/python3.12/site-packages/llmcompressor/modifiers/utils/pytorch_helpers.py", line 82, in run_calibration_forward
    forward_fn(batch, module=model)
  File "/root/venv/lib/python3.12/site-packages/llmcompressor/pytorch/utils/helpers.py", line 394, in tensors_module_forward
    return module(**tensors)
           ^^^^^^^^^^^^^^^^^
  File "/root/venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/venv/lib/python3.12/site-packages/accelerate/hooks.py", line 176, in new_forward
    output = module._old_forward(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/venv/lib/python3.12/site-packages/transformers/utils/deprecation.py", line 172, in wrapped_func
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/root/venv/lib/python3.12/site-packages/transformers/models/llama/modeling_llama.py", line 853, in forward
    outputs = self.model(
              ^^^^^^^^^^^
  File "/root/venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/venv/lib/python3.12/site-packages/transformers/models/llama/modeling_llama.py", line 601, in forward
    layer_outputs = decoder_layer(
                    ^^^^^^^^^^^^^^
  File "/root/venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/venv/lib/python3.12/site-packages/accelerate/hooks.py", line 176, in new_forward
    output = module._old_forward(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/venv/lib/python3.12/site-packages/transformers/models/llama/modeling_llama.py", line 343, in forward
    hidden_states, self_attn_weights = self.self_attn(
                                       ^^^^^^^^^^^^^^^
  File "/root/venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1845, in _call_impl
    return inner()
           ^^^^^^^
  File "/root/venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1793, in inner
    result = forward_call(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/venv/lib/python3.12/site-packages/accelerate/hooks.py", line 176, in new_forward
    output = module._old_forward(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/venv/lib/python3.12/site-packages/transformers/models/llama/modeling_llama.py", line 287, in forward
    key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs)
                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/venv/lib/python3.12/site-packages/llmcompressor/modifiers/quantization/cache.py", line 93, in update
    q_key_states = self._quantize(
                   ^^^^^^^^^^^^^^^
  File "/root/venv/lib/python3.12/site-packages/llmcompressor/modifiers/quantization/cache.py", line 144, in _quantize
    observer = self.k_observers[layer_idx]
               ~~~~~~~~~~~~~~~~^^^^^^^^^^^
IndexError: list index out of range

Additional context
Add any other context about the problem here. Also include any relevant files.

@DreamGenX DreamGenX added the bug Something isn't working label Mar 28, 2025
@brian-dellabetta
Copy link
Collaborator

brian-dellabetta commented Apr 1, 2025

Hi @DreamGenX , I think you have discovered an edge case bug when certain layers are ignored for kv cache quantization. We need to move the List[Observer] field into a Dict[LayerIdx, Observer] because the indices don't align when some layers are ignored.

@brian-dellabetta brian-dellabetta self-assigned this Apr 1, 2025
@brian-dellabetta
Copy link
Collaborator

brian-dellabetta commented Apr 1, 2025

Hi @DreamGenX , I have created PR #1312 to hopefully solve your issue. Could you checkout branch bdellabe/kv_cache_ignore_layer_bugfix, install from source, and run again? If you are unsure how to do that, please provide the full script you used and I can try. Thanks!

@DreamGenX
Copy link
Author

Thanks @brian-dellabetta -- I will need some time to be able to run it again. But you can easily reproduce it with the provided config and any of the Llama 3 models, for exampe.

@brian-dellabetta
Copy link
Collaborator

Hi, @DreamGenX -- I can confirm i see the error you are reporting, using your recipe with meta-llama/Meta-Llama-3-8B-Instruct, and that my PR resolves it. I am able to run lm_eval on the quantized model. I will move this to ready for review to get feedback.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants