Can't quantize kv cache: `observer = self.k_observers[layer_idx]` `liste index out of range` #1295

DreamGenX · 2025-03-28T18:08:12Z

Describe the bug
Using this recipe:

quant_stage:
    quant_modifiers:
        QuantizationModifier:
            ignore: ["lm_head", "re:.*layers.0..*", "re:.*layers.79..*"]
            config_groups:
                group_0:
                    weights:
                        num_bits: 8
                        type: float
                        strategy: channel
                        dynamic: false
                        symmetric: true
                    input_activations:
                        num_bits: 8
                        type: float
                        strategy: token
                        dynamic: true
                        symmetric: true
                    targets: ["Linear"]
            kv_cache_scheme:
                num_bits: 8
                type: float
                strategy: tensor
                dynamic: false
                symmetric: true

Results in the error below. If I remove the kv_cache_scheme, it works.
The model I am quantizing is Llama 3.3 70b.

Expected behavior
A clear and concise description of what you expected to happen.

Environment
Include all relevant environment information:

OS [e.g. Ubuntu 20.04]:
Python version [e.g. 3.7]:
LLM Compressor version or commit hash [e.g. 0.1.0, f7245c8]: 0.4.1
ML framework version(s) [e.g. torch 2.3.1]:
Other Python package versions [e.g. vLLM, compressed-tensors, numpy, ONNX]:
Other relevant environment information [e.g. hardware, CUDA version]: 2xH100SXM
To Reproduce
Exact steps to reproduce the behavior:

Errors
If applicable, add a full print-out of any errors or exceptions that are raised or include screenshots to help explain your problem.

  File "/root/venv/lib/python3.12/site-packages/llmcompressor/modifiers/quantization/quantization/base.py", line 112, in on_initialize
    self._calibrate_if_possible(module)
  File "/root/venv/lib/python3.12/site-packages/llmcompressor/modifiers/quantization/quantization/base.py", line 266, in _calibrate_if_possible
    self._calibrate(module)
  File "/root/venv/lib/python3.12/site-packages/llmcompressor/modifiers/quantization/quantization/base.py", line 314, in _calibrate
    run_calibration_forward(
  File "/root/venv/lib/python3.12/site-packages/llmcompressor/modifiers/utils/pytorch_helpers.py", line 82, in run_calibration_forward
    forward_fn(batch, module=model)
  File "/root/venv/lib/python3.12/site-packages/llmcompressor/pytorch/utils/helpers.py", line 394, in tensors_module_forward
    return module(**tensors)
           ^^^^^^^^^^^^^^^^^
  File "/root/venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/venv/lib/python3.12/site-packages/accelerate/hooks.py", line 176, in new_forward
    output = module._old_forward(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/venv/lib/python3.12/site-packages/transformers/utils/deprecation.py", line 172, in wrapped_func
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/root/venv/lib/python3.12/site-packages/transformers/models/llama/modeling_llama.py", line 853, in forward
    outputs = self.model(
              ^^^^^^^^^^^
  File "/root/venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/venv/lib/python3.12/site-packages/transformers/models/llama/modeling_llama.py", line 601, in forward
    layer_outputs = decoder_layer(
                    ^^^^^^^^^^^^^^
  File "/root/venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/venv/lib/python3.12/site-packages/accelerate/hooks.py", line 176, in new_forward
    output = module._old_forward(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/venv/lib/python3.12/site-packages/transformers/models/llama/modeling_llama.py", line 343, in forward
    hidden_states, self_attn_weights = self.self_attn(
                                       ^^^^^^^^^^^^^^^
  File "/root/venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1845, in _call_impl
    return inner()
           ^^^^^^^
  File "/root/venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1793, in inner
    result = forward_call(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/venv/lib/python3.12/site-packages/accelerate/hooks.py", line 176, in new_forward
    output = module._old_forward(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/venv/lib/python3.12/site-packages/transformers/models/llama/modeling_llama.py", line 287, in forward
    key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs)
                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/venv/lib/python3.12/site-packages/llmcompressor/modifiers/quantization/cache.py", line 93, in update
    q_key_states = self._quantize(
                   ^^^^^^^^^^^^^^^
  File "/root/venv/lib/python3.12/site-packages/llmcompressor/modifiers/quantization/cache.py", line 144, in _quantize
    observer = self.k_observers[layer_idx]
               ~~~~~~~~~~~~~~~~^^^^^^^^^^^
IndexError: list index out of range

Additional context
Add any other context about the problem here. Also include any relevant files.

The text was updated successfully, but these errors were encountered:

brian-dellabetta · 2025-04-01T18:49:57Z

Hi @DreamGenX , I think you have discovered an edge case bug when certain layers are ignored for kv cache quantization. We need to move the List[Observer] field into a Dict[LayerIdx, Observer] because the indices don't align when some layers are ignored.

brian-dellabetta · 2025-04-01T21:49:41Z

Hi @DreamGenX , I have created PR #1312 to hopefully solve your issue. Could you checkout branch bdellabe/kv_cache_ignore_layer_bugfix, install from source, and run again? If you are unsure how to do that, please provide the full script you used and I can try. Thanks!

DreamGenX · 2025-04-02T19:07:47Z

Thanks @brian-dellabetta -- I will need some time to be able to run it again. But you can easily reproduce it with the provided config and any of the Llama 3 models, for exampe.

brian-dellabetta · 2025-04-02T20:13:55Z

Hi, @DreamGenX -- I can confirm i see the error you are reporting, using your recipe with meta-llama/Meta-Llama-3-8B-Instruct, and that my PR resolves it. I am able to run lm_eval on the quantized model. I will move this to ready for review to get feedback.

DreamGenX added the bug Something isn't working label Mar 28, 2025

brian-dellabetta self-assigned this Apr 1, 2025

brian-dellabetta linked a pull request Apr 1, 2025 that will close this issue

bugfix kv cache quantization with ignored layers #1312

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can't quantize kv cache: `observer = self.k_observers[layer_idx]` `liste index out of range` #1295

Can't quantize kv cache: `observer = self.k_observers[layer_idx]` `liste index out of range` #1295

DreamGenX commented Mar 28, 2025

brian-dellabetta commented Apr 1, 2025 •

edited

Loading

brian-dellabetta commented Apr 1, 2025 •

edited

Loading

DreamGenX commented Apr 2, 2025

brian-dellabetta commented Apr 2, 2025

Can't quantize kv cache: observer = self.k_observers[layer_idx] liste index out of range #1295

Can't quantize kv cache: observer = self.k_observers[layer_idx] liste index out of range #1295

Comments

DreamGenX commented Mar 28, 2025

brian-dellabetta commented Apr 1, 2025 • edited Loading

brian-dellabetta commented Apr 1, 2025 • edited Loading

DreamGenX commented Apr 2, 2025

brian-dellabetta commented Apr 2, 2025

Can't quantize kv cache: `observer = self.k_observers[layer_idx]` `liste index out of range` #1295

Can't quantize kv cache: `observer = self.k_observers[layer_idx]` `liste index out of range` #1295

brian-dellabetta commented Apr 1, 2025 •

edited

Loading

brian-dellabetta commented Apr 1, 2025 •

edited

Loading