Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA error: an illegal memory access was encountered #9

Open
CUHKSZzxy opened this issue May 5, 2024 · 2 comments
Open

CUDA error: an illegal memory access was encountered #9

CUHKSZzxy opened this issue May 5, 2024 · 2 comments

Comments

@CUHKSZzxy
Copy link

CUHKSZzxy commented May 5, 2024

Thank you for your excellent work!

Currently, I am trying to reproduce KVQaunt but have encountered some errors. Your assistance with this matter would be appreciated.

1. Reproduce the bug

I followed the provided instructions and set up the environment for gradient/quant/deployment. The gradient and quantization processes performed well; I successfully computed the gradient and built the quantizer. However, when I tested the deployment code using the following instructions, I encountered the error message "CUDA error: an illegal memory access was encountered."

cp ../quant/quantizers.pickle .

CUDA_VISIBLE_DEVICES=1 python llama.py JackFram/llama-160m wikitext2 \
    --abits 4 \
    --include_sparse \
    --sparsity-threshold 0.99 \
    --quantizer-path quantizers.pickle \
    --benchmark 128 \
    --check

2. Error logs

The detailed error logs are shown as follows:

/root/anaconda3/envs/deploy/lib/python3.9/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(
The model was loaded with use_flash_attention_2=True, which is deprecated and may be removed in a future release. Please use `attn_implementation="flash_attention_2"` instead.
/root/anaconda3/envs/deploy/lib/python3.9/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(
splitting into 1 GPUs
/root/anaconda3/envs/deploy/lib/python3.9/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(
Load quantizers.
k:  model.layers.0.self_attn.k_proj
/root/KVQuant/deployment/transformers/src/transformers/models/llama/modeling_llama.py:449: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
  self.outlier_threshold_upper = torch.tensor(quantizer[0]).cuda().half().flatten()
/root/KVQuant/deployment/transformers/src/transformers/models/llama/modeling_llama.py:450: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
  self.outlier_threshold_lower = torch.tensor(quantizer[1]).cuda().half().flatten()
/root/KVQuant/deployment/transformers/src/transformers/models/llama/modeling_llama.py:484: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
  lut_tmp = torch.tensor(self.lut)
k:  model.layers.0.self_attn.v_proj
k:  model.layers.1.self_attn.k_proj
k:  model.layers.1.self_attn.v_proj
k:  model.layers.2.self_attn.k_proj
k:  model.layers.2.self_attn.v_proj
k:  model.layers.3.self_attn.k_proj
k:  model.layers.3.self_attn.v_proj
k:  model.layers.4.self_attn.k_proj
k:  model.layers.4.self_attn.v_proj
k:  model.layers.5.self_attn.k_proj
k:  model.layers.5.self_attn.v_proj
k:  model.layers.6.self_attn.k_proj
k:  model.layers.6.self_attn.v_proj
k:  model.layers.7.self_attn.k_proj
k:  model.layers.7.self_attn.v_proj
k:  model.layers.8.self_attn.k_proj
k:  model.layers.8.self_attn.v_proj
k:  model.layers.9.self_attn.k_proj
k:  model.layers.9.self_attn.v_proj
k:  model.layers.10.self_attn.k_proj
k:  model.layers.10.self_attn.v_proj
k:  model.layers.11.self_attn.k_proj
k:  model.layers.11.self_attn.v_proj
Model type : llama
Benchmarking ...
Traceback (most recent call last):
  File "/root/KVQuant/deployment/llama.py", line 224, in <module>
    benchmark(model, input_ids, check=args.check)
  File "/root/KVQuant/deployment/llama.py", line 82, in benchmark
    out = model(
  File "/root/anaconda3/envs/deploy/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/root/anaconda3/envs/deploy/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/KVQuant/deployment/transformers/src/transformers/models/llama/modeling_llama.py", line 2683, in forward
    outputs = self.model(
  File "/root/anaconda3/envs/deploy/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/root/anaconda3/envs/deploy/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/KVQuant/deployment/transformers/src/transformers/models/llama/modeling_llama.py", line 2565, in forward
    layer_outputs = decoder_layer(
  File "/root/anaconda3/envs/deploy/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/root/anaconda3/envs/deploy/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1582, in _call_impl
    result = forward_call(*args, **kwargs)
  File "/root/KVQuant/deployment/transformers/src/transformers/models/llama/modeling_llama.py", line 2250, in forward
    hidden_states, self_attn_weights, present_key_value = self.self_attn(
  File "/root/anaconda3/envs/deploy/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/root/anaconda3/envs/deploy/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/KVQuant/deployment/transformers/src/transformers/models/llama/modeling_llama.py", line 1965, in forward
    attn_weights = self.kcache.forward_fused_sparse(query_states, key_states)
  File "/root/KVQuant/deployment/transformers/src/transformers/models/llama/modeling_llama.py", line 710, in forward_fused_sparse
    outliers_rescaled = outliers_rescaled.cpu()
RuntimeError: CUDA error: an illegal memory access was encountered
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

According to my understanding, it appears that the error is somehow related to CUDA kernel implementation "vecquant4appendvecKsparse," which modifies the variable "outliers_rescaled".

3. Environment

  • OS: Ubuntu 20.04 LTS
  • GPU: Tesla P100-PCIE-16GB
  • Packages (pip list):
Package                  Version     Editable project location
------------------------ ----------- -------------------------------------
accelerate               0.29.3
aiohttp                  3.9.5
aiosignal                1.3.1
async-timeout            4.0.3
attrs                    23.2.0
certifi                  2024.2.2
charset-normalizer       3.3.2
datasets                 2.19.0
dill                     0.3.8
einops                   0.8.0
filelock                 3.14.0
flash-attn               2.5.8
frozenlist               1.4.1
fsspec                   2024.3.1
huggingface-hub          0.23.0
idna                     3.7
Jinja2                   3.1.3
kvquant                  0.1.0       /root/KVQuant/deployment
MarkupSafe               2.1.5
mpmath                   1.3.0
multidict                6.0.5
multiprocess             0.70.16
networkx                 3.2.1
ninja                    1.11.1.1
numpy                    1.26.4
nvidia-cublas-cu12       12.1.3.1
nvidia-cuda-cupti-cu12   12.1.105
nvidia-cuda-nvrtc-cu12   12.1.105
nvidia-cuda-runtime-cu12 12.1.105
nvidia-cudnn-cu12        8.9.2.26
nvidia-cufft-cu12        11.0.2.54
nvidia-curand-cu12       10.3.2.106
nvidia-cusolver-cu12     11.4.5.107
nvidia-cusparse-cu12     12.1.0.106
nvidia-nccl-cu12         2.20.5
nvidia-nvjitlink-cu12    12.4.127
nvidia-nvtx-cu12         12.1.105
packaging                24.0
pandas                   2.2.2
pip                      23.3.1
protobuf                 5.26.1
psutil                   5.9.8
pyarrow                  16.0.0
pyarrow-hotfix           0.6
python-dateutil          2.9.0.post0
pytz                     2024.1
PyYAML                   6.0.1
quant-cuda               0.0.0
regex                    2024.4.28
requests                 2.31.0
safetensors              0.4.3
sentencepiece            0.2.0
setuptools               68.2.2
six                      1.16.0
sympy                    1.12
tokenizers               0.15.2
torch                    2.3.0
tqdm                     4.66.4
transformers             4.38.0.dev0 /root/KVQuant/deployment/transformers
triton                   2.3.0
typing_extensions        4.11.0
tzdata                   2024.1
urllib3                  2.2.1
wheel                    0.43.0
xxhash                   3.4.1
yarl                     1.9.4

Due to hardware constraints, I intend to perform a quick test on the smaller model weights as indicated above. KVQuant is expected to work properly, as the smaller model differs from Llama-7B only in terms of weight size while sharing a similar architecture.

4、Related solutions that I have tried

As suggested in the discussion related to this CUDA error on https://github.com/pytorch/pytorch/issues/21819 , I have updated CUDA, torch, and other relevant components to the latest versions. However, I am still encountering the same error.

What's the potential problem of this error and how could I solve it?

Thanks in advance!

@blueFeather111
Copy link

blueFeather111 commented Aug 19, 2024

hi, I met the same problem, in my case, my tensor variables are not in the same device, that's the problem,
after I fixed the tensor variables to the same device(cpu or cuda), the problem was solved, maybe this case will help.

@CUHKSZzxy
Copy link
Author

hi, I met the same problem, in my case, my tensor variables are not in the same device, that's the problem, after I fixed the tensor variables to the same device(cpu or cuda), the problem was solved, maybe this case will help.

Thanks for your suggestions, I will give it a try!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants