Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem loading granite-3b in small MIG partitions #104

Open
ccamacho opened this issue Jun 24, 2024 · 1 comment
Open

Problem loading granite-3b in small MIG partitions #104

ccamacho opened this issue Jun 24, 2024 · 1 comment
Labels
bug Something isn't working

Comments

@ccamacho
Copy link

Describe the bug

There is a misleading error when deploying models in small MIG partitions

To Reproduce

  • Deploy TGIS in Openshift AI.
  • Enable MIG (1g.5gb partitions).
  • Deploy granite 3b in TGIS standalone.

Expected output

Have the inference service running or having a detailed error in TGIS about why the model is not working.

** Actual error**

�[2m2024-06-24T10:33:54.484964Z�[0m �[32m INFO�[0m �[2mtext_generation_launcher�[0m�[2m:�[0m TGIS Commit hash: 
�[2m2024-06-24T10:33:54.484984Z�[0m �[32m INFO�[0m �[2mtext_generation_launcher�[0m�[2m:�[0m Launcher args: Args { model_name: "/mnt/models/", revision: None, deployment_framework: "hf_transformers", dtype: None, dtype_str: None, quantize: None, num_shard: None, max_concurrent_requests: 512, max_sequence_length: Some(448), max_new_tokens: 384, max_batch_size: 64, max_prefill_padding: 0.2, batch_safety_margin: 20, max_waiting_tokens: 24, port: 3000, grpc_port: 8033, shard_uds_path: "/tmp/text-generation-server", master_addr: "localhost", master_port: 29500, json_output: false, tls_cert_path: None, tls_key_path: None, tls_client_ca_cert_path: None, output_special_tokens: false, cuda_process_memory_fraction: 1.0, default_include_stop_seqs: true, otlp_endpoint: None, otlp_service_name: None }
�[2m2024-06-24T10:33:54.484997Z�[0m �[32m INFO�[0m �[2mtext_generation_launcher�[0m�[2m:�[0m Inferring num_shard = 1 from CUDA_VISIBLE_DEVICES/NVIDIA_VISIBLE_DEVICES
�[2m2024-06-24T10:33:54.485049Z�[0m �[32m INFO�[0m �[2mtext_generation_launcher�[0m�[2m:�[0m Saving fast tokenizer for `/mnt/models/` to `/tmp/74657ff2-73b1-45f2-b8d5-a7302a63f862`
/opt/tgis/lib/python3.11/site-packages/transformers/utils/hub.py:124: FutureWarning: Using `TRANSFORMERS_CACHE` is deprecated and will be removed in v5 of Transformers. Use `HF_HOME` instead.
  warnings.warn(
�[2m2024-06-24T10:33:56.397996Z�[0m �[32m INFO�[0m �[2mtext_generation_launcher�[0m�[2m:�[0m Using configured max_sequence_length: 448
�[2m2024-06-24T10:33:56.398022Z�[0m �[32m INFO�[0m �[2mtext_generation_launcher�[0m�[2m:�[0m Setting PYTORCH_CUDA_ALLOC_CONF to default value: expandable_segments:True
�[2m2024-06-24T10:33:56.398340Z�[0m �[32m INFO�[0m �[2mtext_generation_launcher�[0m�[2m:�[0m Starting shard 0
Shard 0: /opt/tgis/lib/python3.11/site-packages/transformers/utils/hub.py:124: FutureWarning: Using `TRANSFORMERS_CACHE` is deprecated and will be removed in v5 of Transformers. Use `HF_HOME` instead.
Shard 0:   warnings.warn(
Shard 0: HAS_BITS_AND_BYTES=False, HAS_GPTQ_CUDA=True, EXLLAMA_VERSION=2, GPTQ_CUDA_TYPE=exllama
Shard 0: supports_causal_lm = True, supports_seq2seq_lm = False
Shard 0: Traceback (most recent call last):
Shard 0: 
Shard 0:   File "/opt/tgis/bin/text-generation-server", line 8, in <module>
Shard 0:     sys.exit(app())
Shard 0:              ^^^^^
Shard 0: 
Shard 0:   File "/opt/tgis/lib/python3.11/site-packages/text_generation_server/cli.py", line 75, in serve
Shard 0:     raise e
Shard 0: 
Shard 0:   File "/opt/tgis/lib/python3.11/site-packages/text_generation_server/cli.py", line 56, in serve
Shard 0:     server.serve(
Shard 0: 
Shard 0:   File "/opt/tgis/lib/python3.11/site-packages/text_generation_server/server.py", line 388, in serve
Shard 0:     asyncio.run(
Shard 0: 
Shard 0:   File "/opt/tgis/lib/python3.11/asyncio/runners.py", line 190, in run
Shard 0:     return runner.run(main)
Shard 0:            ^^^^^^^^^^^^^^^^
Shard 0: 
Shard 0:   File "/opt/tgis/lib/python3.11/asyncio/runners.py", line 118, in run
Shard 0:     return self._loop.run_until_complete(task)
Shard 0:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Shard 0: 
Shard 0:   File "/opt/tgis/lib/python3.11/asyncio/base_events.py", line 654, in run_until_complete
Shard 0:     return future.result()
Shard 0:            ^^^^^^^^^^^^^^^
Shard 0: 
Shard 0:   File "/opt/tgis/lib/python3.11/site-packages/text_generation_server/server.py", line 267, in serve_inner
Shard 0:     model = get_model(
Shard 0:             ^^^^^^^^^^
Shard 0: 
Shard 0:   File "/opt/tgis/lib/python3.11/site-packages/text_generation_server/models/__init__.py", line 126, in get_model
Shard 0:     return CausalLM(model_name, revision, deployment_framework, dtype, quantize, model_config, max_sequence_length)
Shard 0:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Shard 0: 
Shard 0:   File "/opt/tgis/lib/python3.11/site-packages/text_generation_server/models/causal_lm.py", line 558, in __init__
Shard 0:     inference_engine = get_inference_engine_class(deployment_framework)(
Shard 0:                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Shard 0: 
Shard 0:   File "/opt/tgis/lib/python3.11/site-packages/text_generation_server/inference_engine/hf_transformers.py", line 76, in __init__
Shard 0:     self.model = model_class.from_pretrained(**kwargs).requires_grad_(False).eval()
Shard 0:                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Shard 0: 
Shard 0:   File "/opt/tgis/lib/python3.11/site-packages/transformers/models/auto/auto_factory.py", line 561, in from_pretrained
Shard 0:     return model_class.from_pretrained(
Shard 0:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Shard 0: 
Shard 0:   File "/opt/tgis/lib/python3.11/site-packages/transformers/modeling_utils.py", line 3375, in from_pretrained
Shard 0:     model = cls(config, *model_args, **model_kwargs)
Shard 0:             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Shard 0: 
Shard 0:   File "/opt/tgis/lib/python3.11/site-packages/transformers/models/llama/modeling_llama.py", line 1103, in __init__
Shard 0:     self.model = LlamaModel(config)
Shard 0:                  ^^^^^^^^^^^^^^^^^^
Shard 0: 
Shard 0:   File "/opt/tgis/lib/python3.11/site-packages/transformers/models/llama/modeling_llama.py", line 924, in __init__
Shard 0:     [LlamaDecoderLayer(config, layer_idx) for layer_idx in range(config.num_hidden_layers)]
Shard 0: 
Shard 0:   File "/opt/tgis/lib/python3.11/site-packages/transformers/models/llama/modeling_llama.py", line 924, in <listcomp>
Shard 0:     [LlamaDecoderLayer(config, layer_idx) for layer_idx in range(config.num_hidden_layers)]
Shard 0:      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Shard 0: 
Shard 0:   File "/opt/tgis/lib/python3.11/site-packages/transformers/models/llama/modeling_llama.py", line 701, in __init__
Shard 0:     self.mlp = LlamaMLP(config)
Shard 0:                ^^^^^^^^^^^^^^^^
Shard 0: 
Shard 0:   File "/opt/tgis/lib/python3.11/site-packages/transformers/models/llama/modeling_llama.py", line 219, in __init__
Shard 0:     self.up_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False)
Shard 0:                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Shard 0: 
Shard 0:   File "/opt/tgis/lib/python3.11/site-packages/torch/nn/modules/linear.py", line 98, in __init__
Shard 0:     self.weight = Parameter(torch.empty((out_features, in_features), **factory_kwargs))
Shard 0:                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Shard 0: 
Shard 0:   File "/opt/tgis/lib/python3.11/site-packages/torch/utils/_device.py", line 77, in __torch_function__
Shard 0:     return func(*args, **kwargs)
Shard 0:            ^^^^^^^^^^^^^^^^^^^^^
Shard 0: 
Shard 0: RuntimeError: NVML_SUCCESS == r INTERNAL ASSERT FAILED at "../c10/cuda/CUDACachingAllocator.cpp":830, please report a bug to PyTorch. 
Shard 0: 
�[2m2024-06-24T10:34:00.379801Z�[0m �[31mERROR�[0m �[2mtext_generation_launcher�[0m�[2m:�[0m Shard 0 failed: ExitStatus(unix_wait_status(256))
�[2m2024-06-24T10:34:00.400918Z�[0m �[32m INFO�[0m �[2mtext_generation_launcher�[0m�[2m:�[0m Shutting down shards

Workaround

Having the model deployed in a bigger partition.

@ccamacho ccamacho added the bug Something isn't working label Jun 24, 2024
@sumaiya1996
Copy link

🤸‍♀️

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants