Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

unable to launch Triton server on finetuned whisper model #568

Open
StephennFernandes opened this issue Apr 8, 2024 · 6 comments
Open

Comments

@StephennFernandes
Copy link

Hi there,
I have been finetuning whisper models using huggingface. Further to convert the model to TensorRT_LLM format, i use a HF script that converts the models from its HF format to the original OpenAI format.
i then follow your instructions and convert the OAI model to TensorRT_LLM format. which happens successfully.

however when i follow the further steps on launching the Triton inference server using the launch_server.sh script.
i get the following error:

I0408 20:10:02.904273 254 server.cc:345] Timeout 30: Found 0 live models and 0 in-flight non-inference requests
error: creating server: Internal - failed to load all models

the following is the stack trace of the entire log post launching the bash script.

I0408 20:09:55.322866 254 pinned_memory_manager.cc:275] Pinned memory pool is created at '0x7c5bd2000000' with size 2048000000
I0408 20:09:55.324786 254 cuda_memory_manager.cc:107] CUDA memory pool is created on device 0 with size 4096000000
I0408 20:09:55.330786 254 model_lifecycle.cc:461] loading: whisper:1
[TensorRT-LLM] TensorRT-LLM version: 0.9.0.dev2024022700
I0408 20:09:58.959624 254 python_be.cc:2362] TRITONBACKEND_ModelInstanceInitialize: whisper_0_0 (CPU device 0)
[TensorRT-LLM] TensorRT-LLM version: 0.9.0.dev2024022700
[04/08/2024-20:10:01] [TRT] [E] 6: The engine plan file is not compatible with this version of TensorRT, expecting library version 9.2.0.5 got 9.3.0.1, please rebuild.
[04/08/2024-20:10:01] [TRT] [E] 2: [engine.cpp::deserializeEngine::1148] Error Code 2: Internal Error (Assertion engine->deserialize(start, size, allocator, runtime) failed. )
I0408 20:10:02.021843 254 pb_stub.cc:346] Failed to initialize Python stub: AttributeError: 'NoneType' object has no attribute 'create_execution_context'

At:
  /usr/local/lib/python3.10/dist-packages/tensorrt_llm/runtime/session.py(67): _init
  /usr/local/lib/python3.10/dist-packages/tensorrt_llm/runtime/session.py(80): from_serialized_engine
  /workspace/TensorRT-LLM/examples/whisper/sherpa/triton/whisper/./model_repo_whisper_trtllm/whisper/1/whisper_trtllm.py(53): get_session
  /workspace/TensorRT-LLM/examples/whisper/sherpa/triton/whisper/./model_repo_whisper_trtllm/whisper/1/whisper_trtllm.py(33): __init__
  /workspace/TensorRT-LLM/examples/whisper/sherpa/triton/whisper/./model_repo_whisper_trtllm/whisper/1/whisper_trtllm.py(201): __init__
  /workspace/TensorRT-LLM/examples/whisper/sherpa/triton/whisper/./model_repo_whisper_trtllm/whisper/1/model.py(52): init_model
  /workspace/TensorRT-LLM/examples/whisper/sherpa/triton/whisper/./model_repo_whisper_trtllm/whisper/1/model.py(45): initialize

E0408 20:10:02.773438 254 backend_model.cc:691] ERROR: Failed to create instance: AttributeError: 'NoneType' object has no attribute 'create_execution_context'

At:
  /usr/local/lib/python3.10/dist-packages/tensorrt_llm/runtime/session.py(67): _init
  /usr/local/lib/python3.10/dist-packages/tensorrt_llm/runtime/session.py(80): from_serialized_engine
  /workspace/TensorRT-LLM/examples/whisper/sherpa/triton/whisper/./model_repo_whisper_trtllm/whisper/1/whisper_trtllm.py(53): get_session
  /workspace/TensorRT-LLM/examples/whisper/sherpa/triton/whisper/./model_repo_whisper_trtllm/whisper/1/whisper_trtllm.py(33): __init__
  /workspace/TensorRT-LLM/examples/whisper/sherpa/triton/whisper/./model_repo_whisper_trtllm/whisper/1/whisper_trtllm.py(201): __init__
  /workspace/TensorRT-LLM/examples/whisper/sherpa/triton/whisper/./model_repo_whisper_trtllm/whisper/1/model.py(52): init_model
  /workspace/TensorRT-LLM/examples/whisper/sherpa/triton/whisper/./model_repo_whisper_trtllm/whisper/1/model.py(45): initialize

E0408 20:10:02.773573 254 model_lifecycle.cc:630] failed to load 'whisper' version 1: Internal: AttributeError: 'NoneType' object has no attribute 'create_execution_context'

At:
  /usr/local/lib/python3.10/dist-packages/tensorrt_llm/runtime/session.py(67): _init
  /usr/local/lib/python3.10/dist-packages/tensorrt_llm/runtime/session.py(80): from_serialized_engine
  /workspace/TensorRT-LLM/examples/whisper/sherpa/triton/whisper/./model_repo_whisper_trtllm/whisper/1/whisper_trtllm.py(53): get_session
  /workspace/TensorRT-LLM/examples/whisper/sherpa/triton/whisper/./model_repo_whisper_trtllm/whisper/1/whisper_trtllm.py(33): __init__
  /workspace/TensorRT-LLM/examples/whisper/sherpa/triton/whisper/./model_repo_whisper_trtllm/whisper/1/whisper_trtllm.py(201): __init__
  /workspace/TensorRT-LLM/examples/whisper/sherpa/triton/whisper/./model_repo_whisper_trtllm/whisper/1/model.py(52): init_model
  /workspace/TensorRT-LLM/examples/whisper/sherpa/triton/whisper/./model_repo_whisper_trtllm/whisper/1/model.py(45): initialize

I0408 20:10:02.773614 254 model_lifecycle.cc:765] failed to load 'whisper'
I0408 20:10:02.773752 254 server.cc:606] 
+------------------+------+
| Repository Agent | Path |
+------------------+------+
+------------------+------+

I0408 20:10:02.773824 254 server.cc:633] 
+---------+---------------------------------------------------+---------------------------------------------------+
| Backend | Path                                              | Config                                            |
+---------+---------------------------------------------------+---------------------------------------------------+
| python  | /opt/tritonserver/backends/python/libtriton_pytho | {"cmdline":{"auto-complete-config":"true","backen |
|         | n.so                                              | d-directory":"/opt/tritonserver/backends","min-co |
|         |                                                   | mpute-capability":"6.000000","default-max-batch-s |
|         |                                                   | ize":"4"}}                                        |
|         |                                                   |                                                   |
+---------+---------------------------------------------------+---------------------------------------------------+

I0408 20:10:02.773886 254 server.cc:676] 
+---------+---------+---------------------------------------------------------------------------------------------+
| Model   | Version | Status                                                                                      |
+---------+---------+---------------------------------------------------------------------------------------------+
| whisper | 1       | UNAVAILABLE: Internal: AttributeError: 'NoneType' object has no attribute 'create_execution |
|         |         | _context'                                                                                   |
|         |         |                                                                                             |
|         |         | At:                                                                                         |
|         |         |   /usr/local/lib/python3.10/dist-packages/tensorrt_llm/runtime/session.py(80): from_seriali |
|         |         | zed_engine                                                                                  |
|         |         |   /workspace/TensorRT-LLM/examples/whisper/sherpa/triton/whisper/./model_repo_whisper_trtll |
|         |         | m/whisper/1/whisper_trtllm.py(33): __init__                                                 |
|         |         |   /workspace/TensorRT-LLM/examples/whisper/sherpa/triton/whisper/./model_repo_whisper_trtll |
|         |         | m/whisper/1/model.py(52): init_model                                                        |
|         |         |   /workspace/TensorRT-LLM/examples/whisper/sherpa/triton/whisper/./model_repo_whisper_trtllm/whisper/1/whisper_trtllm.py(33): __init__ |
+---------+---------+---------------------------------------------------------------------------------------------+

I0408 20:10:02.888804 254 metrics.cc:877] Collecting metrics for GPU 0: NVIDIA RTX A6000
I0408 20:10:02.903932 254 metrics.cc:770] Collecting CPU metrics
I0408 20:10:02.904225 254 tritonserver.cc:2498] 
+----------------------------------+------------------------------------------------------------------------------+
| Option                           | Value                                                                        |
+----------------------------------+------------------------------------------------------------------------------+
| server_id                        | triton                                                                       |
| server_version                   | 2.42.0                                                                       |
| server_extensions                | classification sequence model_repository model_repository(unload_dependents) |
|                                  |  schedule_policy model_configuration system_shared_memory cuda_shared_memory |
|                                  |  binary_tensor_data parameters statistics trace logging                      |
| model_repository_path[0]         | ./model_repo_whisper_trtllm                                                  |
| model_control_mode               | MODE_NONE                                                                    |
| strict_model_config              | 0                                                                            |
| rate_limit                       | OFF                                                                          |
| pinned_memory_pool_byte_size     | 2048000000                                                                   |
| cuda_memory_pool_byte_size{0}    | 4096000000                                                                   |
| min_supported_compute_capability | 6.0                                                                          |
| strict_readiness                 | 1                                                                            |
| exit_timeout                     | 30                                                                           |
| cache_enabled                    | 0                                                                            |
+----------------------------------+------------------------------------------------------------------------------+

I0408 20:10:02.904255 254 server.cc:307] Waiting for in-flight requests to complete.
I0408 20:10:02.904262 254 server.cc:323] Timeout 30: Found 0 model versions that have in-flight inferences
I0408 20:10:02.904268 254 server.cc:338] All models are stopped, unloading models
I0408 20:10:02.904273 254 server.cc:345] Timeout 30: Found 0 live models and 0 in-flight non-inference requests
error: creating server: Internal - failed to load all models
@csukuangfj
Copy link
Collaborator

@yuekaizhang Could you have a look?

@yuekaizhang
Copy link
Collaborator

The engine plan file is not compatible with this version of TensorRT, expecting library version 9.2.0.5 got 9.3.0.1, please rebuild.

@StephennFernandes Seems you build engines and run engines in different envs. Would you mind building and runnning in the same docker container e.g. soar97/triton-whisper:24.01.complete?

@StephennFernandes
Copy link
Author

@yuekaizhang i got it working thanks a ton for your assistance.
also. noticed that we cannot do inference for longer audio files. beyond 30s

@yuekaizhang
Copy link
Collaborator

yuekaizhang commented Apr 12, 2024

@yuekaizhang i got it working thanks a ton for your assistance. also. noticed that we cannot do inference for longer audio files. beyond 30s

@StephennFernandes Since whisper could only process audios smaller than 30s, you need to implement a VAD segmenter like this project https://github.com/shashikg/WhisperS2T/tree/main. Welcome to contribute :D

@StephennFernandes
Copy link
Author

@yuekaizhang thanks for the heads up. already on it.

@StephennFernandes
Copy link
Author

@yuekaizhang hey not like this error done anything bad for the deployment. as far as i have seen my triton deployment works fine.
but i have this weird error log that pops up when i deploy my model.
do you seem to know what it means ?

I0412 08:38:16.910492 1416 pinned_memory_manager.cc:275] Pinned memory pool is created at '0x76d5da000000' with size 2048000000
I0412 08:38:16.911524 1416 cuda_memory_manager.cc:107] CUDA memory pool is created on device 0 with size 4096000000
I0412 08:38:16.915451 1416 model_lifecycle.cc:469] loading: whisper:1
[TensorRT-LLM] TensorRT-LLM version: 0.9.0.dev2024040200
free(): invalid pointer
[user-DSA7TGX-424R:01427] *** Process received signal ***
[user-DSA7TGX-424R:01427] Signal: Aborted (6)
[user-DSA7TGX-424R:01427] Signal code:  (-6)
[user-DSA7TGX-424R:01427] [ 0] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x7b8114a16520]
[user-DSA7TGX-424R:01427] [ 1] /usr/lib/x86_64-linux-gnu/libc.so.6(pthread_kill+0x12c)[0x7b8114a6a9fc]
[user-DSA7TGX-424R:01427] [ 2] /usr/lib/x86_64-linux-gnu/libc.so.6(raise+0x16)[0x7b8114a16476]
[user-DSA7TGX-424R:01427] [ 3] /usr/lib/x86_64-linux-gnu/libc.so.6(abort+0xd3)[0x7b81149fc7f3]
[user-DSA7TGX-424R:01427] [ 4] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x89676)[0x7b8114a5d676]
[user-DSA7TGX-424R:01427] [ 5] /usr/lib/x86_64-linux-gnu/libc.so.6(+0xa0cfc)[0x7b8114a74cfc]
[user-DSA7TGX-424R:01427] [ 6] /usr/lib/x86_64-linux-gnu/libc.so.6(+0xa2a44)[0x7b8114a76a44]
[user-DSA7TGX-424R:01427] [ 7] /usr/lib/x86_64-linux-gnu/libc.so.6(free+0x73)[0x7b8114a79453]
[user-DSA7TGX-424R:01427] [ 8] /opt/tritonserver/backends/python/triton_python_backend_stub(+0x70064)[0x64e9f6599064]
[user-DSA7TGX-424R:01427] [ 9] /opt/tritonserver/backends/python/triton_python_backend_stub(+0x25e13)[0x64e9f654ee13]
[user-DSA7TGX-424R:01427] [10] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x29d90)[0x7b81149fdd90]
[user-DSA7TGX-424R:01427] [11] /usr/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80)[0x7b81149fde40]
[user-DSA7TGX-424R:01427] [12] /opt/tritonserver/backends/python/triton_python_backend_stub(+0x26b75)[0x64e9f654fb75]
[user-DSA7TGX-424R:01427] *** End of error message ***
I0412 08:38:22.776815 1416 python_be.cc:2381] TRITONBACKEND_ModelInstanceInitialize: whisper_0_0 (CPU device 0)
[TensorRT-LLM] TensorRT-LLM version: 0.9.0.dev2024040200
I0412 08:38:30.125002 1416 model_lifecycle.cc:835] successfully loaded 'whisper'
I0412 08:38:30.125259 1416 server.cc:607] 

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants