Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support zipformer2 offline triton recipe #639

Merged
merged 1 commit into from
Aug 23, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
45 changes: 14 additions & 31 deletions triton/Dockerfile/Dockerfile.server
Original file line number Diff line number Diff line change
@@ -1,41 +1,24 @@
FROM nvcr.io/nvidia/tritonserver:22.12-py3
FROM nvcr.io/nvidia/tritonserver:24.07-py3
# https://docs.nvidia.com/deeplearning/frameworks/support-matrix/index.html
# Please choose previous tritonserver:xx.xx if you encounter cuda driver mismatch issue

LABEL maintainer="NVIDIA"
LABEL repository="tritonserver"

RUN apt-get update && apt-get -y install \
python3-dev \
cmake \
libsndfile1
RUN pip3 install \
torch==1.13.1+cu117 \
torchaudio==0.13.1+cu117 \
--index-url https://download.pytorch.org/whl/cu117
RUN pip3 install \
kaldialign \
tensorboard \
sentencepiece \
lhotse \
kaldifeat
RUN pip3 install \
k2==1.24.4.dev20240223+cuda11.7.torch1.13.1 -f https://k2-fsa.github.io/k2/cuda.html
# Dependency for client
RUN pip3 install soundfile grpcio-tools tritonclient pyyaml
RUN apt-get update && apt-get install -y cmake
RUN python3 -m pip install k2==1.24.4.dev20240725+cuda12.4.torch2.4.0 -f https://k2-fsa.github.io/k2/cuda.html && \
python3 -m pip install -r https://raw.githubusercontent.com/k2-fsa/icefall/master/requirements.txt && \
pip install -U "huggingface_hub[cli]" lhotse colored onnx_graphsurgeon polygraphy
# https://github.com/k2-fsa/k2/blob/master/k2/python/k2/__init__.py#L13 delete the cuda version check
RUN sed -i '/if (/,/^ )/d' /usr/local/lib/python3.10/dist-packages/k2/__init__.py
WORKDIR /workspace

# #install k2 from source
# #"sed -i ..." line tries to turn off the cuda check
# RUN git clone https://github.com/k2-fsa/k2.git && \
# cd k2 && \
# sed -i 's/FATAL_ERROR/STATUS/g' cmake/torch.cmake && \
# sed -i 's/in running_cuda_version//g' get_version.py && \
# python3 setup.py install && \
# cd -
RUN git clone https://github.com/csukuangfj/kaldifeat && \
cd kaldifeat && \
sed -i 's/in running_cuda_version//g' get_version.py && \
python3 setup.py install && \
cd -

RUN git clone https://github.com/k2-fsa/icefall.git
ENV PYTHONPATH "${PYTHONPATH}:/workspace/icefall"
# https://github.com/k2-fsa/icefall/issues/674
ENV PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION "python"

COPY ./scripts scripts
COPY ./scripts scripts
34 changes: 11 additions & 23 deletions triton/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,19 +34,14 @@ Build the server docker image:
cd $SHERPA_SRC/triton
docker build . -f Dockerfile/Dockerfile.server -t sherpa_triton_server:latest --network host
```
Alternatively, you could directly pull the pre-built image based on tritonserver 22.12.
Alternatively, you could directly pull the pre-built image based on tritonserver image.
```
docker pull soar97/triton-k2:22.12.1
```

If you are planning to use TRT to accelerate the inference speed, you can use the following prebuit image:
```
docker pull wd929/sherpa_wend_23.04:v1.1
docker pull soar97/triton-k2:24.07
```

Start the docker container:
```bash
docker run --gpus all -v $SHERPA_SRC:/workspace/sherpa --name sherpa_server --net host --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 -it soar97/triton-k2:22.12.1
docker run --gpus all -v $SHERPA_SRC:/workspace/sherpa --name sherpa_server --net host --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 -it soar97/triton-k2:24.07
```
Now, you should enter into the container successfully.

Expand All @@ -69,8 +64,7 @@ apt-get install git-lfs
pip3 install -r ./requirements.txt
export CUDA_VISIBLE_DEVICES="your_gpu_id"

bash scripts/build_wenetspeech_pruned_transducer_stateless5_streaming.sh
bash scripts/build_librispeech_pruned_transducer_stateless3_streaming.sh
bash scripts/build_wenetspeech_zipformer_offline_trt.sh
```

## Using TensorRT acceleration
Expand All @@ -83,26 +77,20 @@ You can directly use the following script to export TRT engine and start Triton
bash scripts/build_librispeech_pruned_transducer_stateless3_offline_trt.sh
```

### Export to TensorRT Step by Step

If you want to build TensorRT for your own model, you can try the following steps:
### Export to TensorRT

#### Preparation for TRT

First of all, you have to install the TensorRT. Here we suggest you to use docker container to run TRT. Just run the following command:

```bash
docker run --gpus '"device=0"' -it --rm --net host -v $PWD/:/k2 nvcr.io/nvidia/tensorrt:23.04-py3
```
You can also see [here](https://github.com/NVIDIA/TensorRT#build) to build TRT on your machine.
If you want to build TensorRT for your own service, you can try the following steps:

#### Model export

You have to prepare the ONNX model by referring [here](https://github.com/k2-fsa/sherpa/blob/master/triton/scripts/build_librispeech_pruned_transducer_stateless3_offline.sh#L41C1-L41C1) to export your models into ONNX format. Assume you have put your ONNX model in the `$model_dir` directory.
You have to prepare the ONNX model by referring [here](https://icefall.readthedocs.io/en/latest/model-export/export-onnx.html#export-the-model-to-onnx) to export your models into ONNX format. Assume you have put your ONNX model in the `$model_dir` directory.
Then, just run the command:

```bash
bash scripts/build_trt.sh 128 $model_dir/encoder.onnx model_repo_offline/encoder/1/encoder.trt
# First, use polygraphy to simplify the onnx model.
polygraphy surgeon sanitize $model_dir/encoder.onnx --fold-constant -o encoder.trt
# Using /usr/src/tensorrt/bin/trtexec tool in the tritonserver docker image.
bash scripts/build_trt.sh 16 $model_dir/encoder.onnx model_repo_offline/encoder/1/encoder.trt
```

The generated TRT model will be saved into `model_repo_offline/encoder/1/encoder.trt`.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -44,7 +44,7 @@ input [
},
{
name: "wav_lens"
data_type: TYPE_INT64
data_type: TYPE_INT32
dims: [1]
}
]
Expand Down
16 changes: 16 additions & 0 deletions triton/model_repo_offline/scorer/config.pbtxt.template
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,22 @@ parameters [
{
key: "decoding_method",
value: { string_value: "greedy_search"}
},
{
key: "beam",
value: { string_value: "4"}
},
{
key: "max_contexts",
value: { string_value: "4"}
},
{
key: "max_states",
value: { string_value: "32"}
},
{
key: "temperature",
value: { string_value: "1.0"}
}
]

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ stop_stage=2

# change to your own model directory
pretrained_model_dir=/mnt/samsung-t7/wend/github/icefall/egs/librispeech/ASR/pruned_transducer_stateless7/exp/
model_repo_path=./zipformer/model_repo_offline
model_repo_path=./model_repo_offline

# modify model specific parameters according to $pretrained_model_dir/exp/onnx_export.log
VOCAB_SIZE=500
Expand Down
2 changes: 1 addition & 1 deletion triton/scripts/build_trt.sh
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@

# paramters for TRT engines
MIN_BATCH=1
OPT_BATCH=32
OPT_BATCH=4
MAX_BATCH=$1
onnx_model=$2
trt_model=$3
Expand Down
131 changes: 131 additions & 0 deletions triton/scripts/build_wenetspeech_zipformer_offline_trt.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,131 @@
#!/bin/bash
stage=-1
stop_stage=3

export CUDA_VISIBLE_DEVICES=1

pretrained_model_dir=/workspace/icefall-asr-zipformer-wenetspeech-20230615
model_repo_path=./model_repo_offline

# modify model specific parameters according to $pretrained_model_dir/exp/ log files
VOCAB_SIZE=5537

DECODER_CONTEXT_SIZE=2
DECODER_DIM=512
ENCODER_DIM=512 # max(_to_int_tuple(params.encoder_dim)


if [ -d "$pretrained_model_dir/data/lang_char" ]
then
echo "pretrained model using char"
TOKENIZER_FILE=$pretrained_model_dir/data/lang_char
else
echo "pretrained model using bpe"
TOKENIZER_FILE=$pretrained_model_dir/data/lang_bpe_500/bpe.model
fi

MAX_BATCH=16
# model instance num
FEATURE_EXTRACTOR_INSTANCE_NUM=2
ENCODER_INSTANCE_NUM=1
JOINER_INSTANCE_NUM=1
DECODER_INSTANCE_NUM=1
SCORER_INSTANCE_NUM=2


icefall_dir=/workspace/icefall
export PYTHONPATH=$PYTHONPATH:$icefall_dir
recipe_dir=$icefall_dir/egs/wenetspeech/ASR/zipformer

if [ ${stage} -le -2 ] && [ ${stop_stage} -ge -2 ]; then
if [ -d "$pretrained_model_dir" ]
then
echo "skip download pretrained model"
else
echo "downloading pretrained model"
cd /workspace
GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/pkufool/icefall-asr-zipformer-wenetspeech-20230615
pushd icefall-asr-zipformer-wenetspeech-20230615
git lfs pull --include "exp/pretrained.pt"
ln -s ./exp/pretrained.pt ./exp/epoch-9999.pt
popd
cd -
fi
fi

if [ ${stage} -le -1 ] && [ ${stop_stage} -ge -1 ]; then
echo "export onnx"
cd ${recipe_dir}
# WAR: please comment https://github.com/k2-fsa/icefall/blob/master/egs/librispeech/ASR/zipformer/zipformer.py#L1422-L1427
# if you would like to use the exported onnx to build trt engine later.
python3 ./export-onnx.py \
--tokens $TOKENIZER_FILE/tokens.txt \
--use-averaged-model 0 \
--epoch 9999 \
--avg 1 \
--exp-dir $pretrained_model_dir/exp/ \
--num-encoder-layers "2,2,3,4,3,2" \
--downsampling-factor "1,2,4,8,4,2" \
--feedforward-dim "512,768,1024,1536,1024,768" \
--num-heads "4,4,4,8,4,4" \
--encoder-dim "192,256,384,512,384,256" \
--query-head-dim 32 \
--value-head-dim 12 \
--causal False || exit 1

cd -
fi

if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
echo "auto gen config.pbtxt"
dirs="encoder decoder feature_extractor joiner scorer transducer"

if [ ! -d $model_repo_path ]; then
echo "Please cd to $model_repo_path"
exit 1
fi

cp -r $TOKENIZER_FILE $model_repo_path/scorer/
TOKENIZER_FILE=$model_repo_path/scorer/$(basename $TOKENIZER_FILE)
for dir in $dirs
do
cp $model_repo_path/$dir/config.pbtxt.template $model_repo_path/$dir/config.pbtxt

sed -i "s|VOCAB_SIZE|${VOCAB_SIZE}|g" $model_repo_path/$dir/config.pbtxt
sed -i "s|DECODER_CONTEXT_SIZE|${DECODER_CONTEXT_SIZE}|g" $model_repo_path/$dir/config.pbtxt
sed -i "s|DECODER_DIM|${DECODER_DIM}|g" $model_repo_path/$dir/config.pbtxt
sed -i "s|ENCODER_LAYERS|${ENCODER_LAYERS}|g" $model_repo_path/$dir/config.pbtxt
sed -i "s|ENCODER_DIM|${ENCODER_DIM}|g" $model_repo_path/$dir/config.pbtxt
sed -i "s|ENCODER_LEFT_CONTEXT|${ENCODER_LEFT_CONTEXT}|g" $model_repo_path/$dir/config.pbtxt
sed -i "s|ENCODER_RIGHT_CONTEXT|${ENCODER_RIGHT_CONTEXT}|g" $model_repo_path/$dir/config.pbtxt

sed -i "s|TOKENIZER_FILE|${TOKENIZER_FILE}|g" $model_repo_path/$dir/config.pbtxt

sed -i "s|MAX_BATCH|${MAX_BATCH}|g" $model_repo_path/$dir/config.pbtxt

sed -i "s|FEATURE_EXTRACTOR_INSTANCE_NUM|${FEATURE_EXTRACTOR_INSTANCE_NUM}|g" $model_repo_path/$dir/config.pbtxt
sed -i "s|ENCODER_INSTANCE_NUM|${ENCODER_INSTANCE_NUM}|g" $model_repo_path/$dir/config.pbtxt
sed -i "s|JOINER_INSTANCE_NUM|${JOINER_INSTANCE_NUM}|g" $model_repo_path/$dir/config.pbtxt
sed -i "s|DECODER_INSTANCE_NUM|${DECODER_INSTANCE_NUM}|g" $model_repo_path/$dir/config.pbtxt
sed -i "s|SCORER_INSTANCE_NUM|${SCORER_INSTANCE_NUM}|g" $model_repo_path/$dir/config.pbtxt
done
fi

if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
cp $pretrained_model_dir/exp/encoder-epoch-9999-avg-1.onnx $model_repo_path/encoder/1/encoder.onnx
cp $pretrained_model_dir/exp/decoder-epoch-9999-avg-1.onnx $model_repo_path/decoder/1/decoder.onnx
cp $pretrained_model_dir/exp/joiner-epoch-9999-avg-1.onnx $model_repo_path/joiner/1/joiner.onnx
fi

if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then
echo "Buiding TRT engine..., skip the stage if you would like to use onnxruntime"
polygraphy surgeon sanitize $pretrained_model_dir/exp/encoder-epoch-9999-avg-1.onnx --fold-constant -o $pretrained_model_dir/exp/encoder.onnx
bash scripts/build_trt.sh $MAX_BATCH $pretrained_model_dir/exp/encoder.onnx $model_repo_path/encoder/1/encoder.trt || exit 1

sed -i "s|onnxruntime|tensorrt|g" $model_repo_path/encoder/config.pbtxt
sed -i "s|encoder.onnx|encoder.trt|g" $model_repo_path/encoder/config.pbtxt
fi

if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then
tritonserver --model-repository=$model_repo_path --pinned-memory-pool-byte-size=512000000 --cuda-memory-pool-byte-size=0:1024000000 --http-port 10086
fi
Empty file.
44 changes: 0 additions & 44 deletions triton/zipformer/model_repo_offline/decoder/config.pbtxt.template

This file was deleted.

Empty file.
Loading