You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Add more tests for pre-processing C APIs (microsoft#793)
* initial api for tokenizer
* More fixings and test data refinement
* add a simple wrapper for pre-processing APIs
* fix the test issues
* test if the tokenizer is spm based
* fix the failed test cases
* json pointer does not work
Copy file name to clipboardExpand all lines: docs/c_api.md
+10-1
Original file line number
Diff line number
Diff line change
@@ -18,4 +18,13 @@ Most APIs accept raw data inputs such as audio, image compressed binary formats,
18
18
19
19
**Audio feature extraction:**`OrtxCreateSpeechFeatureExtractor` creates a speech feature extractor to obtain log mel spectrum data as input for the Whisper model. An example code snippet can be found [here](../test/pp_api_test/test_feature_extraction.cc#L16).
20
20
21
-
NB: If onnxruntime-extensions is to build as a shared library, which requires the OCOS_ENABLE_AUDIO OCOS_ENABLE_CV2 OCOS_ENABLE_OPENCV_CODECS OCOS_ENABLE_GPT2_TOKENIZER build flags are ON to have a full function of binary. Only onnxruntime-extensions static library can be used for a minimal build with the selected operators, so in that case, the shared library build can be switched off by `-DOCOS_BUILD_SHARED_LIB=OFF`.
21
+
**NB:** If onnxruntime-extensions is to build as a shared library, which requires the OCOS_ENABLE_AUDIO OCOS_ENABLE_CV2 OCOS_ENABLE_OPENCV_CODECS OCOS_ENABLE_GPT2_TOKENIZER build flags are ON to have a full function of binary. Only onnxruntime-extensions static library can be used for a minimal build with the selected operators, so in that case, the shared library build can be switched off by `-DOCOS_BUILD_SHARED_LIB=OFF`.
22
+
23
+
There is a simple Python wrapper on these C API in [pp_api](../onnxruntime_extensions/pp_api.py), which can have a easy access these APIs in Python code like
24
+
25
+
```Python
26
+
from onnxruntime_extensions.pp_api import Tokenizer
27
+
# the name can be the same one used by Huggingface transformers.AutoTokenizer
28
+
pp_tok = Tokenizer('google/gemma-2-2b')
29
+
print(pp_tok.tokenize("what are you? \n 给 weiss ich, über was los ist \n"))
Copy file name to clipboardExpand all lines: docs/development.md
+1
Original file line number
Diff line number
Diff line change
@@ -16,6 +16,7 @@ The package contains all custom operators and some Python scripts to manipulate
16
16
- no-azure: disable AzureOp kernel build in Python package.
17
17
- no-opencv: disable operators based on OpenCV in build.
18
18
- cc-debug: generate debug info for extensions binaries and disable C/C++ compiler optimization.
19
+
- pp_api: enable pre-processing C ABI Python wrapper, `from onnxruntime_extensions.pp_api import *`
19
20
- cuda-archs: specify the CUDA architectures(like 70, 85, etc.), and the multiple values can be combined with semicolon. The default value is nvidia-smi util output of GPU-0
20
21
- ort\_pkg\_dir: specify ONNXRuntime package directory the extension project is depending on. This is helpful if you want to use some ONNXRuntime latest function which has not been involved in the official build
0 commit comments