2.3.0

broken released this 28 Jul 22:33

fa898b7

Release 2.3.0

Major Features and Improvements

Added UnicodeCharacterTokenizer
Tokenizers are now tf.Modules and can be saved from within Keras layers.

Bug Fixes and Other Changes

Allow wordpiece_tokenizer to output int32 tokens natively.
Tracks the Sentencepiece model resource via a TrackableResource.
oss-segmenter:
- fix end-offset error in split_merge_tokenizer_kernel.
TensorFlow text python ops wordshape:
- More comprehensive emoji handling
Other:
- Unref lookup_table in wordpiece_kernel fixing a possible memory leak.
- Add missing LICENSE file for third_party/tensorflow_text/core/kernels.
- add normalize kernals test
- Fix Sentencepiece tests.
- Add some metric logs to tokenizers.
- Fix documentation formatting for SplitMergeTokenizer
- Bug fix: make sure tokenize() method does not ignore itself.
- Improve logging efficiency.
- Update tf.text's regression test model for model server. Without the asserts, errors are erroneously swallowed by tensorflow. I also added tf.unicode_script test just to ensure that ICU is working correctly from within model server.
- Add the ability to define a user-defined destination directory to make testing easier.
- Fix typo in documentation of BertTokenizer
- Clarify docstring of UnicodeScriptTokenizer about splitting on space
- Add executable flag to the run_build.sh script.
- Clarify docstring of WordpieceTokenizer on unknown_token:
- Update protobuf library and point HEAD to build on tf 2.3.0-rc0

Thanks to our Contributors

Assets 2