Skip to content

2.3.0

Compare
Choose a tag to compare
@broken broken released this 28 Jul 22:33

Release 2.3.0

Major Features and Improvements

  • Added UnicodeCharacterTokenizer
  • Tokenizers are now tf.Modules and can be saved from within Keras layers.

Bug Fixes and Other Changes

  • Allow wordpiece_tokenizer to output int32 tokens natively.
  • Tracks the Sentencepiece model resource via a TrackableResource.
  • oss-segmenter:
    • fix end-offset error in split_merge_tokenizer_kernel.
  • TensorFlow text python ops wordshape:
    • More comprehensive emoji handling
  • Other:
    • Unref lookup_table in wordpiece_kernel fixing a possible memory leak.
    • Add missing LICENSE file for third_party/tensorflow_text/core/kernels.
    • add normalize kernals test
    • Fix Sentencepiece tests.
    • Add some metric logs to tokenizers.
    • Fix documentation formatting for SplitMergeTokenizer
    • Bug fix: make sure tokenize() method does not ignore itself.
    • Improve logging efficiency.
    • Update tf.text's regression test model for model server. Without the asserts, errors are erroneously swallowed by tensorflow. I also added tf.unicode_script test just to ensure that ICU is working correctly from within model server.
    • Add the ability to define a user-defined destination directory to make testing easier.
    • Fix typo in documentation of BertTokenizer
    • Clarify docstring of UnicodeScriptTokenizer about splitting on space
    • Add executable flag to the run_build.sh script.
    • Clarify docstring of WordpieceTokenizer on unknown_token:
    • Update protobuf library and point HEAD to build on tf 2.3.0-rc0

Thanks to our Contributors