Skip to content

Commit e1579c1

Browse files
committed
Fixes GitHub issue #158. This replaces the ICU character class for whitespace.
If we want to support these, we need to add the RE2_USE_ICU build flag and link in ICU to the regex ops. I have a working patch, but am not convinced it is worth submitting. PiperOrigin-RevId: 279791264
1 parent 17370ff commit e1579c1

File tree

2 files changed

+2
-4
lines changed

2 files changed

+2
-4
lines changed

tensorflow_text/BUILD

-1
Original file line numberDiff line numberDiff line change
@@ -96,7 +96,6 @@ py_library(
9696
":tokenization",
9797
":unicode_script_tokenizer",
9898
":wordpiece_tokenizer",
99-
":wordshape_ops",
10099
# python:array_ops tensorflow dep,
101100
# python:dtypes tensorflow dep,
102101
# python:math_ops tensorflow dep,

tensorflow_text/python/ops/bert_tokenizer.py

+2-3
Original file line numberDiff line numberDiff line change
@@ -28,11 +28,10 @@
2828
from tensorflow_text.python.ops.tokenization import Tokenizer
2929
from tensorflow_text.python.ops.tokenization import TokenizerWithOffsets
3030
from tensorflow_text.python.ops.wordpiece_tokenizer import WordpieceTokenizer
31-
from tensorflow_text.python.ops.wordshape_ops import WordShape
3231

3332

3433
_DELIM_REGEX = [
35-
WordShape.IS_WHITESPACE.value,
34+
r"\s+",
3635
r"|".join([
3736
r"[!-/]",
3837
r"[:-@]",
@@ -54,7 +53,7 @@
5453

5554
_DELIM_REGEX_PATTERN = "|".join(_DELIM_REGEX)
5655
_KEEP_DELIM_NO_WHITESPACE = copy.deepcopy(_DELIM_REGEX)
57-
_KEEP_DELIM_NO_WHITESPACE.remove(WordShape.IS_WHITESPACE.value)
56+
_KEEP_DELIM_NO_WHITESPACE.remove(r"\s+")
5857

5958
_KEEP_DELIM_NO_WHITESPACE_PATTERN = "|".join(_KEEP_DELIM_NO_WHITESPACE)
6059

0 commit comments

Comments
 (0)