Skip to content

Latest commit



363 lines (276 loc) · 15 KB

File metadata and controls

363 lines (276 loc) · 15 KB

Common SavedModel APIs for Text Tasks

This page describes how TF2 SavedModels for text-related tasks should implement the Reusable SavedModel API. (This replaces and extends the Common Signatures for Text for the now-deprecated TF1 Hub format.)


There are several APIs to compute text embeddings (also known as dense representations of text, or text feature vectors).

  • The API for text embeddings from text inputs is implemented by a SavedModel that maps a batch of strings to a batch of embedding vectors. This is very easy to use, and many models on TF Hub have implemented it. However, this does not allow fine-tuning the model on TPU.

  • The API for text embeddings with preprocessed inputs solves the same task, but is implemented by two separate SavedModels:

    • a preprocessor that can run inside a input pipeline and converts strings and other variable-length data into numeric Tensors,
    • an encoder that accepts the results of the preprocessor and performs the trainable part of the embbedding computation.

    This split allows inputs to be preprocessed asynchronously before being fed into the training loop. In particular, it allows building encoders that can be run and fine-tuned on TPU.

  • The API for text embeddings with Transformer encoders extends the API for text embeddings from preprocessed inputs to the particular case of BERT and other Transformer encoders.

    • The preprocessor is extended to build encoder inputs from more than one segment of input text.
    • The Transformer encoder exposes the context-aware embeddings of individual tokens.

In each case, the text inputs are UTF-8 encoded strings, typically of plain text, unless the model documentation provides otherwise.

Regardless of API, different models have been pre-trained on text from different languages and domains, and with different tasks in mind. Therefore, not every text embedding model is suitable for every problem.

Text Embedding from Text Inputs

A SavedModel for text embeddings from text inputs accepts a batch of inputs in a string Tensor of shape [batch_size] and maps them to a float32 Tensor of shape [batch_size, dim] with dense representations (feature vectors) of the inputs.

Usage synopsis

obj = hub.load("path/to/model")
text_input = ["A long sentence.",
embeddings = obj(text_input)

Recall from the Reusable SavedModel API that running the model in training mode (e.g., for dropout) may require a keyword argument obj(..., training=True), and that obj provides attributes .variables, .trainable_variables and .regularization_losses as applicable.

In Keras, all this is taken care of by

embeddings = hub.KerasLayer("path/to/model", trainable=...)(text_input)

Distributed training

If the text embedding is used as part of a model that gets trained with a distribution strategy, the call to hub.load("path/to/model") or hub.KerasLayer("path/to/model", ...), resp., must happen inside the DistributionStrategy scope in order to create the model's variables in the distributed way. For example

  with strategy.scope():
    model = hub.load("path/to/model")


Text Embeddings with Preprocessed Inputs

A text embedding with preprocessed inputs is implemented by two separate SavedModels:

  • a preprocessor that maps a string Tensor of shape [batch_size] to a dict of numeric Tensors,
  • an encoder that accepts a dict of Tensors as returned by the preprocessor, performs the trainable part of the embbedding computation, and returns a dict of outputs. The output under key "default" is a float32 Tensor of shape [batch_size, dim].

This allows to run the preprocessor in an input pipeline but fine-tune the embeddings computed by the encoder as part of a larger model. In particular, it allows to build encoders that can be run and fine-tuned on TPU.

It is an implementation detail which Tensors are contained in the preprocessor's output, and which (if any) additional Tensors besides "default" are contained in the encoder's output.

The documentation of the encoder must specify which preprocessor to use with it. Typically, there is exactly one correct choice.

Usage synopsis

text_input = tf.constant(["A long sentence.",
preprocessor = hub.load("path/to/preprocessor")  # Must match `encoder`.
encoder_inputs = preprocessor(text_input)

encoder = hub.load("path/to/encoder")
enocder_outputs = encoder(encoder_inputs)
embeddings = enocder_outputs["default"]

Recall from the Reusable SavedModel API that running the encoder in training mode (e.g., for dropout) may require a keyword argument encoder(..., training=True), and that encoder provides attributes .variables, .trainable_variables and .regularization_losses as applicable.

The preprocessor model may have .variables but is not meant to be trained further. Preprocessing is not mode-dependent: if preprocessor() has a training=... argument at all, it has no effect.

In Keras, all this is taken care of by

encoder_inputs = hub.KerasLayer("path/to/preprocessor")(text_input)
encoder_outputs = hub.KerasLayer("path/to/encoder", trainable=True)(encoder_inputs)
embeddings = encoder_outputs["default"]

Distributed training

If the encoder is used as part of a model that gets trained with a distribution strategy, the call to hub.load("path/to/encoder") or hub.KerasLayer("path/to/encoder", ...), resp., must happen inside

  with strategy.scope():

in order to re-create the encoder variables in the distributed way.

Likewise, if the preprocessor is part of the trained model (as in the simple example above), it also needs to be loaded under the distribution strategy scope. If, however, the preprocessor is used in an input pipeline (e.g., in a callable passed to, its loading must happen outside the distribution strategy scope, in order to place its variables (if any) on the host CPU.


Text embeddings with Transformer Encoders

Transformer encoders for text operate on a batch of input sequences, each sequence comprising n ≥ 1 segments of tokenized text, within some model-specific bound on n. For BERT and many of its extensions, that bound is 2, so they accept single segments and segment pairs.

The API for text embeddings with Transformer encoders extends the API for text embeddings with preprocessed inputs to this setting.


A preprocessor SavedModel for text embeddings with Transformer encoders implements the API of a preprocessor SavedModel for text embeddings with preprocessed inputs (see above), which provides a way to map single-segment text inputs directly to encoder inputs.

In addition, the preprocessor SavedModel provides callable subobjects tokenize for tokenization (separately per segment) and bert_pack_inputs for packing n tokenized segments into one input sequence for the encoder. Each subobject follows the Reusable SavedModel API.

Usage synopsis

As a concrete example for two segments of text, let us look at a sentence entailment task that asks whether a premise (first segment) does or does not imply a hypothesis (second segment).

preprocessor = hub.load("path/to/preprocessor")

# Tokenize batches of both text inputs.
text_premises = tf.constant(["The quick brown fox jumped over the lazy dog.",
                             "Good day."])
tokenized_premises = preprocessor.tokenize(text_premises)
text_hypotheses = tf.constant(["The dog was lazy.",  # Implied.
                               "Axe handle!"])       # Not implied.
tokenized_hypotheses = preprocessor.tokenize(text_hypotheses)

# Pack input sequences for the Transformer encoder.
seq_length = 128
encoder_inputs = preprocessor.bert_pack_inputs(
    [tokenized_premises, tokenized_hypotheses],
    seq_length=seq_length)  # Optional argument.

In Keras, this computation can be expessed as

tokenize = hub.KerasLayer(preprocessor.tokenize)
tokenized_hypotheses = tokenize(text_hypotheses)
tokenized_premises = tokenize(text_premises)

bert_pack_inputs = hub.KerasLayer(
    arguments=dict(seq_length=seq_length))  # Optional argument.
encoder_inputs = bert_pack_inputs([tokenized_premises, tokenized_hypotheses])

Details of tokenize

A call to preprocessor.tokenize() accepts a string Tensor of shape [batch_size] and returns a RaggedTensor of shape [batch_size, ...] whose values are int32 token ids representing the input strings. There can be r ≥ 1 ragged dimensions after batch_size but no other uniform dimension.

  • If r=1, the shape is [batch_size, (tokens)], and each input is simply tokenized into a flat sequence of tokens.
  • If r>1, there are r-1 additional levels of grouping. For example, tensorflow_text.BertTokenizer uses r=2 to group tokens by words and yields shape [batch_size, (words), (tokens_per_word)]. It is up to the model at hand how many of these extra level(s) exist, if any, and what groupings they represent.

The user can (but need not) modify tokenized inputs, e.g., to accommodate the seq_length limit that will be enforced in packing encoder inputs. Extra dimensions in the tokenizer output can help here (e.g., to respect word boundaries) but become meaningless in the next step.

In terms of the Reusable SavedModel API, the preprocessor.tokenize object may have .variables but is not meant to be trained further. Tokenization is not mode-dependent: if preprocessor.tokenize() has a training=... argument at all, it has no effect.

Details of bert_pack_inputs

A call to preprocessor.bert_pack_inputs() accepts a Python list of tokenized inputs (batched separately for each input segment) and returns a dict of Tensors representing a batch of fixed-length input sequences for the Transformer encoder model.

Each tokenized input is an int32 RaggedTensor of shape [batch_size, ...], where the number r of ragged dimensions after batch_size is either 1 or the same as in the output of preprocessor.tokenize(). (The latter is for convenience only; the extra dimensions are flattened out before packing.)

Packing adds special tokens around the input segments as expected by the encoder. The bert_pack_inputs() call implements exactly the packing scheme used by the original BERT models and many of their extensions: the packed sequence starts with one start-of-sequence token, followed by the tokenized segments, each terminated by one end-of-segment token. Remaining positions up to seq_length, if any, are filled up with padding tokens.

If a packed sequence would exceed seq_length, bert_pack_inputs() truncates its segments to prefixes of approximately equal sizes so that the packed sequence fits exactly within seq_length.

Packing is not mode-dependent: if preprocessor.bert_pack_inputs() has a training=... argument at all, it has no effect. Also, preprocessor.bert_pack_inputs is not expected to have variables, or support fine-tuning.


The encoder is called on the dict of encoder_inputs in the same way as in the API for text embeddings with preprocessed inputs (see above), including the provisions from the Reusable SavedModel API.

Usage synposis

enocder = hub.load("path/to/encoder")
enocder_outputs = encoder(encoder_inputs)

or equivalently in Keras:

encoder = hub.KerasLayer("path/to/encoder", trainable=True)
encoder_outputs = encoder(encoder_inputs)


The encoder_outputs are a dict of Tensors with the following keys.

  • "sequence_output": a float32 Tensor of shape [batch_size, seq_length, dim] with the context-aware embedding of each token of every packed input sequence.
  • "pooled_output": a float32 Tensor of shape [batch_size, dim] with the embedding of each input sequence as a whole, derived from sequence_output in some trainable manner.
  • "default", as required by the API for text embeddings with preprocessed inputs: a float32 Tensor of shape [batch_size, dim] with the embedding of each input sequence. (This might be just an alias of pooled_output.)

The contents of the encoder_inputs are not strictly required by this API definition. However, for encoders that use BERT-style inputs, it is recommended to use the following names (from the NLP Modeling Toolkit of TensorFlow Model Garden) to minimize friction in interchanging encoders and reusing preprocessor models:

  • "input_word_ids": an int32 Tensor of shape [batch_size, seq_length] with the token ids of the packed input sequence (that is, including a start-of-sequence token, end-of-segment tokens, and padding).
  • "input_mask": an int32 Tensor of shape [batch_size, seq_length] with value 1 at the position of all input tokens present before padding and value 0 for the padding tokens.
  • "input_type_ids": an int32 Tensor of shape [batch_size, seq_length] with the index of the input segment that gave rise to the input token at the respective position. The first input segment (index 0) includes the start-of-sequence token and its end-of-segment token. The second and later segments (if present) include their respetive end-of-segment token. Padding tokens get index 0 again.

Distributed training

For loading the preprocessor and encoder objects inside or outside a distribution strategy scope, the same rules apply as in the API for text embeddings with preprocessed inputs (see above).
