You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am trying to use Galactica as the embedding model for BERTopic and have tried a variety of methods to load and use the model but have encountered errors at each method.
Using the transformers library
I first tried to use the pipeline method from the transformers library with the following usage:
This results in the following TypeError:
TypeError Traceback (most recent call last) in
1 # NOT WORKING
----> 2 topics, probs = BERTopic(embedding_model=galactica_model ,nr_topics='auto', verbose = True).fit_transform(docs)
/usr/local/lib/python3.8/dist-packages/transformers/pipelines/pt_utils.py in next(self)
123 # We're out of items within a batch
124 item = next(self.iterator)
--> 125 processed = self.infer(item, **self.params)
126 # We now have a batch of "inferred things".
127 if self.loader_batch_size is not None:
/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
1192 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
1193 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1194 return forward_call(*input, **kwargs)
1195 # Do not call functions when jit is used
1196 full_backward_hooks, non_full_backward_hooks = [], []
TypeError: forward() got an unexpected keyword argument 'token_type_ids'
Using galaiload_model
When I tried to use the load_model function built into galai the output of the loading indicating the model was downloaded properly. However, the verbose output of BERTopic indicated that the default embedding model that comes with BERTopic was used instead of the loaded galactica model. More examples of this can be seen in the github issue I raised with BERTopic here. The author of the package indicated that using pipeline from the transformers library was the proper usage for using language models from huggingface.
Using flair
Finally I tried to load the model via flare. Similar to the results when I used the transformers library I was met with an error on compiling, but this time it was a value error. This usage resulted in the following ValueError:
0%| | 0/499 [00:00<?, ?it/s]Using pad_token, but it is not set yet.
0%| | 0/499 [00:00<?, ?it/s]
ValueError Traceback (most recent call last) in
1 # NOT WORKING
----> 2 galactica_topics, galactica_probs = BERTopic(embedding_model=flair_gal ,nr_topics='auto', verbose = True).fit_transform(clean_docs)
10 frames /usr/local/lib/python3.8/dist-packages/transformers/tokenization_utils_base.py in _get_padding_truncation_strategies(self, padding, truncation, max_length, pad_to_multiple_of, verbose, **kwargs)
2423 # Test if we have a padding token
2424 if padding_strategy != PaddingStrategy.DO_NOT_PAD and (not self.pad_token or self.pad_token_id < 0):
-> 2425 raise ValueError(
2426 "Asking to pad but the tokenizer does not have a padding token. "
2427 "Please select a token to use as pad_token(tokenizer.pad_token = tokenizer.eos_token e.g.) "
ValueError: Asking to pad but the tokenizer does not have a padding token. Please select a token to use as pad_token(tokenizer.pad_token = tokenizer.eos_token e.g.) or add a new pad token via tokenizer.add_special_tokens({'pad_token': '[PAD]'}).
The text was updated successfully, but these errors were encountered:
I am trying to use Galactica as the embedding model for BERTopic and have tried a variety of methods to load and use the model but have encountered errors at each method.
Using the
transformers
libraryI first tried to use the
pipeline
method from thetransformers
library with the following usage:This results in the following TypeError:
TypeError Traceback (most recent call last)
in
1 # NOT WORKING
----> 2 topics, probs = BERTopic(embedding_model=galactica_model ,nr_topics='auto', verbose = True).fit_transform(docs)
9 frames
/usr/local/lib/python3.8/dist-packages/bertopic/_bertopic.py in fit_transform(self, documents, embeddings, y)
337 self.embedding_model = select_backend(self.embedding_model,
338 language=self.language)
--> 339 embeddings = self._extract_embeddings(documents.Document,
340 method="document",
341 verbose=self.verbose)
/usr/local/lib/python3.8/dist-packages/bertopic/_bertopic.py in _extract_embeddings(self, documents, method, verbose)
2785 embeddings = self.embedding_model.embed_words(documents, verbose)
2786 elif method == "document":
-> 2787 embeddings = self.embedding_model.embed_documents(documents, verbose)
2788 else:
2789 raise ValueError("Wrong method for extracting document/word embeddings. "
/usr/local/lib/python3.8/dist-packages/bertopic/backend/_base.py in embed_documents(self, document, verbose)
67 that each have an embeddings size of
m
68 """
---> 69 return self.embed(document, verbose)
/usr/local/lib/python3.8/dist-packages/bertopic/backend/_hftransformers.py in embed(self, documents, verbose)
58
59 embeddings = []
---> 60 for document, features in tqdm(zip(documents, self.embedding_model(dataset, truncation=True, padding=True)),
61 total=len(dataset), disable=not verbose):
62 embeddings.append(self._embed(document, features))
/usr/local/lib/python3.8/dist-packages/tqdm/std.py in iter(self)
1193
1194 try:
-> 1195 for obj in iterable:
1196 yield obj
1197 # Update and possibly print the progressbar.
/usr/local/lib/python3.8/dist-packages/transformers/pipelines/pt_utils.py in next(self)
122
123 # We're out of items within a batch
--> 124 item = next(self.iterator)
125 processed = self.infer(item, **self.params)
126 # We now have a batch of "inferred things".
/usr/local/lib/python3.8/dist-packages/transformers/pipelines/pt_utils.py in next(self)
123 # We're out of items within a batch
124 item = next(self.iterator)
--> 125 processed = self.infer(item, **self.params)
126 # We now have a batch of "inferred things".
127 if self.loader_batch_size is not None:
/usr/local/lib/python3.8/dist-packages/transformers/pipelines/base.py in forward(self, model_inputs, **forward_params)
988 with inference_context():
989 model_inputs = self._ensure_tensor_on_device(model_inputs, device=self.device)
--> 990 model_outputs = self._forward(model_inputs, **forward_params)
991 model_outputs = self._ensure_tensor_on_device(model_outputs, device=torch.device("cpu"))
992 else:
/usr/local/lib/python3.8/dist-packages/transformers/pipelines/feature_extraction.py in _forward(self, model_inputs)
81
82 def _forward(self, model_inputs):
---> 83 model_outputs = self.model(**model_inputs)
84 return model_outputs
85
/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
1192 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
1193 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1194 return forward_call(*input, **kwargs)
1195 # Do not call functions when jit is used
1196 full_backward_hooks, non_full_backward_hooks = [], []
TypeError: forward() got an unexpected keyword argument 'token_type_ids'
Using
galai
load_model
When I tried to use the
load_model
function built intogalai
the output of the loading indicating the model was downloaded properly. However, the verbose output of BERTopic indicated that the default embedding model that comes with BERTopic was used instead of the loaded galactica model. More examples of this can be seen in the github issue I raised with BERTopic here. The author of the package indicated that usingpipeline
from thetransformers
library was the proper usage for using language models from huggingface.Using
flair
Finally I tried to load the model via
flare
. Similar to the results when I used thetransformers
library I was met with an error on compiling, but this time it was a value error. This usage resulted in the following ValueError:0%| | 0/499 [00:00<?, ?it/s]Using pad_token, but it is not set yet.
0%| | 0/499 [00:00<?, ?it/s]
ValueError Traceback (most recent call last)
in
1 # NOT WORKING
----> 2 galactica_topics, galactica_probs = BERTopic(embedding_model=flair_gal ,nr_topics='auto', verbose = True).fit_transform(clean_docs)
10 frames
/usr/local/lib/python3.8/dist-packages/transformers/tokenization_utils_base.py in _get_padding_truncation_strategies(self, padding, truncation, max_length, pad_to_multiple_of, verbose, **kwargs)
2423 # Test if we have a padding token
2424 if padding_strategy != PaddingStrategy.DO_NOT_PAD and (not self.pad_token or self.pad_token_id < 0):
-> 2425 raise ValueError(
2426 "Asking to pad but the tokenizer does not have a padding token. "
2427 "Please select a token to use as
pad_token
(tokenizer.pad_token = tokenizer.eos_token e.g.)
"ValueError: Asking to pad but the tokenizer does not have a padding token. Please select a token to use as
pad_token
(tokenizer.pad_token = tokenizer.eos_token e.g.)
or add a new pad token viatokenizer.add_special_tokens({'pad_token': '[PAD]'})
.The text was updated successfully, but these errors were encountered: