Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

hidden state of the eos token? #36

Open
katzurik opened this issue Nov 21, 2022 · 3 comments
Open

hidden state of the eos token? #36

katzurik opened this issue Nov 21, 2022 · 3 comments

Comments

@katzurik
Copy link

katzurik commented Nov 21, 2022

Do galactica model output the hidden state of the EOS ? Would it be possible to get it somehow using Huggingface's codebase or the original implementation? In a similar manner to OPT when doing sequence classification

@Puzer
Copy link

Puzer commented Nov 24, 2022

I'm using this approach
You can also calculate mean of the last hidden state, but don't forget to apply L2 norm after that. It might work better than EOS embedding for some use cases.

from transformers import AutoTokenizer, OPTForCausalLM

tokenizer = AutoTokenizer.from_pretrained("facebook/galactica-1.3b")
model = OPTForCausalLM.from_pretrained("facebook/galactica-1.3b", device_map="auto")

def get_embedding(s):
  input_text = s + "</s>"
  input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to("cuda")
  outputs = model.generate(input_ids, max_new_tokens=0, return_dict_in_generate=True, output_hidden_states=True)

  sentence_representation = outputs['hidden_states'][0][-1][0,-1].to('cpu').numpy()
  return sentence_representation

@katzurik
Copy link
Author

@Puzer Thanks, What's your take on the quality of sentence representations using this method? i'm not sure the model manages to do that very good

@Puzer
Copy link

Puzer commented Nov 28, 2022

@Puzer Thanks, What's your take on the quality of sentence representations using this method? i'm not sure the model manages to do that very good

I tried to embed arxiv papers by their titles and then train a linear model to classify tags:

Model F1 macro AUC-ROC mean
galactica-6.7b 0.749 0.806
all-mpnet-base-v1 0.744 0.799
all-roberta-large-v1 0.738 0.8
galactica-1.3b 0.722 0.796
tf-idf 0.697 0.763

Also tried to use that for semantic search of similar papers using cosine similarity of embeddings, I don't have any metrics for that, but results of all-mpnet-base-v1 subjectively looks more reasonable for me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants