Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is it expected the same sentence gives different features? #115

Open
towr opened this issue May 5, 2021 · 0 comments
Open

Is it expected the same sentence gives different features? #115

towr opened this issue May 5, 2021 · 0 comments

Comments

@towr
Copy link

towr commented May 5, 2021

I'm a bit puzzled by something I encountered trying to encode sentences as embeddings. When I ran the sentences through the model one at a time, I got slightly different results from when I ran batches of sentences.

I've reduced an example down to:

from transformers import pipeline
import numpy as np

p = pipeline('feature-extraction', model='allenai/scibert_scivocab_uncased')
s = 'the scurvy dog walked home alone'.split()

for l in range(1,len(s)+1):
    txt = ' '.join(s[:l])
    
    res1 = p(txt)
    res2 = p(txt)
    res1_2 = p([txt, txt])
    print(l, txt, len(res1[0]))
    print(all( np.allclose(i, j) for i, j in zip(res1[0], res2[0])),
          all( np.allclose(i, j) for i, j in zip(res2[0], res1_2[0])),
          all( np.allclose(i, j) for i, j in zip(res1_2[0], res1_2[1])))

The output I get is:

1 the 3
True False True
2 the scurvy 6
True True True
3 the scurvy dog 7
True False False
4 the scurvy dog walked 9
True False True
5 the scurvy dog walked home 10
True True False
6 the scurvy dog walked home alone 11
True True True

So running a single sentence through the model seems to give the same output each time, but if I run a batch with the same sentence twice, it's sometimes different (between the two outputs, and compared to the single-sentence case)

Is this expected/explainable?

Further context, I'm running it on CPU (laptop), python 3.8.9, freshly installed venv.
The difference is usually just in a few indices of the embeddings, and can be up to 1e-3. The difference is negligible when comparing the embeddings with cosine distance. But I'd like to understand where it comes from before dismissing it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant