Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wrong similarity score for identical embeddings #1345

Open
SaidKhudoyan opened this issue Jan 20, 2025 · 1 comment
Open

Wrong similarity score for identical embeddings #1345

SaidKhudoyan opened this issue Jan 20, 2025 · 1 comment

Comments

@SaidKhudoyan
Copy link

Testing the bge-m3 embedding model, I wanted to see how it behaves under varying scenarios. After generating sparse embeddings and storing them in some json, I wanted to calculate their similarity using the _compute_single_lexical_matching_score method, which is defined in FlagEmbedding/inference/embedder/encoder_only/m3.py.
However, I got e.g. only a score of 0.23 when comparing identical sparse-embeddings

Here an output from my terminal:
Teste Sparse Similarity Berechnung mit konvertierten Embeddings... Sparse Similarity Score: 0.23759149310728778 Similarity Berechnung erfolgreich! Sparse 1: {35542: 0.16986805200576782, 443: 0.1528966724872589, 599: 0.0936431884765625, 8647: 0.30713802576065063, 9: 0.04344563186168671, 174379: 0.2834935784339905} Sparse 2: {35542: 0.16986805200576782, 443: 0.1528966724872589, 599: 0.0936431884765625, 8647: 0.30713802576065063, 9: 0.04344563186168671, 174379: 0.2834935784339905}

Maybe I'm wrong, but wouldnt we need some kind of normalization factor for that? Currently only a simple dot-product is conducted.

@545999961
Copy link
Collaborator

Since sparse embeddings are not normalized, the sparse embedding similarity between identical embeddings cannot reach 1.
It doesn't need normalization.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants