Wrong similarity score for identical embeddings #1345

SaidKhudoyan · 2025-01-20T11:03:14Z

Testing the bge-m3 embedding model, I wanted to see how it behaves under varying scenarios. After generating sparse embeddings and storing them in some json, I wanted to calculate their similarity using the _compute_single_lexical_matching_score method, which is defined in FlagEmbedding/inference/embedder/encoder_only/m3.py.
However, I got e.g. only a score of 0.23 when comparing identical sparse-embeddings

Here an output from my terminal:
Teste Sparse Similarity Berechnung mit konvertierten Embeddings... Sparse Similarity Score: 0.23759149310728778 Similarity Berechnung erfolgreich! Sparse 1: {35542: 0.16986805200576782, 443: 0.1528966724872589, 599: 0.0936431884765625, 8647: 0.30713802576065063, 9: 0.04344563186168671, 174379: 0.2834935784339905} Sparse 2: {35542: 0.16986805200576782, 443: 0.1528966724872589, 599: 0.0936431884765625, 8647: 0.30713802576065063, 9: 0.04344563186168671, 174379: 0.2834935784339905}

Maybe I'm wrong, but wouldnt we need some kind of normalization factor for that? Currently only a simple dot-product is conducted.

The text was updated successfully, but these errors were encountered:

545999961 · 2025-01-23T09:30:07Z

Since sparse embeddings are not normalized, the sparse embedding similarity between identical embeddings cannot reach 1.
It doesn't need normalization.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wrong similarity score for identical embeddings #1345

Wrong similarity score for identical embeddings #1345

SaidKhudoyan commented Jan 20, 2025

545999961 commented Jan 23, 2025

Wrong similarity score for identical embeddings #1345

Wrong similarity score for identical embeddings #1345

Comments

SaidKhudoyan commented Jan 20, 2025

545999961 commented Jan 23, 2025