Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wrong distance result #1

Open
sup3rgiu opened this issue Mar 1, 2024 · 0 comments
Open

Wrong distance result #1

sup3rgiu opened this issue Mar 1, 2024 · 0 comments

Comments

@sup3rgiu
Copy link

sup3rgiu commented Mar 1, 2024

Hi there,
First of all, thanks for the project! I found it very fast compared to other GPU implementations.
However, I think the Levenshtein distance is not calculated correctly.

For instance, consider this example:

tokenizer = Tokenizer()
padToken = 10
device = 'cuda:0'
ref = ['aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaacaccccccccccbbbbbbbbbbbbbb']
hyp = ['bbbbbbbbbbcbbbbbbcbccccbbbbccbbcbbcbccacaaaaaaaaaaaaaaaaaaaaaaaa']

ref = [tokenizer.tokenize(w) for w in ref]
hyp = [tokenizer.tokenize(w) for w in hyp]

# padding
x = pad_sequence(ref, batch_first=True, padding_value=padToken)
y = pad_sequence(hyp, batch_first=True, padding_value=padToken)

x, y = x.to(device), y.to(device)
pred = editdistance(x, y, padToken).to('cpu')
print(pred)

I get pred -> 49

However, it should be 62, as you can see when calculating the same distance using the famous Rapidfuzz Levensthein module:

from Levenshtein import distance
distance('bbbbbbbbbbcbbbbbbcbccccbbbbccbbcbbcbccacaaaaaaaaaaaaaaaaaaaaaaaa', 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaacaccccccccccbbbbbbbbbbbbbb')

or directly here.

Also, I think there is another problem with your implementation since the order of the arguments (i.e. strings) should not matter, yet running editdistance(y, x, padToken) returns 39.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant