Issue with a name that includes "í" #7866

danyaljj · 2024-10-26T16:26:47Z

I was adding an entry for Mateo Díaz (https://dblp.org/pid/200/7297.html) which I realized that it causes an error.

Processing Mateo Díaz,Johns Hopkins University,https://mateodd25.github.io/,F3cPGhsAAAAJ
  Checking https://dblp.org/search/author/api?q=author%3AMateo%20Diaz:$%3A&format=json&c=10
  WARNING: Possibly invalid name (Mateo Diaz). This may be a disambiguation entry.
  Checking homepage URL (https://mateodd25.github.io/

Here is what I found:

matching_name_with_dblp("Mateo Díaz") just fine. It returns "1".
However, I realized that in the code we actually process Mateo Diaz (note í changed to i) which when I pass on to matching_name_with_dblp("Mateo Diaz") it returns "2" and causes the error due to name ambiguity.
I dug and found that the change (from í to i) is done by unidecode.unidecode(.)

name = unidecode.unidecode("Mateo Díaz")
print(f"Name after unicode normalization: {name}")
# prints:  Name after unicode normalization: Mateo Diaz

I looked into whether one can encode í in ways that does not get removed by unidecode.unidecode(.) but nothing worked.

So in conclusion, my suggestion is to revise unidecode.unidecode(.) so that it does not escape í. For example:

import unicodedata

def custom_unidecode(text, keep_characters="í"):
    result = []
    for char in text:
        # If the character is in the keep list, add it directly
        if char in keep_characters:
            result.append(char)
        else:
            # Normalize and strip accents
            normalized_char = unicodedata.normalize('NFD', char)
            stripped_char = ''.join(c for c in normalized_char if unicodedata.category(c) != 'Mn')
            result.append(stripped_char)
    return ''.join(result)

# Example usage
text = "Café con piñata and ítem"
print(custom_unidecode(text))

text = "Mateo Díaz"
print(custom_unidecode(text))

# Would print: 
# Cafe con pinata and ítem
# Mateo Díaz

Happy to send a PR if you like the suggested change.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue with a name that includes "í" #7866

Issue with a name that includes "í" #7866

danyaljj commented Oct 26, 2024

Issue with a name that includes "í" #7866

Issue with a name that includes "í" #7866

Comments

danyaljj commented Oct 26, 2024