Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Keep original order of pronunciation variants (#1) #2

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

dietmar
Copy link

@dietmar dietmar commented Oct 25, 2023

Fixes #1

Rather than writing to an intermediate file and then sort via sort in the shell script extract_de_ipa.sh, we collect the result rows in Python and sort there, but only considering the text in the first column (and not the IPA string), thus keeping the original order (on Wiktionary) of pronunciation variants for the same text (because Python's list.sort() is stable).

I also switched to writing the result file using Python's built-in csv library instead of just printing result lines, because:

  • We need to keep the columns separated in order to be able to sort at the end using only the first column. So printIpa is now called buildRow and returns a list of three strings (text, IPA, comments).
  • Using the csv library takes care of quoting for us, so we don't need something like return "\"" + s + "\"" if "," in s else s anymore.
  • Using the csv library makes the result file actually valid CSV. Until now, rows with a comment had three fields (two commas) and rows without a comment had only two fields (one comma). The new result can be read, e.g., by pandas.read_csv, which threw an error before, complaining about rows with incorrect number of fields.
  • Writing to a result file in Python makes stdout free for printing status and progress information, some of which I added.

Finally, I also switched from plain open() to bz2.open() so we don't need to decompress the file downloaded from Wikimedia before running the script.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Order of pronunciation variants should be like on Wiktionary
1 participant