Keep original order of pronunciation variants (#1) #2

dietmar · 2023-10-25T08:15:58Z

Fixes #1

Rather than writing to an intermediate file and then sort via sort in the shell script extract_de_ipa.sh, we collect the result rows in Python and sort there, but only considering the text in the first column (and not the IPA string), thus keeping the original order (on Wiktionary) of pronunciation variants for the same text (because Python's list.sort() is stable).

I also switched to writing the result file using Python's built-in csv library instead of just printing result lines, because:

We need to keep the columns separated in order to be able to sort at the end using only the first column. So printIpa is now called buildRow and returns a list of three strings (text, IPA, comments).
Using the csv library takes care of quoting for us, so we don't need something like return "\"" + s + "\"" if "," in s else s anymore.
Using the csv library makes the result file actually valid CSV. Until now, rows with a comment had three fields (two commas) and rows without a comment had only two fields (one comma). The new result can be read, e.g., by pandas.read_csv, which threw an error before, complaining about rows with incorrect number of fields.
Writing to a result file in Python makes stdout free for printing status and progress information, some of which I added.

Finally, I also switched from plain open() to bz2.open() so we don't need to decompress the file downloaded from Wikimedia before running the script.

Keep original order of pronunciation variants (devio-at#1)

4b35bcd

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Keep original order of pronunciation variants (#1) #2

Keep original order of pronunciation variants (#1) #2

dietmar commented Oct 25, 2023

Keep original order of pronunciation variants (#1) #2

Are you sure you want to change the base?

Keep original order of pronunciation variants (#1) #2

Conversation

dietmar commented Oct 25, 2023