Keep original order of pronunciation variants (#1) #2
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Fixes #1
Rather than writing to an intermediate file and then sort via
sort
in the shell script extract_de_ipa.sh, we collect the result rows in Python and sort there, but only considering the text in the first column (and not the IPA string), thus keeping the original order (on Wiktionary) of pronunciation variants for the same text (because Python's list.sort() is stable).I also switched to writing the result file using Python's built-in
csv
library instead of justprint
ing result lines, because:printIpa
is now calledbuildRow
and returns a list of three strings (text, IPA, comments).csv
library takes care of quoting for us, so we don't need something likereturn "\"" + s + "\"" if "," in s else s
anymore.csv
library makes the result file actually valid CSV. Until now, rows with a comment had three fields (two commas) and rows without a comment had only two fields (one comma). The new result can be read, e.g., by pandas.read_csv, which threw an error before, complaining about rows with incorrect number of fields.print
ing status and progress information, some of which I added.Finally, I also switched from plain
open()
tobz2.open()
so we don't need to decompress the file downloaded from Wikimedia before running the script.