-
Notifications
You must be signed in to change notification settings - Fork 71
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Nested lists for subdialect info #329
Comments
I'd rather just have them in separate files and if people want to combine
them they're welcome to. I can't think how I'd use the three-column file
for actual processing without just filtering it into separate dialects, so
we should do that from the jump.
That said the pt issue you point out is important. We should fix that
upstream (probably).
…On Sat, Jan 23, 2021 at 12:03 AM Hossep Dolatian ***@***.***> wrote:
I noticed this problem for Armenian
<https://en.wiktionary.org/wiki/%D5%A5%D6%80%D5%AF%D6%80%D5%B8%D6%80%D5%A4>
and a colleague told me it's also found in Portuguese
<https://en.wiktionary.org/wiki/afetar>. For some languages, the
pronunciation entry can use a nested list, such that
- The first level contains the main dialect name
- The second level contains a subdialect name
For example, Portuguese <afetar <https://en.wiktionary.org/wiki/afetar>>
has a level-1 entry for (standard) Brazilian Portuguese. But this entry has
3 level-2 entries for different regions of Brazil. As of now, WikiPron
scraps all 4 pronunciations
<https://github.com/kylebgorman/wikipron/blob/master/data/tsv/por_bz_phonemic_filtered.tsv>
as part of "Brazilian Portuguese". But that obfuscates the fact that the 4
entries correspond to separate subdialects.
It would be nice if the script could 'fix' this somehow. Maybe you can add
an extra column to the scraped content, such that the new column would keep
the name of its line-entry's name. For example, for , maybe you could
return something like
afetar | a f e t a ɹ | Brazil
afetar | a f e t a ɻ | Paulista
afetar | a f e t a ʁ | South Brazil
afetar | a f e t a χ | Carioca
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<https://github.com/kylebgorman/wikipron/issues/329>, or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AABG4OPUNYHJ25LVQEJMGI3S3JKBBANCNFSM4WPMQETA>
.
|
@jacksonllee would love your thoughts on this. Keeps coming up... |
I share the same intuition. In general, augmenting the scraped data beyond the current two-column format would be what we should avoid within WikiPron. In this particular, I agree that just having separate files for individual subdialectal varieties is the way to go. As for implementation -- if the Portugues afetar example is representative (the Armenian example shows that it's not -- I'll come back to this below), it might be possible to tighten the IPA extraction so that it more narrowly targets the individual The Armenian երկրորդ example shows the embedded |
I managed to get the Armenian Wiktionary editors to upgrade their script so that the colloquial entries now have a more informative text, like Portuguese. |
Nice!
…On Sun, Jan 24, 2021 at 2:52 AM Hossep Dolatian ***@***.***> wrote:
I managed to get the Armenian Wiktionary editors to upgrade their script
so that the colloquial entries now have a more informative text, like
Portuguese.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<https://github.com/kylebgorman/wikipron/issues/329#issuecomment-766307312>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AABG4OLMO5F7SP4TF4HGNRLS3PGUJANCNFSM4WPMQETA>
.
|
Did the upstream editors fix this as far as our code is concerned? |
Well for Armenian, most of the entries now look like Portuguese. There are still probably some stragglers because of some older entries. But I can find+fix those once I find out how the WikiPron code is handling the canonical cases like Portuguese. |
I noticed this problem for Armenian and a colleague told me it's also found in Portuguese. For some languages, the pronunciation entry can use a nested list, such that
For example, Portuguese <afetar> has a level-1 entry for (standard) Brazilian Portuguese. But this entry has 3 level-2 entries for different regions of Brazil. As of now, WikiPron scraps all 4 pronunciations as part of "Brazilian Portuguese". But that obfuscates the fact that the 4 entries correspond to separate subdialects.
It would be nice if the script could 'fix' this somehow. Maybe you can add an extra column to the scraped content, such that the new column would keep the name of its line-entry's name. For example, for afetar, maybe you could return something like
afetar | a f e t a ɹ | Brazil
afetar | a f e t a ɻ | Paulista
afetar | a f e t a ʁ | South Brazil
afetar | a f e t a χ | Carioca
The text was updated successfully, but these errors were encountered: