Help wanted: Example sentences for testing languages #1107
Replies: 51 comments
-
In french :
Title of news :
|
Beta Was this translation helpful? Give feedback.
-
Some questions in English and French: en_examples = [ fr_examples = [ |
Beta Was this translation helpful? Give feedback.
-
A related idea: you might want to test these strings as well to check that none of them causes a problem: https://github.com/minimaxir/big-list-of-naughty-strings |
Beta Was this translation helpful? Give feedback.
-
Thanks for the examples so far! 👍 @oroszgy Thanks, that's a good idea for the tokenizer tests, especially the crazy unicode stuff! I'm actually pretty curious to see if spaCy can handle all of them – I'll start adding a test with a selection of them on |
Beta Was this translation helpful? Give feedback.
-
@ines Hi, how could one contribute for the Indian languages, like Hindi and Bengali? |
Beta Was this translation helpful? Give feedback.
-
@eldor4do There's no Hindi tokenization support yet (although it looks like some people have started working on it, see #641). Bengali tokenization is already supported – so a few sentences in Bengali would be great 👍 Here's what we have in the tokenizer tests so far. Just had a look at some of the test examples in Google Translate and they seem pretty good – so more examples like that would be nice. |
Beta Was this translation helpful? Give feedback.
-
For Spanish:
|
Beta Was this translation helpful? Give feedback.
-
For Polish:
|
Beta Was this translation helpful? Give feedback.
-
Hebrew examples: he_this_is_a_sentence = 'זהו משפט.'
he_examples_news = [
'סין מקימה קרן של 440 מיליון דולר להשקעה בהייטק בישראל',
'רה"מ הודיע כי יחרים טקס בחסותו',
'הכנסת צפויה לאשר איכון אוטומטי של שיחות למוקד 100',
'תוכנית לאומית תהפוך את ישראל למעצמה דיגיטלית'
]
he_examples_misc = [
'סע לשלום, המפתחות בפנים.',
'מלצר, פעמיים טורקי!',
'ואהבת לרעך כמוך.',
'היום נעשה משהו בלתי נשכח.'
]
he_examples_questions = [
'איפה הילד?',
'מיהו נשיא צרפת?',
'מהי בירת ארצות הברית?',
"איך קוראים בעברית לצ'ופצ'יק של הקומקום?",
'מה הייתה הדקה?',
'מי אומר שלום ראשון, זה שעולה או זה שיורד?',
] |
Beta Was this translation helpful? Give feedback.
-
For Swedish:
|
Beta Was this translation helpful? Give feedback.
-
“This is a sentence.” -> „To jest zdanie.” (Polish) |
Beta Was this translation helpful? Give feedback.
-
For Portuguese BR pt_examples = [ |
Beta Was this translation helpful? Give feedback.
-
For Danish:
|
Beta Was this translation helpful? Give feedback.
-
For Norwegian NB:
“This is a sentence.” -> "Dette er en setning." |
Beta Was this translation helpful? Give feedback.
-
For Italian:
Edit: "This is a sentence" --> "Questa è una frase" |
Beta Was this translation helpful? Give feedback.
-
Arabic |
Beta Was this translation helpful? Give feedback.
-
I would like to support to spaCy by adding Tamil language .@ines |
Beta Was this translation helpful? Give feedback.
-
For Catalan:
|
Beta Was this translation helpful? Give feedback.
-
Ukrainian:
Це речення.
нд, 17 лют. 2019 о 17:41 Marc Puig <[email protected]> пише:
… For Catalan:
This is a sentence -> Això és una frase
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<https://github.com/explosion/spaCy/issues/1107#issuecomment-464471841>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABcqpVXlFSuMVSFo35q_hzCur7h484bvks5vOXgigaJpZM4NxRts>
.
|
Beta Was this translation helpful? Give feedback.
-
"एप्पल एक यूके की स्टार्टअप को १ अरब डॉलर मे खरीदने का सोच रहा है|"
should be स्वचालित गाड़ियां अब बीमा के दायित्व को अब उत्पादक के ऊपर डाल रही हैं|
"अमेरिका की राजधानी क्या है?"
यह एक वाक्य है| |
Beta Was this translation helpful? Give feedback.
-
I saw Indonesian model is in progress, here is the Indonesian translation:
|
Beta Was this translation helpful? Give feedback.
-
Albanian:
|
Beta Was this translation helpful? Give feedback.
-
For Bulgarian:
There are some specifics about the use of Bulgarian.
Spacy is great and I would like to contribute more for the Bulgarian pack in the future. |
Beta Was this translation helpful? Give feedback.
-
"This is a sentence" example sentence in Arabic
|
Beta Was this translation helpful? Give feedback.
-
Do you need further help with Arabic, @ines ? More examples, translation review, etc.? |
Beta Was this translation helpful? Give feedback.
-
Russian: |
Beta Was this translation helpful? Give feedback.
-
Translated the examples below into Tamil:
translates to
|
Beta Was this translation helpful? Give feedback.
-
There is something wrong with the Indonesian Example sentences, it contains sensitive phrases as terorism and bombing, while there are tons of examples available. sending a pull request for it. |
Beta Was this translation helpful? Give feedback.
-
Simplified Chinese / zh_cn
->
|
Beta Was this translation helpful? Give feedback.
-
Icelandic examples: examples = [
'Apple íhugar að kaupa breskt sprotafyrirtæki fyrir 1 milljarð dollara',
'Sjálfstýrðir bílar færa tryggingarábyrgð yfir á framleiðendur',
'San Francisco íhugar að banna afhendingarvélmenni á gangstéttum',
'London er stór borg í Bretlandi.'
] Note: Similarly to the Bulgarian translations (as with other translations that I see here),
As mentioned for other languages, the translation for Icelandic "This is a sentence" example sentence: Þetta er setning |
Beta Was this translation helpful? Give feedback.
-
When testing models and spaCy's performance in general, it's always nice to have a few text examples handy – especially for the languages you don't speak. We noticed this while testing the new Spanish and multi-language NER models. It's not always easy to find good, representative sentences, and copy-pasting whatever is on the Wikipedia front page that day isn't ideal either.
So here's the idea: For each language, we want to add a simple list of text examples that can be imported and processed – for example, to test if a model is working as expected. Ideally, those examples should include different types of sentences, grammatical constructions and named entities. The examples could also be used in spaCy's test suite.
Examples could live with the language's data in
spacy.lang
.✅ Languages with models (8/8)
en
(can be covered by us)de
(can be covered by us)fr
(thanks @Gregory-Howard & @Tpt)es
(thanks @lucianosilvi)pt
(thanks @oximer)it
(thanks @MartinoMensio)nl
(thanks @redevries)el
(thanks @Eleni170 & @giannisdaras)Languages with only tokenizers (29/40)
sq
(thanks @AlbionaHoti)ar
(thanks @tzano)bn
(thanks @roshni-b)bg
(thanks @KristiyanVachev)ca
(thanks @mpuig)zh
(thanks @renewang)hr
(thanks @isaric)da
(thanks @mollerhoj)fi
(thanks @tjkemp)he
(thanks @beneyal)hi
(thanks @weavermonkey)hu
(thanks @oroszgy)id
(thanks @geovedi)ja
(thanks @tokestermw)lt
(thanks @rokasramas)nb
(thanks @luvogels)fa
(thanks @azarezade)pl
(thanks @adam-ra)ro
(thanks @janimo)ru
(thanks @gsoul)si
(thanks @keshan)sv
(thanks @RickardSjogren)ta
(thanks @loghijiaha)tt
(thanks @aliiae)te
(thanks @sainathadapa)th
(thanks @korakot & @wannaphongcom)tr
(thanks @DuyguA)uk
(thanks @juliamakogon)ur
(thanks @mirfan899)af
cs
is
ga
kn
lv
lb
sr
sk
sl
vi
"This is a sentence" example sentence (28/49)
If you happen to speak one of the tokenization-only languages above, a translation of
"This is a sentence."
would also be very helpful! The new model docs have a "Quickstart" widget that shows the install command, plus an example sentence. If we ever add a model for one of the alpha languages, it'd be nice to have the example sentence available (and mistakes in the example would be pretty embarrassing, haha).sq
: "Kjo është një fjali." (thanks @AlbionaHoti)ar
: هذه جملة (thanks @Ahmed0Sultan)bg
: "Това е изречение" (thanks @KristiyanVachev)da
: "Dette er en sætning." (thanks @philiphoyos)nl
: "Dit is een zin." (thanks @redevries)en
: "This is a sentence."fr
: "C'est une phrase."de
: "Dies ist ein Satz."el
: "Αυτή είναι μια πρόταση." (thanks @Eleni170)he
: "זהו משפט." (thanks @beneyal)hi
: "यह एक वाक्य है।" (thanks @weavermonkey & @psmehta21)hu
: "Ez egy mondat." (thanks @oroszgy)id
: "Ini adalah sebuah kalimat." (thanks @prdx)it
: "Questa è una frase." (thanks @MartinoMensio)ja
: "これは文章です。" (thanks @tokestermw)nb
: "Dette er en setning." (thanks @luvogels)pt
: "To jest zdanie." (thanks @adam-ra)pt
: "Esta é uma frase."ro
: "Aceasta este o propoziție" (thanks @janimo)ru
: Это предложение. (thanks @buriy)si
: "මෙය වාක්යයකි." (thanks @keshan)es
: "Esto es una frase."ta
: "இது ஒரு வாக்கியம்." (thanks @arvindcheenu)te
: "ఇది ఒక వాక్యం." (thanks @sainathadapa)th
: "นี่คือประโยค" (thanks @cvorasucha)tr
: "Bu bir cümledir." (thanks @DuyguA)uk
: "Це речення." (thanks @juliamakogon)ur
: "یہ ایک جملہ ہے۔" (thanks @mirfan899)af
bn
ca
zh
hr
cs
fi
is
ga
kn
lv
lt
lb
fa
sr
sk
sl
sv
tt
vi
We haven't fully decided how the text examples should be implementedThe examples will live in a fileexamples.py
in the language data – for examplelang/en/examples.py
. Currently, there's only asentences
list, but more examples can be added in the future.If you want to help out and contribute a few examples, you can submit a PR, or simply post them here! 🙂
Beta Was this translation helpful? Give feedback.
All reactions