Skip to content

Commit

Permalink
docs: reorder datasets.csv and update datasets table
Browse files Browse the repository at this point in the history
  • Loading branch information
mariagrandury committed Aug 28, 2023
1 parent 799ee87 commit adae431
Show file tree
Hide file tree
Showing 2 changed files with 29 additions and 10 deletions.
21 changes: 20 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,4 +14,23 @@ Si no encuentras lo que estás buscando te animamos a unirte a Discord y pregunt

- [Versión web](https://somosnlp.org/recursos/open-source/datasets)

<!-- TABLE_CONTENT -->
| nombre | tareas | idioma | página_web | github | paper | hf_dataset_name | hf_contributor_handle | dominio | pais |
|:--------------------------------------------------|:-----------------------------------------------------|:-----------|:-----------------------------------------------------|:---------------------------------------------------------|:----------------------------------------------------|:-----------------------------------------------------------------------|:------------------------|:-----------|:-------|
| BasCrawl | modelado del lenguaje | eu | https://doi.org/10.5281/zenodo.7313092 | nan | nan | nan | nan | general | España |
| Biomedical Spanish CBOW Word Embeddings in Floret | modelado del lenguaje,CBOW (Continuous Bag Of Words) | es | https://doi.org/10.5281/zenodo.7314041 | https://arxiv.org/abs/2109.07765 | nan | nan | nan | clinico | España |
| CSIC Spanish Corpus | modelado del lenguaje | es | https://doi.org/10.5281/zenodo.7313126 | nan | nan | nan | nan | academico | España |
| Catalonia Independence Corpus | clasificación de sentimientos | ca, es | nan | https://github.com/ixa-ehu/catalonia-independence-corpus | https://www.aclweb.org/anthology/2020.lrec-1.171/ | catalonia_independence | lewtun | rrss | España |
| HEAD-QA | preguntas de opción múltiple | es | https://aghie.github.io/head-qa/ | https://github.com/aghie/head-qa | https://www.aclweb.org/anthology/P19-1092/ | head_qa | mariagrandury | clinico | España |
| InfoLibros Corpus | modelado del lenguaje | es | https://doi.org/10.5281/zenodo.7313105 | nan | nan | nan | nan | literatura | Varios |
| Large Spanish Corpus | modelado del lenguaje,pre-entrenamiento | es | nan | https://github.com/josecannete/spanish-corpora | nan | large_spanish_corpus | lewtun | general | Varios |
| Mucho Cine | clasificación de sentimientos | es | http://www.lsi.us.es/~fermin/index.php/Datasets | nan | nan | muchocine | mapmeld | general | ? |
| Spanish Billion Words | modelado del lenguaje,pre-entrenamiento | es | https://crscardellino.github.io/SBWCE/ | nan | nan | spanish_billion_words | mariagrandury | general | Varios |
| Spanish Biomedical Crawled Corpus | modelado del lenguaje | es | https://doi.org/10.5281/zenodo.5513237 | nan | https://arxiv.org/abs/2109.07765 | nan | nan | clinico | España |
| Spanish CBOW Word Embeddings in FastText | modelado del lenguaje,FastText | es | https://doi.org/10.5281/zenodo.5044988 | nan | nan | http://journal.sepln.org/sepln/ojs/ojs/index.php/pln/article/view/6405 | nan | genera | España |
| Spanish CBOW Word Embeddings in Floret | modelado del lenguaje,CBOW (Continuous Bag Of Words) | es | https://doi.org/10.5281/zenodo.7314098 | nan | nan | nan | nan | general | España |
| Spanish Legal Domain Corpora | modelado del lenguaje | es | https://doi.org/10.5281/zenodo.5495529 | https://github.com/PlanTL-GOB-ES/lm-legal-es | https://arxiv.org/abs/2110.12201 | nan | nan | legal | España |
| Spanish Legal Domain Word & Sub-Word Embeddings | modelado del lenguaje | es | https://doi.org/10.5281/zenodo.5036147 | https://github.com/PlanTL-GOB-ES/lm-legal-es | https://arxiv.org/abs/2110.12201 | nan | nan | legal | España |
| Spanish Skip-Gram Word Embeddings in FastText | modelado del lenguaje,FastText | es | https://doi.org/10.5281/zenodo.5046525 | nan | nan | http://journal.sepln.org/sepln/ojs/ojs/index.php/pln/article/view/6405 | nan | general | España |
| TDX Thesis Spanish Corpus | modelado del lenguaje | ca, es | https://doi.org/10.5281/zenodo.7313149 | nan | nan | nan | nan | academico | España |
| WikiCorpus | modelado del lenguaje,POS (Part of Speech) | ca, en, es | https://www.cs.upc.edu/~nlp/wikicorpus/ | nan | https://www.cs.upc.edu/~nlp/papers/reese10.pdf | wikicorpus | albertvillanova | general | Varios |
| eHealth-KD | NER (Named Entity Recognition) | es | https://knowledge-learning.github.io/ehealthkd-2020/ | https://github.com/knowledge-learning/ehealthkd-2020 | http://ceur-ws.org/Vol-2664/eHealth-KD_overview.pdf | ehealth_kd | mariagrandury | clinico | España |
18 changes: 9 additions & 9 deletions datasets.csv
Original file line number Diff line number Diff line change
@@ -1,19 +1,19 @@
nombre,tareas,idioma,página_web,github,paper,hf_dataset_name,hf_contributor_handle,dominio,pais
BasCrawl,modelado del lenguaje,eu,https://doi.org/10.5281/zenodo.7313092,,,,,general,"España"
Biomedical Spanish CBOW Word Embeddings in Floret,"modelado del lenguaje,CBOW (Continuous Bag Of Words)","es",https://doi.org/10.5281/zenodo.7314041,https://arxiv.org/abs/2109.07765,,,,clinico,España
BasCrawl,modelado del lenguaje,eu,https://doi.org/10.5281/zenodo.7313092,,,,,general,España
Biomedical Spanish CBOW Word Embeddings in Floret,"modelado del lenguaje,CBOW (Continuous Bag Of Words)",es,https://doi.org/10.5281/zenodo.7314041,https://arxiv.org/abs/2109.07765,,,,clinico,España
CSIC Spanish Corpus,modelado del lenguaje,es,https://doi.org/10.5281/zenodo.7313126,,,,,academico,España
Catalonia Independence Corpus,clasificación de sentimientos,"ca, es",,https://github.com/ixa-ehu/catalonia-independence-corpus,https://www.aclweb.org/anthology/2020.lrec-1.171/,catalonia_independence,lewtun,rrss,"España"
HEAD-QA,preguntas de opción múltiple,es,https://aghie.github.io/head-qa/,https://github.com/aghie/head-qa,https://www.aclweb.org/anthology/P19-1092/,head_qa,mariagrandury,clinico,"España"
Catalonia Independence Corpus,clasificación de sentimientos,"ca, es",,https://github.com/ixa-ehu/catalonia-independence-corpus,https://www.aclweb.org/anthology/2020.lrec-1.171/,catalonia_independence,lewtun,rrss,España
HEAD-QA,preguntas de opción múltiple,es,https://aghie.github.io/head-qa/,https://github.com/aghie/head-qa,https://www.aclweb.org/anthology/P19-1092/,head_qa,mariagrandury,clinico,España
InfoLibros Corpus,modelado del lenguaje,es,https://doi.org/10.5281/zenodo.7313105,,,,,literatura,Varios
Large Spanish Corpus,"modelado del lenguaje,pre-entrenamiento",es,,https://github.com/josecannete/spanish-corpora,,large_spanish_corpus,lewtun,general,Varios
Mucho Cine,clasificación de sentimientos,"es",http://www.lsi.us.es/~fermin/index.php/Datasets,,,muchocine,mapmeld,general,?
Spanish Billion Words,"modelado del lenguaje,pre-entrenamiento","es",https://crscardellino.github.io/SBWCE/,,,spanish_billion_words,mariagrandury,general,Varios
Spanish Biomedical Crawled Corpus,modelado del lenguaje,"es",https://doi.org/10.5281/zenodo.5513237,,https://arxiv.org/abs/2109.07765,,,clinico,España
Spanish CBOW Word Embeddings in FastText,"modelado del lenguaje,FastText","es",https://doi.org/10.5281/zenodo.5044988,,,http://journal.sepln.org/sepln/ojs/ojs/index.php/pln/article/view/6405,,genera,España
Mucho Cine,clasificación de sentimientos,es,http://www.lsi.us.es/~fermin/index.php/Datasets,,,muchocine,mapmeld,general,?
Spanish Billion Words,"modelado del lenguaje,pre-entrenamiento",es,https://crscardellino.github.io/SBWCE/,,,spanish_billion_words,mariagrandury,general,Varios
Spanish Biomedical Crawled Corpus,modelado del lenguaje,es,https://doi.org/10.5281/zenodo.5513237,,https://arxiv.org/abs/2109.07765,,,clinico,España
Spanish CBOW Word Embeddings in FastText,"modelado del lenguaje,FastText",es,https://doi.org/10.5281/zenodo.5044988,,,http://journal.sepln.org/sepln/ojs/ojs/index.php/pln/article/view/6405,,genera,España
Spanish CBOW Word Embeddings in Floret,"modelado del lenguaje,CBOW (Continuous Bag Of Words)",es,https://doi.org/10.5281/zenodo.7314098,,,,,general,España
Spanish Legal Domain Corpora,modelado del lenguaje,es,https://doi.org/10.5281/zenodo.5495529,https://github.com/PlanTL-GOB-ES/lm-legal-es,https://arxiv.org/abs/2110.12201,,,legal,España
Spanish Legal Domain Word & Sub-Word Embeddings,modelado del lenguaje,es,https://doi.org/10.5281/zenodo.5036147,https://github.com/PlanTL-GOB-ES/lm-legal-es,https://arxiv.org/abs/2110.12201,,,legal,España
Spanish Skip-Gram Word Embeddings in FastText,"modelado del lenguaje,FastText","es",https://doi.org/10.5281/zenodo.5046525,,,http://journal.sepln.org/sepln/ojs/ojs/index.php/pln/article/view/6405,,general,España
Spanish Skip-Gram Word Embeddings in FastText,"modelado del lenguaje,FastText",es,https://doi.org/10.5281/zenodo.5046525,,,http://journal.sepln.org/sepln/ojs/ojs/index.php/pln/article/view/6405,,general,España
TDX Thesis Spanish Corpus,modelado del lenguaje,"ca, es",https://doi.org/10.5281/zenodo.7313149,,,,,academico,España
WikiCorpus,"modelado del lenguaje,POS (Part of Speech)","ca, en, es",https://www.cs.upc.edu/~nlp/wikicorpus/,,https://www.cs.upc.edu/~nlp/papers/reese10.pdf,wikicorpus,albertvillanova,general,Varios
eHealth-KD,NER (Named Entity Recognition),es,https://knowledge-learning.github.io/ehealthkd-2020/,https://github.com/knowledge-learning/ehealthkd-2020,http://ceur-ws.org/Vol-2664/eHealth-KD_overview.pdf,ehealth_kd,mariagrandury,clinico,España

0 comments on commit adae431

Please sign in to comment.