Skip to content

COVID-19 corpus with annotated biomedical entities.

License

Notifications You must be signed in to change notification settings

davidcampos/covid19-corpus

Repository files navigation

COVID-19 corpus

COVID-19 corpus repository contains research articles annotated with biomedical entities of interest, namely Disorder, Species, Chemical or Drug, Gene or Protein, Enzyme, Anatomy, Biological Process, Molecular Function, Cellular Component, Pathway and microRNA.

Two different datasets are provided:

  • CORD-19 full-text articles with more than 31 million annotations.
  • Pubmed abstract articles with more than 680 thousand annotations.

Annotated corpora are freely available and can be used to further research topics related with COVID-19, contributing to find insights towards a better understanding of the disease, in order to find effective drugs and reduce the pandemic impact.

Blog post is available at https://hands-on-tech.github.io/2020/03/28/covid19-corpus.html.

CORD-19

Full-text research articles related with COVID-19 topics. Raw text and detailed description available on the official CORD-19 corpus Kaggle page.

Download

Download the latest version of the CORD-19 annotated corpus.

Statistics

Overall corpus statistics:

  • Number of articles: 33 375
  • Number of entity annotation occurrences: 31 272 212
  • Number of unique entity annotations: 141 604

Number of annotations per entity type:

Entity # Occurrences # Unique
Disorder 5638277 18704
Species 5899678 30343
Chemical or Drug 4458126 11173
Gene and Protein 2013425 57738
Enzyme 372308 1480
Anatomy 5420584 10373
Biological Process 3701117 7765
Molecular Function 842418 1722
Cellular Component 2542276 1099
Pathway 382338 517
microRNA 1665 690

Technical description

Technical description of the annotated CORD-19 corpus is available here.

Pubmed

Abstracts of research articles from Pubmed related with COVID-19 topics. Blog post about building this corpus is available at https://hands-on-tech.github.io/2020/03/28/covid19-corpus.html.

Download

Download the latest version of the annotated Pubmed corpus.

Statistics

Overall corpus statistics:

  • Number of abstracts: 17 740
  • Number of entity annotation occurrences: 683 349
  • Number of unique entity annotations: 29 423

Number of annotations per entity type:

Entity # Occurrences # Unique
Disorder 183528 4477
Species 128356 2170
Chemical or Drug 70619 2768
Gene and Protein 51114 15025
Enzyme 7892 282
Anatomy 106401 2369
Biological Process 74286 1561
Molecular Function 15089 383
Cellular Component 39451 263
Pathway 6587 97
microRNA 26 28

Technical description

Technical description of the annotated Pubmed corpus is available here.

Resources

The following resources were applied to annotate each entity type:

For more details please check the article. Unfortunately dictionaries could not be shared for download, due to UMLS usage license. Nevertheless, keep in mind that Disorder and Species entities were extended to include COVID-19 entities of interest.

Annotation

Neji is the tool used for NER (Named Entity Recognition) and normalization, which is optimized for biomedical scientific articles and provides an easy to use CLI. For more details please check the article.

Changelog

04-04-2020:

  • CORD-19 annotated corpus.

28-03-2020:

  • Initial release.

29-03-2020:

  • Annotate "methods", "results" and "conclusions" sections from JSON files.

Next steps

Possible next steps to improve the COVID-19 corpus:

  • Annotate "methods", "results" and "conclusions" sections from JSON files;
  • Further optimize resources to target entities related with COVID-19;
  • Include additional entities of relevance;
  • Annotate PMC and Elsevier full text articles;
  • Collect co-occurrences to understand which entities might be related more often;
  • Index articles and annotations and provide access to search tool.

Contact

If you would like to know more or contribute, please send an e-mail to [email protected] or create a ticket on GitHub.

License

The annotations and scripts are free to use and released under the MIT license.

About

COVID-19 corpus with annotated biomedical entities.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published