MedTAG: An open-source biomedical annotation tool for diagnostic reports.
This repository contains the full source code of MedTAG, a biomedical annotation tool for tagging biomedical concepts in clinical reports.
MedTAG provides four annotation types:
-
Concepts: allows the user to specify which concepts are relevant for a document. Users can take advantage of auto-complete functionalities for searching the relevant concepts to assign to each document.
-
Labels: allows the user to assign, by clicking on the check-boxes, one or more labels to a document. The labels indicate some reports' properties (e.g. "Cancer" label indicates the presence of a cancer related disease).
-
Mentions: shows the list of the mentions identified by the user in the report text.
-
Linking: allows the user to link the mentions identified with the corresponding concepts. Users can link the same mention to multiple concepts.
MedTAG provides the following functionalities:
-
a web-based collaborative annotation platform with support for users and roles
-
support for click-away mention annotation
-
support for mentions highlighting in different colors
-
automatic saving every time an action is performed
-
sorting of medical reports according to two different strategies: lexicographic order and “unannotated-first” policy
-
web responsive design to support mobile devices
-
download of annotations and ground truths in several formats (i.e., BioC/XML, CSV, JSON)
-
support for multi-label annotation
-
support for document-level annotations
-
multilingual support
-
support for overlapping mentions
-
support for ontologies/concepts to use for the annotation process
-
support for schema configuration, so that users can easily import data (i.e., reports, labels and concepts), as CSV files, and choose which report fields to annotate.
-
support for automatic annotation of all the annotation types for reports belonging to three use cases: colon, uterine cervix and lung. Automatic annotation is available for english reports
-
support for annotation of PubMed articles
-
support for the upload and visualization of other team members' ground-truths
NOTE: MedTAG does not support discontinuous annotations.
The directory tree is organized as follows:
- The MedTAG_Dockerized directory contains the full source code of MedTAG.
- The example directory contains some instances of CSV files to work with.
- The templates directory contains some instances of CSV files with only the list of header columns.
- The img directory contains the project images such as the screeenshots of MedTAG.
Since MedTAG is provided as a Docker container, both docker and docker-compose are required. To this aim, check out the installation procedure for your platform. Moreover, the MedTAG docker container instantiates a PostgreSQL database, so if you plan to insert a large amount of data make sure you have enough disk space. For what concerns the browser choice, Chrome would be the best browser to work with MedTAG. Nevertheless, both Safari and Firefox are supported as well.
If you already have both docker and docker-compose installed on your machine, you can skip the first two steps.
-
Install Docker. To this aim, check out the correct installation procedure for your platform.
-
Install Docker-compose. As in the first step, check out the correct installation procedure to get docker-compose installed for your platform.
-
Check the Docker daemon (
dockerd
) is up and running. -
Download or clone the medtag-core repository.
-
Open MedTAG_Dockerized/baseurl.txt file and put the baseURL of the server where MedTAG is deployed. If the server hosting MedTAG has the baseURL http://example.com/server/example/, specify this URL in place of the http://0.0.0.0:8000/ provided by default.
-
Open the MedTAG_Dockerized project folder and, on a new terminal session, type
docker-compose up
. After running the latter command the installation of MedTAG dependencies is performed and the following output will be generated:NOTE: In Unix-like systems
docker-compose
should be run without usingsudo
in a directory owned by the user. -
MedTAG installation has completed and you can access it on your browser at http://0.0.0.0:8000/.
NOTE: If you want to shut down MedTAG, open a new terminal window and navigate to the project folder. Finally type docker-compose down
NOTE: If you want to redo the whole installation process and run MedTAG in Test Mode (i.e., with the provided sample data) open a new terminal and, inside the project folder, run the following commands:
docker-compose down
sudo rm -rf data
docker image ls
- Then select the IMAGE ID of the image whose name is medtag dockerized web and run:
docker image rm <IMAGE ID>
- Finally run
docker-compose up
NOTE: If you are running docker in operating systems like CentOS and the installation can not terminate due to errors related to Cython, you can install MedTAG with a different Dockerfile
and docker-compose.yml
files based on Ubuntu:20.04 OS. These files can be found at: medtag-core/docker_config_ubuntuOS/
. Copy the Dockerfile
and the docker-compose.yml
files in: medtag-core/MedTAG_Dockerized
and make sure these files overwrite the existing ones. Stop the container, remove the images related to the previous installation process that threw the errors and redo the entire installation.
The following procedure describe how to start using MedTAG in Test Mode, which allows you to try MedTAG with the pre-loaded dataset of reports. If you want to load and work with your own reports, you have to proceed with the following steps anyway and then jump to the Customize MedTAG section. In addition, in Test Mode only the Test user is enabled, and not other member can be added.
-
Open a new browser window and go to: http://0.0.0.0:8000/, you will see the MedTAG web interface.
-
Log into MedTAG using "Test" both as username and password. In this way, you will enter in MedTAG using the Test Mode that allows you to try MedTAG features using a sample of data we provided.
-
Once you have logged in, you will be asked to provide a first reports configuration. In particular, you have to provide:
-
Report type: this can be MedTAG Reports and it indicates the reports the administrator uploaded or PubMed articles and it indicates the PubMed articles you uploaded giving their ID.
-
Language: this is the language of the reports you will annotate.
-
Use case: this is the use case of the clinical reports (e.g. Colon cancer and Lung cancer).
-
Institute: this is the medical institute which provides the diagnostic reports.
-
Annotation mode: This can be Manual if the user creates the ground truths from scratch, or Automatic if the user edits the ground truths automatically created. Automatic option is available if there are some automatically created ground truths.
NOTE: In Test Mode you can annotate a set of reports about colon we provided (this corresponds to the following combination: MedTAG reports, English, colon, default_hospital, Manual), or a set of PubMed articles (this corresponds to: PubMed articles, colon, Manual). Note that if you select PubMed articles you do not need to set language or institute because they are set by default.
-
When you customize MedTAG for the first time, the Test user, and all the documents, concepts and labels we provided for testing MedTAG are removed from the database and replaced with the documents, concepts, labels you uploaded.
In order to customize MedTAG with your own data, you need to provide three CSV files (i.e, reports_file, concepts_file, labels_file). Please, make sure to use a comma as separator for your CSV files. Furthermore, make sure to escape values that contains commas.
-
reports_file: this file contains the clinical reports to annotate. The csv header must contain the following columns:
-
id_report: the report unique identifier.
-
language: the language adopted for the report textual content.
-
institute: the health-care institute which provides the diagnostic reports.
-
usecase: the report use-case (e.g. colon cancer) indicates the clinical case the report refers to.
NOTE: if you are not interested in providing either the institute or the usecase you can assign them a default value of your choice, that holds for all the rows of the reports_file.
NOTE: In addition to the previous mandatory columns, you need to provide a set of additional columns to describe the actual textual content of your reports (e.g. the diagnosis text, the patient information and so on). You can specify as many columns as you want.
-
-
pubmed_file: this file contains the PubMed articles to annotate. The csv header must contain the following columns:
-
ID: the PubMed article's unique identifier.
-
usecase: the article's use-case (e.g. colon cancer) indicates the clinical case the report refers to.
NOTE: if you are not interested in providing the usecase you can assign it a default value of your choice, that holds for all the rows of the pubmed_file.
NOTE: the language considered for PubMed articles is: English.
NOTE: PubMed articles are uploaded with a rate of 3 articles per second.
-
-
concepts_file: this file contains the concepts used for annotating the clinical reports. All the concepts must be identified with a concept_url which uniquely identifies the concept according to a reference ontology. The csv header must contain the following columns:
-
concepts_url: the URL of the concept in the reference ontology.
-
concepts_nome: the name of the concept the concept url points to.
-
area: this is a category associated to the concept.
-
usecase: the concept use-case (e.g. colon cancer) indicates the clinical case the concept refers to.
NOTE: if you are not interested in providing either the area or the usecase you can assign them a default value of your choice, that holds for all the rows of the concepts_file. It is worth noting that the usecase provided for the concepts should be coherent with the one provided for the reports.
-
-
labels_file: this file contains the labels used for annotating the clinical reports. The labels describe a diagnostic property of a clinical report. For instance, the "Cancer" label describe the presence of a cancer-related disease. The csv header must contain the following columns:
The following procedure describe how to configure MedTAG in order to load your own reports and work with them in MedTAG. It is worth noting that only the admin user has the privileges to change the MedTAG configurations. Moreover, every time a new configuration is provided the previous one will be overwritten, thus data and annotations will be removed as well.
To start a new configuration follow the instructions below:
-
Open the Menu from the Test Mode and go to Configure.
-
Read and follow the instructions of the guided procedure.
-
Provide the CSV files.
NOTE: You can add one or more files from the same folder. If it is the first time you configure MedTAG, you are asked to provide both the username and the password that will be used by the admin user to login into MedTAG. The admin user is the only one who can change the configuration files and access the data. If you do not have access to Configure section (i.e., you do not see it in the side bar), this means that you are not logged in as the admin user.
NOTE: It is mandatory to upload at least one file between reports_file and pubmed_file. Once you uploaded one or more reports_file, MedTAG automatically detects the columns which characterize your report and asks you to choose which fields of the report you want to hide, display or annotate. You need to set at least one field to be displayed. If you uploaded one or more pubmed_file instead, you do not have to to set any field to be displayed or annotated: abstract and title are annotable by default, while volume, journal, year, authors are only displayed.
NOTE: The concepts_file and labels_file and are not mandatory. This means that if you are not interested in labels annotation and/or concepts identification you can avoid to provide them. By the way, you must provide either the labels_file or the concepts_file or set at least one field to Display and Annotate.
NOTE: If you uploaded the reports_file or the pubmed_file giving Colon, Uterine cervix or Lung as use-cases, you can rely on a set of concepts and labels we provide, without uploading your own ones. Remember that it is not allowed to upload new concepts (or labels) if you decided to rely on those we provide.
-
Check the format of the provided CSV files, by clicking on the Check button. Then, the automatic procedure will produce some state messages in different colors:
-
Green: messages in green color (i.e., success messages) mean that the provided CSV files are well-formatted.
-
Orange: messages in orange color (i.e., warning messages) mean that you should revise the format of the provided CSV files. Nevertheless, the provided CSV files are accepted anyway.
-
Red: messages in red color (i.e., error messages) mean that you must revise the format of the provided CSV files, since they are not well-formatted. Error messages provide information about the errors occurred and suggest the user how to fix the issues.
-
-
When the procedure has ended, a notification of success or error will be provided. If you provided reports (or PubMed articles) whose use-cases are: Colon, Uterine cervix or lung you will be notified that automatic annotation is available. This operation can be time consuming, hence you can decide to automatically annotate your reports or log in and start the automatic annotation process in another moment. If you want to automatically annotate your reports you have to select the fields you want to extract the concepts, the mentions and the labels from. In case of successful configuration of MedTAG, the login page will look like the screenshot below.
The following procedure describe how to provide additional data to the current configuration of MedTAG. Updating the configuration is possible only if you are not running MedTAG with the sample data we provided, that is, MedTAG is not running in Test Mode. In order to update a configuration follow these steps:
-
Open the Menu and go to Configure and click on Update configuration.
-
Select what you want to update. You can add some reports, labels or concepts. You can also change the fields to display and annotate. If you want to update the fields to annotate and display, remember that you cannot set to Hide or Display the fields you previously decided to annotate, since this would affect the annotations that rely on those fields.
NOTE: If you decide to add reports having columns that MedTAG has never detected before, you will be asked to choose what columns to display, hide or annotate.
-
In this page you can also automatically annotate your reports whose use-cases are: Colon, Uterine cervix or Lung (if any). You can decide for each use-case what fields you want to extract concepts, labels and mentions from. This process might be time and memory consuming, this is why we recommend you to have machines powerful enough (see requirements section) to perform this task.
NOTE: If you want to automatically re-annotate reports belonging to a use-case, all the ground-truths previously automatically created for that use-case will be removed. The same holds for the automatic annotation of PubMed articles.
MedTAG annotation performance has been assessed by means of an automatic agent, which simulated the annotation process for two specific use-cases:
- document-level annotation: this task concerns the annotation of documents with labels that describe the overall document content.
- mention-level annotation: this task concerns the identification of concept-related mentions, in the documents' textual content.
The annotation process performance has been evaluated in terms of:
- number of actions: number of user-required actions (e.g. clicks and keys pressed) to annotate documents according to the use-cases specified.
- time elapsed: the amount of time required to perform the whole annotation process (i.e. all the sample documents considered get annotated).
The analysis we conducted considers a sample of one hundred documents, randomly chosen from a real dataset concerning the digital pathology domain (i.e. colon cancer clinical reports). We assessed the performances of MedTAG and other annotation tools including ezTag, MyMiner and tagtog. We measured the number of actions and the time elapsed for each annotation tool. We computed the mean and the standard deviation over forty trials.
The experiment results are summarized in the following tables:
Table 1: document-level annotation performance analysis
Tool | #Actions | Elapsed time in seconds (mean) | Standard deviation in seconds |
---|---|---|---|
MedTAG | 200 | 46.84 | 0.803 |
MyMiner | 100 | 56.677 | 0.416 |
tagtog | 400 | 205.74 | 5.471 |
Table 2: mention-level annotation performance analysis
Tool | #Actions | Elapsed time in seconds (mean) | Standard deviation in seconds |
---|---|---|---|
MedTAG | 519 | 159.337 | 0.479 |
ezTag | 307 | 260.34 | 0.576 |
teamTat | 307 | 271.577 | 1.542 |
tagtog | 404 | 304.692 | 10.067 |
MyMiner | 414 | 114.390 | 1.507 |
The datasets considered for the benchmark experiments consist of a sample of one hundred documents from the digital pathology domain (i.e. colon cancer clinical reports anonymized). The datasets are available inside the folder datasets
The benchmark experiments have been conducted using the Python Web automation library Selenium. The full source code of the automated agents implemented is available inside the folder automated_agents_selenium.
If you use MedTAG for your research work, please consider citing our paper:
@article{GiachelleMedTAG2021,
author = {Fabio Giachelle and
Ornella Irrera and
Gianmaria Silvello},
title = {MedTAG: a portable and customizable annotation tool for biomedical
documents},
journal = {{BMC} Medical Informatics Decis. Mak.},
volume = {21},
number = {1},
pages = {352},
year = {2021}
}
MedTAG has been developed by the Intelligent Interactive Information Access Hub (IIIA) of the Department of Information Engineering, University of Padua, Italy.
This work was partially supported by ExaMode, European Union Horizon 2020 program under Grant Agreement no. 825292.
Any questions? The authors are glad to answer your questions and receive your feedback or suggestions to further improve MedTAG.
- Fabio Giachelle · fabio.giachelle AT unipd.it
- Ornella Irrera · ornella.irrera AT unipd.it
- Gianmaria Silvello · gianmaria.silvello AT unipd.it