EdgeCare-masking-dataset

Overview

This repository contains a masking dataset designed for anonymizing personal information in medical text. The dataset consists of structured data where sensitive information is labeled and categorized for easy identification and removal.

Data Structure

The dataset is stored in Dataset.json and follows the structure below:

{
        "sentence": "The patient, Zoe Reed, has been receiving treatment since 03/12/2020. Contact: +1-555-678-2345.",
        "meta": {
            "note_id": "1750",
            "sl_context": false
        },
        "spans": [
            {
                "id": 0,
                "start": 13,
                "end": 21,
                "label": "PERSON",
                "text": "Zoe Reed"
            },
            {
                "id": 1,
                "start": 58,
                "end": 68,
                "label": "DATE",
                "text": "03/12/2020"
            },
            {
                "id": 2,
                "start": 79,
                "end": 94,
                "label": "PHONE",
                "text": "+1-555-678-2345"
            }
        ]
    }

Fields:

sentence: The original text containing sensitive medical information.
meta: Metadata related to the text sample.
- note_id: Unique identifier for the note.
- sl_context: Boolean flag indicating if the note includes Sri Lankan context.
spans: A list of entities detected in the sentence.
- id: Unique identifier for the entity.
- start: Start index of the entity in the sentence.
- end: End index of the entity in the sentence.
- label: Type of sensitive information.

Labels

The dataset includes the following entity labels:

ID	Label	Description
0	O	Non-entity tokens
1	PATIENT	Patient's name
2	STAFF	Medical staff name
3	AGE	Patient's age
4	BIRTHDATE	Patient's birthdate
5	DATE	Any date (admission, discharge, etc.)
6	PHONE	Phone number
7	ID	Identifiers (e.g. patient ID)
8	EMAIL	Email addresses
9	LOC	Location (city, state, address, etc.)
10	HOSP	Hospital or medical institution
11	ORG	Organizations or institutions
12	URL	URLs
13	TIMESTAMPS	Relative dates (e.g. last summer, yesterday)

Applications

This dataset can be used for:

De-identification tasks: Removing personal identifiers from medical text.
Named Entity Recognition (NER) models: Training machine learning models to detect sensitive information.
Privacy-preserving data processing: Ensuring compliance with HIPAA and other privacy regulations.

Contributing

If you would like to contribute to this dataset, feel free to:

Open an issue for any errors or improvements.
Submit a pull request with additional data or enhancements.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
Dataset.json		Dataset.json
Labels.json		Labels.json
Prompts_for_Masking_Dataset_Generation_Guide.pdf		Prompts_for_Masking_Dataset_Generation_Guide.pdf
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

EdgeCare-masking-dataset

Overview

Data Structure

Fields:

Labels

Applications

Contributing

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

EdgeCare-masking-dataset

Overview

Data Structure

Fields:

Labels

Applications

Contributing

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages