-
Inspired by the Kaggle Competition: The Learning Agency Lab - PII Data Detection
-
The goal of this model is to detect personally identifiable information (PII) in student writing. Automating the detection and removal of PII from educational data will lower the cost of releasing educational datasets, which will support learning science research and the development of educational tools.
-
The dataset comprises approximately 22,000 essays written by students enrolled in a massively open online course. All of the essays were written in response to a single assignment prompt, which asked students to apply course material to a real-world problem. The goal is to annotate personally identifiable information (PII) found within the essays.
-
In order to protect student privacy, the original PII in the dataset has been replaced by surrogate identifiers of the same type using a partially automated process. A majority of the essays are reserved for the test set (70%).
-
The model should be able to assign labels to the following seven types of PII:
NAME_STUDENT
- The full or partial name of a student that is not necessarily the author of the essay. This excludes instructors, authors, and other person names.EMAIL
- A student’s email address.USERNAME
- A student's username on any platform.ID_NUM
- A number or sequence of characters that could be used to identify a student, such as a student ID or a social security number.PHONE_NUM
- A phone number associated with a student.URL_PERSONAL
- A URL that might be used to identify a student.STREET_ADDRESS
- A full or partial street address that is associated with the student, such as their home address.
-
The data is presented in JSON format, which includes a document identifier, the full text of the essay, a list of tokens, information about whitespace, and token annotations. The documents were tokenized using the SpaCy English tokenizer.
-
Token labels are presented in BIO (Beginning, Inner, Outer) format. The PII type is prefixed with
B-
when it is the beginning of an entity. If the token is a continuation of an entity, it is prefixed withI-
. Tokens that are not PII are labeledO
.-
{test|train}.json
- the test and training data; the test data given on this page is for illustrative purposes only, and will be replaced during Code rerun with a hidden test set.- (int): the index of the essay
document
(int): an integer ID of the essayfull_text
(string): a UTF-8 representation of the essaytokens
(list)- (string): a string representation of each token
trailing_whitespace
(list)- (bool): a boolean value indicating whether each token is followed by whitespace.
labels
(list) [training data only]- (string): a token label in BIO format
-
sample_submission.csv
- An example of the correct submission format. See the Submission File section of the Overview page for details.
-
kaggle competitions download -c pii-detection-removal-from-educational-data
- ALBERT Base v2 is applied to this Named Entity Recognition problem