Deconstructing In-Context Learning: Understanding Prompts via Corruption

This is the official code for Deconstructing In-Context Learning: Understanding Prompts via Corruption. Arxiv link [TBD]

Accepted to LREC-COLING 2024 main conference.

Repeated inline instruction boost model performance.

Keypoints

Including repeated text in the prompt boosts model performance.
Addition of both task and in-line instructions improves model performance, even when these instructions are random words.
Larger models exhibit higher sensitivity to prompt semantics and pay more attention to the semantically relevant prompt components.

Requirements

pip install -r requiremensts.txt

A. Create Prompts

We take original task files from Super-NaturalInstructions and create the baseline files for each tasks. We also add in-line instructions which are not a part of SupernaturalInstructions. We used PromptSource to do changes to the instructions and labelspace to make the prompt easy and understandable.

Get original task files - We copied the task files manually from Super-NaturalInstructions dataset files to src/data_generator/original_tasks.
Reorder samples - We need to reorder the samples in the original tasks files such that the top 100 samples are balanced and can be used to evaluation. We do that by using this command.
```
  python src/data_generator/reorder_instances_for_testing.py
```

The reordered datafiles are saved to src/data_generator/reordered_original_tasks.

Create baseline files - Baseline files are the files with baseline corruptions. The baseline files are created for each task and the files have suffix _baseline and can be found at src/data_generator/processed-tasks.

Note: _baseline files includes all components. Instruction, inline instruction in each demostrations and demostration inputs and labels.

To create baseline files, run
```
  python src/data_generator/create_baseline_data.py
```
Add Semantic corruptions - run the following command to add semantic corruptions to the baseline file. The corruption name is added as suffix to the task files. The corrupted files are stored in src/data_generator/processed_tasks.
```
  python src/data_generator/create_corrupted_data.py
```

B. Datasets, models and metrics

We evaluate 10 models on 10 datasets. Out of 10 tasks, 8 are classification tasks and 2 are generation tasks.

Classification Datasets are

task1344_glue_entailment_classification
task843_financial_phrasebank_classification_baseline
task1645_medical_question_pair_dataset_text_classification
task116_com2sense_commonsense_reasoning
task512_twitter_emotion_classification
task379_agnews_topic_classification
task828_copa_commonsense_cause_effect
task1346_glue_cola_grammatical_correctness_classification

generation datasets are

task1564_triviaqa_answer_generation
task835_mathdataset_answer_generation

Models are GPT2-xl, GPT-J-6B, Pythia-12B, OPT-30B, OPT-30B-IML-MAX2, OPT-66B, Vicuna-33B, Llama-7B, Llama-2-70B and Llama-2-70B- chat.
Evaluation metrics - Exact match for classification and Rouge-L for generation tasks.

C. Evaluation

The datasets for evaluation are listed in src/evaluation/test_tasks.txt.

TFirst we will get the prediction from the model for the specific list of corruptions, run the following to generation model prediction.

  python evaluation/evaluate_model.py /
  --modelname [MODELNAME] /
  --corruptions_name_list [LIST OF CORRUPTION NAMES] /
  --task_set_no [SET NUMBER] / 
  --max_target_len [MAXIMUM TARGET LENGTH] /
  --batch_size [EVALUATION BATCH SIZE]

  e.g. python evaluation/evaluate_model.py /
  --modelname facebook/opt-125m /
  --corruptions_name_list [empty,only_instruction] /
  --task_set_no set1 /
  --max_target_len 10 /
  --batch_size 16

This will save the prediction in the output_dir/

Run the following script to compute metrics for the model predictions

  python evaluation/compute_metrics.py --predictions [PREDICTION JSONL FILE PATH] predicted_examples_[CORRUPTION NAME].jsonl --track default --compute_per_category_metrics --compute_per_task_metrics'

  e.g. 

  python evaluation/compute_metrics.py --predictions output/1.56B_gpt2-xl/set100/predicted_examples_only_instruction.jsonl --track default --compute_per_category_metrics --compute_per_task_metrics

Jacknife variance estimation

Once we have the metric, we use jacknife varaicne estimation method to compute mean and variance of the metric. Select an input file and run the notebook src/evaluation/jacknife_variance_estimation.ipynb .

D. Attention plots

Firstly we pick 10 samples from each dataset where the model predicted correctly. For each of these sample, we compute component length and average attention norm on this components and save the complete jsonl file in plot/collect_results/[modelname]_results. Run the following command to get the complete jsol file for a corruption.

    python get_plots.py --file_path [PREDICTION JSONL FILEPATH] --corruption_name [CORRUPTION NAME]
    e.g.
    python get_plots.py --file_path ../predictions/6.05B_gpt-j-6B/all_corruptions/predicted_examples_all_instr_n_demo.jsonl --corruption_name all_instr_n_demo

Next we use plots/get_plot_pdf.ipynb notebook to plot attention plots which are presented inthe paper.

Corruption names code supports

'empty', Only test instance in the prompt.
'only_demostration', is where the four demostrations are added to the test instance.
'only_instruction', is where the task instruction is added to the test instance.
'only_inline', instead of task instruction, inline instruction is added to the test instance.
'both_instructions', we adds both task and inline instruction to the test in- stance
'instr_n_demo', stands for adding the task instructions as well as four demonstrations to the test instance.
'inline_n_demo', we adds inline instruction in every demonstrations and no task instruction is included in this prompt.
'all_instr_n_demo', this is the Baseline prompt which includes all prompt components i.e. task instruction, demonstrations, inline instructions, and test instance.
'instr_randomwords', both task and inline instructions are corrupted with random words.
'labels_randomwords', labels are corrupted with random words.
'input_empty', We remove demonstration inputs from baseline prompt.
'label_empty', We remove demonstration labels from baseline prompt.
'input_ood', stands for Out-of-Distribution input corruption
'input_randomwords' demostration inputs are corrupted with random words [new!]

In repeated text corruptions i.e. Inline instr. in [n] demos. and Rw inline instr. in [n] demos., we add meaningful and random words inline instruction in “n” number of demonstrations respectively.

'inline_instr_in_0_demo',
'inline_instr_in_1_demo',
'inline_instr_in_2_demo',
'inline_instr_in_3_demo',
'randomwords_inline_instr_in_0_demo',
'randomwords_inline_instr_in_1_demo',
'randomwords_inline_instr_in_2_demo',
'randomwords_inline_instr_in_3_demo',

Citation

[TBD]

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
images		images
plot		plot
src		src
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Deconstructing In-Context Learning: Understanding Prompts via Corruption

Keypoints

Requirements

A. Create Prompts

B. Datasets, models and metrics

C. Evaluation

D. Attention plots

Corruption names code supports

Citation

About

Releases

Packages

Contributors 2

Languages

text-machine-lab/Understanding_prompts_via_corruption

Folders and files

Latest commit

History

Repository files navigation

Deconstructing In-Context Learning: Understanding Prompts via Corruption

Keypoints

Requirements

A. Create Prompts

B. Datasets, models and metrics

C. Evaluation

D. Attention plots

Corruption names code supports

Citation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages